CN111626414A

CN111626414A - Dynamic multi-precision neural network acceleration unit

Info

Publication number: CN111626414A
Application number: CN202010747687.0A
Authority: CN
Inventors: 伍元聪; 刘洋; 田野; 刘祎鹤; 刘爽; 于奇
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2020-09-04
Anticipated expiration: 2040-07-30
Also published as: CN111626414B

Abstract

The invention relates to the field of integrated circuits and artificial intelligence, in particular to a system-level integrated circuit chip, and particularly relates to a dynamic multi-precision neural network acceleration unit. According to the invention, after the PE units are arrayed, the PE unit array is used as the dual array design of the array units again to support different neural network hierarchical operations, so that the requirements of different algorithms and different data accuracies are met. Hardware acceleration design is also carried out on the neural network computing unit, and pipeline design is added in the overall design, so that the data throughput rate and the operation speed are greatly improved, the multiplication and addition unit of the neural computing unit is optimized, the reuse of hardware resources is improved, and the hardware area is greatly reduced; the data type calculation with different precisions can be met according to the requirement of the neural network model; and meanwhile, basic operations in the neural network are supported, such as convolution, pooling, nonlinear mapping, matrix multiplication and the like.

Description

Dynamic multi-precision neural network acceleration unit

Technical Field

The invention relates to the field of integrated circuits and artificial intelligence, in particular to a system-level integrated circuit chip, and particularly relates to a dynamic multi-precision neural network acceleration unit.

Background

As data explosion grows and computing performance rapidly increases, machine learning has become more and more intense, gradually penetrating into all corners in our lives: common applications are voice video such as Siri by apple and Cortana by microsoft; face recognition such as Apple iPhoto by Apple and Google Picasa by Google; artificial intelligence such as the ahpa Go of deep mind corporation under google has defeated human top players. Currently, the most attractive thing in the field of machine learning is deep learning represented by a neural network.

The deep neural network simulates the neural connection structure of the human brain by establishing a model, and has great breakthrough in the fields of image processing, voice recognition and the like. However, as the neural network model is more and more complex, the weight data is more and more, the required precision is more and more high, and the convolution neural network is started, and the larger image input data puts higher requirements on hardware. The commonly used general-purpose processor chip is not completely suitable for acceleration of a neural network due to its general-purpose properties, such as the universality, branch prediction, address merging and the like, and occupies the design area and resources of the chip. Existing general purpose processors have slowly become unable to meet the requirements of neural network acceleration, and dedicated neural network acceleration hardware has been designed to accelerate neural networks.

Because Field Programmable Gate Arrays (FPGAs) have high programmability and are short in development period as accelerators, researches on realizing neural network acceleration are increasing. However, the current deep neural network calculation still depends heavily on the dense floating-point matrix multiplication, and a unique data type (data type after sparse compression) design is abandoned, so that the deep neural network calculation is more beneficial to being mapped onto the GPU (conventional parallelism), and therefore the GPU is still widely used for accelerating the deep neural network in the market. Although the FPGA provides excellent energy consumption efficiency, it cannot achieve the performance of the GPU at present, and is not suitable for popularization at present.

Disclosure of Invention

In order to solve the problems of power consumption, cost, area and the like in the conventional neural network acceleration technology, the invention provides a dynamic multi-precision neural network acceleration unit, which is an Application Specific Integrated Circuit (ASIC). Compared with a common application-specific integrated circuit, the invention has more flexibility due to programmability and multi-precision. The ASIC makes it possible to use low-power consumption, low-cost and high-performance on-line learning artificial intelligence chip.

The specific technical scheme is as follows:

a dynamic multi-precision neural network acceleration unit, comprising: the device comprises a data interface module, a configuration interface module, a memory interface, an input buffer module, a neural network computing unit, a memory scheduling unit, an output buffer module and a system control module.

The configuration interface module configures the entire system by conveying configuration information to the system control module.

The data interface module is responsible for transmitting input data to the system.

The memory interface is connected with the input buffer module, the memory scheduling unit and the output buffer module; and under the control and monitoring of the memory scheduling unit, transmitting the weight data stored in the memory to the input buffer module, or receiving the operation result data of the output buffer module, or controlling and temporarily storing the intermediate operation result of the neural network computing unit through the memory scheduling unit.

The input buffer module is connected with the data interface module, the memory interface and the neural network computing unit, and sends the weighted data in the memory and the input data obtained through the data interface into the neural network computing unit after data preprocessing.

And the output buffer module is connected with the memory scheduling unit and the memory interface, and after the neural network computing unit obtains the final operation result, the operation result is arranged, prepared and stored in the corresponding memory address under the scheduling of the memory scheduling unit.

The memory scheduling unit is responsible for scheduling data and interacting with a memory, and is used for fetching data from a corresponding address, storing a final calculation result into a corresponding memory address or temporarily storing an operation result of the neural network calculation unit into the memory; the system control module controls the input buffer module, the neural network computing unit, the memory scheduling unit and the output buffer module to cooperatively and synchronously work by monitoring the current data flow states of the neural network computing unit and the memory scheduling unit.

The system control module controls the input buffer module, the neural network computing unit, the memory scheduling unit and the output buffer module to cooperatively and synchronously work by monitoring the current data flow states of the neural network computing unit and the memory scheduling unit. The data type operation with different precisions is met according to the requirements of the neural network model. And meanwhile, basic operations in the neural network are supported, such as convolution, pooling, nonlinear mapping, matrix multiplication and the like.

The memory scheduling unit is responsible for scheduling data and interacting with the memory.

The neural network computing unit comprises N parallel processing unit (PE) arrays, N is larger than or equal to 2, and the PE arrays with corresponding number are programmed according to a neural network algorithm model and run in parallel, so that the computing processing efficiency and speed are improved, and the convolutional operation and the matrix multiplication operation are performed. The Processing Element (PE) array is n composed of S Processing Elements (PE)₁×m₁Of a two-dimensional matrix of n₁And m₁The number of the processing units is more than or equal to 2, and connecting paths among the processing units (PE) are programmable to support neural network algorithm models of different scales.

And when the neural network computing unit carries out convolution operation, the number of the PEs is configured according to the size of the convolution kernel to meet the computing requirement. Meanwhile, the number of the PE arrays is configured to be parallel in the dimension of the output channel and the dimension of the input data, so that different PE arrays are controlled to output different output channels, and a plurality of PE arrays are controlled to process data of different parts of the same input image at the same time, the parallelism is improved, and the operation is accelerated.

When the neural network computing unit carries out matrix multiplication, the PE is configured according to the size of the input matrix, and the PE array is spliced in the dimension of the row or the column so as to meet the computing requirement.

Furthermore, the data interface, the memory interface, the input buffer module and the output buffer module can be hung on a bus in the chip, so that the transmission efficiency is improved, and the structure is optimized.

Furthermore, the neural network computing unit is connected with an M-level addition chain behind each PE array, and is connected with a K-level addition tree behind all the M-level addition chains to reduce the time delay of a key path, so that the system speed and the computing efficiency are improved.

Furthermore, each PE of the neural network computing unit can process multiple multiplication and addition operations of floating point numbers and fixed point numbers with data precision, wherein the data precision is int-16, int-8, int-4, int-2, int-1 and floating-16.

Furthermore, the hardware resources used by the PE of the neural network computing unit are multiplexed, the multiplication operation with high bit width is multiplexed with the multiplier with low bit width, and the addition operation with all data bit widths is multiplexed with the same adder.

Further, the neural network algorithm model is ResNet, VggNet, GoogleNet, YoLo or RCNN.

According to the dynamic multi-precision neural network accelerating unit provided by the invention, after the PE units are arrayed, the PE unit array is used as a double array design of the array units again to support different neural network hierarchical operations, so that the requirements of different algorithms and different data precisions are met. Hardware acceleration design is also carried out on the neural network computing unit, and pipeline design is added in the overall design, so that the data throughput rate and the operation speed are greatly improved, the multiplication and addition unit of the neural computing unit is optimized, the reuse of hardware resources is improved, and the hardware area is greatly reduced; the data type calculation with different precisions can be met according to the requirement of the neural network model; and meanwhile, basic operations in the neural network are supported, such as convolution, pooling, nonlinear mapping, matrix multiplication and the like.

Drawings

FIG. 1 is an architectural block diagram of the present invention.

FIG. 2 is a block diagram of the architecture of the neural network computational unit of the present invention.

FIG. 3 is a block diagram illustrating the structure of a single PE according to the present invention.

Fig. 4 is a structural block diagram of building a high-bit-width multiplier by using a low-bit-width multiplier according to the embodiment of the present invention.

FIG. 5 is a schematic diagram illustrating data flow during a convolution operation according to an embodiment of the present invention.

FIG. 6 is a schematic diagram illustrating data flow during mapping a matrix operation according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Example one

The dynamic multi-precision neural network acceleration unit provided by this embodiment is applied to the design of AI acceleration chips in the field of intelligent assistant driving, and the architecture thereof is shown in fig. 1, and includes: the device comprises a data interface module, a configuration interface module, a memory interface, an input buffer unit, a neural network computing unit, a memory scheduling unit, an output buffer unit and a system control module.

The system control module controls the input buffer module, the memory interface, the neural network computing unit, the memory scheduling unit and the output buffer module, and controls and coordinates the whole system by monitoring the process of the whole neural network.

The configuration interface module may configure the entire system as desired by conveying configuration information to the system control module.

The system control module can be hung on a peripheral bus, such as an APB bus of AMBA, and interacts with modules such as a configuration interface and an external system through the bus.

The data interface module is responsible for transmitting input data to the system, supports interfaces such as USB, MIPI, RGB888, BT1120 and BT656, and transmits image data and other types of data.

The memory interface is connected with the input buffer module, the memory scheduling unit and the output buffer module; the memory interface can transmit the weight data stored in the memory to the input buffer module or receive the operation result data of the output buffer module under the control and monitoring of the memory scheduling unit, or control and temporarily store the intermediate operation result of the neural network computing unit through the memory scheduling unit when the data scale is large.

The output buffer module is connected with the memory scheduling unit and the memory interface, and after the neural network computing unit obtains the final operation result, the output buffer module arranges and prepares the operation result under the scheduling of the memory scheduling unit and stores the operation result into the corresponding memory address.

The data interface, the memory interface, the input buffer module and the output buffer module can be hung on a bus in the chip, such as an AXI bus protocol of AMBA and the like, so that the transmission efficiency is improved, and the structure is optimized.

The memory scheduling unit is responsible for scheduling data and interacting with the memory: for example, data is fetched from a corresponding address, a final calculation result is stored in a corresponding memory address, or when data processed by a neural network calculation unit is too large, a part of calculation results are temporarily stored in a memory and then are fetched.

The structure of the neural network computing unit is shown in fig. 2, and includes: an array of N parallel Processing Elements (PEs), an addition chain connected to the array, and an addition tree.

The single PE arrays can be arranged into a desired size, for example, 3 × 3, 5 × 5 (square array in general), according to the practical application and the size requirement of the convolution kernel.

N PE arrays can be arranged in parallel to improve the efficiency and speed of calculation processing.

After each single PE array, M-level addition chains may be connected, where M is the order of each PE array, which add the computation results of the PE arrays in steps, rather than adding all the results together, which may be used to improve the efficiency and speed of the computation on a hardware level.

After all the addition chains, a K-level addition tree can be connected, which is used to process the results of the previous M-level addition chains, wherein

And N is the number of parallel PE arrays. To improve the efficiency and speed of the computation and to reduce critical path delays to improve system speed and computational efficiency.

The PE of the invention mainly completes multiplication and accumulation operations, supports the calculation of multi-precision data types, and supports the calculation of fixed point numbers as well as floating point numbers.

Specifically, the PEs in the neural network computing unit may dynamically support data type operations with different precisions in the neural network model according to configurations, such as: data types such as int-16, int-8, int-4, int-2, int-1, floating-16, etc.

Fig. 3 shows the structure of a PE according to this embodiment. When the input data and the weight data are floating point numbers, multiplication operation needs to be carried out by the cooperation of a multiplier and an adder to complete the operation of exponent addition and mantissa multiplication of the floating point numbers; the addition operation needs a comparator and a shift register as shown in the figure to complete the scale and opposite operation in the floating point number addition. When the input data and the weight data are integers, the multiplication operation only needs to be carried out on the multipliers, the addition operation also can skip the floating-point number modules, and the data are directly input into the adder.

When data flows in the PE array without calculation, the PE structure also comprises a data transmission interface which directly outputs the data to the next PE after one clock cycle without calculation in the PE.

Alternatively, the multipliers employed in the PE structure of fig. 3 are built using the lowest precision data type. Taking int-16, int-8, int-4 and floating-16 data types as examples, all multipliers are built by adopting 4-bit multipliers, and the multiplication with higher data width is formed by splicing a plurality of 4-bit multipliers.

Specifically, for example, a and b are 16-bit data types, and are respectively split into two 8-bit numbers, a1, a2, b1 and b 2. Then the 16-bit multiplication of a, b can be split into 8-bit multiplications:

similarly, the 8-bit number can be split into two 4-bit numbers, and the 8-bit multiplication is split into the 4-bit multiplication.

The 16bit number by c, d method can then be split into 4bit data of c1, c2, c3, c4, d1, d2, d3, d 4. Then the 16bit multiplication of c, d can be split into 4bit multiplications:

specifically, as shown in fig. 4, the 8-bit multiplier is built up from 4 bits, and the 16 bits are built up from 4 8-bit multipliers.

Optionally, the adder in the PE hardware structure of the present invention multiplexes the same adder, so as to achieve multiplexing of hardware resources as much as possible, and reduce the area required for design.

Specifically, if N T bit fixed points are added, the bit width of the adder is required to be:

the same adder can be multiplexed.

For convenience of explanation, the size of each of the single PE arrays mentioned below in this embodiment is 3 × 3, and the data flow when performing convolution operation according to the present invention is given below by taking convolution operation as an example in conjunction with fig. 5.

As shown in fig. 5, taking the convolution kernel size of 3 × 3 as an example, the input buffer module inversely loads the weight values corresponding to the convolution kernels into the array in the manner shown in the figure.

When there is input data, the input buffer module obtains the input data from the data interface module and then flows the input characteristic data into the array in sequence according to the row unit, namely according to R₁₁、R₂₁、R₃₁、……、R_1n、R_2n、R_3nWhere n is the total number of columns of input data.

After the data flow of two unit clock cycles, the first column data R₁₁、R₂₁、R₃₁… …, flow to the far right of the array, start the first convolution operation. And moving the data to the right by one step length every time of operation, namely moving the convolution kernel to the right by one step length in the input image in convolution operation, moving the data to the lower by one step length until the last row of data enters the array to finish operation, and continuing the above operation until the whole input image is traversed.

Namely, the calculation of the formula is completed:

wherein Z represents the result of the neuron in three-dimensional space;

is the number of input channels;

one of the output channels;

coordinate axes of the output characteristic diagram on the 2-D plane;

is the size of the convolution kernel;

respectively representing coordinates on three coordinate axes;

step length of convolution operation;

w represents the weight value in the convolution kernel;

is the bias of the neuron; f is the activation function.

For example, if the available resources are 3 PE arrays in total, i.e. 3 output channels 3 × 9=27 output channels PE., each PE may be responsible for one output channel and 3 output channels output in parallel, and if there is only one output channel, 3 PE arrays may be used to process the input image simultaneously, e.g. processing 1 to 3 lines, 4 to 6 lines, 7 to 9 lines … … n respectively₂~n₂+2 lines wherein n₂< (m-1), where m is the total number of rows of input data.

Alternatively, the present invention may be used to perform matrix operations in addition to convolution operations. In fig. 6, the data flow when the matrix operation is performed according to the present invention is shown.

Alternatively, the weight matrix flows in from one side of the PE array and the input matrix flows in from the other side. As shown in FIG. 6, calculate [3, n₃]Input matrix of [ n ]₃,3]The result is [3,3 ]]Of the matrix of (a). The weight matrix flows downward from the upper section of the PE array, and the input matrix flows in from the left side to the right side of the PE array. In the weight matrix data is input in columns, in the input matrix data is input in rows and adjacent rows or columns are separated by one clock cycle in order not to lose part of the result of the matrix multiplication.

Specifically, for example, if the matrix in the PE array located in the second row and the first column is not separated by one clock cycle, the input data in the second row arrives earlier than the weight data in the first column by one cycle. After a clock cycle, the input data of the second row and the weight data of the first column can arrive at the PE at the same time, ensuring the correctness of the data and the result.

Each single PE calculates the result of multiplying one row by one column in the matrix multiplication, and the PE in the first row and the first column outputs the result of multiplying the matrix by the first row and the first column, namely S₁₁The result of the corresponding position in the result matrix of the PE calculation for the corresponding coordinate, i.e. the PE array and the result matrix size, is corresponding.

Alternatively, multiple PE array splices of 3 × 3 can be used to satisfy the multiplication of matrices of different sizes, such as [3, n ]₃]And [ n₃,6]The result is [3,6 ] and the weight matrix]The computational requirements can be met by paralleling two arrays in the row dimension to form a PE array of 3 × 6, and the result is [6,3 ] in the same way]Can be parallel two arrays in the column dimension to form a PE array of 6 × 3 if m is to be calculated₃,n₃]Matrix sum n₃,m₃]Multiplication of the matrix, the result being [ m ]₃,m₃]The matrix (c) needs to be parallel (m) in the row and column dimensions₃And/3) arrays.

And finally, according to a formula:

the result of the matrix multiplication can be calculated, where M represents the result of the neuron in two-dimensional space;

the horizontal and vertical coordinates of the output result are shown; o is the number of neurons in the output layer; d is input data; w is weight data; bias is Bias of the neuron; f is the activation function.

In the whole convolution operation and matrix operation processes, the system control module controls data access and data flow in the array through the monitoring neural network computing unit and the memory scheduling unit, and controls the input buffer module, the neural network computing unit, the memory scheduling module and the output buffer module to cooperatively and synchronously work.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

the dynamic multi-precision neural network acceleration unit provided by the invention controls the units to work cooperatively to accelerate the neural network operation through the system control module. The neural network computing unit adopts a general hardware acceleration design, and can be used for controlling the operation type and the operation scale of the neural network in a programmable manner. The streamline design is added in the design, so that the data throughput rate and the operation speed are greatly improved, the multiply-add unit of the neural operation unit is optimized at the dead point, the multiplexing of hardware resources is improved, and the hardware area is greatly reduced. The invention can meet the data type calculation with different precisions according to the requirements of the neural network model. And meanwhile, basic operations in the neural network are supported, such as convolution, pooling, nonlinear mapping, matrix multiplication and the like. Different neural network hierarchical operations can be supported by the combination of different computing components, and the requirements of different algorithms are met.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A dynamic multi-precision neural network accelerating unit comprises a data interface module, a configuration interface module, a memory interface, an input buffer module, a neural network computing unit, a memory scheduling unit, an output buffer module and a system control module, and is characterized in that:

the configuration interface module is used for configuring the whole system by transmitting configuration information to the system control module;

the data interface module is responsible for transmitting input data to the system;

the memory interface is connected with the input buffer module, the memory scheduling unit and the output buffer module; under the control and monitoring of the memory scheduling unit, transmitting the weight data stored in the memory to the input buffer module, or receiving the operation result data of the output buffer module, or temporarily storing the intermediate operation result of the neural network computing unit under the control of the memory scheduling unit;

the input buffer module is connected with the data interface module, the memory interface and the neural network computing unit, and sends the weighted data in the memory and the input data obtained through the data interface into the neural network computing unit after data preprocessing;

the output buffer module is connected with the memory scheduling unit and the memory interface, and after the neural network computing unit obtains a final operation result, the operation result is arranged and prepared under the scheduling of the memory scheduling unit and is stored in a corresponding memory address;

the memory scheduling unit is responsible for scheduling data and interacting with a memory, and is used for fetching data from a corresponding address, storing a final calculation result into a corresponding memory address or temporarily storing an operation result of the neural network calculation unit into the memory; the system control module controls the input buffer module, the neural network computing unit, the memory scheduling unit and the output buffer module to cooperatively and synchronously work by monitoring the current data flow states of the neural network computing unit and the memory scheduling unit;

the neural network computing unit comprises N parallel processing unit PE arrays, N is more than or equal to 2, and the processing unit PE arrays with corresponding quantity are programmed according to a neural network algorithm model and run in parallel for carrying out convolution operation and matrix multiplication operation; the processing element PE array is n composed of S processing elements PE₁×m₁Of a two-dimensional matrix of n₁And m₁The number of the processing units is more than or equal to 2, and connecting paths among the processing units PE are programmable to support neural network algorithm models of different scales.

2. The dynamic multi-precision neural network acceleration unit of claim 1, characterized in that: the data interface, the memory interface, the input buffer module and the output buffer module are hung on a bus in the chip.

3. The dynamic multi-precision neural network acceleration unit of claim 1, characterized in that: the neural network computing unit is also connected with an M-level addition chain behind each processing unit PE array, and is also connected with a K-level addition tree behind all the M-level addition chains.

4. The dynamic multi-precision neural network acceleration unit of claim 1, characterized in that: each processing unit PE of the neural network computing unit can process multiplication and addition operation of a plurality of floating point numbers and fixed point numbers with data precision, wherein the data precision is int-16, int-8, int-4, int-2, int-1 and floating-16.

5. The dynamic multi-precision neural network acceleration unit of claim 1, characterized in that: the processing unit PE of the neural network computing unit uses hardware resources for multiplexing, the multiplication operation with high bit width multiplexes the multiplier with low bit width, and the addition operation with all data bit widths multiplexes the same adder.