CN111626414A - Dynamic multi-precision neural network acceleration unit - Google Patents

Dynamic multi-precision neural network acceleration unit Download PDF

Info

Publication number
CN111626414A
CN111626414A CN202010747687.0A CN202010747687A CN111626414A CN 111626414 A CN111626414 A CN 111626414A CN 202010747687 A CN202010747687 A CN 202010747687A CN 111626414 A CN111626414 A CN 111626414A
Authority
CN
China
Prior art keywords
neural network
data
unit
memory
computing unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010747687.0A
Other languages
Chinese (zh)
Other versions
CN111626414B (en
Inventor
伍元聪
刘洋
田野
刘祎鹤
刘爽
于奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010747687.0A priority Critical patent/CN111626414B/en
Publication of CN111626414A publication Critical patent/CN111626414A/en
Application granted granted Critical
Publication of CN111626414B publication Critical patent/CN111626414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17381Two dimensional, e.g. mesh, torus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/177Initialisation or configuration control

Abstract

The invention relates to the field of integrated circuits and artificial intelligence, in particular to a system-level integrated circuit chip, and particularly relates to a dynamic multi-precision neural network acceleration unit. According to the invention, after the PE units are arrayed, the PE unit array is used as the dual array design of the array units again to support different neural network hierarchical operations, so that the requirements of different algorithms and different data accuracies are met. Hardware acceleration design is also carried out on the neural network computing unit, and pipeline design is added in the overall design, so that the data throughput rate and the operation speed are greatly improved, the multiplication and addition unit of the neural computing unit is optimized, the reuse of hardware resources is improved, and the hardware area is greatly reduced; the data type calculation with different precisions can be met according to the requirement of the neural network model; and meanwhile, basic operations in the neural network are supported, such as convolution, pooling, nonlinear mapping, matrix multiplication and the like.

Description

Dynamic multi-precision neural network acceleration unit
Technical Field
The invention relates to the field of integrated circuits and artificial intelligence, in particular to a system-level integrated circuit chip, and particularly relates to a dynamic multi-precision neural network acceleration unit.
Background
As data explosion grows and computing performance rapidly increases, machine learning has become more and more intense, gradually penetrating into all corners in our lives: common applications are voice video such as Siri by apple and Cortana by microsoft; face recognition such as Apple iPhoto by Apple and Google Picasa by Google; artificial intelligence such as the ahpa Go of deep mind corporation under google has defeated human top players. Currently, the most attractive thing in the field of machine learning is deep learning represented by a neural network.
The deep neural network simulates the neural connection structure of the human brain by establishing a model, and has great breakthrough in the fields of image processing, voice recognition and the like. However, as the neural network model is more and more complex, the weight data is more and more, the required precision is more and more high, and the convolution neural network is started, and the larger image input data puts higher requirements on hardware. The commonly used general-purpose processor chip is not completely suitable for acceleration of a neural network due to its general-purpose properties, such as the universality, branch prediction, address merging and the like, and occupies the design area and resources of the chip. Existing general purpose processors have slowly become unable to meet the requirements of neural network acceleration, and dedicated neural network acceleration hardware has been designed to accelerate neural networks.
Because Field Programmable Gate Arrays (FPGAs) have high programmability and are short in development period as accelerators, researches on realizing neural network acceleration are increasing. However, the current deep neural network calculation still depends heavily on the dense floating-point matrix multiplication, and a unique data type (data type after sparse compression) design is abandoned, so that the deep neural network calculation is more beneficial to being mapped onto the GPU (conventional parallelism), and therefore the GPU is still widely used for accelerating the deep neural network in the market. Although the FPGA provides excellent energy consumption efficiency, it cannot achieve the performance of the GPU at present, and is not suitable for popularization at present.
Disclosure of Invention
In order to solve the problems of power consumption, cost, area and the like in the conventional neural network acceleration technology, the invention provides a dynamic multi-precision neural network acceleration unit, which is an Application Specific Integrated Circuit (ASIC). Compared with a common application-specific integrated circuit, the invention has more flexibility due to programmability and multi-precision. The ASIC makes it possible to use low-power consumption, low-cost and high-performance on-line learning artificial intelligence chip.
The specific technical scheme is as follows:
a dynamic multi-precision neural network acceleration unit, comprising: the device comprises a data interface module, a configuration interface module, a memory interface, an input buffer module, a neural network computing unit, a memory scheduling unit, an output buffer module and a system control module.
The configuration interface module configures the entire system by conveying configuration information to the system control module.
The data interface module is responsible for transmitting input data to the system.
The memory interface is connected with the input buffer module, the memory scheduling unit and the output buffer module; and under the control and monitoring of the memory scheduling unit, transmitting the weight data stored in the memory to the input buffer module, or receiving the operation result data of the output buffer module, or controlling and temporarily storing the intermediate operation result of the neural network computing unit through the memory scheduling unit.
The input buffer module is connected with the data interface module, the memory interface and the neural network computing unit, and sends the weighted data in the memory and the input data obtained through the data interface into the neural network computing unit after data preprocessing.
And the output buffer module is connected with the memory scheduling unit and the memory interface, and after the neural network computing unit obtains the final operation result, the operation result is arranged, prepared and stored in the corresponding memory address under the scheduling of the memory scheduling unit.
The memory scheduling unit is responsible for scheduling data and interacting with a memory, and is used for fetching data from a corresponding address, storing a final calculation result into a corresponding memory address or temporarily storing an operation result of the neural network calculation unit into the memory; the system control module controls the input buffer module, the neural network computing unit, the memory scheduling unit and the output buffer module to cooperatively and synchronously work by monitoring the current data flow states of the neural network computing unit and the memory scheduling unit.
The system control module controls the input buffer module, the neural network computing unit, the memory scheduling unit and the output buffer module to cooperatively and synchronously work by monitoring the current data flow states of the neural network computing unit and the memory scheduling unit. The data type operation with different precisions is met according to the requirements of the neural network model. And meanwhile, basic operations in the neural network are supported, such as convolution, pooling, nonlinear mapping, matrix multiplication and the like.
The memory scheduling unit is responsible for scheduling data and interacting with the memory.
The neural network computing unit comprises N parallel processing unit (PE) arrays, N is larger than or equal to 2, and the PE arrays with corresponding number are programmed according to a neural network algorithm model and run in parallel, so that the computing processing efficiency and speed are improved, and the convolutional operation and the matrix multiplication operation are performed. The Processing Element (PE) array is n composed of S Processing Elements (PE)1×m1Of a two-dimensional matrix of n1And m1The number of the processing units is more than or equal to 2, and connecting paths among the processing units (PE) are programmable to support neural network algorithm models of different scales.
And when the neural network computing unit carries out convolution operation, the number of the PEs is configured according to the size of the convolution kernel to meet the computing requirement. Meanwhile, the number of the PE arrays is configured to be parallel in the dimension of the output channel and the dimension of the input data, so that different PE arrays are controlled to output different output channels, and a plurality of PE arrays are controlled to process data of different parts of the same input image at the same time, the parallelism is improved, and the operation is accelerated.
When the neural network computing unit carries out matrix multiplication, the PE is configured according to the size of the input matrix, and the PE array is spliced in the dimension of the row or the column so as to meet the computing requirement.
Furthermore, the data interface, the memory interface, the input buffer module and the output buffer module can be hung on a bus in the chip, so that the transmission efficiency is improved, and the structure is optimized.
Furthermore, the neural network computing unit is connected with an M-level addition chain behind each PE array, and is connected with a K-level addition tree behind all the M-level addition chains to reduce the time delay of a key path, so that the system speed and the computing efficiency are improved.
Furthermore, each PE of the neural network computing unit can process multiple multiplication and addition operations of floating point numbers and fixed point numbers with data precision, wherein the data precision is int-16, int-8, int-4, int-2, int-1 and floating-16.
Furthermore, the hardware resources used by the PE of the neural network computing unit are multiplexed, the multiplication operation with high bit width is multiplexed with the multiplier with low bit width, and the addition operation with all data bit widths is multiplexed with the same adder.
Further, the neural network algorithm model is ResNet, VggNet, GoogleNet, YoLo or RCNN.
According to the dynamic multi-precision neural network accelerating unit provided by the invention, after the PE units are arrayed, the PE unit array is used as a double array design of the array units again to support different neural network hierarchical operations, so that the requirements of different algorithms and different data precisions are met. Hardware acceleration design is also carried out on the neural network computing unit, and pipeline design is added in the overall design, so that the data throughput rate and the operation speed are greatly improved, the multiplication and addition unit of the neural computing unit is optimized, the reuse of hardware resources is improved, and the hardware area is greatly reduced; the data type calculation with different precisions can be met according to the requirement of the neural network model; and meanwhile, basic operations in the neural network are supported, such as convolution, pooling, nonlinear mapping, matrix multiplication and the like.
Drawings
FIG. 1 is an architectural block diagram of the present invention.
FIG. 2 is a block diagram of the architecture of the neural network computational unit of the present invention.
FIG. 3 is a block diagram illustrating the structure of a single PE according to the present invention.
Fig. 4 is a structural block diagram of building a high-bit-width multiplier by using a low-bit-width multiplier according to the embodiment of the present invention.
FIG. 5 is a schematic diagram illustrating data flow during a convolution operation according to an embodiment of the present invention.
FIG. 6 is a schematic diagram illustrating data flow during mapping a matrix operation according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Example one
The dynamic multi-precision neural network acceleration unit provided by this embodiment is applied to the design of AI acceleration chips in the field of intelligent assistant driving, and the architecture thereof is shown in fig. 1, and includes: the device comprises a data interface module, a configuration interface module, a memory interface, an input buffer unit, a neural network computing unit, a memory scheduling unit, an output buffer unit and a system control module.
The system control module controls the input buffer module, the memory interface, the neural network computing unit, the memory scheduling unit and the output buffer module, and controls and coordinates the whole system by monitoring the process of the whole neural network.
The configuration interface module may configure the entire system as desired by conveying configuration information to the system control module.
The system control module can be hung on a peripheral bus, such as an APB bus of AMBA, and interacts with modules such as a configuration interface and an external system through the bus.
The data interface module is responsible for transmitting input data to the system, supports interfaces such as USB, MIPI, RGB888, BT1120 and BT656, and transmits image data and other types of data.
The memory interface is connected with the input buffer module, the memory scheduling unit and the output buffer module; the memory interface can transmit the weight data stored in the memory to the input buffer module or receive the operation result data of the output buffer module under the control and monitoring of the memory scheduling unit, or control and temporarily store the intermediate operation result of the neural network computing unit through the memory scheduling unit when the data scale is large.
The input buffer module is connected with the data interface module, the memory interface and the neural network computing unit, and sends the weighted data in the memory and the input data obtained through the data interface into the neural network computing unit after data preprocessing.
The output buffer module is connected with the memory scheduling unit and the memory interface, and after the neural network computing unit obtains the final operation result, the output buffer module arranges and prepares the operation result under the scheduling of the memory scheduling unit and stores the operation result into the corresponding memory address.
The data interface, the memory interface, the input buffer module and the output buffer module can be hung on a bus in the chip, such as an AXI bus protocol of AMBA and the like, so that the transmission efficiency is improved, and the structure is optimized.
The memory scheduling unit is responsible for scheduling data and interacting with the memory: for example, data is fetched from a corresponding address, a final calculation result is stored in a corresponding memory address, or when data processed by a neural network calculation unit is too large, a part of calculation results are temporarily stored in a memory and then are fetched.
The structure of the neural network computing unit is shown in fig. 2, and includes: an array of N parallel Processing Elements (PEs), an addition chain connected to the array, and an addition tree.
The single PE arrays can be arranged into a desired size, for example, 3 × 3, 5 × 5 (square array in general), according to the practical application and the size requirement of the convolution kernel.
N PE arrays can be arranged in parallel to improve the efficiency and speed of calculation processing.
After each single PE array, M-level addition chains may be connected, where M is the order of each PE array, which add the computation results of the PE arrays in steps, rather than adding all the results together, which may be used to improve the efficiency and speed of the computation on a hardware level.
After all the addition chains, a K-level addition tree can be connected, which is used to process the results of the previous M-level addition chains, wherein
Figure 17925DEST_PATH_IMAGE001
And N is the number of parallel PE arrays. To improve the efficiency and speed of the computation and to reduce critical path delays to improve system speed and computational efficiency.
The PE of the invention mainly completes multiplication and accumulation operations, supports the calculation of multi-precision data types, and supports the calculation of fixed point numbers as well as floating point numbers.
Specifically, the PEs in the neural network computing unit may dynamically support data type operations with different precisions in the neural network model according to configurations, such as: data types such as int-16, int-8, int-4, int-2, int-1, floating-16, etc.
Fig. 3 shows the structure of a PE according to this embodiment. When the input data and the weight data are floating point numbers, multiplication operation needs to be carried out by the cooperation of a multiplier and an adder to complete the operation of exponent addition and mantissa multiplication of the floating point numbers; the addition operation needs a comparator and a shift register as shown in the figure to complete the scale and opposite operation in the floating point number addition. When the input data and the weight data are integers, the multiplication operation only needs to be carried out on the multipliers, the addition operation also can skip the floating-point number modules, and the data are directly input into the adder.
When data flows in the PE array without calculation, the PE structure also comprises a data transmission interface which directly outputs the data to the next PE after one clock cycle without calculation in the PE.
Alternatively, the multipliers employed in the PE structure of fig. 3 are built using the lowest precision data type. Taking int-16, int-8, int-4 and floating-16 data types as examples, all multipliers are built by adopting 4-bit multipliers, and the multiplication with higher data width is formed by splicing a plurality of 4-bit multipliers.
Specifically, for example, a and b are 16-bit data types, and are respectively split into two 8-bit numbers, a1, a2, b1 and b 2. Then the 16-bit multiplication of a, b can be split into 8-bit multiplications:
Figure 935066DEST_PATH_IMAGE002
similarly, the 8-bit number can be split into two 4-bit numbers, and the 8-bit multiplication is split into the 4-bit multiplication.
The 16bit number by c, d method can then be split into 4bit data of c1, c2, c3, c4, d1, d2, d3, d 4. Then the 16bit multiplication of c, d can be split into 4bit multiplications:
Figure 170263DEST_PATH_IMAGE003
specifically, as shown in fig. 4, the 8-bit multiplier is built up from 4 bits, and the 16 bits are built up from 4 8-bit multipliers.
Optionally, the adder in the PE hardware structure of the present invention multiplexes the same adder, so as to achieve multiplexing of hardware resources as much as possible, and reduce the area required for design.
Specifically, if N T bit fixed points are added, the bit width of the adder is required to be:
Figure 96630DEST_PATH_IMAGE004
the same adder can be multiplexed.
For convenience of explanation, the size of each of the single PE arrays mentioned below in this embodiment is 3 × 3, and the data flow when performing convolution operation according to the present invention is given below by taking convolution operation as an example in conjunction with fig. 5.
As shown in fig. 5, taking the convolution kernel size of 3 × 3 as an example, the input buffer module inversely loads the weight values corresponding to the convolution kernels into the array in the manner shown in the figure.
When there is input data, the input buffer module obtains the input data from the data interface module and then flows the input characteristic data into the array in sequence according to the row unit, namely according to R11、R21、R31、……、R1n、R2n、R3nWhere n is the total number of columns of input data.
After the data flow of two unit clock cycles, the first column data R11、R21、R31… …, flow to the far right of the array, start the first convolution operation. And moving the data to the right by one step length every time of operation, namely moving the convolution kernel to the right by one step length in the input image in convolution operation, moving the data to the lower by one step length until the last row of data enters the array to finish operation, and continuing the above operation until the whole input image is traversed.
Namely, the calculation of the formula is completed:
Figure 911003DEST_PATH_IMAGE005
wherein Z represents the result of the neuron in three-dimensional space;
Figure 874411DEST_PATH_IMAGE006
is the number of input channels;
Figure 141444DEST_PATH_IMAGE007
one of the output channels;
Figure 871502DEST_PATH_IMAGE008
coordinate axes of the output characteristic diagram on the 2-D plane;
Figure 540381DEST_PATH_IMAGE009
is the size of the convolution kernel;
Figure 986275DEST_PATH_IMAGE010
respectively representing coordinates on three coordinate axes;
Figure 740604DEST_PATH_IMAGE011
step length of convolution operation;
Figure 477616DEST_PATH_IMAGE012
w represents the weight value in the convolution kernel;
Figure 141947DEST_PATH_IMAGE013
is the bias of the neuron; f is the activation function.
For example, if the available resources are 3 PE arrays in total, i.e. 3 output channels 3 × 9=27 output channels PE., each PE may be responsible for one output channel and 3 output channels output in parallel, and if there is only one output channel, 3 PE arrays may be used to process the input image simultaneously, e.g. processing 1 to 3 lines, 4 to 6 lines, 7 to 9 lines … … n respectively2~n2+2 lines wherein n2< (m-1), where m is the total number of rows of input data.
Alternatively, the present invention may be used to perform matrix operations in addition to convolution operations. In fig. 6, the data flow when the matrix operation is performed according to the present invention is shown.
Alternatively, the weight matrix flows in from one side of the PE array and the input matrix flows in from the other side. As shown in FIG. 6, calculate [3, n3]Input matrix of [ n ]3,3]The result is [3,3 ]]Of the matrix of (a). The weight matrix flows downward from the upper section of the PE array, and the input matrix flows in from the left side to the right side of the PE array. In the weight matrix data is input in columns, in the input matrix data is input in rows and adjacent rows or columns are separated by one clock cycle in order not to lose part of the result of the matrix multiplication.
Specifically, for example, if the matrix in the PE array located in the second row and the first column is not separated by one clock cycle, the input data in the second row arrives earlier than the weight data in the first column by one cycle. After a clock cycle, the input data of the second row and the weight data of the first column can arrive at the PE at the same time, ensuring the correctness of the data and the result.
Each single PE calculates the result of multiplying one row by one column in the matrix multiplication, and the PE in the first row and the first column outputs the result of multiplying the matrix by the first row and the first column, namely S11The result of the corresponding position in the result matrix of the PE calculation for the corresponding coordinate, i.e. the PE array and the result matrix size, is corresponding.
Alternatively, multiple PE array splices of 3 × 3 can be used to satisfy the multiplication of matrices of different sizes, such as [3, n ]3]And [ n3,6]The result is [3,6 ] and the weight matrix]The computational requirements can be met by paralleling two arrays in the row dimension to form a PE array of 3 × 6, and the result is [6,3 ] in the same way]Can be parallel two arrays in the column dimension to form a PE array of 6 × 3 if m is to be calculated3,n3]Matrix sum n3,m3]Multiplication of the matrix, the result being [ m ]3,m3]The matrix (c) needs to be parallel (m) in the row and column dimensions3And/3) arrays.
And finally, according to a formula:
Figure 571791DEST_PATH_IMAGE014
the result of the matrix multiplication can be calculated, where M represents the result of the neuron in two-dimensional space;
Figure 344575DEST_PATH_IMAGE015
the horizontal and vertical coordinates of the output result are shown; o is the number of neurons in the output layer; d is input data; w is weight data; bias is Bias of the neuron; f is the activation function.
In the whole convolution operation and matrix operation processes, the system control module controls data access and data flow in the array through the monitoring neural network computing unit and the memory scheduling unit, and controls the input buffer module, the neural network computing unit, the memory scheduling module and the output buffer module to cooperatively and synchronously work.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
the dynamic multi-precision neural network acceleration unit provided by the invention controls the units to work cooperatively to accelerate the neural network operation through the system control module. The neural network computing unit adopts a general hardware acceleration design, and can be used for controlling the operation type and the operation scale of the neural network in a programmable manner. The streamline design is added in the design, so that the data throughput rate and the operation speed are greatly improved, the multiply-add unit of the neural operation unit is optimized at the dead point, the multiplexing of hardware resources is improved, and the hardware area is greatly reduced. The invention can meet the data type calculation with different precisions according to the requirements of the neural network model. And meanwhile, basic operations in the neural network are supported, such as convolution, pooling, nonlinear mapping, matrix multiplication and the like. Different neural network hierarchical operations can be supported by the combination of different computing components, and the requirements of different algorithms are met.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A dynamic multi-precision neural network accelerating unit comprises a data interface module, a configuration interface module, a memory interface, an input buffer module, a neural network computing unit, a memory scheduling unit, an output buffer module and a system control module, and is characterized in that:
the configuration interface module is used for configuring the whole system by transmitting configuration information to the system control module;
the data interface module is responsible for transmitting input data to the system;
the memory interface is connected with the input buffer module, the memory scheduling unit and the output buffer module; under the control and monitoring of the memory scheduling unit, transmitting the weight data stored in the memory to the input buffer module, or receiving the operation result data of the output buffer module, or temporarily storing the intermediate operation result of the neural network computing unit under the control of the memory scheduling unit;
the input buffer module is connected with the data interface module, the memory interface and the neural network computing unit, and sends the weighted data in the memory and the input data obtained through the data interface into the neural network computing unit after data preprocessing;
the output buffer module is connected with the memory scheduling unit and the memory interface, and after the neural network computing unit obtains a final operation result, the operation result is arranged and prepared under the scheduling of the memory scheduling unit and is stored in a corresponding memory address;
the memory scheduling unit is responsible for scheduling data and interacting with a memory, and is used for fetching data from a corresponding address, storing a final calculation result into a corresponding memory address or temporarily storing an operation result of the neural network calculation unit into the memory; the system control module controls the input buffer module, the neural network computing unit, the memory scheduling unit and the output buffer module to cooperatively and synchronously work by monitoring the current data flow states of the neural network computing unit and the memory scheduling unit;
the neural network computing unit comprises N parallel processing unit PE arrays, N is more than or equal to 2, and the processing unit PE arrays with corresponding quantity are programmed according to a neural network algorithm model and run in parallel for carrying out convolution operation and matrix multiplication operation; the processing element PE array is n composed of S processing elements PE1×m1Of a two-dimensional matrix of n1And m1The number of the processing units is more than or equal to 2, and connecting paths among the processing units PE are programmable to support neural network algorithm models of different scales.
2. The dynamic multi-precision neural network acceleration unit of claim 1, characterized in that: the data interface, the memory interface, the input buffer module and the output buffer module are hung on a bus in the chip.
3. The dynamic multi-precision neural network acceleration unit of claim 1, characterized in that: the neural network computing unit is also connected with an M-level addition chain behind each processing unit PE array, and is also connected with a K-level addition tree behind all the M-level addition chains.
4. The dynamic multi-precision neural network acceleration unit of claim 1, characterized in that: each processing unit PE of the neural network computing unit can process multiplication and addition operation of a plurality of floating point numbers and fixed point numbers with data precision, wherein the data precision is int-16, int-8, int-4, int-2, int-1 and floating-16.
5. The dynamic multi-precision neural network acceleration unit of claim 1, characterized in that: the processing unit PE of the neural network computing unit uses hardware resources for multiplexing, the multiplication operation with high bit width multiplexes the multiplier with low bit width, and the addition operation with all data bit widths multiplexes the same adder.
CN202010747687.0A 2020-07-30 2020-07-30 Dynamic multi-precision neural network acceleration unit Active CN111626414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010747687.0A CN111626414B (en) 2020-07-30 2020-07-30 Dynamic multi-precision neural network acceleration unit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010747687.0A CN111626414B (en) 2020-07-30 2020-07-30 Dynamic multi-precision neural network acceleration unit

Publications (2)

Publication Number Publication Date
CN111626414A true CN111626414A (en) 2020-09-04
CN111626414B CN111626414B (en) 2020-10-27

Family

ID=72271557

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010747687.0A Active CN111626414B (en) 2020-07-30 2020-07-30 Dynamic multi-precision neural network acceleration unit

Country Status (1)

Country Link
CN (1) CN111626414B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112153347A (en) * 2020-09-27 2020-12-29 北京天地玛珂电液控制系统有限公司 Coal mine underground intelligent visual perception terminal, perception method, storage medium and electronic equipment
CN112269757A (en) * 2020-09-30 2021-01-26 北京清微智能科技有限公司 Computational array in coarse-grained reconfigurable processor and reconfigurable processor
CN112712173A (en) * 2020-12-31 2021-04-27 北京清微智能科技有限公司 Method and system for acquiring sparse operation data based on MAC (media Access control) multiply-add array
CN113298245A (en) * 2021-06-07 2021-08-24 中国科学院计算技术研究所 Multi-precision neural network computing device and method based on data flow architecture
CN113592066A (en) * 2021-07-08 2021-11-02 深圳市易成自动驾驶技术有限公司 Hardware acceleration method, apparatus, device, computer program product and storage medium
CN114239816A (en) * 2021-12-09 2022-03-25 电子科技大学 Reconfigurable hardware acceleration architecture of convolutional neural network-graph convolutional neural network
CN114239815A (en) * 2021-11-15 2022-03-25 电子科技大学 Reconfigurable neural network computing chip
CN114237551A (en) * 2021-11-26 2022-03-25 南方科技大学 Multi-precision accelerator based on pulse array and data processing method thereof
CN114997388A (en) * 2022-06-30 2022-09-02 北京知存科技有限公司 Linear programming-based neural network bias processing method for memory and computation integrated chip
WO2023124361A1 (en) * 2021-12-30 2023-07-06 上海商汤智能科技有限公司 Chip, acceleration card, electronic device and data processing method

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102665049A (en) * 2012-03-29 2012-09-12 中国科学院半导体研究所 Programmable visual chip-based visual image processing system
US20170316312A1 (en) * 2016-05-02 2017-11-02 Cavium, Inc. Systems and methods for deep learning processor
CN107423816A (en) * 2017-03-24 2017-12-01 中国科学院计算技术研究所 A kind of more computational accuracy Processing with Neural Network method and systems
CN107480782A (en) * 2017-08-14 2017-12-15 电子科技大学 Learn neural network processor on a kind of piece
CN107657316A (en) * 2016-08-12 2018-02-02 北京深鉴科技有限公司 The cooperative system of general processor and neural network processor designs
CN107993186A (en) * 2017-12-14 2018-05-04 中国人民解放军国防科技大学 3D CNN acceleration method and system based on Winograd algorithm
CN108564168A (en) * 2018-04-03 2018-09-21 中国科学院计算技术研究所 A kind of design method to supporting more precision convolutional neural networks processors
CN108665063A (en) * 2018-05-18 2018-10-16 南京大学 Two-way simultaneous for BNN hardware accelerators handles convolution acceleration system
CN109634558A (en) * 2018-12-12 2019-04-16 上海燧原科技有限公司 Programmable mixed-precision arithmetic element
CN110245748A (en) * 2018-03-09 2019-09-17 北京深鉴智能科技有限公司 Convolutional neural networks implementation method, device, hardware accelerator, storage medium
US20190303757A1 (en) * 2018-03-29 2019-10-03 Mediatek Inc. Weight skipping deep learning accelerator
CN110390384A (en) * 2019-06-25 2019-10-29 东南大学 A kind of configurable general convolutional neural networks accelerator
CN110705703A (en) * 2019-10-16 2020-01-17 北京航空航天大学 Sparse neural network processor based on systolic array
CN111199277A (en) * 2020-01-10 2020-05-26 中山大学 Convolutional neural network accelerator
CN111433851A (en) * 2017-09-29 2020-07-17 科洛斯巴股份有限公司 Arithmetic memory architecture

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102665049A (en) * 2012-03-29 2012-09-12 中国科学院半导体研究所 Programmable visual chip-based visual image processing system
US20170316312A1 (en) * 2016-05-02 2017-11-02 Cavium, Inc. Systems and methods for deep learning processor
CN107657316A (en) * 2016-08-12 2018-02-02 北京深鉴科技有限公司 The cooperative system of general processor and neural network processor designs
CN107423816A (en) * 2017-03-24 2017-12-01 中国科学院计算技术研究所 A kind of more computational accuracy Processing with Neural Network method and systems
CN107480782A (en) * 2017-08-14 2017-12-15 电子科技大学 Learn neural network processor on a kind of piece
CN111433851A (en) * 2017-09-29 2020-07-17 科洛斯巴股份有限公司 Arithmetic memory architecture
CN107993186A (en) * 2017-12-14 2018-05-04 中国人民解放军国防科技大学 3D CNN acceleration method and system based on Winograd algorithm
CN110245748A (en) * 2018-03-09 2019-09-17 北京深鉴智能科技有限公司 Convolutional neural networks implementation method, device, hardware accelerator, storage medium
US20190303757A1 (en) * 2018-03-29 2019-10-03 Mediatek Inc. Weight skipping deep learning accelerator
CN108564168A (en) * 2018-04-03 2018-09-21 中国科学院计算技术研究所 A kind of design method to supporting more precision convolutional neural networks processors
CN108665063A (en) * 2018-05-18 2018-10-16 南京大学 Two-way simultaneous for BNN hardware accelerators handles convolution acceleration system
CN109634558A (en) * 2018-12-12 2019-04-16 上海燧原科技有限公司 Programmable mixed-precision arithmetic element
CN110390384A (en) * 2019-06-25 2019-10-29 东南大学 A kind of configurable general convolutional neural networks accelerator
CN110705703A (en) * 2019-10-16 2020-01-17 北京航空航天大学 Sparse neural network processor based on systolic array
CN111199277A (en) * 2020-01-10 2020-05-26 中山大学 Convolutional neural network accelerator

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨维科: "基于RISC-V开源处理器的卷积神经网络加速器设计方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112153347A (en) * 2020-09-27 2020-12-29 北京天地玛珂电液控制系统有限公司 Coal mine underground intelligent visual perception terminal, perception method, storage medium and electronic equipment
WO2022068206A1 (en) * 2020-09-30 2022-04-07 北京清微智能科技有限公司 Computation array and related processor
CN112269757A (en) * 2020-09-30 2021-01-26 北京清微智能科技有限公司 Computational array in coarse-grained reconfigurable processor and reconfigurable processor
CN112269757B (en) * 2020-09-30 2023-10-27 北京清微智能科技有限公司 Computing array in coarse-grained reconfigurable processor and reconfigurable processor
CN112712173A (en) * 2020-12-31 2021-04-27 北京清微智能科技有限公司 Method and system for acquiring sparse operation data based on MAC (media Access control) multiply-add array
CN113298245A (en) * 2021-06-07 2021-08-24 中国科学院计算技术研究所 Multi-precision neural network computing device and method based on data flow architecture
CN113592066A (en) * 2021-07-08 2021-11-02 深圳市易成自动驾驶技术有限公司 Hardware acceleration method, apparatus, device, computer program product and storage medium
CN113592066B (en) * 2021-07-08 2024-01-05 深圳市易成自动驾驶技术有限公司 Hardware acceleration method, device, equipment and storage medium
CN114239815A (en) * 2021-11-15 2022-03-25 电子科技大学 Reconfigurable neural network computing chip
CN114239815B (en) * 2021-11-15 2023-05-12 电子科技大学 Reconfigurable neural network computing chip
CN114237551A (en) * 2021-11-26 2022-03-25 南方科技大学 Multi-precision accelerator based on pulse array and data processing method thereof
WO2023092669A1 (en) * 2021-11-26 2023-06-01 南方科技大学 Multi-precision accelerator based on systolic array and data processing method therefor
CN114239816B (en) * 2021-12-09 2023-04-07 电子科技大学 Reconfigurable hardware acceleration architecture of convolutional neural network-graph convolutional neural network
CN114239816A (en) * 2021-12-09 2022-03-25 电子科技大学 Reconfigurable hardware acceleration architecture of convolutional neural network-graph convolutional neural network
WO2023124361A1 (en) * 2021-12-30 2023-07-06 上海商汤智能科技有限公司 Chip, acceleration card, electronic device and data processing method
CN114997388A (en) * 2022-06-30 2022-09-02 北京知存科技有限公司 Linear programming-based neural network bias processing method for memory and computation integrated chip

Also Published As

Publication number Publication date
CN111626414B (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN111626414B (en) Dynamic multi-precision neural network acceleration unit
US10831862B2 (en) Performing matrix multiplication in hardware
CN109886400B (en) Convolution neural network hardware accelerator system based on convolution kernel splitting and calculation method thereof
CN110263925B (en) Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA
Guo et al. Software-hardware codesign for efficient neural network acceleration
CN108108809B (en) Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof
CN110738308B (en) Neural network accelerator
CN104915322B (en) A kind of hardware-accelerated method of convolutional neural networks
US11847553B2 (en) Parallel computational architecture with reconfigurable core-level and vector-level parallelism
CN107609641A (en) Sparse neural network framework and its implementation
US20140344203A1 (en) Neural network computing apparatus and system, and method therefor
CN107085562B (en) Neural network processor based on efficient multiplexing data stream and design method
US20200356809A1 (en) Flexible pipelined backpropagation
Ma et al. FPGA-based AI smart NICs for scalable distributed AI training systems
AU2020395435B2 (en) Flexible precision neural inference processing units
CN113516236A (en) VGG16 network parallel acceleration processing method based on ZYNQ platform
CN113988280B (en) Array computing accelerator architecture based on binarization neural network
US20230195836A1 (en) One-dimensional computational unit for an integrated circuit
US11847072B2 (en) Ai accelerator apparatus using in-memory compute chiplet devices for transformer workloads
US20240094986A1 (en) Method and apparatus for matrix computation using data conversion in a compute accelerator
US20240037379A1 (en) Server system with ai accelerator apparatuses using in-memory compute chiplet devices for transformer workloads
US20230168892A1 (en) Risc-v-based 3d interconnected multi-core processor architecture and working method thereof
Tiwari et al. Design of a Low Power and Area Efficient Bfloat16 based Generalized Systolic Array for DNN Applications
Zhao et al. Deep learning accelerators
CN115587613A (en) Hardware architecture and design space exploration method for accelerating multichannel convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant