CN209708122U

CN209708122U - A kind of computing unit, array, module, hardware system

Info

Publication number: CN209708122U
Application number: CN201920827602.2U
Authority: CN
Inventors: 李丽; 陈沁雨; 傅玉祥; 曹华锋; 何书专
Original assignee: Nanjing Ningqi Intelligent Computing Chip Research Institute Co Ltd
Current assignee: Nanjing Ningqi Intelligent Computing Chip Research Institute Co Ltd
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2019-11-29
Anticipated expiration: 2029-06-03

Abstract

The utility model discloses a kind of computing unit, array, module, hardware systems, belong to the hardware-accelerated field of intelligent algorithm.It is huge for rarefaction convolutional neural networks algorithm data existing in the prior art, calculate the problem of time length, the utility model devises a kind of invalid data removal mechanism in computing unit, invalid weight or input image data can be removed, it reduces and calculates the time, reduce multiplication and accumulation calculating bring power consumption；A kind of multi-pass line computing unit is designed, convolution operation is completed using cumulative channel mechanism is multiplexed, reduces resource consumption；In the case where invalid data removes, the utility model has also been devised a kind of for several rotation mechanism, can keep the sufficient for number of computing unit；The utility model can efficiently complete Car license recognition, recognition of face etc. such as smart home, smart city because low in energy consumption, area is small, throughput is high, recognition speed is fast, suitable for the application of mobile terminal.

Description

A kind of computing unit, array, module, hardware system

Technical field

It is the present invention relates to the hardware-accelerated field of intelligent algorithm, in particular to a kind of computing unit, array, module, hard Part system.

Background technique

Convolutional neural networks (Convolutional Neural Network, CNN) are a kind of feedforward neural networks, in people Work smart field has a wide range of applications, including the processing of image recognition, big data, natural language processing etc..In order to improve algorithm Precision, the model structure more sophisticated of convolutional neural networks, depth continues to increase, thus brought by model parameter it is huge, It calculates overlong time and hinders the algorithm in the deployment of terminal, such as smart home, intelligent transportation Internet of Things application.These problems The further investigation of the algorithm and hardware design to convolutional neural networks is caused, to pursue low-power consumption and high-throughput.

Algorithmically, a kind of method is parameter beta pruning: structuring beta pruning and unstructured beta pruning bring the sparse of weight Change, the activation primitives such as ReLU also bring the rarefaction of every layer of output activation image data.Another method is parameter sharing: being made By network training it is quantization neural network with particular quantization method, such as binaryzation or three-valued network, and ensures the effect of its algorithm Fruit will not influence the realization of application.

It is more and more for the hardware design of rarefaction convolutional neural networks algorithm in recent years, but correlative study collects mostly In conventional sparse convolution neural network hardware design, coding reconciliation such as is carried out to weight matrix and input picture matrix Code, and the present invention be directed to the quantization convolutional neural networks of rarefaction to design a kind of hardware implementation method, conventional coding techniques Brought cost is much larger than the effect itself promoted.

Chinese patent application, application number CN201811486547.1, disclose one kind and are directed to publication date on May 3rd, 2019 The accelerated method that hardware realization rarefaction convolutional neural networks are inferred is joined including the grouping beta pruning in face of sparse hardware-accelerated framework Number determine methods, for sparse hardware-accelerated framework grouping beta pruning training method and for before rarefaction convolutional neural networks to The dispositions method of deduction: determining the block length and beta pruning rate of grouping beta pruning according to number of multipliers in hardware structure, based on amount Grade cutting method the weight other than compression ratio is cut, by incremental training mode promoted the network accuracy rate after beta pruning and Compression ratio, the weight and indexing parameter and the meter being sent under hardware structure of the non-beta pruning position of preservation after the network of beta pruning is fine-tuned It calculates in unit, the activation value that computing unit obtains block length simultaneously is completed before sparse network to deduction.The present invention is based on hardware Framework sets out the beta pruning parameter and Pruning strategy of set algorithm level, and the logical complexity for being beneficial to reduce sparse accelerator improves To the whole efficiency of deduction before sparse accelerator.Though the invention also promotes whole efficiency from hardware structure, not to nothing Effect data are handled, and power consumption is high, and the calculating time not enough optimizes.

Summary of the invention

1. technical problems to be solved

Huge for rarefaction convolutional neural networks algorithm data existing in the prior art, the calculating time is long, grinds at present Study carefully and have focused largely on related hardware design, the present invention provides a kind of computing unit, array, module, hardware systems, can flexibly prop up The realization of a variety of binaryzations or three-valued sparse convolution neural network is held, resource utilization is high, throughput is big, low in energy consumption, area Small, the application suitable for terminal is realized.

2. technical solution

The purpose of the present invention is achieved through the following technical solutions.

In a first aspect, providing a kind of computing unit, including an invalid data module, a buffer cell group, one Adder, a multichannel part, register group and multiple gates, input data pass after invalid data resume module Defeated to arrive buffer cell group, to effective data source is provided after data buffering to adder, data pass through after adder buffer cell group By multichannel partial sum register group, final data is divided into positive weight part and negative weight part and is transferred to again by gate Adder.

Further, invalid data module judges whether input data is zero, and input data is zero to be judged to invalid data Skip computing unit.

Further, the input data of invalid data module is system input image data or weighted data.

Further, buffer cell group includes multiple sub- buffer cells, passes through invalid data cancellation module for buffering Data, input data is after sub- buffer cell by ensuring to provide enough valid data to adder for several rotation mechanism Source.No invalid data cancellation module elimination mechanism and for number rotate working mechanism the case where when, data by with one kind ten Divide conventional method to be handled: input image data is passed to multiplier with weight and is multiplied, and multiplication result adds up.Though These right results will not have any impact to last result, but these invalid datas still carry out multiplication and cumulative behaviour Make, causes the consumption of resource, time and energy.

Further, for two-value or the network characteristic of three value weights, the computing unit uses a kind of restructural The mechanism for being multiplexed cumulative channel carries out multiply-accumulate operation, and multichannel partial sum register group is for depositing each corresponding weight Part and accumulation result, and select according to the value of weight the cumulative source data provided to adder, gate is used for basis The value of weight reconstructs the cumulative access of adder.

Second aspect provides a kind of array, and the array is in array distribution by computing unit, and array has Z piece, per a piece of point There is not Y row composition, every a line has X computing unit；The cubical array of an X*Y*Z is formed, and the array is a three-dimensional Convolutional calculation array multiplies accumulating calculating for carrying out.

Further, every a line computing unit of array receives identical data, the computing unit of each column of array Identical data are received, every received data of a line are input image data, and each column input data is convolution kernel Weighted data.

The third aspect provides a kind of module, including the array, further includes a polymerization processing unit, at the polymerization Manage additional calculation, normalization layer processing and the activation primitive processing of the part sum of cell processing computing array convolutional calculation.

Further, polymerization processing unit includes multiple sub- polymerization processing units, and every height polymerization processing unit includes Two constant multipliers and an adder, every height polymerize a piece of computing unit output of array described in processing unit alignment processing Data.The part and carry out add operation by polymerization processing unit that the array computation goes out, obtain the knot multiplied accumulating to the end Fruit；And according to the linearisation feature of normalization layer, the calculating of normalization layer is incorporated in part and increment part.

Fourth aspect, provides a kind of hardware system, including on piece equipment and piece external equipment, and on piece equipment includes the mould Block is computing module；On piece equipment further includes control module, configuration module, memory module, bus interface, and piece external equipment includes CPU and outer memory module；The CPU of piece external equipment is electrically connected with the control module of on piece equipment, the external storage of piece external equipment The electrical connection of the memory module of device and on piece equipment, the control module of on piece equipment and configuration module, the memory module of on piece equipment It is electrically connected with computing module, memory module and the computing module electrical connection of the configuration module and on piece equipment of on piece equipment, on piece The memory module and computing module of equipment are electrically connected.

The instruction of a kind of implementation method of hardware system, the CPU of piece external equipment is read on piece equipment by bus interface Control module be decoded, decoded configuration-direct is conveyed into the configuration module of on piece equipment, and data path is according to matching Confidence breath is reconstructed；What the control module of on piece equipment issued, which execute instruction, controls the task execution of each module and starts or tie Beam；The data of the storage unit of piece external equipment are conveyed into the memory module of on piece equipment by bus interface；The calculating of on piece equipment Module reads data from the storage unit of on piece equipment and is calculated, calculated according to executing instruction of receiving and configuration-direct As a result the storage unit of on piece equipment is stored back to by instruction.

3. beneficial effect

Compared with the prior art, the present invention has the advantages that

(1) present invention devises a kind of invalid data removal mechanism, compared to conventional data processing method, for two-value Or three value convolutional neural networks sparse characteristic, invalid weight or input image data can be removed, reduce calculating when Between, multiplication and accumulation calculating bring power consumption are reduced, the consumption of resource, time and energy is reduced；

(2) present invention devises a kind of multi-pass line computing unit, for two-value or the characteristic of three value weight low bits, adopts Convolution operation is carried out with the restructural channel mechanism that adds up that is multiplexed, corresponds to each power for depositing different channels using register group The part of weight and accumulation result, and the cumulative source data provided to adder is provided according to the value of weight, gate is used for root The cumulative access of adder is reconstructed according to the value of weight, reduces resource consumption；

(3) invalid data remove in the case where, the present invention have also been devised it is a kind of for several rotation mechanism, for several rotation mechanism It remains that buffer cell group has sub- buffer cell receiving data, there is a sub- buffer cell to provide to the cumulative channel of multiplexing Valid data keep the sufficient for number of computing unit；

(4) the invention proposes a kind of hardware optimization algorithm, it is common according to it by the algorithm of convolution sum normalization layer Linear character merges, and linear calculating is executed after simplifying, reduces computing redundancy, reduces power consumption and area overhead.

It is being realized firmly in conclusion the present invention can effectively improve binaryzation or three-valued sparse convolution neural network Throughput and computing resource utilization rate when part accelerates；Design area of the invention is small, low in energy consumption, there is good practical application, Especially in the application in the Internet of Things field of terminal.

Detailed description of the invention

Fig. 1 is general hardware architecture schematic diagram of the invention；

Fig. 2 is sub- computing unit of the invention and is multiplexed cumulative channel mechanism access schematic diagram；

Fig. 3 is the elimination mechanism of invalid data cancellation module of the invention and rotates working mechanism's schematic diagram for number.

Specific embodiment

With reference to the accompanying drawings of the specification and specific embodiment, the present invention is described in detail.

Embodiment

As shown in Figure 1, a kind of hardware system, including on piece equipment and piece external equipment, on piece equipment include control module, match Module, memory module, computing module and bus interface are set, piece external equipment includes CPU and outer memory module；Piece external equipment CPU is electrically connected with the control module of on piece equipment, the external memory of piece external equipment and the memory module electrical connection of on piece equipment, The control module of on piece equipment is electrically connected with the configuration module of on piece equipment, memory module and computing module, and on piece equipment is matched The memory module and computing module electrical connection, the memory module and computing module of on piece equipment for setting module and on piece equipment are electrically connected It connects.

For system when carrying out convolutional neural networks data processing calculating, instruction is passed through bus interface by the CPU of piece external equipment The control module for being transferred on piece equipment is decoded, and decoded configuration-direct is transferred in the configuration module of on piece equipment, The data path of system is reconstructed according to configuration information；The data of the storage unit of piece external equipment are conveyed into piece by bus interface The memory module of upper equipment.The control module sending of on piece equipment executes instruction control to configuration module, memory module and calculating Module；The computing module of on piece equipment is read according to executing instruction of receiving and configuration-direct from the storage unit of on piece equipment According to being calculated, calculated result is stored back to the storage unit of on piece equipment by instruction for access.On piece storage control unit is according to configuration Information and control information determination result data are transmitted outside slice or are stayed in inside on-chip memory cell, voltage input by interface The generation of data and structured data address controls the interaction between on-chip memory cell and computing module.

The computing module of on piece equipment, is a kind of computing module of isomery, and the computing module includes a Three dimensional convolution Array and a polymerization processing unit, Three dimensional convolution array are in array distribution by 4*4*8 computing unit, and array has 8, each Piece has 4 rows composition respectively, and every a line has 4 computing units；Three dimensional convolution array is tired for multiply to the input data of system Add calculating, the input data of system includes input image data, weighted data and activation value；Every a line computing unit of array connects Identical input image data is received, the computing unit of each column of array receives identical weighted data.System input data warp After crossing Three dimensional convolution array computation, to the part of convolutional calculation and multiplication and additional calculation, standard are carried out by polymerization processing unit Change layer processing and activation primitive processing, the image data of output is transferred into polymerization processing unit on the right of array.Polymerization processing Unit includes multiple sub- polymerization processing units, and every height polymerization processing unit includes two constant multipliers and an adder, The data of every a piece of Three dimensional convolution array output of height polymerization processing unit alignment processing.The calculated part of computing array Add operation is carried out by polymerization processing unit with data, obtains the result multiplied accumulating to the end；And according to the linear of normalization layer Change feature, the calculating of normalization layer is incorporated in part and increment part.

As shown in Fig. 2, the computing unit in Three dimensional convolution array includes an invalid data cancellation module, a buffering list Tuple, an adder, a multichannel part, register group and multiple gates.Invalid data module uses one kind Invalid data removes mechanism, and invalid weight or input image data are removed, and reduces and calculates the time, reduces multiplication and cumulative meter Calculate bring power consumption.The input data of system is transferred to buffer cell group, buffer cell group after invalid data resume module To providing effective data source after data buffering to adder, using for several rotation mechanism in buffer cell group, it is ensured that have enough Valid data source is supplied to adder, guarantees the sufficient for number of computing unit；Buffered data pass through excessively after passing through adder Channel part and register group are divided into positive weight part in binaryzation or three-valued sparse convolution neural network and and bear Weight part, gate by positive weight part and and negative weight part and separate different channels be transferred to again adder reconstruct Cumulative channel, the design of multichannel computing unit reduce resource consumption.

It is to pass through invalid data that the invalid data of invalid data cancellation module in computing unit, which eliminates mechanism works mode, Module judges the image data of input or whether weighted data is zero, and invalid data is judged to if input data is zero and skips calculating Unit calculates；The input data of system enters buffer cell group after invalid data module, and buffer cell group includes three sons Buffer cell is supplied to the cumulative channel of multiplexing for buffered data；Three sub- buffer cells for several rotation mechanism by can ensure that Enough number of effective sources evidences are provided to adder.Working machine is rotated in the elimination mechanism of no invalid data cancellation module and for number When the case where processed, data are handled in a kind of entirely conventional method: input image data and weight be passed to multiplier into Row is multiplied, and multiplication result adds up.Although these results will not have any impact to last result, these are invalid Data still carry out multiplication and accumulation operations, cause the consumption of resource, time and energy；It is then reduced using invalid data module The calculating time of system and calculation times, reduce the power consumption of system.

The invalid data of invalid data cancellation module in computing unit as shown in Figure 3 eliminates mechanism and rotates work for number Mechanism obtains two weighted datas and two input image datas from the memory module of on piece equipment every time, protects to the full extent Demonstrate,proving synchronization, there are two in two sub- buffer cells in number deposit buffer cell group.Carrying out convolutional Neural in the prior art It is then cumulative in adder by image data and multiplied by weight when network query function, it is conventional convolution algorithm operation, the present invention First data are eliminated before convolution algorithm, invalid data is rejected to the calculating for being not involved in computing module, reduces calculating Time reduces the power of calculating, in order to guarantee to have after invalid data is eliminated enough data enter the cumulative channel of multiplexing into Row calculates in next step, increases buffer cell group and buffers to the data after invalid eliminate, three sub- buffer cells pass through for number Rotation mechanism provides data source to adder, guarantees that enough data sources participate in calculating.As shown in figure 3, at the moment 0, input picture Data are invalid data, are filtered；Moment 2, no invalid data exist, and input image data 2 and 1 is transmitted into sub- buffering respectively Unit 0 and sub- buffer cell 1；Moment 3, no invalid data exist, and input image data 1 and 3 is still transmitted into sub- buffering list Member 0 and sub- buffer cell 1；Moment 3, input image data 5 are transmitted into sub- buffer cell 2, this is engraved in sub- buffer cell 0 Data 1 have been brought into the cumulative channel of multiplexing.The rotation mechanism remains that two sub- buffer cells are receiving data, a son Buffer cell is providing valid data to the cumulative channel of multiplexing, it is ensured that provides enough number of effective sources evidences to adder.

As shown in Fig. 2, computing unit of the invention can be weighed for two-value or the network characteristic of three value weights using one kind The mechanism that structure is multiplexed cumulative channel carries out multiply-accumulate operation；Multichannel partial sum register group is for depositing each corresponding power The part of weight and accumulation result, and the cumulative source data provided to adder is provided according to the value of weight；Gate is used for root The cumulative access of adder is reconstructed according to the value of weight.The weight of binaryzation convolutional neural networks be-w1 or two kinds of w2, it is three-valued The case where weight of convolutional neural networks is tri- kinds of-w1,0 or w2, due to weight is 0 has been disappeared in invalid data cancellation module It removes, therefore be multiplexed cumulative channel mechanism only to need in two kinds of situation: the positive cumulative channel of weight and the cumulative channel of negative weight.By positive weight Data and negative weight data complete convolution operation using cumulative channel mechanism by multichannel separate computations, by constantly cumulative Reconstruction calculations channel calculates simple, reduction resource consumption.

By taking positive weight adds up channel as an example, when the input weight at computing unit moment is positive value, and input image data It is not 0, then the input image data enters buffer cell group by invalid data cancellation module and buffered, this is engraved in multichannel The number in positive weight partial sum register in partial sum register group is selected, is conveyed into adder, with input picture number It according to being added, obtains result and is still stored back to positive weight partial sum register, so far complete the meter in the primary cumulative channel of positive weight Calculation process.When weight is negative, which reconstructs the cumulative channel of negative weight by control gate.

Assuming that the positive weight of computing module convolution kernel is+w1, the part participated in and result are Y1, negative weight-w2, ginseng With part and result be Y2, then Y1 and Y2 is stored in respectively in the correspondence register in different channel and register group, and by It is transferred to polymerization processing unit and carries out next step calculating, obtained calculated result is indicated with Y: Y=w1*Y1-w2*Y2+ Bias, wherein bias is biasing；Linear calculating: Y=kx+b is executed after simplification by the normalization layer after convolutional calculation；Two As a result formula can obtain after merging, and Y=k*w1*Y1-k*w2*Y2+bias+b, the calculating process is completed by polymerization processing unit, Middle k*w1 and k*w2 and (bias+b) can be calculated under line.

During hardware system of the invention is realized, the external memory of piece external equipment is transmitted data by bus interface To the memory module of on piece equipment, the CPU of piece external equipment sends instructions to the control module of on piece equipment, piece by bus interface The operation of other modules of equipment, the instruction decoding that control module will receive are transferred in the control module control sheet of upper equipment Configuration module, memory module and the computing module of on piece equipment.The configuration module of on piece equipment receives matching for control module transmission Instruction is set, according to the data path of configuration information reconfiguration system.The computing module of on piece equipment counts convolutional neural networks It calculates, computing module is counted according to executing instruction of receiving and configuration-direct from the storage unit of on piece equipment reading data It calculates.

The computing module of on piece equipment is calculated by the computing unit in Three dimensional convolution array；Input data, including input Image data and weighted data first pass through the invalid data cancellation module of computing unit, and the invalid data that will be deemed as zero is eliminated, Computing unit is skipped, buffer cell group is transferred to, invalid data is skipped calculating, reduced by the design of invalid data cancellation module The data volume of calculating reduces multiplication and accumulation calculating bring power consumption, also accelerates the time of calculating；Buffer cell group has three A sub- buffer cell, sub- buffer cell store the data after invalid data cancellation module, and rotate for number, by the number of buffering According to adder is supplied to, ensure that adder there are enough number of effective sources evidences for several rotation mechanism, it will not be because of invalid data module Elimination mechanism input data reduce data source it is inadequate.According to the sparse characteristic of convolutional neural networks, it is divided into binaryzation or three values Change, then weight is-w1 or two kinds of w2 to binaryzation；Three-valued weight is tri- kinds of-w1,0 or w2, due in invalid data cancellation module The case where being 0 by weight, is eliminated, because data are divided into positive weight part and and negative weight part after adder calculating With gate deposits data in different channel and register according to weight difference respectively, and gate is also according to weight The cumulative access of value reconstruct adder, it is restructural to be multiplexed cumulative channel mechanism progress convolution operation, reduce resource loss.Pass through three The data of dimension convolution array are transferred to the polymerization processing unit of computing module, by the son polymerization processing unit of polymerization processing unit Multiplier and adder, accumulated result to the end is obtained, and according to the linearisation feature of normalization layer, by the meter of normalization layer Calculation is incorporated in part and increment part.Hardware system of the invention, which reduces, calculates data area, reduces the resource consumption of system, It improves and calculates the time, recognition speed is fast, and throughput is high, can do and be efficiently completed the operation such as Car license recognition, recognition of face, application Extensively.

Schematically the invention and embodiments thereof are described above, description is not limiting, not In the case where spirit or essential characteristics of the invention, the present invention can be realized in other specific forms.Institute in attached drawing What is shown is also one of the embodiment of the invention, and actual structure is not limited to this, any attached in claim Icon note should not limit the claims involved.So not departed from if those of ordinary skill in the art are inspired by it In the case where this creation objective, frame mode similar with the technical solution and embodiment are not inventively designed, it should all Belong to the protection scope of this patent.In addition, one word of " comprising " is not excluded for other elements or step, "one" word before the component It is not excluded for including " multiple " element.The multiple element stated in claim to a product can also by an element by software or Person hardware is realized.The first, the second equal words are used to indicate names, and are not indicated any particular order.

Claims

1. a kind of computing unit, it is characterised in that: including an invalid data module, a buffer cell group, an adder, One multichannel part, register group and multiple gates, input data are transferred to slow after invalid data resume module Unit group is rushed, to effective data source is provided after data buffering to adder, data pass through excessively buffer cell group after passing through adder Channel part and register group, final data are divided into positive weight part and negative weight part and pass through gate again and be transferred to addition Device.

2. a kind of computing unit according to claim 1, which is characterized in that whether invalid data module judges input data It is zero, input data is zero to be judged to invalid data and skip computing unit.

3. a kind of computing unit according to claim 2, which is characterized in that the input data of invalid data module is system Input image data or weighted data.

4. a kind of computing unit according to claim 1, which is characterized in that buffer cell group includes multiple sub- buffering lists Member, for buffering the data for passing through invalid data cancellation module, input data is after sub- buffer cell by for several rotary press System ensures to provide enough valid data sources to adder.

5. a kind of computing unit according to claim 1, which is characterized in that the computing unit uses a kind of restructural The mechanism for being multiplexed cumulative channel carries out multiply-accumulate operation, and multichannel partial sum register group is for depositing each corresponding weight Part and accumulation result, and the cumulative source data provided to adder is selected according to the value of weight to reconstruct cumulative channel, choosing Logical device is used to reconstruct the cumulative access of adder according to the value of weight.

6. a kind of array, including any computing unit of claim 1-5, which is characterized in that the array is single by calculating Member is in array distribution, and array has Z piece, has Y row composition respectively per a piece of, every a line has X computing unit；The array be used for into Row multiplies accumulating calculating.

7. a kind of array according to claim 6, which is characterized in that every a line computing unit of array receives identical defeated Enter image data, the computing unit of each column of array receives identical weighted data.

8. a kind of module, including a kind of array according to claim 7, which is characterized in that further include a polymerization processing Unit, additional calculation, normalization layer processing and the activation letter of the part sum of the polymerization processing unit processes array convolutional calculation Number processing.

9. a kind of module according to claim 8, which is characterized in that polymerization processing unit includes that multiple sub- polymerization processing are single Member, every height polymerization processing unit includes two constant multipliers and an adder, and every height polymerize processing unit corresponding position Manage the data of a piece of computing unit output of the array；The part and added by polymerizeing processing unit that the array computation goes out Method operation obtains the result multiplied accumulating to the end；And according to the linearisation feature of normalization layer, the calculating of normalization layer is incorporated in Part and increment part.

10. a kind of hardware system, including on piece equipment and piece external equipment, which is characterized in that on piece equipment includes claim 8-9 A kind of any module is computing module；On piece equipment further includes that control module, configuration module, memory module, bus connect Mouthful, piece external equipment includes CPU and outer memory module；The CPU of piece external equipment is electrically connected with the control module of on piece equipment, outside piece The external memory of equipment and the memory module electrical connection of on piece equipment, the configuration of the control module and on piece equipment of on piece equipment Module, memory module and computing module electrical connection, the configuration module of on piece equipment and the memory module of on piece equipment and calculating mould Block electrical connection, memory module and the computing module electrical connection of on piece equipment.