CN209708122U - A kind of computing unit, array, module, hardware system - Google Patents
A kind of computing unit, array, module, hardware system Download PDFInfo
- Publication number
- CN209708122U CN209708122U CN201920827602.2U CN201920827602U CN209708122U CN 209708122 U CN209708122 U CN 209708122U CN 201920827602 U CN201920827602 U CN 201920827602U CN 209708122 U CN209708122 U CN 209708122U
- Authority
- CN
- China
- Prior art keywords
- data
- module
- computing unit
- piece
- array
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Image Analysis (AREA)
Abstract
The utility model discloses a kind of computing unit, array, module, hardware systems, belong to the hardware-accelerated field of intelligent algorithm.It is huge for rarefaction convolutional neural networks algorithm data existing in the prior art, calculate the problem of time length, the utility model devises a kind of invalid data removal mechanism in computing unit, invalid weight or input image data can be removed, it reduces and calculates the time, reduce multiplication and accumulation calculating bring power consumption;A kind of multi-pass line computing unit is designed, convolution operation is completed using cumulative channel mechanism is multiplexed, reduces resource consumption;In the case where invalid data removes, the utility model has also been devised a kind of for several rotation mechanism, can keep the sufficient for number of computing unit;The utility model can efficiently complete Car license recognition, recognition of face etc. such as smart home, smart city because low in energy consumption, area is small, throughput is high, recognition speed is fast, suitable for the application of mobile terminal.
Description
Technical field
It is the present invention relates to the hardware-accelerated field of intelligent algorithm, in particular to a kind of computing unit, array, module, hard
Part system.
Background technique
Convolutional neural networks (Convolutional Neural Network, CNN) are a kind of feedforward neural networks, in people
Work smart field has a wide range of applications, including the processing of image recognition, big data, natural language processing etc..In order to improve algorithm
Precision, the model structure more sophisticated of convolutional neural networks, depth continues to increase, thus brought by model parameter it is huge,
It calculates overlong time and hinders the algorithm in the deployment of terminal, such as smart home, intelligent transportation Internet of Things application.These problems
The further investigation of the algorithm and hardware design to convolutional neural networks is caused, to pursue low-power consumption and high-throughput.
Algorithmically, a kind of method is parameter beta pruning: structuring beta pruning and unstructured beta pruning bring the sparse of weight
Change, the activation primitives such as ReLU also bring the rarefaction of every layer of output activation image data.Another method is parameter sharing: being made
By network training it is quantization neural network with particular quantization method, such as binaryzation or three-valued network, and ensures the effect of its algorithm
Fruit will not influence the realization of application.
It is more and more for the hardware design of rarefaction convolutional neural networks algorithm in recent years, but correlative study collects mostly
In conventional sparse convolution neural network hardware design, coding reconciliation such as is carried out to weight matrix and input picture matrix
Code, and the present invention be directed to the quantization convolutional neural networks of rarefaction to design a kind of hardware implementation method, conventional coding techniques
Brought cost is much larger than the effect itself promoted.
Chinese patent application, application number CN201811486547.1, disclose one kind and are directed to publication date on May 3rd, 2019
The accelerated method that hardware realization rarefaction convolutional neural networks are inferred is joined including the grouping beta pruning in face of sparse hardware-accelerated framework
Number determine methods, for sparse hardware-accelerated framework grouping beta pruning training method and for before rarefaction convolutional neural networks to
The dispositions method of deduction: determining the block length and beta pruning rate of grouping beta pruning according to number of multipliers in hardware structure, based on amount
Grade cutting method the weight other than compression ratio is cut, by incremental training mode promoted the network accuracy rate after beta pruning and
Compression ratio, the weight and indexing parameter and the meter being sent under hardware structure of the non-beta pruning position of preservation after the network of beta pruning is fine-tuned
It calculates in unit, the activation value that computing unit obtains block length simultaneously is completed before sparse network to deduction.The present invention is based on hardware
Framework sets out the beta pruning parameter and Pruning strategy of set algorithm level, and the logical complexity for being beneficial to reduce sparse accelerator improves
To the whole efficiency of deduction before sparse accelerator.Though the invention also promotes whole efficiency from hardware structure, not to nothing
Effect data are handled, and power consumption is high, and the calculating time not enough optimizes.
Summary of the invention
1. technical problems to be solved
Huge for rarefaction convolutional neural networks algorithm data existing in the prior art, the calculating time is long, grinds at present
Study carefully and have focused largely on related hardware design, the present invention provides a kind of computing unit, array, module, hardware systems, can flexibly prop up
The realization of a variety of binaryzations or three-valued sparse convolution neural network is held, resource utilization is high, throughput is big, low in energy consumption, area
Small, the application suitable for terminal is realized.
2. technical solution
The purpose of the present invention is achieved through the following technical solutions.
In a first aspect, providing a kind of computing unit, including an invalid data module, a buffer cell group, one
Adder, a multichannel part, register group and multiple gates, input data pass after invalid data resume module
Defeated to arrive buffer cell group, to effective data source is provided after data buffering to adder, data pass through after adder buffer cell group
By multichannel partial sum register group, final data is divided into positive weight part and negative weight part and is transferred to again by gate
Adder.
Further, invalid data module judges whether input data is zero, and input data is zero to be judged to invalid data
Skip computing unit.
Further, the input data of invalid data module is system input image data or weighted data.
Further, buffer cell group includes multiple sub- buffer cells, passes through invalid data cancellation module for buffering
Data, input data is after sub- buffer cell by ensuring to provide enough valid data to adder for several rotation mechanism
Source.No invalid data cancellation module elimination mechanism and for number rotate working mechanism the case where when, data by with one kind ten
Divide conventional method to be handled: input image data is passed to multiplier with weight and is multiplied, and multiplication result adds up.Though
These right results will not have any impact to last result, but these invalid datas still carry out multiplication and cumulative behaviour
Make, causes the consumption of resource, time and energy.
Further, for two-value or the network characteristic of three value weights, the computing unit uses a kind of restructural
The mechanism for being multiplexed cumulative channel carries out multiply-accumulate operation, and multichannel partial sum register group is for depositing each corresponding weight
Part and accumulation result, and select according to the value of weight the cumulative source data provided to adder, gate is used for basis
The value of weight reconstructs the cumulative access of adder.
Second aspect provides a kind of array, and the array is in array distribution by computing unit, and array has Z piece, per a piece of point
There is not Y row composition, every a line has X computing unit;The cubical array of an X*Y*Z is formed, and the array is a three-dimensional
Convolutional calculation array multiplies accumulating calculating for carrying out.
Further, every a line computing unit of array receives identical data, the computing unit of each column of array
Identical data are received, every received data of a line are input image data, and each column input data is convolution kernel
Weighted data.
The third aspect provides a kind of module, including the array, further includes a polymerization processing unit, at the polymerization
Manage additional calculation, normalization layer processing and the activation primitive processing of the part sum of cell processing computing array convolutional calculation.
Further, polymerization processing unit includes multiple sub- polymerization processing units, and every height polymerization processing unit includes
Two constant multipliers and an adder, every height polymerize a piece of computing unit output of array described in processing unit alignment processing
Data.The part and carry out add operation by polymerization processing unit that the array computation goes out, obtain the knot multiplied accumulating to the end
Fruit;And according to the linearisation feature of normalization layer, the calculating of normalization layer is incorporated in part and increment part.
Fourth aspect, provides a kind of hardware system, including on piece equipment and piece external equipment, and on piece equipment includes the mould
Block is computing module;On piece equipment further includes control module, configuration module, memory module, bus interface, and piece external equipment includes
CPU and outer memory module;The CPU of piece external equipment is electrically connected with the control module of on piece equipment, the external storage of piece external equipment
The electrical connection of the memory module of device and on piece equipment, the control module of on piece equipment and configuration module, the memory module of on piece equipment
It is electrically connected with computing module, memory module and the computing module electrical connection of the configuration module and on piece equipment of on piece equipment, on piece
The memory module and computing module of equipment are electrically connected.
The instruction of a kind of implementation method of hardware system, the CPU of piece external equipment is read on piece equipment by bus interface
Control module be decoded, decoded configuration-direct is conveyed into the configuration module of on piece equipment, and data path is according to matching
Confidence breath is reconstructed;What the control module of on piece equipment issued, which execute instruction, controls the task execution of each module and starts or tie
Beam;The data of the storage unit of piece external equipment are conveyed into the memory module of on piece equipment by bus interface;The calculating of on piece equipment
Module reads data from the storage unit of on piece equipment and is calculated, calculated according to executing instruction of receiving and configuration-direct
As a result the storage unit of on piece equipment is stored back to by instruction.
3. beneficial effect
Compared with the prior art, the present invention has the advantages that
(1) present invention devises a kind of invalid data removal mechanism, compared to conventional data processing method, for two-value
Or three value convolutional neural networks sparse characteristic, invalid weight or input image data can be removed, reduce calculating when
Between, multiplication and accumulation calculating bring power consumption are reduced, the consumption of resource, time and energy is reduced;
(2) present invention devises a kind of multi-pass line computing unit, for two-value or the characteristic of three value weight low bits, adopts
Convolution operation is carried out with the restructural channel mechanism that adds up that is multiplexed, corresponds to each power for depositing different channels using register group
The part of weight and accumulation result, and the cumulative source data provided to adder is provided according to the value of weight, gate is used for root
The cumulative access of adder is reconstructed according to the value of weight, reduces resource consumption;
(3) invalid data remove in the case where, the present invention have also been devised it is a kind of for several rotation mechanism, for several rotation mechanism
It remains that buffer cell group has sub- buffer cell receiving data, there is a sub- buffer cell to provide to the cumulative channel of multiplexing
Valid data keep the sufficient for number of computing unit;
(4) the invention proposes a kind of hardware optimization algorithm, it is common according to it by the algorithm of convolution sum normalization layer
Linear character merges, and linear calculating is executed after simplifying, reduces computing redundancy, reduces power consumption and area overhead.
It is being realized firmly in conclusion the present invention can effectively improve binaryzation or three-valued sparse convolution neural network
Throughput and computing resource utilization rate when part accelerates;Design area of the invention is small, low in energy consumption, there is good practical application,
Especially in the application in the Internet of Things field of terminal.
Detailed description of the invention
Fig. 1 is general hardware architecture schematic diagram of the invention;
Fig. 2 is sub- computing unit of the invention and is multiplexed cumulative channel mechanism access schematic diagram;
Fig. 3 is the elimination mechanism of invalid data cancellation module of the invention and rotates working mechanism's schematic diagram for number.
Specific embodiment
With reference to the accompanying drawings of the specification and specific embodiment, the present invention is described in detail.
Embodiment
As shown in Figure 1, a kind of hardware system, including on piece equipment and piece external equipment, on piece equipment include control module, match
Module, memory module, computing module and bus interface are set, piece external equipment includes CPU and outer memory module;Piece external equipment
CPU is electrically connected with the control module of on piece equipment, the external memory of piece external equipment and the memory module electrical connection of on piece equipment,
The control module of on piece equipment is electrically connected with the configuration module of on piece equipment, memory module and computing module, and on piece equipment is matched
The memory module and computing module electrical connection, the memory module and computing module of on piece equipment for setting module and on piece equipment are electrically connected
It connects.
For system when carrying out convolutional neural networks data processing calculating, instruction is passed through bus interface by the CPU of piece external equipment
The control module for being transferred on piece equipment is decoded, and decoded configuration-direct is transferred in the configuration module of on piece equipment,
The data path of system is reconstructed according to configuration information;The data of the storage unit of piece external equipment are conveyed into piece by bus interface
The memory module of upper equipment.The control module sending of on piece equipment executes instruction control to configuration module, memory module and calculating
Module;The computing module of on piece equipment is read according to executing instruction of receiving and configuration-direct from the storage unit of on piece equipment
According to being calculated, calculated result is stored back to the storage unit of on piece equipment by instruction for access.On piece storage control unit is according to configuration
Information and control information determination result data are transmitted outside slice or are stayed in inside on-chip memory cell, voltage input by interface
The generation of data and structured data address controls the interaction between on-chip memory cell and computing module.
The computing module of on piece equipment, is a kind of computing module of isomery, and the computing module includes a Three dimensional convolution
Array and a polymerization processing unit, Three dimensional convolution array are in array distribution by 4*4*8 computing unit, and array has 8, each
Piece has 4 rows composition respectively, and every a line has 4 computing units;Three dimensional convolution array is tired for multiply to the input data of system
Add calculating, the input data of system includes input image data, weighted data and activation value;Every a line computing unit of array connects
Identical input image data is received, the computing unit of each column of array receives identical weighted data.System input data warp
After crossing Three dimensional convolution array computation, to the part of convolutional calculation and multiplication and additional calculation, standard are carried out by polymerization processing unit
Change layer processing and activation primitive processing, the image data of output is transferred into polymerization processing unit on the right of array.Polymerization processing
Unit includes multiple sub- polymerization processing units, and every height polymerization processing unit includes two constant multipliers and an adder,
The data of every a piece of Three dimensional convolution array output of height polymerization processing unit alignment processing.The calculated part of computing array
Add operation is carried out by polymerization processing unit with data, obtains the result multiplied accumulating to the end;And according to the linear of normalization layer
Change feature, the calculating of normalization layer is incorporated in part and increment part.
As shown in Fig. 2, the computing unit in Three dimensional convolution array includes an invalid data cancellation module, a buffering list
Tuple, an adder, a multichannel part, register group and multiple gates.Invalid data module uses one kind
Invalid data removes mechanism, and invalid weight or input image data are removed, and reduces and calculates the time, reduces multiplication and cumulative meter
Calculate bring power consumption.The input data of system is transferred to buffer cell group, buffer cell group after invalid data resume module
To providing effective data source after data buffering to adder, using for several rotation mechanism in buffer cell group, it is ensured that have enough
Valid data source is supplied to adder, guarantees the sufficient for number of computing unit;Buffered data pass through excessively after passing through adder
Channel part and register group are divided into positive weight part in binaryzation or three-valued sparse convolution neural network and and bear
Weight part, gate by positive weight part and and negative weight part and separate different channels be transferred to again adder reconstruct
Cumulative channel, the design of multichannel computing unit reduce resource consumption.
It is to pass through invalid data that the invalid data of invalid data cancellation module in computing unit, which eliminates mechanism works mode,
Module judges the image data of input or whether weighted data is zero, and invalid data is judged to if input data is zero and skips calculating
Unit calculates;The input data of system enters buffer cell group after invalid data module, and buffer cell group includes three sons
Buffer cell is supplied to the cumulative channel of multiplexing for buffered data;Three sub- buffer cells for several rotation mechanism by can ensure that
Enough number of effective sources evidences are provided to adder.Working machine is rotated in the elimination mechanism of no invalid data cancellation module and for number
When the case where processed, data are handled in a kind of entirely conventional method: input image data and weight be passed to multiplier into
Row is multiplied, and multiplication result adds up.Although these results will not have any impact to last result, these are invalid
Data still carry out multiplication and accumulation operations, cause the consumption of resource, time and energy;It is then reduced using invalid data module
The calculating time of system and calculation times, reduce the power consumption of system.
The invalid data of invalid data cancellation module in computing unit as shown in Figure 3 eliminates mechanism and rotates work for number
Mechanism obtains two weighted datas and two input image datas from the memory module of on piece equipment every time, protects to the full extent
Demonstrate,proving synchronization, there are two in two sub- buffer cells in number deposit buffer cell group.Carrying out convolutional Neural in the prior art
It is then cumulative in adder by image data and multiplied by weight when network query function, it is conventional convolution algorithm operation, the present invention
First data are eliminated before convolution algorithm, invalid data is rejected to the calculating for being not involved in computing module, reduces calculating
Time reduces the power of calculating, in order to guarantee to have after invalid data is eliminated enough data enter the cumulative channel of multiplexing into
Row calculates in next step, increases buffer cell group and buffers to the data after invalid eliminate, three sub- buffer cells pass through for number
Rotation mechanism provides data source to adder, guarantees that enough data sources participate in calculating.As shown in figure 3, at the moment 0, input picture
Data are invalid data, are filtered;Moment 2, no invalid data exist, and input image data 2 and 1 is transmitted into sub- buffering respectively
Unit 0 and sub- buffer cell 1;Moment 3, no invalid data exist, and input image data 1 and 3 is still transmitted into sub- buffering list
Member 0 and sub- buffer cell 1;Moment 3, input image data 5 are transmitted into sub- buffer cell 2, this is engraved in sub- buffer cell 0
Data 1 have been brought into the cumulative channel of multiplexing.The rotation mechanism remains that two sub- buffer cells are receiving data, a son
Buffer cell is providing valid data to the cumulative channel of multiplexing, it is ensured that provides enough number of effective sources evidences to adder.
As shown in Fig. 2, computing unit of the invention can be weighed for two-value or the network characteristic of three value weights using one kind
The mechanism that structure is multiplexed cumulative channel carries out multiply-accumulate operation;Multichannel partial sum register group is for depositing each corresponding power
The part of weight and accumulation result, and the cumulative source data provided to adder is provided according to the value of weight;Gate is used for root
The cumulative access of adder is reconstructed according to the value of weight.The weight of binaryzation convolutional neural networks be-w1 or two kinds of w2, it is three-valued
The case where weight of convolutional neural networks is tri- kinds of-w1,0 or w2, due to weight is 0 has been disappeared in invalid data cancellation module
It removes, therefore be multiplexed cumulative channel mechanism only to need in two kinds of situation: the positive cumulative channel of weight and the cumulative channel of negative weight.By positive weight
Data and negative weight data complete convolution operation using cumulative channel mechanism by multichannel separate computations, by constantly cumulative
Reconstruction calculations channel calculates simple, reduction resource consumption.
By taking positive weight adds up channel as an example, when the input weight at computing unit moment is positive value, and input image data
It is not 0, then the input image data enters buffer cell group by invalid data cancellation module and buffered, this is engraved in multichannel
The number in positive weight partial sum register in partial sum register group is selected, is conveyed into adder, with input picture number
It according to being added, obtains result and is still stored back to positive weight partial sum register, so far complete the meter in the primary cumulative channel of positive weight
Calculation process.When weight is negative, which reconstructs the cumulative channel of negative weight by control gate.
Assuming that the positive weight of computing module convolution kernel is+w1, the part participated in and result are Y1, negative weight-w2, ginseng
With part and result be Y2, then Y1 and Y2 is stored in respectively in the correspondence register in different channel and register group, and by
It is transferred to polymerization processing unit and carries out next step calculating, obtained calculated result is indicated with Y: Y=w1*Y1-w2*Y2+
Bias, wherein bias is biasing;Linear calculating: Y=kx+b is executed after simplification by the normalization layer after convolutional calculation;Two
As a result formula can obtain after merging, and Y=k*w1*Y1-k*w2*Y2+bias+b, the calculating process is completed by polymerization processing unit,
Middle k*w1 and k*w2 and (bias+b) can be calculated under line.
During hardware system of the invention is realized, the external memory of piece external equipment is transmitted data by bus interface
To the memory module of on piece equipment, the CPU of piece external equipment sends instructions to the control module of on piece equipment, piece by bus interface
The operation of other modules of equipment, the instruction decoding that control module will receive are transferred in the control module control sheet of upper equipment
Configuration module, memory module and the computing module of on piece equipment.The configuration module of on piece equipment receives matching for control module transmission
Instruction is set, according to the data path of configuration information reconfiguration system.The computing module of on piece equipment counts convolutional neural networks
It calculates, computing module is counted according to executing instruction of receiving and configuration-direct from the storage unit of on piece equipment reading data
It calculates.
The computing module of on piece equipment is calculated by the computing unit in Three dimensional convolution array;Input data, including input
Image data and weighted data first pass through the invalid data cancellation module of computing unit, and the invalid data that will be deemed as zero is eliminated,
Computing unit is skipped, buffer cell group is transferred to, invalid data is skipped calculating, reduced by the design of invalid data cancellation module
The data volume of calculating reduces multiplication and accumulation calculating bring power consumption, also accelerates the time of calculating;Buffer cell group has three
A sub- buffer cell, sub- buffer cell store the data after invalid data cancellation module, and rotate for number, by the number of buffering
According to adder is supplied to, ensure that adder there are enough number of effective sources evidences for several rotation mechanism, it will not be because of invalid data module
Elimination mechanism input data reduce data source it is inadequate.According to the sparse characteristic of convolutional neural networks, it is divided into binaryzation or three values
Change, then weight is-w1 or two kinds of w2 to binaryzation;Three-valued weight is tri- kinds of-w1,0 or w2, due in invalid data cancellation module
The case where being 0 by weight, is eliminated, because data are divided into positive weight part and and negative weight part after adder calculating
With gate deposits data in different channel and register according to weight difference respectively, and gate is also according to weight
The cumulative access of value reconstruct adder, it is restructural to be multiplexed cumulative channel mechanism progress convolution operation, reduce resource loss.Pass through three
The data of dimension convolution array are transferred to the polymerization processing unit of computing module, by the son polymerization processing unit of polymerization processing unit
Multiplier and adder, accumulated result to the end is obtained, and according to the linearisation feature of normalization layer, by the meter of normalization layer
Calculation is incorporated in part and increment part.Hardware system of the invention, which reduces, calculates data area, reduces the resource consumption of system,
It improves and calculates the time, recognition speed is fast, and throughput is high, can do and be efficiently completed the operation such as Car license recognition, recognition of face, application
Extensively.
Schematically the invention and embodiments thereof are described above, description is not limiting, not
In the case where spirit or essential characteristics of the invention, the present invention can be realized in other specific forms.Institute in attached drawing
What is shown is also one of the embodiment of the invention, and actual structure is not limited to this, any attached in claim
Icon note should not limit the claims involved.So not departed from if those of ordinary skill in the art are inspired by it
In the case where this creation objective, frame mode similar with the technical solution and embodiment are not inventively designed, it should all
Belong to the protection scope of this patent.In addition, one word of " comprising " is not excluded for other elements or step, "one" word before the component
It is not excluded for including " multiple " element.The multiple element stated in claim to a product can also by an element by software or
Person hardware is realized.The first, the second equal words are used to indicate names, and are not indicated any particular order.
Claims (10)
1. a kind of computing unit, it is characterised in that: including an invalid data module, a buffer cell group, an adder,
One multichannel part, register group and multiple gates, input data are transferred to slow after invalid data resume module
Unit group is rushed, to effective data source is provided after data buffering to adder, data pass through excessively buffer cell group after passing through adder
Channel part and register group, final data are divided into positive weight part and negative weight part and pass through gate again and be transferred to addition
Device.
2. a kind of computing unit according to claim 1, which is characterized in that whether invalid data module judges input data
It is zero, input data is zero to be judged to invalid data and skip computing unit.
3. a kind of computing unit according to claim 2, which is characterized in that the input data of invalid data module is system
Input image data or weighted data.
4. a kind of computing unit according to claim 1, which is characterized in that buffer cell group includes multiple sub- buffering lists
Member, for buffering the data for passing through invalid data cancellation module, input data is after sub- buffer cell by for several rotary press
System ensures to provide enough valid data sources to adder.
5. a kind of computing unit according to claim 1, which is characterized in that the computing unit uses a kind of restructural
The mechanism for being multiplexed cumulative channel carries out multiply-accumulate operation, and multichannel partial sum register group is for depositing each corresponding weight
Part and accumulation result, and the cumulative source data provided to adder is selected according to the value of weight to reconstruct cumulative channel, choosing
Logical device is used to reconstruct the cumulative access of adder according to the value of weight.
6. a kind of array, including any computing unit of claim 1-5, which is characterized in that the array is single by calculating
Member is in array distribution, and array has Z piece, has Y row composition respectively per a piece of, every a line has X computing unit;The array be used for into
Row multiplies accumulating calculating.
7. a kind of array according to claim 6, which is characterized in that every a line computing unit of array receives identical defeated
Enter image data, the computing unit of each column of array receives identical weighted data.
8. a kind of module, including a kind of array according to claim 7, which is characterized in that further include a polymerization processing
Unit, additional calculation, normalization layer processing and the activation letter of the part sum of the polymerization processing unit processes array convolutional calculation
Number processing.
9. a kind of module according to claim 8, which is characterized in that polymerization processing unit includes that multiple sub- polymerization processing are single
Member, every height polymerization processing unit includes two constant multipliers and an adder, and every height polymerize processing unit corresponding position
Manage the data of a piece of computing unit output of the array;The part and added by polymerizeing processing unit that the array computation goes out
Method operation obtains the result multiplied accumulating to the end;And according to the linearisation feature of normalization layer, the calculating of normalization layer is incorporated in
Part and increment part.
10. a kind of hardware system, including on piece equipment and piece external equipment, which is characterized in that on piece equipment includes claim 8-9
A kind of any module is computing module;On piece equipment further includes that control module, configuration module, memory module, bus connect
Mouthful, piece external equipment includes CPU and outer memory module;The CPU of piece external equipment is electrically connected with the control module of on piece equipment, outside piece
The external memory of equipment and the memory module electrical connection of on piece equipment, the configuration of the control module and on piece equipment of on piece equipment
Module, memory module and computing module electrical connection, the configuration module of on piece equipment and the memory module of on piece equipment and calculating mould
Block electrical connection, memory module and the computing module electrical connection of on piece equipment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201920827602.2U CN209708122U (en) | 2019-06-03 | 2019-06-03 | A kind of computing unit, array, module, hardware system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201920827602.2U CN209708122U (en) | 2019-06-03 | 2019-06-03 | A kind of computing unit, array, module, hardware system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN209708122U true CN209708122U (en) | 2019-11-29 |
Family
ID=68651308
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201920827602.2U Active CN209708122U (en) | 2019-06-03 | 2019-06-03 | A kind of computing unit, array, module, hardware system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN209708122U (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111160545A (en) * | 2019-12-31 | 2020-05-15 | 北京三快在线科技有限公司 | Artificial neural network processing system and data processing method thereof |
CN115459896A (en) * | 2022-11-11 | 2022-12-09 | 北京超摩科技有限公司 | Control method, control system, medium and chip for multi-channel data transmission |
-
2019
- 2019-06-03 CN CN201920827602.2U patent/CN209708122U/en active Active
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111160545A (en) * | 2019-12-31 | 2020-05-15 | 北京三快在线科技有限公司 | Artificial neural network processing system and data processing method thereof |
CN115459896A (en) * | 2022-11-11 | 2022-12-09 | 北京超摩科技有限公司 | Control method, control system, medium and chip for multi-channel data transmission |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110069444A (en) | A kind of computing unit, array, module, hardware system and implementation method | |
CN112465110B (en) | Hardware accelerator for convolution neural network calculation optimization | |
CN109328361B (en) | Accelerator for deep neural network | |
CN110070178A (en) | A kind of convolutional neural networks computing device and method | |
CN108108809B (en) | Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof | |
CN111414994B (en) | FPGA-based Yolov3 network computing acceleration system and acceleration method thereof | |
CN110378468A (en) | A kind of neural network accelerator quantified based on structuring beta pruning and low bit | |
CN111062472B (en) | Sparse neural network accelerator based on structured pruning and acceleration method thereof | |
CN108932548A (en) | A kind of degree of rarefication neural network acceleration system based on FPGA | |
CN209708122U (en) | A kind of computing unit, array, module, hardware system | |
CN111723947A (en) | Method and device for training federated learning model | |
CN112286864B (en) | Sparse data processing method and system for accelerating operation of reconfigurable processor | |
CN112836813B (en) | Reconfigurable pulse array system for mixed-precision neural network calculation | |
CN112257844B (en) | Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof | |
CN113361695A (en) | Convolutional neural network accelerator | |
CN103577161A (en) | Big data frequency parallel-processing method | |
Shu et al. | High energy efficiency FPGA-based accelerator for convolutional neural networks using weight combination | |
CN113313244B (en) | Near-storage neural network accelerator for addition network and acceleration method thereof | |
CN109993293A (en) | A kind of deep learning accelerator suitable for stack hourglass network | |
CN116431562B (en) | Multi-head attention mechanism fusion calculation distribution method based on acceleration processor | |
CN107831823B (en) | Gaussian elimination method for analyzing and optimizing power grid topological structure | |
CN116012657A (en) | Neural network-based 3D point cloud data processing method and accelerator | |
Kuang et al. | Entropy-based gradient compression for distributed deep learning | |
CN110245756A (en) | Method for handling the programming device of data group and handling data group | |
CN112905954A (en) | CNN model convolution operation accelerated calculation method using FPGA BRAM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
GR01 | Patent grant | ||
GR01 | Patent grant |