CN106127302A - Process the circuit of data, image processing system, the method and apparatus of process data - Google Patents
Process the circuit of data, image processing system, the method and apparatus of process data Download PDFInfo
- Publication number
- CN106127302A CN106127302A CN201610480591.6A CN201610480591A CN106127302A CN 106127302 A CN106127302 A CN 106127302A CN 201610480591 A CN201610480591 A CN 201610480591A CN 106127302 A CN106127302 A CN 106127302A
- Authority
- CN
- China
- Prior art keywords
- processing unit
- unit
- data
- circuit
- shift register
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
- G06F9/30047—Prefetch instructions; cache control instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
- G06F9/3893—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Neurology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The embodiment of the invention discloses a kind of circuit processing data, image processing system and the method and apparatus processing data, this circuit includes control unit and N number of processing unit, this control unit is connected with data transmission unit respectively with each processing unit in this N number of processing unit, the same data that this each processing unit is respectively used to be sequentially output this data transmission unit process, the outfan of the i-th processing unit in this N number of processing unit is connected with the input of i+1 processing unit, wherein, N is the integer more than 1, the span of i is the integer in 1 to N 1, this control unit is used for: when the pending data determining that this data transmission unit exports are 0, control this N number of processing unit all in closed mode.The circuit of embodiment of the present invention offer, system, method and apparatus, it is possible to realizing convolution algorithm when, reduce circuit power consumption.
Description
Technical field
The present invention relates to convolutional neural networks (Convolution Neural Network, CNN), particularly relate in CNN
A kind of process the circuit of data, image processing system, the method and apparatus of process data.
Background technology
Neutral net and degree of depth learning algorithm have been obtained for extremely successful application, and are in the process developed rapidly
In, industry generally expects that new calculation contributes to realizing the most universal, complicated intelligent use.CNN is in recent years at image
Identifying that application achieves very prominent achievement, therefore optimization and the high efficiency of CNN algorithm are realized beginning to focus on by industry
And pay attention to, such as face book facebook, high pass qualcomm, it is excellent that the company such as Baidu baidu, Google google has all put into CNN
Change the research of algorithm.
Usually, the convolution algorithm in CNN is that the scheme using conventional pipeline realizes.Concretely comprise the following steps: N number of take advantage of
Musical instruments used in a Buddhist or Taoist mass carries out multiplying the most parallel at a clock to N number of data, uses the adder unit of a tree these
Multiplication result is added up by adder.In the convolution algorithm of this routine, when the data of input are sparse vector, i.e.
Have an existence of a large amount of 0, and due to 0 and valid data be mixed, the most all circuit are required in running order.Respectively
Close different circuit expense on the control circuitry bigger, implementation benefit is relatively small, it would therefore be highly desirable to a kind of more efficiently
Circuit layout realizes reducing the control of power consumption.
Summary of the invention
In view of this, the invention provides a kind of process the circuit of data, image processing system, the method for process data and
Device, it is possible to reduce circuit power consumption realizing convolution algorithm when.
First aspect, it is provided that a kind of circuit processing data, this circuit includes control unit and N number of processing unit, should
Control unit is connected with data transmission unit respectively with each processing unit in this N number of processing unit, this each processing unit
The same data being respectively used to be sequentially output this data transmission unit process, and the i-th in this N number of processing unit processes
The outfan of unit is connected with the input of i+1 processing unit, and wherein, N is the integer more than 1, and the span of i is 1
Integer to N-1, this control unit is used for: when the pending data determining that this data transmission unit exports are 0, and controlling should
N number of processing unit is all in closed mode.
Multiple processing units receive same data at synchronization, when inputting data and being 0, simultaneously close off multiple process single
Unit such that it is able to reduce power consumption.
In conjunction with first aspect, in the first possible implementation of first aspect, this circuit also includes that N number of displacement is posted
Storage, the input of the i-th shift register in this N number of shift register is connected with the outfan of this i-th processing unit,
The outfan of this i-th shift register is connected with the input of this i+1 processing unit, and this control unit is additionally operable to: when
Determining when these pending data are 0, the value controlling this i-th shift register exports this i+1 processing unit;When determining
When these pending data are not 0, control this N number of processing unit and these pending data are processed, and control the process of this i-th
The value of unit exports this i-th shift register.
When inputting data and being 0, circuit only needs to preserve last shift register, in other words, when input data
When being 0, all processing units are closed, and the value of storage is exported next stage processing unit by all depositors, i.e. in sparse vector
On the premise of input, circuit is substantially at closed mode, and only shift register is in running order, and power consumption is extremely low.
In conjunction with some implementations of above-mentioned first aspect, in the implementation that the second of first aspect is possible, should
Circuit also includes memory element, and this memory element is used for storing weights, this i+1 processing unit specifically for: to this storage
The weights corresponding with this i+1 processing unit and this pending data of unit output carry out multiplying, and to by this
Result and the value of this i-th processing unit output that multiplying obtains carry out additive operation.
By weight storage in circuit, when inputting data and being more, convolution algorithm can both have been realized, it is also possible to pass through rotation
Weights realize full concatenation operation, it is possible to reach hardware resource sharing, thus improve the utilization rate of resources of chip.
In conjunction with some implementations of above-mentioned first aspect, in the third possible implementation of first aspect, this N
Individual processing unit includes N number of multiplier and N number of accumulator, wherein, the outfan of the i-th multiplier in this N number of multiplier and
The input of the i-th accumulator in this N number of accumulator is connected, the outfan of this i-th accumulator and this i-th shift LD
The input of device is connected, this i-th multiplier for weights corresponding with this i-th multiplier that this memory element is exported and
These pending data carry out multiplying;I+1 accumulator is for shifting the value of i+1 multiplier output and this i-th
The value of depositor output carries out additive operation.
In conjunction with some implementations of above-mentioned first aspect, in the 4th kind of possible implementation of first aspect, should
Pending data and/or the power side that weights are 2 of this memory element storage, this N number of processing unit includes N number of shifting processing list
First and N number of accumulator, wherein, the outfan of the i-th shifting processing unit in this N number of shifting processing unit is N number of cumulative with this
The input of the i-th accumulator in device is connected, the outfan of this i-th accumulator and the input of this i-th shift register
End is connected, and this i-th shifting processing unit is for according to these pending data, to corresponding with this i-th shifting processing unit
Weights carry out shifting function, and to complete multiplying, or this i-th shifting processing unit is at basis with this i-th displacement
These pending data are carried out shifting function, to complete multiplying by the weights that reason unit is corresponding;I+1 accumulator is for right
The value of i+1 shifting processing unit output and the value of this i-th shift register output carry out additive operation.
In conjunction with some implementations of above-mentioned first aspect, in the 5th kind of possible implementation of first aspect, should
Control unit is additionally operable to: when determining that these pending data are 0, controls this memory element and is closed.
When inputting data and being 0, control memory element and be closed, it is possible to reduce power consumption further.
In conjunction with some implementations of above-mentioned first aspect, in the 6th kind of possible implementation of first aspect, should
Control unit includes switch element and routing unit, and this switch element, should for controlling the opening and closing of this N number of processing unit
Routing unit is for selecting the mode of operation of this N number of shift register, and this mode of operation includes shifting accumulation mode and displacement is posted
Deposit pattern.
In conjunction with some implementations of above-mentioned first aspect, in the 7th kind of possible implementation of first aspect, should
Circuit also includes decompression unit, and this decompression unit is compressed for the weights storing this memory element and decompresses.
Under a lot of application scenarios, power is sparse distribution, and therefore compressed storage can improve storage and the efficiency processed.
During use, data can also be sequentially output from memory element by the data received first being compressed storage,
Processing unit processes again, can reduce the rate of change of derived data, reduces the power consumption of circuit.
Second aspect, it is provided that a kind of image processing system, this system includes: M such as any one circuit of first aspect, figure
As input block, nonlinear mapping unit and output unit, the outfan of the kth circuit in this M circuit and kth+1
The input of circuit is connected, and wherein, M is positive integer, and the span of i is the integer in 1 to M-1, and this image input units is used
In the data of different images row are carried out delay disposal, it is sequentially output data;This nonlinear mapping unit is for this m-th electricity
The result of the n-th processing unit output in road carries out nonlinear operation;This output unit is used for exporting this nonlinear mapping list
The result of unit's output.
In conjunction with second aspect, in the implementation that the second of second aspect is possible, this system also includes: at least one
Buffer unit, the corresponding multiple circuit of each buffer unit in this at least one buffer unit, this each buffer unit is used for depositing
The value of the n-th processing unit output of storage corresponding circuits.
By being cached to the tupe of caching so that using flexibly of hardware resource is possibly realized, CNN convolution and entirely connecting
Connect and operation law contains the highest degree of parallelism, by caching parallel expansion data, so that follow-up circuit is simultaneously
Process large-scale data, improve limiting performance.
The third aspect, it is provided that a kind of method processing data, the method is to use any one circuit of first aspect to defeated
Entering data to process, the method includes: this control unit judges that whether pending data that this data transmission unit exports are
0;When determining that these pending data are 0, this control unit controls this N number of processing unit all in closed mode.
In conjunction with the third aspect, in the first possible implementation of the 3rd, the method also includes: this control unit
When determining that these pending data are 0, the value controlling this i-th shift register exports this i+1 processing unit;This control
Unit processed, when determining that these pending data are not 0, controls this N number of processing unit and processes these pending data, and control
The value making this i-th processing unit exports this i-th shift register.
Fourth aspect, it is provided that a kind of device processing data, this device includes: processor, transceiver, memorizer, N number of
Multiplier, N number of accumulator and bus system.Wherein, this memorizer, this processor, this transceiver, this N number of multiplier and this is N number of
Accumulator is connected by this bus system, and this memorizer is used for storing instruction, and this processor is for performing the storage of this memorizer
Instruction, to control transceivers signal or to send signal, and when this processor performs the instruction of this memorizer storage, should
Perform to make this processor realize the control unit in any one possible implementation of first aspect or first aspect.
5th aspect, it is provided that a kind of computer-readable storage medium, for saving as the computer software used by said method
Instruction, it comprises for performing the program designed by above-mentioned aspect.
The aspects of the invention or other aspects be meeting more straightforward in the following description.
Accompanying drawing explanation
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, will make required in the embodiment of the present invention below
Accompanying drawing be briefly described, it should be apparent that, drawings described below is only some embodiments of the present invention, for
From the point of view of those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to obtain other according to these accompanying drawings
Accompanying drawing.
Fig. 1 shows the circuit logic diagram of the conventional pipeline scheme realizing convolution algorithm.
Fig. 2 shows the schematic block diagram of the circuit of the process data that the embodiment of the present invention provides.
Fig. 3 shows another schematic block diagram of the circuit of the process data that the embodiment of the present invention provides.
Fig. 4 shows the schematic block diagram of the image processing system that the embodiment of the present invention provides.
Fig. 5 shows the schematic block diagram of the method for the process data that the embodiment of the present invention provides.
Fig. 6 shows the schematic block diagram of the device of the process data that the embodiment of the present invention provides.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Describe, it is clear that described embodiment is a part of embodiment of the present invention rather than whole embodiment wholely.Based on this
Embodiment in bright, the every other reality that those of ordinary skill in the art are obtained on the premise of not making creative work
Execute example, all should belong to the scope of protection of the invention.
Convolutional neural networks is the one of artificial neural network, it has also become grinding of current speech analysis and field of image recognition
Study carefully focus.Its weights are shared network structure and are allowed to be more closely similar to biological neural network, reduce the complexity of network model, subtract
Lack the quantity of weights.What this advantage showed when the input of network is multidimensional image becomes apparent from, and makes image directly to make
Input for network, it is to avoid feature extraction complicated in tional identification algorithm and data reconstruction processes.Convolutional network is for knowing
Other two-dimensional shapes and a multilayer perceptron of particular design, this network structure to translation, proportional zoom, tilt or altogether he
The deformation of form has height invariance.
Convolution algorithm of the prior art is that the scheme using conventional pipeline as shown in Figure 1 realizes.Concrete step
Suddenly be: use the mode of streamline that N number of multiplier carries out multiplying clock is the most parallel to N number of data, then
These multiplication results are added up by adder, need the adder unit of a tree, by some level production lines,
The accumulation result of the most cumulative final item can be obtained.For example, as it is shown in figure 1, input data are D00~D08,9 multiplication
Device carries out multiplying, the i.e. result of multiplier output simultaneously and is respectively W0*D00, W1*D01 D00~D08 ... W8*
D08, is R0=W0*D00+W1*D01+W8*D08 by the result of output after the adder unit of tree structure, similarly, when
When input data are D10~D18, the result of output is R1=W0*D10+W1*D11+W8*D18.
Mentioned above, the meaning of convolution algorithm is sharing of weights, therefore, in existing tradition convolution circuit, and should
The weights that each multiplier is corresponding are fixed configurations.And on the basis of existing this conventional circuit structure, when being simultaneously entered
When data are 0 and valid data mix, the distribution due to 0 disperses, and is unfavorable for the control realizing reducing power consumption, or realizes control institute
The additional cost needed is high.
Should be understood that the technical scheme of the embodiment of the present invention may apply in various signal processing field, such as, voice is known
Not, seismic prospecting, ultrasonic diagnosis, optical imagery, system identification, image recognition etc..
Fig. 2 shows the structural representation of the circuit 100 of the process data that the embodiment of the present invention provides.As in figure 2 it is shown, should
Circuit includes control unit 120 and N number of processing unit, each process 110 in this control unit 120 and this N number of processing unit
Unit is connected with data transmission unit respectively, and this each processing unit 110 is respectively used to be sequentially output this data transmission unit
Same data process, the outfan of the i-th processing unit 110 in this N number of processing unit and i+1 processing unit
The input of 110 is connected, and wherein, N is the integer more than 1, and the span of i is the integer in 1 to N-1, this control unit 120
For: when the pending data determining that this data transmission unit exports are 0, control this N number of processing unit all in closing shape
State.
Owing to, in CNN, the view data of input is typically sparse vector, the most a large amount of 0 and valid data be mixed, nothing
Whether opinion input data are 0, and the circuit all of multiply-add operation module of needs of tradition convolution algorithm is all in work, if used
Closing different computing modules respectively to reduce circuit power consumption, this expense on the control circuitry is bigger, and implementation benefit is relative
Less.To this end, on the premise of control circuit expense, the circuit that the embodiment of the present invention provides synchronizes to connect by N number of processing unit
Receive same data to realize, when inputting data and being 0, closing all processing units, to reach to reduce electricity when realizing convolution algorithm
The purpose of road power consumption.
Therefore, the circuit of the process data that the embodiment of the present invention provides, can reduce circuit when realizing convolution algorithm
Power consumption.
It will be appreciated by those skilled in the art that the work that convolution algorithm is to be completed is: y=w1*x1+w2*x2+......wn*xn,
I.e. two vectorial dot products.This computing generally has two kinds of ways: a kind of is to be completed multiplying by n parallel unit, adopts
By the mode of streamline, generally can accomplish that a clock provides the result of n multiplying simultaneously parallel, then, all
These results add up, and need the adder unit of a tree, by some level production lines, could obtain final the most tired
The accumulation result of plus item;Another kind is gradually to complete multiplication on a computing unit, and cumulative, after n clock, multiply-add
With complete, finally export, this way use a computing unit complete this calculating, altogether need the time of n beat.
About the shortcoming of the first way above-mentioned we it has been noted that the power consumption of this circuit is relatively big, be unfavorable for raising figure
As identifying the efficiency of network;About the second way, owing to the meaning of convolution algorithm is sharing of weights, therefore, it can n
The cascade of individual computing unit is cumulative.Through such adjustment, corresponding multiplying completes the most on request, the most same output institute
N corresponding multiplying can complete on different time beats.This adjustment does not makes calculating process complicate, instead
And accumulating operation can gradually be completed, and export at last clock.For example, it is possible to as shown in Figure 1 first group
Data do corresponding delay disposal, the most each clock one data of output.
Specifically weigh correspond to different data, therefore, original it is known that basic convolution algorithm is one group
Data form in, data are substantially different, random, and this makes to be difficult to do normalizing towards the processing unit of sparse data
The design changed.After being adjusted, different data have been adjusted on different time locations, and this makes at some different meters
Under calculation pattern, on same time location, place identical data and be possibly realized, and the power of correspondence leaves in memory element, can
To do arbitrary adjustment, this Method of Data Organization as required so that data processing links corresponding to data can do normalizing
Change, unified design so that data are that low consumption circuit design when 0 is more prone to.
Should be understood that the data for the different images row involved by a convolution, the delay disposal that can be postponed by row
It is input to N number of processing unit with guaranteeing the data parallel gradually output, it is also possible to come real by the most different data pointers
Existing.By this process so that identical data, simultaneously by all processing unit multiplexings, improve data-reusing rate, simplify and reduce
The design on control circuit of power consumption.
Alternatively, this circuit also includes N number of shift register, the i-th shift register in this N number of shift register
Input is connected with the outfan of the i-th processing unit in this N number of processing unit, the outfan of this i-th shift register
Being connected with the input of this i+1 processing unit, this control unit is additionally operable to when determining that these pending data are not 0, control
Make this N number of processing unit these pending data are processed, and the value controlling this i-th processing unit exports this i-th
Shift register.
According to the principle of shift register, shift register can not only store data, and can also be at the work of clock signal
Data therein are made to move to left successively or move to right under with.Shift register in the embodiment of the present invention includes two kinds of mode of operations, one
Kind it is displacement accumulation mode, will shift register export in next stage processing unit the value of storage;Another kind is displacement
Memory module, will accordingly processing unit output value store.
When inputting data and being 0, circuit only needs to preserve last shift register, in other words, when input data
When being 0, all processing units are closed, and the value of storage is exported next stage processing unit by all depositors, i.e. in sparse vector
On the premise of input, circuit is substantially at closed mode, and only shift register is in running order, and energy consumption is extremely low.
Alternatively, this circuit also includes memory element, and this memory element is used for storing weights, this i+1 processing unit
Specifically for: the weights corresponding with this i+1 processing unit exporting this memory element and this pending data are taken advantage of
Method computing, and the value of the result obtained by this multiplying and the output of this i-th processing unit is carried out additive operation.
Specifically, if storing one group of weights in memory element, the most N number of weights, when pending data are not 0, control single
Unit controls memory element and exports this N number of weights, each weights in these N number of weights and this each processing unit one_to_one corresponding, when
When input data are organized more, such as when having i.e. M*N the data of M group, the weights that often group data are corresponding are this N number of weights, the most N number of
The most exportable convolution results of data.
If in memory element during store M groups weights, i.e. M*N weights, when corresponding input data are M*N data,
Using N number of data as one group, the corresponding same group of weights of same group of data, every one group of weights of N number of data rotation, and every N number of number
According to one result of output, this result is full concatenation operation output.
Accordingly, because each processing unit in the circuit of embodiment of the present invention offer can receive all of input number
According to, and can be adjusted by the weights that memory element is stored so that the weights difference that each group of input data are corresponding,
Such that it is able to realize full concatenation operation, it is to avoid the different circuit of employing realizes asking of convolution algorithm and full concatenation operation respectively
Topic, it is possible to realize fully-connected network and share on hardware with convolutional network, thus improve resources of chip utilization rate.
Fig. 3 shows the circuit 200 of the process data that the embodiment of the present invention provides.As it is shown on figure 3, this circuit 200 includes:
N number of multiplier 210, N number of accumulator 220, N number of shift register 230, memory element 240 and control unit 250, citing comes
Saying, N is 9.One processing unit includes a multiplier and an accumulator, the outfan of multiplier and corresponding accumulator
Input is connected, and the outfan of accumulator is connected with the input of corresponding shift register, and the outfan of shift register
Then the input with the accumulator of next stage is connected, and multiplier is for the power corresponding with this multiplier exporting this memory element
Value and this pending data carry out multiplying;Accumulator is defeated for value and the upper level shift register exporting this multiplier
The value gone out carries out additive operation.
Should be understood that the number of multiplier that the processing unit in the embodiment of the present invention includes and accumulator is unrestricted, example
As, first order processing unit can only include multiplier, do not include adder, it is also possible to be that each processing unit includes two
Multiplier, an input of one of them multiplier is fixed to 1, it is also possible to being other compound modes, the present invention is not limited to
This.
When circuit carries out convolution algorithm or full concatenation operation, using first defeated as each multiplier 210 of input data
Entering value, N number of weights memory element 240 exported are as the second input value of N number of multiplier 210, each power in N number of weights
Value and each multiplier one_to_one corresponding, each multiplier 210 is same in the case of control unit 250 judgement input data are not 0
Time the first input value and corresponding second input value are carried out multiplying, similarly, corresponding second input value is by storing list
Unit 240 controls output in the case of control unit 250 judgement input data are not 0.And when inputting data and being not 0, often
Value and the value of corresponding multiplier 210 output that upper level shift register 230 is exported by individual accumulator 220 carry out addition fortune
Calculating, wherein, an input of first accumulator could be arranged to 0.Meanwhile, when inputting data and being not 0, each displacement is posted
Storage 230 stores the value of respective accumulators 220 output.And when the data of input are 0, control unit 240 controls all multiplication
Device 210, all accumulators 220 and memory element 240 etc. are all in closed mode, and controlling each shift register 230 will deposit
The value of storage is input to next stage accumulator 220.
Alternatively, these pending data and/or the power side that weights are 2 of this memory element storage, this N number of processing unit
Including N number of shifting processing unit and N number of accumulator, wherein, i-th shifting processing unit in this N number of shifting processing unit
Outfan is connected with the input of the i-th accumulator in this N number of accumulator, the outfan of this i-th accumulator and this i-th
The input of shift register is connected, and this i-th shifting processing unit, for according to these pending data, moves with this i-th
The weights that bit processing unit is corresponding carry out shifting function, and to complete multiplying, or this i-th shifting processing unit is for basis
These pending data are carried out shifting function, to complete multiplying by the weights corresponding with this i-th shifting processing unit;i+1
Individual accumulator is for carrying out addition to the value of i+1 shifting processing unit output and the value of this i-th shift register output
Computing.
For relatively floating point representation, or the weights of high accuracy fixed-point representation and data, can be to weights and the data amount of carrying out
Change, storage demand can be reduced after quantization, be greatly reduced amount of calculation simultaneously.Wherein weights or data can be quantified as the power of 2
Side, this makes follow-up multiplying be reduced to shift additional calculation.Specifically, can be according to input data, to corresponding power
Value carries out shifting processing, thus completes multiplying;Can also be according to corresponding weights, each shifting processing unit is to input number
According to carrying out shifting processing, realizing multiplying, weights or data can reduce code check, the wherein power of 2 by nonlinear quantization
Quantization be the most simplest situation, a kind of limiting case that power quantifies is :-1,0,1}), or {-1,1}, this limit
In the case of, even if data are high-precision fixed-point numbers, follow-up process also has only to additive operation.
Directly by Bit plane, many Bit characteristic image can also be put into different characteristic planes, this is done directly
The Binary of simplest input data quantifies so that the network calculated towards Binary can process many Bit data.
Should be understood that in embodiments of the present invention, processing unit can be the unit that input data carry out multiply-add operation, also
As long as be exactly can realize multiply-add operation unit can the scheme of the embodiment of the present invention, the embodiment of the present invention only with multiplier and
As a example by shifting processing unit realizes multiplying, the invention is not restricted to this.
Alternatively, this control unit is additionally operable to: when determining that these pending data are 0, controls this memory element and is in pass
Closed state.
When inputting data and being 0, control memory element and be closed, it is possible to reduce power consumption further.
Should be understood that this control unit 250 can include simple switch element or routing unit, it is also possible to include that clock produces
Raw unit and clock control cell etc..Such as, clock generating unit is for producing a clock signal, the week of this clock signal
Phase should be more than the time of one data of a processing unit processes, and one data of output of each clock cycle.This clock
The clock signal that control unit can produce according to input Data Control clock generating unit, when data are 0, close clock and produces
Raw unit, when data are not 0, opens clock generating unit.Routing unit is then used for basis and enters data to select displacement to post
The mode of operation of storage, i.e. when data are 0, shift register is only operated in shift LD pattern, when data are not 0, then
Shift register is operated in displacement accumulation mode, it is also possible to controlled unlatching or the pass of all processing units by switch element
Close.For example, it is possible to when data are 0, close all of processing unit, when inputting data and being not 0, open all of process single
Unit.
Alternatively, as one embodiment of the present of invention, this memory element is dynamic RAM DRAM.This storage list
Unit can be also used for storing data.The data that a convolution relates to can be stored in memory element, this circuit can also include
Delay cell, in use, data memory element stored are exported, as N number of processing unit one by one by delay cell
Input.
Specifically, during CNN convolution and full connection calculate, all there is reusable situation in data and weights, and
And data and power are all that regular order is deposited, order uses.All there is random access memory in the most actually used data
In (random access memory, RAM), but the capacity of data and power can be very big in logic, therefore can use big
The DRAM of capacity.Thus realize jumbo network, high performance calculating.
Alternatively, as one embodiment of the present of invention, this circuit also includes decompression unit, and this decompression unit is used for
The data storing this memory element and/or weights are compressed and decompress.
According to the feature of Deta sparseness, data buffer storage can also compressed storage, improve the utilization rate of data buffer storage.Use
Time, decompress output data sequence from data buffer storage.Under a lot of application scenarios, power is also sparse distribution, and therefore compression is deposited
Put the efficiency that can improve storage and process.Memory element is used to realize data and the storage of weights so that data and weights
Can reuse, data or the multiplexing of power, it is possible to reduce power consumption further.
In embodiments of the present invention, foregoing circuit can apply to cloud computing scene, and in this scene, foregoing circuit can be by
In cloud equipment, a certain unit realizes, such as: realized foregoing circuit by the processor unit in cloud equipment.It addition, foregoing circuit is also
Can apply on terminal unit, in this scene, foregoing circuit by being connected with imageing sensor in terminal unit or can draw
Near parts realize, such as: realized foregoing circuit by the process chip of terminal unit.And terminal unit here includes flat board electricity
Brain, mobile phone, electronic reader, remote controller, PC, notebook computer, mobile unit, Web TV, wearable device etc. have figure
As identifying the smart machine of function.
The embodiment of the present invention additionally provides a kind of image processing system.As shown in Figure 4, this system includes: M above-mentioned electricity
Road, image input units, nonlinear mapping unit and output unit, the outfan of the kth circuit in M circuit and kth+
The input of 1 circuit is connected, and wherein, M is positive integer, and the span of k is the integer in 1 to M-1.This image input units
For the data of different images row are carried out delay disposal, it is sequentially output data;This nonlinear mapping unit is for m-th electricity
The result of the n-th processing unit output on road carries out nonlinear operation;This output unit is used for exporting this nonlinear mapping unit
The result of output.
Each circuit includes cascade input and cascaded-output, can be with application of the manystage cascade connection.Make the concrete body of calculating of different scales
Being now the cumulative of different scales, by cascading the utilization of input and output so that add up and can cascade, both fixing array can be located
The accumulating operation of reason different length.
Should be understood that in image input units, a large-sized plane of delineation can be converted directly into multiple little chi
The very little plane of delineation.Such as, the image of a 256*256, the image of 4 64*64 can be converted to, its medium and small image identical
4, position pixel, each in the most corresponding big image 2*2 window data.This mapping mode, can be directly by big image
Convolution algorithm, be converted to the convolution algorithm of multiple little image.CNN convolution algorithm each layer parameter excursion is diminished, has
It is beneficial to the efficiency that hardware is implemented.
Should be understood that and nonlinear mapping unit can be set after last processing unit of each circuit, it is possible to
With only at last processing unit nonlinear mapping disposed behind unit of whole system, permissible in this nonlinear mapping unit
The Sigmoid that storage efficiently realizes during nerve calculates maps, and ReLU maps, or other map.Such as, logarithm, index or based on
The image enhaucament of histogram distribution maps.
Alternatively, as one embodiment of the present of invention, this system also includes: at least one buffer unit, this is at least one years old
The corresponding multiple circuit of each buffer unit in individual buffer unit, this each buffer unit is for storing the n-th of corresponding circuits
The value of processing unit output.
The convolutional calculation amount relating to multiple features is very big, often beyond the scope of a physical array disposal ability, needs
By caching intermediate object program, repeatedly calculate and just can complete after adding up.The mode supporting this calculating is: calculate is cumulative defeated every time
Entering to take from caching, result of calculation is also deposited to caching simultaneously.Caching can allow to exist multiple, can be tired from specifying as required
Add caching to read in, and export different appointments cumulative buffering write.
By being cached to the tupe of caching so that using flexibly of hardware resource is possibly realized;Unified caching knot
Structure guarantees that same hardware structure is supported CNN convolution and is entirely connected calculating;Caching write with reading can by take different in the way of,
Adaptation data processes the needs of rule, reduces caching consumption.And buffer unit can corresponding multiple circuit, it is possible to increase
The degree of parallelism of data input so that extensive, high performance parallel is treated as possibility.
System shown in Fig. 4 can be extended on the basis of the circuit 100 shown in Fig. 2 or Fig. 3 or circuit 200
Arrive.And the operation that the modules in circuit 100 or circuit 200 realizes is identical with technique scheme with function, for letter
Clean, do not repeat them here.
The method 300 that embodiment of the present invention offer processes data below in conjunction with Fig. 5 is described.The method uses
Input data are processed by circuit 100 or the circuit 200 of foregoing description, such as, can be performed by control unit, such as Fig. 5 institute
Showing, the method 300 includes:
S310, it is judged that whether the pending data of this data transmission unit output are 0;
S320, when determining that these pending data are 0, controls this N number of processing unit all in closed mode.
Therefore, the method for the process data that the embodiment of the present invention provides, it is possible to reduce circuit when realizing convolution algorithm
Power consumption.
Alternatively, as one embodiment of the invention, the method also includes: this control unit is determining this pending data
When being 0, the value controlling this i-th shift register exports this i+1 processing unit;This control unit is determining that this waits to locate
When reason data are not 0, control this N number of processing unit and these pending data are processed, and control this i-th processing unit
Value output is to this i-th shift register.
Fig. 6 shows device 500 according to embodiments of the present invention.This device includes: processor 520, transceiver 530, deposit
Reservoir 510, N number of multiplier 540, N number of accumulator 550 and bus system 560.Wherein, this memorizer, this processor, this transmitting-receiving
Device, this N number of multiplier are connected by this bus system with this N number of accumulator, and this memorizer is used for storing instruction, and this processor is used
In performing the instruction of this memorizer storage, it is sequentially output same data controlling transceiver, and deposits when this processor performs this
During the instruction of reservoir storage, this processor is used for: when the pending data determining this transceivers are 0, controls this and N number of takes advantage of
Musical instruments used in a Buddhist or Taoist mass and N number of accumulator are all in closed mode.
Therefore, the device of the process data that the embodiment of the present invention provides, it is possible to reduce electricity realizing convolution algorithm when
Road power consumption.
Should be understood that in embodiments of the present invention, this processor 520 can be CPU, and this processor 520 can also is that other
General processor, digital signal processor (DSP), special IC (ASIC), FPGA or other PLDs,
Discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor.
This memorizer 510 can include read only memory and random access memory, and to processor 520 provide instruction and
Data.A part for memorizer 510 can also include nonvolatile RAM.Such as, memorizer 510 can also be deposited
The information of storage device type.
This bus system 560 is in addition to including data/address bus, it is also possible to includes power bus, control bus and status signal
Bus etc..But for the sake of understanding explanation, in the drawings various buses are all designated as bus system 560.
During realizing, each step of said method 300 can be by the integration logic electricity of the hardware in processor 520
The instruction of road or software form completes.Step in conjunction with the method disclosed in the embodiment of the present invention can be embodied directly in hardware
Processor has performed, or completes with the hardware in processor and software module combination execution.Software module may be located at
Machine memorizer, flash memory, read only memory, programmable read only memory or electrically erasable programmable memorizer, depositor etc. are originally
In the storage medium that field is ripe.This storage medium is positioned at memorizer 510, and processor 520 reads the information in memorizer 510,
The step of said method 300 is completed in conjunction with its hardware.For avoiding repeating, it is not detailed herein.
Should be understood that in embodiments of the present invention, device 500 according to embodiments of the present invention may be used for realizing shown in Fig. 3
Circuit 200.For sake of simplicity, do not repeat them here.
Should be understood that in embodiments of the present invention, " B corresponding with A " represents that B with A is associated, and may determine that B according to A.But
Should also be understood that determining B to be not meant to only according to A according to A determines B, it is also possible to determine B according to A and/or out of Memory.
Those of ordinary skill in the art are it is to be appreciated that combine the list of each example that the embodiments described herein describes
Unit and algorithm steps, it is possible to electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate hardware
With the interchangeability of software, the most generally describe composition and the step of each example according to function.This
A little functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Specially
Industry technical staff can use different methods to realize described function to each specifically should being used for, but this realization is not
It is considered as beyond the scope of this invention.
Those skilled in the art is it can be understood that arrive, for convenience of description and succinctly, foregoing description be
System, device and the specific works process of unit, be referred to the corresponding process in preceding method embodiment, do not repeat them here.
In several embodiments provided herein, it should be understood that disclosed system, apparatus and method, permissible
Realize by another way.Such as, device embodiment described above is only schematically, such as, and drawing of this unit
Point, it is only a kind of logic function and divides, actual can have other dividing mode when realizing, and the most multiple unit or assembly can
To combine or to be desirably integrated into another system, or some features can be ignored, or does not performs.It addition, it is shown or discussed
Coupling each other direct-coupling or communication connection can be the INDIRECT COUPLING by some interfaces, device or unit or
Communication connection, it is also possible to be electric, machinery or other form connect.
This as the unit that separating component illustrates can be or may not be physically separate, shows as unit
Parts can be or may not be physical location, i.e. may be located at a place, or multiple net can also be distributed to
On network unit.Some or all of unit therein can be selected according to the actual needs to realize embodiment of the present invention scheme
Purpose.
It addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it is also possible to
It is that unit is individually physically present, it is also possible to be that two or more unit are integrated in a unit.Above-mentioned integrated
Unit both can realize to use the form of hardware, it would however also be possible to employ the form of SFU software functional unit realizes.
If this integrated unit is using the form realization of SFU software functional unit and as independent production marketing or use,
Can be stored in a computer read/write memory medium.Based on such understanding, technical scheme substantially or
Person says the part contributing prior art, or this technical scheme completely or partially can be with the form body of software product
Revealing to come, this computer software product is stored in a storage medium, including some instructions with so that a computer sets
Standby (can be personal computer, server, or the network equipment etc.) perform the whole of each embodiment the method for the present invention or
Part steps.And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory),
Random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store program code
Medium.
The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is not limited thereto, and any
Those familiar with the art, in the technical scope that the invention discloses, can readily occur in the amendment of various equivalence or replace
Change.
Claims (12)
1. the circuit processing data, it is characterised in that described circuit includes control unit and N number of processing unit, described control
Unit processed is connected with data transmission unit respectively with each processing unit in described N number of processing unit, and described each process is single
The same data that unit is respectively used to be sequentially output described data transmission unit process, i-th in described N number of processing unit
The outfan of individual processing unit is connected with the input of i+1 processing unit, and wherein, N is the integer more than 1, the value model of i
Enclosing is the integer in 1 to N-1,
Described control unit is used for:
When the pending data determining that described data transmission unit exports are 0, control described N number of processing unit all in closedown
State.
Circuit the most according to claim 1, it is characterised in that described circuit also includes N number of shift register, described N number of
The input of the i-th shift register in shift register is connected with the outfan of described i-th processing unit, and described i-th
The outfan of individual shift register is connected with the input of described i+1 processing unit,
Described control unit is additionally operable to:
When determining that described pending data are 0, the value controlling described i-th shift register exports at described i+1
Reason unit;
When determining that described pending data are not 0, control described N number of processing unit and described pending data processed,
And the value controlling described i-th processing unit exports described i-th shift register.
Circuit the most according to claim 2, it is characterised in that described circuit also includes memory element, described memory element
For storing weights, described i+1 processing unit specifically for:
The weights corresponding with described i+1 processing unit and the described pending data that export described memory element are taken advantage of
Method computing, and the value of the value obtained by described multiplying and the output of described i-th processing unit is carried out additive operation.
Circuit the most according to claim 3, it is characterised in that described N number of processing unit includes N number of multiplier and N number of tired
Add device, wherein, the i-th accumulator in the outfan of the i-th multiplier in described N number of multiplier and described N number of accumulator
Input be connected, the outfan of described i-th accumulator is connected with the input of described i-th shift register,
Described i-th multiplier is for the weights corresponding with described i-th multiplier that export described memory element and described
Pending data carry out multiplying;
I+1 accumulator is for adding the value of i+1 multiplier output and the value of described i-th shift register output
Method computing.
Circuit the most according to claim 3, it is characterised in that described pending data and/or the storage of described memory element
The power side that weights are 2, described N number of processing unit includes N number of shifting processing unit and N number of accumulator, wherein, described N number of
The input of the i-th accumulator in the outfan of the i-th shifting processing unit in shifting processing unit and described N number of accumulator
End is connected, and the outfan of described i-th accumulator is connected with the input of described i-th shift register,
Described i-th shifting processing unit is for according to described pending data, to corresponding with described i-th shifting processing unit
Weights carry out shifting function, to complete multiplying, or
Described i-th shifting processing unit is for according to the weights corresponding with described i-th shifting processing unit, locating described waiting
Reason data carry out shifting function, to complete multiplying;
I+1 accumulator is for the value exporting i+1 shifting processing unit and the value of described i-th shift register output
Carry out additive operation.
6. according to the circuit according to any one of claim 3 to 5, it is characterised in that described control unit is additionally operable to:
When determining that described pending data are 0, control described memory element and be closed.
7. according to the circuit according to any one of claim 2 to 6, it is characterised in that described control unit includes switch element
With routing unit, described switch element is for controlling the opening and closing of described N number of processing unit, and described routing unit is used for selecting
Selecting the mode of operation of described N number of shift register, described mode of operation includes shifting accumulation mode and shift LD pattern.
8. according to the circuit according to any one of claim 3 to 7, it is characterised in that it is single that described circuit also includes that compression processes
Unit, described compression processing unit is compressed for the weights storing described memory element and decompresses.
9. an image processing system, it is characterised in that including:
M circuit, image input units, nonlinear mapping unit and the output as according to any one of claim 1 to 8 is single
Unit, the input of+1 circuit of outfan and kth of the kth circuit in described M circuit is connected, and wherein, M is positive integer, k
Span be the integer in 1 to M-1,
Described image input units, for the data of different images row are carried out delay disposal, is sequentially output data;
Described nonlinear mapping unit carries out non-linear fortune for the result exporting the n-th processing unit in m-th circuit
Calculate;
Described output unit is for exporting the result of described nonlinear mapping unit output.
System the most according to claim 9, it is characterised in that described system also includes:
At least one buffer unit, the corresponding multiple circuit of each buffer unit at least one buffer unit described, described often
Individual buffer unit is for storing the value of the n-th processing unit output of corresponding circuits.
11. 1 kinds of methods processing data, it is characterised in that described method is to use according to any one of claim 1 to 8
Input data are processed by circuit, and described method includes:
Described control unit judges whether the pending data that described data transmission unit exports are 0;
When determining that described pending data are 0, described control unit controls described N number of processing unit all in closed mode.
12. methods according to claim 11, it is characterised in that described method also includes:
Described control unit is when determining that described pending data are 0, and the value controlling described i-th shift register exports institute
State i+1 processing unit;
Described control unit, when determining that described pending data are not 0, controls described N number of processing unit to described pending number
According to processing, and the value controlling described i-th processing unit exports described i-th shift register.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610480591.6A CN106127302A (en) | 2016-06-23 | 2016-06-23 | Process the circuit of data, image processing system, the method and apparatus of process data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610480591.6A CN106127302A (en) | 2016-06-23 | 2016-06-23 | Process the circuit of data, image processing system, the method and apparatus of process data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106127302A true CN106127302A (en) | 2016-11-16 |
Family
ID=57265725
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610480591.6A Pending CN106127302A (en) | 2016-06-23 | 2016-06-23 | Process the circuit of data, image processing system, the method and apparatus of process data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106127302A (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951961A (en) * | 2017-02-24 | 2017-07-14 | 清华大学 | The convolutional neural networks accelerator and system of a kind of coarseness restructural |
CN107491416A (en) * | 2017-08-31 | 2017-12-19 | 中国人民解放军信息工程大学 | Reconfigurable Computation structure and calculating dispatching method and device suitable for Arbitrary Dimensions convolution demand |
CN107818367A (en) * | 2017-10-30 | 2018-03-20 | 中国科学院计算技术研究所 | Processing system and processing method for neutral net |
CN107832841A (en) * | 2017-11-14 | 2018-03-23 | 福州瑞芯微电子股份有限公司 | The power consumption optimization method and circuit of a kind of neural network chip |
CN108229668A (en) * | 2017-09-29 | 2018-06-29 | 北京市商汤科技开发有限公司 | Operation implementation method, device and electronic equipment based on deep learning |
CN108304925A (en) * | 2018-01-08 | 2018-07-20 | 中国科学院计算技术研究所 | A kind of pond computing device and method |
CN108345934A (en) * | 2018-01-16 | 2018-07-31 | 中国科学院计算技术研究所 | A kind of activation device and method for neural network processor |
CN108388333A (en) * | 2018-01-25 | 2018-08-10 | 福州瑞芯微电子股份有限公司 | A kind of power consumption method of adjustment and device that multiplier precision is set based on electricity |
CN108416388A (en) * | 2018-03-13 | 2018-08-17 | 武汉久乐科技有限公司 | State correction method, apparatus and wearable device |
CN108564169A (en) * | 2017-04-11 | 2018-09-21 | 上海兆芯集成电路有限公司 | Hardware processing element, neural network unit and computer usable medium |
CN108629405A (en) * | 2017-03-22 | 2018-10-09 | 杭州海康威视数字技术股份有限公司 | The method and apparatus for improving convolutional neural networks computational efficiency |
CN108647184A (en) * | 2018-05-10 | 2018-10-12 | 杭州雄迈集成电路技术有限公司 | A kind of Dynamic High-accuracy bit convolution multiplication Fast implementation |
CN109032704A (en) * | 2017-06-12 | 2018-12-18 | 深圳市中兴微电子技术有限公司 | A kind of method and apparatus of data processing |
CN109409514A (en) * | 2018-11-02 | 2019-03-01 | 广州市百果园信息技术有限公司 | Fixed-point calculation method, apparatus, equipment and the storage medium of convolutional neural networks |
CN109635940A (en) * | 2019-01-28 | 2019-04-16 | 深兰人工智能芯片研究院(江苏)有限公司 | A kind of image processing method and image processing apparatus based on convolutional neural networks |
CN110036368A (en) * | 2016-12-06 | 2019-07-19 | Arm有限公司 | For executing arithmetical operation with the device and method for the floating number that adds up |
WO2019165989A1 (en) * | 2018-03-01 | 2019-09-06 | 华为技术有限公司 | Data processing circuit for use in neural network |
CN110692038A (en) * | 2017-05-24 | 2020-01-14 | 微软技术许可有限责任公司 | Multi-functional vector processor circuit |
CN111064912A (en) * | 2019-12-20 | 2020-04-24 | 江苏芯盛智能科技有限公司 | Frame format conversion circuit and method |
CN111209244A (en) * | 2018-11-21 | 2020-05-29 | 上海寒武纪信息科技有限公司 | Data processing device and related product |
CN111610846A (en) * | 2020-05-08 | 2020-09-01 | 上海安路信息科技有限公司 | FPGA internal DSP and power consumption reduction method thereof |
CN111814972A (en) * | 2020-07-08 | 2020-10-23 | 上海雪湖科技有限公司 | Neural network convolution operation acceleration method based on FPGA |
WO2021073053A1 (en) * | 2019-10-15 | 2021-04-22 | 百度在线网络技术(北京)有限公司 | Device and method for convolution operation |
CN113222859A (en) * | 2021-05-27 | 2021-08-06 | 西安电子科技大学 | Low-illumination image enhancement system and method based on logarithmic image processing model |
CN113821701A (en) * | 2021-10-14 | 2021-12-21 | 厦门半导体工业技术研发有限公司 | Method and device for improving circuit access efficiency |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110029471A1 (en) * | 2009-07-30 | 2011-02-03 | Nec Laboratories America, Inc. | Dynamically configurable, multi-ported co-processor for convolutional neural networks |
CN103716039A (en) * | 2013-12-04 | 2014-04-09 | 浙江大学城市学院 | Floating gate MOS tube-based enhanced dynamic full adder design |
CN203645649U (en) * | 2013-12-20 | 2014-06-11 | 浙江大学城市学院 | Neuron MOS tube-based three-valued dynamic BiCMOS OR gate design |
CN105260773A (en) * | 2015-09-18 | 2016-01-20 | 华为技术有限公司 | Image processing device and image processing method |
-
2016
- 2016-06-23 CN CN201610480591.6A patent/CN106127302A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110029471A1 (en) * | 2009-07-30 | 2011-02-03 | Nec Laboratories America, Inc. | Dynamically configurable, multi-ported co-processor for convolutional neural networks |
CN103716039A (en) * | 2013-12-04 | 2014-04-09 | 浙江大学城市学院 | Floating gate MOS tube-based enhanced dynamic full adder design |
CN203645649U (en) * | 2013-12-20 | 2014-06-11 | 浙江大学城市学院 | Neuron MOS tube-based three-valued dynamic BiCMOS OR gate design |
CN105260773A (en) * | 2015-09-18 | 2016-01-20 | 华为技术有限公司 | Image processing device and image processing method |
Non-Patent Citations (2)
Title |
---|
C. FARABET: "CNP: An FPGA-based processor for Convolutional Networks", 《FIELD PROGRAMMABLE LOGIC AND APPLICATIONS》 * |
R. G. SHOUP: "Parameterized Convolution Filtering in a Field Programmable Gate Array", 《SELECTED PAPERS FROM THE OXFORD 1993 INTERNATIONAL WORKSHOP ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS ON MORE FPGAS》 * |
Cited By (43)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110036368B (en) * | 2016-12-06 | 2023-02-28 | Arm有限公司 | Apparatus and method for performing arithmetic operations to accumulate floating point numbers |
CN110036368A (en) * | 2016-12-06 | 2019-07-19 | Arm有限公司 | For executing arithmetical operation with the device and method for the floating number that adds up |
CN106951961A (en) * | 2017-02-24 | 2017-07-14 | 清华大学 | The convolutional neural networks accelerator and system of a kind of coarseness restructural |
CN106951961B (en) * | 2017-02-24 | 2019-11-26 | 清华大学 | A kind of convolutional neural networks accelerator that coarseness is restructural and system |
CN108629405A (en) * | 2017-03-22 | 2018-10-09 | 杭州海康威视数字技术股份有限公司 | The method and apparatus for improving convolutional neural networks computational efficiency |
CN108629405B (en) * | 2017-03-22 | 2020-09-18 | 杭州海康威视数字技术股份有限公司 | Method and device for improving calculation efficiency of convolutional neural network |
CN108564169B (en) * | 2017-04-11 | 2020-07-14 | 上海兆芯集成电路有限公司 | Hardware processing unit, neural network unit, and computer usable medium |
CN108564169A (en) * | 2017-04-11 | 2018-09-21 | 上海兆芯集成电路有限公司 | Hardware processing element, neural network unit and computer usable medium |
CN110692038B (en) * | 2017-05-24 | 2023-04-04 | 微软技术许可有限责任公司 | Multi-functional vector processor circuit |
CN110692038A (en) * | 2017-05-24 | 2020-01-14 | 微软技术许可有限责任公司 | Multi-functional vector processor circuit |
CN109032704A (en) * | 2017-06-12 | 2018-12-18 | 深圳市中兴微电子技术有限公司 | A kind of method and apparatus of data processing |
CN107491416A (en) * | 2017-08-31 | 2017-12-19 | 中国人民解放军信息工程大学 | Reconfigurable Computation structure and calculating dispatching method and device suitable for Arbitrary Dimensions convolution demand |
CN108229668B (en) * | 2017-09-29 | 2020-07-07 | 北京市商汤科技开发有限公司 | Operation implementation method and device based on deep learning and electronic equipment |
CN108229668A (en) * | 2017-09-29 | 2018-06-29 | 北京市商汤科技开发有限公司 | Operation implementation method, device and electronic equipment based on deep learning |
CN107818367A (en) * | 2017-10-30 | 2018-03-20 | 中国科学院计算技术研究所 | Processing system and processing method for neutral net |
CN107832841A (en) * | 2017-11-14 | 2018-03-23 | 福州瑞芯微电子股份有限公司 | The power consumption optimization method and circuit of a kind of neural network chip |
CN108304925B (en) * | 2018-01-08 | 2020-11-03 | 中国科学院计算技术研究所 | Pooling computing device and method |
CN108304925A (en) * | 2018-01-08 | 2018-07-20 | 中国科学院计算技术研究所 | A kind of pond computing device and method |
CN108345934A (en) * | 2018-01-16 | 2018-07-31 | 中国科学院计算技术研究所 | A kind of activation device and method for neural network processor |
CN108345934B (en) * | 2018-01-16 | 2020-11-03 | 中国科学院计算技术研究所 | Activation device and method for neural network processor |
CN108388333A (en) * | 2018-01-25 | 2018-08-10 | 福州瑞芯微电子股份有限公司 | A kind of power consumption method of adjustment and device that multiplier precision is set based on electricity |
CN110222833B (en) * | 2018-03-01 | 2023-12-19 | 华为技术有限公司 | Data processing circuit for neural network |
CN110222833A (en) * | 2018-03-01 | 2019-09-10 | 华为技术有限公司 | A kind of data processing circuit for neural network |
WO2019165989A1 (en) * | 2018-03-01 | 2019-09-06 | 华为技术有限公司 | Data processing circuit for use in neural network |
CN108416388A (en) * | 2018-03-13 | 2018-08-17 | 武汉久乐科技有限公司 | State correction method, apparatus and wearable device |
CN108416388B (en) * | 2018-03-13 | 2022-03-11 | 武汉久乐科技有限公司 | State correction method and device and wearable equipment |
CN108647184B (en) * | 2018-05-10 | 2022-04-12 | 杭州雄迈集成电路技术股份有限公司 | Method for realizing dynamic bit convolution multiplication |
CN108647184A (en) * | 2018-05-10 | 2018-10-12 | 杭州雄迈集成电路技术有限公司 | A kind of Dynamic High-accuracy bit convolution multiplication Fast implementation |
CN109409514A (en) * | 2018-11-02 | 2019-03-01 | 广州市百果园信息技术有限公司 | Fixed-point calculation method, apparatus, equipment and the storage medium of convolutional neural networks |
CN111209244A (en) * | 2018-11-21 | 2020-05-29 | 上海寒武纪信息科技有限公司 | Data processing device and related product |
CN111209244B (en) * | 2018-11-21 | 2022-05-06 | 上海寒武纪信息科技有限公司 | Data processing device and related product |
CN109635940A (en) * | 2019-01-28 | 2019-04-16 | 深兰人工智能芯片研究院(江苏)有限公司 | A kind of image processing method and image processing apparatus based on convolutional neural networks |
WO2021073053A1 (en) * | 2019-10-15 | 2021-04-22 | 百度在线网络技术(北京)有限公司 | Device and method for convolution operation |
US11556614B2 (en) | 2019-10-15 | 2023-01-17 | Apollo Intelligent Driving Technology (Beijing) Co., Ltd. | Apparatus and method for convolution operation |
CN111064912A (en) * | 2019-12-20 | 2020-04-24 | 江苏芯盛智能科技有限公司 | Frame format conversion circuit and method |
CN111064912B (en) * | 2019-12-20 | 2022-03-22 | 江苏芯盛智能科技有限公司 | Frame format conversion circuit and method |
CN111610846A (en) * | 2020-05-08 | 2020-09-01 | 上海安路信息科技有限公司 | FPGA internal DSP and power consumption reduction method thereof |
CN111814972A (en) * | 2020-07-08 | 2020-10-23 | 上海雪湖科技有限公司 | Neural network convolution operation acceleration method based on FPGA |
CN111814972B (en) * | 2020-07-08 | 2024-02-02 | 上海雪湖科技有限公司 | Neural network convolution operation acceleration method based on FPGA |
CN113222859A (en) * | 2021-05-27 | 2021-08-06 | 西安电子科技大学 | Low-illumination image enhancement system and method based on logarithmic image processing model |
CN113222859B (en) * | 2021-05-27 | 2023-04-21 | 西安电子科技大学 | Low-illumination image enhancement system and method based on logarithmic image processing model |
CN113821701A (en) * | 2021-10-14 | 2021-12-21 | 厦门半导体工业技术研发有限公司 | Method and device for improving circuit access efficiency |
CN113821701B (en) * | 2021-10-14 | 2023-09-26 | 厦门半导体工业技术研发有限公司 | Method and device for improving circuit access efficiency |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106127302A (en) | Process the circuit of data, image processing system, the method and apparatus of process data | |
CN111242282B (en) | Deep learning model training acceleration method based on end edge cloud cooperation | |
CN107169563B (en) | Processing system and method applied to two-value weight convolutional network | |
CN106203621B (en) | The processor calculated for convolutional neural networks | |
CN109543832B (en) | Computing device and board card | |
US20190087713A1 (en) | Compression of sparse deep convolutional network weights | |
US11847553B2 (en) | Parallel computational architecture with reconfigurable core-level and vector-level parallelism | |
CN109409510B (en) | Neuron circuit, chip, system and method thereof, and storage medium | |
WO2020074989A1 (en) | Data representation for dynamic precision in neural network cores | |
CN109934336A (en) | Neural network dynamic based on optimum structure search accelerates platform designing method and neural network dynamic to accelerate platform | |
CN115880132B (en) | Graphics processor, matrix multiplication task processing method, device and storage medium | |
CN101576961B (en) | High-speed image matching method and device thereof | |
CN114239859B (en) | Power consumption data prediction method and device based on transfer learning and storage medium | |
CN113792621B (en) | FPGA-based target detection accelerator design method | |
Tu et al. | A power efficient neural network implementation on heterogeneous FPGA and GPU devices | |
CN110991608A (en) | Convolutional neural network quantitative calculation method and system | |
WO2021036362A1 (en) | Method and apparatus for processing data, and related product | |
CN112529477A (en) | Credit evaluation variable screening method, device, computer equipment and storage medium | |
JP2020077066A (en) | Learning device and method for learning | |
CN112308335A (en) | Short-term electricity price prediction method and device based on xgboost algorithm | |
CN111831354A (en) | Data precision configuration method, device, chip array, equipment and medium | |
Zhan et al. | Field programmable gate array‐based all‐layer accelerator with quantization neural networks for sustainable cyber‐physical systems | |
CN116822600A (en) | Neural network search chip based on RISC-V architecture | |
CN114254740B (en) | Convolution neural network accelerated calculation method, calculation system, chip and receiver | |
CN109582911B (en) | Computing device for performing convolution and computing method for performing convolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20161116 |