CN106127302A

CN106127302A - Process the circuit of data, image processing system, the method and apparatus of process data

Info

Publication number: CN106127302A
Application number: CN201610480591.6A
Authority: CN
Inventors: 费旭东; 袁宏辉; 徐斌
Original assignee: Hangzhou Huawei Digital Technologies Co Ltd
Current assignee: Hangzhou Huawei Digital Technologies Co Ltd
Priority date: 2016-06-23
Filing date: 2016-06-23
Publication date: 2016-11-16

Abstract

The embodiment of the invention discloses a kind of circuit processing data, image processing system and the method and apparatus processing data, this circuit includes control unit and N number of processing unit, this control unit is connected with data transmission unit respectively with each processing unit in this N number of processing unit, the same data that this each processing unit is respectively used to be sequentially output this data transmission unit process, the outfan of the i-th processing unit in this N number of processing unit is connected with the input of i+1 processing unit, wherein, N is the integer more than 1, the span of i is the integer in 1 to N 1, this control unit is used for: when the pending data determining that this data transmission unit exports are 0, control this N number of processing unit all in closed mode.The circuit of embodiment of the present invention offer, system, method and apparatus, it is possible to realizing convolution algorithm when, reduce circuit power consumption.

Description

Process the circuit of data, image processing system, the method and apparatus of process data

Technical field

The present invention relates to convolutional neural networks (Convolution Neural Network, CNN), particularly relate in CNN A kind of process the circuit of data, image processing system, the method and apparatus of process data.

Background technology

Neutral net and degree of depth learning algorithm have been obtained for extremely successful application, and are in the process developed rapidly In, industry generally expects that new calculation contributes to realizing the most universal, complicated intelligent use.CNN is in recent years at image Identifying that application achieves very prominent achievement, therefore optimization and the high efficiency of CNN algorithm are realized beginning to focus on by industry And pay attention to, such as face book facebook, high pass qualcomm, it is excellent that the company such as Baidu baidu, Google google has all put into CNN Change the research of algorithm.

Usually, the convolution algorithm in CNN is that the scheme using conventional pipeline realizes.Concretely comprise the following steps: N number of take advantage of Musical instruments used in a Buddhist or Taoist mass carries out multiplying the most parallel at a clock to N number of data, uses the adder unit of a tree these Multiplication result is added up by adder.In the convolution algorithm of this routine, when the data of input are sparse vector, i.e. Have an existence of a large amount of 0, and due to 0 and valid data be mixed, the most all circuit are required in running order.Respectively Close different circuit expense on the control circuitry bigger, implementation benefit is relatively small, it would therefore be highly desirable to a kind of more efficiently Circuit layout realizes reducing the control of power consumption.

Summary of the invention

In view of this, the invention provides a kind of process the circuit of data, image processing system, the method for process data and Device, it is possible to reduce circuit power consumption realizing convolution algorithm when.

First aspect, it is provided that a kind of circuit processing data, this circuit includes control unit and N number of processing unit, should Control unit is connected with data transmission unit respectively with each processing unit in this N number of processing unit, this each processing unit The same data being respectively used to be sequentially output this data transmission unit process, and the i-th in this N number of processing unit processes The outfan of unit is connected with the input of i+1 processing unit, and wherein, N is the integer more than 1, and the span of i is 1 Integer to N-1, this control unit is used for: when the pending data determining that this data transmission unit exports are 0, and controlling should N number of processing unit is all in closed mode.

Multiple processing units receive same data at synchronization, when inputting data and being 0, simultaneously close off multiple process single Unit such that it is able to reduce power consumption.

In conjunction with first aspect, in the first possible implementation of first aspect, this circuit also includes that N number of displacement is posted Storage, the input of the i-th shift register in this N number of shift register is connected with the outfan of this i-th processing unit, The outfan of this i-th shift register is connected with the input of this i+1 processing unit, and this control unit is additionally operable to: when Determining when these pending data are 0, the value controlling this i-th shift register exports this i+1 processing unit；When determining When these pending data are not 0, control this N number of processing unit and these pending data are processed, and control the process of this i-th The value of unit exports this i-th shift register.

When inputting data and being 0, circuit only needs to preserve last shift register, in other words, when input data When being 0, all processing units are closed, and the value of storage is exported next stage processing unit by all depositors, i.e. in sparse vector On the premise of input, circuit is substantially at closed mode, and only shift register is in running order, and power consumption is extremely low.

In conjunction with some implementations of above-mentioned first aspect, in the implementation that the second of first aspect is possible, should Circuit also includes memory element, and this memory element is used for storing weights, this i+1 processing unit specifically for: to this storage The weights corresponding with this i+1 processing unit and this pending data of unit output carry out multiplying, and to by this Result and the value of this i-th processing unit output that multiplying obtains carry out additive operation.

By weight storage in circuit, when inputting data and being more, convolution algorithm can both have been realized, it is also possible to pass through rotation Weights realize full concatenation operation, it is possible to reach hardware resource sharing, thus improve the utilization rate of resources of chip.

In conjunction with some implementations of above-mentioned first aspect, in the third possible implementation of first aspect, this N Individual processing unit includes N number of multiplier and N number of accumulator, wherein, the outfan of the i-th multiplier in this N number of multiplier and The input of the i-th accumulator in this N number of accumulator is connected, the outfan of this i-th accumulator and this i-th shift LD The input of device is connected, this i-th multiplier for weights corresponding with this i-th multiplier that this memory element is exported and These pending data carry out multiplying；I+1 accumulator is for shifting the value of i+1 multiplier output and this i-th The value of depositor output carries out additive operation.

In conjunction with some implementations of above-mentioned first aspect, in the 4th kind of possible implementation of first aspect, should Pending data and/or the power side that weights are 2 of this memory element storage, this N number of processing unit includes N number of shifting processing list First and N number of accumulator, wherein, the outfan of the i-th shifting processing unit in this N number of shifting processing unit is N number of cumulative with this The input of the i-th accumulator in device is connected, the outfan of this i-th accumulator and the input of this i-th shift register End is connected, and this i-th shifting processing unit is for according to these pending data, to corresponding with this i-th shifting processing unit Weights carry out shifting function, and to complete multiplying, or this i-th shifting processing unit is at basis with this i-th displacement These pending data are carried out shifting function, to complete multiplying by the weights that reason unit is corresponding；I+1 accumulator is for right The value of i+1 shifting processing unit output and the value of this i-th shift register output carry out additive operation.

In conjunction with some implementations of above-mentioned first aspect, in the 5th kind of possible implementation of first aspect, should Control unit is additionally operable to: when determining that these pending data are 0, controls this memory element and is closed.

When inputting data and being 0, control memory element and be closed, it is possible to reduce power consumption further.

In conjunction with some implementations of above-mentioned first aspect, in the 6th kind of possible implementation of first aspect, should Control unit includes switch element and routing unit, and this switch element, should for controlling the opening and closing of this N number of processing unit Routing unit is for selecting the mode of operation of this N number of shift register, and this mode of operation includes shifting accumulation mode and displacement is posted Deposit pattern.

In conjunction with some implementations of above-mentioned first aspect, in the 7th kind of possible implementation of first aspect, should Circuit also includes decompression unit, and this decompression unit is compressed for the weights storing this memory element and decompresses.

Under a lot of application scenarios, power is sparse distribution, and therefore compressed storage can improve storage and the efficiency processed.

During use, data can also be sequentially output from memory element by the data received first being compressed storage, Processing unit processes again, can reduce the rate of change of derived data, reduces the power consumption of circuit.

Second aspect, it is provided that a kind of image processing system, this system includes: M such as any one circuit of first aspect, figure As input block, nonlinear mapping unit and output unit, the outfan of the kth circuit in this M circuit and kth+1 The input of circuit is connected, and wherein, M is positive integer, and the span of i is the integer in 1 to M-1, and this image input units is used In the data of different images row are carried out delay disposal, it is sequentially output data；This nonlinear mapping unit is for this m-th electricity The result of the n-th processing unit output in road carries out nonlinear operation；This output unit is used for exporting this nonlinear mapping list The result of unit's output.

In conjunction with second aspect, in the implementation that the second of second aspect is possible, this system also includes: at least one Buffer unit, the corresponding multiple circuit of each buffer unit in this at least one buffer unit, this each buffer unit is used for depositing The value of the n-th processing unit output of storage corresponding circuits.

By being cached to the tupe of caching so that using flexibly of hardware resource is possibly realized, CNN convolution and entirely connecting Connect and operation law contains the highest degree of parallelism, by caching parallel expansion data, so that follow-up circuit is simultaneously Process large-scale data, improve limiting performance.

The third aspect, it is provided that a kind of method processing data, the method is to use any one circuit of first aspect to defeated Entering data to process, the method includes: this control unit judges that whether pending data that this data transmission unit exports are 0；When determining that these pending data are 0, this control unit controls this N number of processing unit all in closed mode.

In conjunction with the third aspect, in the first possible implementation of the 3rd, the method also includes: this control unit When determining that these pending data are 0, the value controlling this i-th shift register exports this i+1 processing unit；This control Unit processed, when determining that these pending data are not 0, controls this N number of processing unit and processes these pending data, and control The value making this i-th processing unit exports this i-th shift register.

Fourth aspect, it is provided that a kind of device processing data, this device includes: processor, transceiver, memorizer, N number of Multiplier, N number of accumulator and bus system.Wherein, this memorizer, this processor, this transceiver, this N number of multiplier and this is N number of Accumulator is connected by this bus system, and this memorizer is used for storing instruction, and this processor is for performing the storage of this memorizer Instruction, to control transceivers signal or to send signal, and when this processor performs the instruction of this memorizer storage, should Perform to make this processor realize the control unit in any one possible implementation of first aspect or first aspect.

5th aspect, it is provided that a kind of computer-readable storage medium, for saving as the computer software used by said method Instruction, it comprises for performing the program designed by above-mentioned aspect.

The aspects of the invention or other aspects be meeting more straightforward in the following description.

Accompanying drawing explanation

In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, will make required in the embodiment of the present invention below Accompanying drawing be briefly described, it should be apparent that, drawings described below is only some embodiments of the present invention, for From the point of view of those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to obtain other according to these accompanying drawings Accompanying drawing.

Fig. 1 shows the circuit logic diagram of the conventional pipeline scheme realizing convolution algorithm.

Fig. 2 shows the schematic block diagram of the circuit of the process data that the embodiment of the present invention provides.

Fig. 3 shows another schematic block diagram of the circuit of the process data that the embodiment of the present invention provides.

Fig. 4 shows the schematic block diagram of the image processing system that the embodiment of the present invention provides.

Fig. 5 shows the schematic block diagram of the method for the process data that the embodiment of the present invention provides.

Fig. 6 shows the schematic block diagram of the device of the process data that the embodiment of the present invention provides.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is a part of embodiment of the present invention rather than whole embodiment wholely.Based on this Embodiment in bright, the every other reality that those of ordinary skill in the art are obtained on the premise of not making creative work Execute example, all should belong to the scope of protection of the invention.

Convolutional neural networks is the one of artificial neural network, it has also become grinding of current speech analysis and field of image recognition Study carefully focus.Its weights are shared network structure and are allowed to be more closely similar to biological neural network, reduce the complexity of network model, subtract Lack the quantity of weights.What this advantage showed when the input of network is multidimensional image becomes apparent from, and makes image directly to make Input for network, it is to avoid feature extraction complicated in tional identification algorithm and data reconstruction processes.Convolutional network is for knowing Other two-dimensional shapes and a multilayer perceptron of particular design, this network structure to translation, proportional zoom, tilt or altogether he The deformation of form has height invariance.

Convolution algorithm of the prior art is that the scheme using conventional pipeline as shown in Figure 1 realizes.Concrete step Suddenly be: use the mode of streamline that N number of multiplier carries out multiplying clock is the most parallel to N number of data, then These multiplication results are added up by adder, need the adder unit of a tree, by some level production lines, The accumulation result of the most cumulative final item can be obtained.For example, as it is shown in figure 1, input data are D00～D08,9 multiplication Device carries out multiplying, the i.e. result of multiplier output simultaneously and is respectively W0*D00, W1*D01 D00～D08 ... W8* D08, is R0=W0*D00+W1*D01+W8*D08 by the result of output after the adder unit of tree structure, similarly, when When input data are D10～D18, the result of output is R1=W0*D10+W1*D11+W8*D18.

Mentioned above, the meaning of convolution algorithm is sharing of weights, therefore, in existing tradition convolution circuit, and should The weights that each multiplier is corresponding are fixed configurations.And on the basis of existing this conventional circuit structure, when being simultaneously entered When data are 0 and valid data mix, the distribution due to 0 disperses, and is unfavorable for the control realizing reducing power consumption, or realizes control institute The additional cost needed is high.

Should be understood that the technical scheme of the embodiment of the present invention may apply in various signal processing field, such as, voice is known Not, seismic prospecting, ultrasonic diagnosis, optical imagery, system identification, image recognition etc..

Fig. 2 shows the structural representation of the circuit 100 of the process data that the embodiment of the present invention provides.As in figure 2 it is shown, should Circuit includes control unit 120 and N number of processing unit, each process 110 in this control unit 120 and this N number of processing unit Unit is connected with data transmission unit respectively, and this each processing unit 110 is respectively used to be sequentially output this data transmission unit Same data process, the outfan of the i-th processing unit 110 in this N number of processing unit and i+1 processing unit The input of 110 is connected, and wherein, N is the integer more than 1, and the span of i is the integer in 1 to N-1, this control unit 120 For: when the pending data determining that this data transmission unit exports are 0, control this N number of processing unit all in closing shape State.

Owing to, in CNN, the view data of input is typically sparse vector, the most a large amount of 0 and valid data be mixed, nothing Whether opinion input data are 0, and the circuit all of multiply-add operation module of needs of tradition convolution algorithm is all in work, if used Closing different computing modules respectively to reduce circuit power consumption, this expense on the control circuitry is bigger, and implementation benefit is relative Less.To this end, on the premise of control circuit expense, the circuit that the embodiment of the present invention provides synchronizes to connect by N number of processing unit Receive same data to realize, when inputting data and being 0, closing all processing units, to reach to reduce electricity when realizing convolution algorithm The purpose of road power consumption.

Therefore, the circuit of the process data that the embodiment of the present invention provides, can reduce circuit when realizing convolution algorithm Power consumption.

It will be appreciated by those skilled in the art that the work that convolution algorithm is to be completed is: y=w1*x1+w2*x2+......wn*xn, I.e. two vectorial dot products.This computing generally has two kinds of ways: a kind of is to be completed multiplying by n parallel unit, adopts By the mode of streamline, generally can accomplish that a clock provides the result of n multiplying simultaneously parallel, then, all These results add up, and need the adder unit of a tree, by some level production lines, could obtain final the most tired The accumulation result of plus item；Another kind is gradually to complete multiplication on a computing unit, and cumulative, after n clock, multiply-add With complete, finally export, this way use a computing unit complete this calculating, altogether need the time of n beat.

About the shortcoming of the first way above-mentioned we it has been noted that the power consumption of this circuit is relatively big, be unfavorable for raising figure As identifying the efficiency of network；About the second way, owing to the meaning of convolution algorithm is sharing of weights, therefore, it can n The cascade of individual computing unit is cumulative.Through such adjustment, corresponding multiplying completes the most on request, the most same output institute N corresponding multiplying can complete on different time beats.This adjustment does not makes calculating process complicate, instead And accumulating operation can gradually be completed, and export at last clock.For example, it is possible to as shown in Figure 1 first group Data do corresponding delay disposal, the most each clock one data of output.

Specifically weigh correspond to different data, therefore, original it is known that basic convolution algorithm is one group Data form in, data are substantially different, random, and this makes to be difficult to do normalizing towards the processing unit of sparse data The design changed.After being adjusted, different data have been adjusted on different time locations, and this makes at some different meters Under calculation pattern, on same time location, place identical data and be possibly realized, and the power of correspondence leaves in memory element, can To do arbitrary adjustment, this Method of Data Organization as required so that data processing links corresponding to data can do normalizing Change, unified design so that data are that low consumption circuit design when 0 is more prone to.

Should be understood that the data for the different images row involved by a convolution, the delay disposal that can be postponed by row It is input to N number of processing unit with guaranteeing the data parallel gradually output, it is also possible to come real by the most different data pointers Existing.By this process so that identical data, simultaneously by all processing unit multiplexings, improve data-reusing rate, simplify and reduce The design on control circuit of power consumption.

Alternatively, this circuit also includes N number of shift register, the i-th shift register in this N number of shift register Input is connected with the outfan of the i-th processing unit in this N number of processing unit, the outfan of this i-th shift register Being connected with the input of this i+1 processing unit, this control unit is additionally operable to when determining that these pending data are not 0, control Make this N number of processing unit these pending data are processed, and the value controlling this i-th processing unit exports this i-th Shift register.

According to the principle of shift register, shift register can not only store data, and can also be at the work of clock signal Data therein are made to move to left successively or move to right under with.Shift register in the embodiment of the present invention includes two kinds of mode of operations, one Kind it is displacement accumulation mode, will shift register export in next stage processing unit the value of storage；Another kind is displacement Memory module, will accordingly processing unit output value store.

When inputting data and being 0, circuit only needs to preserve last shift register, in other words, when input data When being 0, all processing units are closed, and the value of storage is exported next stage processing unit by all depositors, i.e. in sparse vector On the premise of input, circuit is substantially at closed mode, and only shift register is in running order, and energy consumption is extremely low.

Alternatively, this circuit also includes memory element, and this memory element is used for storing weights, this i+1 processing unit Specifically for: the weights corresponding with this i+1 processing unit exporting this memory element and this pending data are taken advantage of Method computing, and the value of the result obtained by this multiplying and the output of this i-th processing unit is carried out additive operation.

Specifically, if storing one group of weights in memory element, the most N number of weights, when pending data are not 0, control single Unit controls memory element and exports this N number of weights, each weights in these N number of weights and this each processing unit one_to_one corresponding, when When input data are organized more, such as when having i.e. M*N the data of M group, the weights that often group data are corresponding are this N number of weights, the most N number of The most exportable convolution results of data.

If in memory element during store M groups weights, i.e. M*N weights, when corresponding input data are M*N data, Using N number of data as one group, the corresponding same group of weights of same group of data, every one group of weights of N number of data rotation, and every N number of number According to one result of output, this result is full concatenation operation output.

Accordingly, because each processing unit in the circuit of embodiment of the present invention offer can receive all of input number According to, and can be adjusted by the weights that memory element is stored so that the weights difference that each group of input data are corresponding, Such that it is able to realize full concatenation operation, it is to avoid the different circuit of employing realizes asking of convolution algorithm and full concatenation operation respectively Topic, it is possible to realize fully-connected network and share on hardware with convolutional network, thus improve resources of chip utilization rate.

Fig. 3 shows the circuit 200 of the process data that the embodiment of the present invention provides.As it is shown on figure 3, this circuit 200 includes: N number of multiplier 210, N number of accumulator 220, N number of shift register 230, memory element 240 and control unit 250, citing comes Saying, N is 9.One processing unit includes a multiplier and an accumulator, the outfan of multiplier and corresponding accumulator Input is connected, and the outfan of accumulator is connected with the input of corresponding shift register, and the outfan of shift register Then the input with the accumulator of next stage is connected, and multiplier is for the power corresponding with this multiplier exporting this memory element Value and this pending data carry out multiplying；Accumulator is defeated for value and the upper level shift register exporting this multiplier The value gone out carries out additive operation.

Should be understood that the number of multiplier that the processing unit in the embodiment of the present invention includes and accumulator is unrestricted, example As, first order processing unit can only include multiplier, do not include adder, it is also possible to be that each processing unit includes two Multiplier, an input of one of them multiplier is fixed to 1, it is also possible to being other compound modes, the present invention is not limited to This.

When circuit carries out convolution algorithm or full concatenation operation, using first defeated as each multiplier 210 of input data Entering value, N number of weights memory element 240 exported are as the second input value of N number of multiplier 210, each power in N number of weights Value and each multiplier one_to_one corresponding, each multiplier 210 is same in the case of control unit 250 judgement input data are not 0 Time the first input value and corresponding second input value are carried out multiplying, similarly, corresponding second input value is by storing list Unit 240 controls output in the case of control unit 250 judgement input data are not 0.And when inputting data and being not 0, often Value and the value of corresponding multiplier 210 output that upper level shift register 230 is exported by individual accumulator 220 carry out addition fortune Calculating, wherein, an input of first accumulator could be arranged to 0.Meanwhile, when inputting data and being not 0, each displacement is posted Storage 230 stores the value of respective accumulators 220 output.And when the data of input are 0, control unit 240 controls all multiplication Device 210, all accumulators 220 and memory element 240 etc. are all in closed mode, and controlling each shift register 230 will deposit The value of storage is input to next stage accumulator 220.

Alternatively, these pending data and/or the power side that weights are 2 of this memory element storage, this N number of processing unit Including N number of shifting processing unit and N number of accumulator, wherein, i-th shifting processing unit in this N number of shifting processing unit Outfan is connected with the input of the i-th accumulator in this N number of accumulator, the outfan of this i-th accumulator and this i-th The input of shift register is connected, and this i-th shifting processing unit, for according to these pending data, moves with this i-th The weights that bit processing unit is corresponding carry out shifting function, and to complete multiplying, or this i-th shifting processing unit is for basis These pending data are carried out shifting function, to complete multiplying by the weights corresponding with this i-th shifting processing unit；i+1 Individual accumulator is for carrying out addition to the value of i+1 shifting processing unit output and the value of this i-th shift register output Computing.

For relatively floating point representation, or the weights of high accuracy fixed-point representation and data, can be to weights and the data amount of carrying out Change, storage demand can be reduced after quantization, be greatly reduced amount of calculation simultaneously.Wherein weights or data can be quantified as the power of 2 Side, this makes follow-up multiplying be reduced to shift additional calculation.Specifically, can be according to input data, to corresponding power Value carries out shifting processing, thus completes multiplying；Can also be according to corresponding weights, each shifting processing unit is to input number According to carrying out shifting processing, realizing multiplying, weights or data can reduce code check, the wherein power of 2 by nonlinear quantization Quantization be the most simplest situation, a kind of limiting case that power quantifies is :-1,0,1}), or {-1,1}, this limit In the case of, even if data are high-precision fixed-point numbers, follow-up process also has only to additive operation.

Directly by Bit plane, many Bit characteristic image can also be put into different characteristic planes, this is done directly The Binary of simplest input data quantifies so that the network calculated towards Binary can process many Bit data.

Should be understood that in embodiments of the present invention, processing unit can be the unit that input data carry out multiply-add operation, also As long as be exactly can realize multiply-add operation unit can the scheme of the embodiment of the present invention, the embodiment of the present invention only with multiplier and As a example by shifting processing unit realizes multiplying, the invention is not restricted to this.

Alternatively, this control unit is additionally operable to: when determining that these pending data are 0, controls this memory element and is in pass Closed state.

Should be understood that this control unit 250 can include simple switch element or routing unit, it is also possible to include that clock produces Raw unit and clock control cell etc..Such as, clock generating unit is for producing a clock signal, the week of this clock signal Phase should be more than the time of one data of a processing unit processes, and one data of output of each clock cycle.This clock The clock signal that control unit can produce according to input Data Control clock generating unit, when data are 0, close clock and produces Raw unit, when data are not 0, opens clock generating unit.Routing unit is then used for basis and enters data to select displacement to post The mode of operation of storage, i.e. when data are 0, shift register is only operated in shift LD pattern, when data are not 0, then Shift register is operated in displacement accumulation mode, it is also possible to controlled unlatching or the pass of all processing units by switch element Close.For example, it is possible to when data are 0, close all of processing unit, when inputting data and being not 0, open all of process single Unit.

Alternatively, as one embodiment of the present of invention, this memory element is dynamic RAM DRAM.This storage list Unit can be also used for storing data.The data that a convolution relates to can be stored in memory element, this circuit can also include Delay cell, in use, data memory element stored are exported, as N number of processing unit one by one by delay cell Input.

Specifically, during CNN convolution and full connection calculate, all there is reusable situation in data and weights, and And data and power are all that regular order is deposited, order uses.All there is random access memory in the most actually used data In (random access memory, RAM), but the capacity of data and power can be very big in logic, therefore can use big The DRAM of capacity.Thus realize jumbo network, high performance calculating.

Alternatively, as one embodiment of the present of invention, this circuit also includes decompression unit, and this decompression unit is used for The data storing this memory element and/or weights are compressed and decompress.

According to the feature of Deta sparseness, data buffer storage can also compressed storage, improve the utilization rate of data buffer storage.Use Time, decompress output data sequence from data buffer storage.Under a lot of application scenarios, power is also sparse distribution, and therefore compression is deposited Put the efficiency that can improve storage and process.Memory element is used to realize data and the storage of weights so that data and weights Can reuse, data or the multiplexing of power, it is possible to reduce power consumption further.

In embodiments of the present invention, foregoing circuit can apply to cloud computing scene, and in this scene, foregoing circuit can be by In cloud equipment, a certain unit realizes, such as: realized foregoing circuit by the processor unit in cloud equipment.It addition, foregoing circuit is also Can apply on terminal unit, in this scene, foregoing circuit by being connected with imageing sensor in terminal unit or can draw Near parts realize, such as: realized foregoing circuit by the process chip of terminal unit.And terminal unit here includes flat board electricity Brain, mobile phone, electronic reader, remote controller, PC, notebook computer, mobile unit, Web TV, wearable device etc. have figure As identifying the smart machine of function.

The embodiment of the present invention additionally provides a kind of image processing system.As shown in Figure 4, this system includes: M above-mentioned electricity Road, image input units, nonlinear mapping unit and output unit, the outfan of the kth circuit in M circuit and kth+ The input of 1 circuit is connected, and wherein, M is positive integer, and the span of k is the integer in 1 to M-1.This image input units For the data of different images row are carried out delay disposal, it is sequentially output data；This nonlinear mapping unit is for m-th electricity The result of the n-th processing unit output on road carries out nonlinear operation；This output unit is used for exporting this nonlinear mapping unit The result of output.

Each circuit includes cascade input and cascaded-output, can be with application of the manystage cascade connection.Make the concrete body of calculating of different scales Being now the cumulative of different scales, by cascading the utilization of input and output so that add up and can cascade, both fixing array can be located The accumulating operation of reason different length.

Should be understood that in image input units, a large-sized plane of delineation can be converted directly into multiple little chi The very little plane of delineation.Such as, the image of a 256*256, the image of 4 64*64 can be converted to, its medium and small image identical 4, position pixel, each in the most corresponding big image 2*2 window data.This mapping mode, can be directly by big image Convolution algorithm, be converted to the convolution algorithm of multiple little image.CNN convolution algorithm each layer parameter excursion is diminished, has It is beneficial to the efficiency that hardware is implemented.

Should be understood that and nonlinear mapping unit can be set after last processing unit of each circuit, it is possible to With only at last processing unit nonlinear mapping disposed behind unit of whole system, permissible in this nonlinear mapping unit The Sigmoid that storage efficiently realizes during nerve calculates maps, and ReLU maps, or other map.Such as, logarithm, index or based on The image enhaucament of histogram distribution maps.

Alternatively, as one embodiment of the present of invention, this system also includes: at least one buffer unit, this is at least one years old The corresponding multiple circuit of each buffer unit in individual buffer unit, this each buffer unit is for storing the n-th of corresponding circuits The value of processing unit output.

The convolutional calculation amount relating to multiple features is very big, often beyond the scope of a physical array disposal ability, needs By caching intermediate object program, repeatedly calculate and just can complete after adding up.The mode supporting this calculating is: calculate is cumulative defeated every time Entering to take from caching, result of calculation is also deposited to caching simultaneously.Caching can allow to exist multiple, can be tired from specifying as required Add caching to read in, and export different appointments cumulative buffering write.

By being cached to the tupe of caching so that using flexibly of hardware resource is possibly realized；Unified caching knot Structure guarantees that same hardware structure is supported CNN convolution and is entirely connected calculating；Caching write with reading can by take different in the way of, Adaptation data processes the needs of rule, reduces caching consumption.And buffer unit can corresponding multiple circuit, it is possible to increase The degree of parallelism of data input so that extensive, high performance parallel is treated as possibility.

System shown in Fig. 4 can be extended on the basis of the circuit 100 shown in Fig. 2 or Fig. 3 or circuit 200 Arrive.And the operation that the modules in circuit 100 or circuit 200 realizes is identical with technique scheme with function, for letter Clean, do not repeat them here.

The method 300 that embodiment of the present invention offer processes data below in conjunction with Fig. 5 is described.The method uses Input data are processed by circuit 100 or the circuit 200 of foregoing description, such as, can be performed by control unit, such as Fig. 5 institute Showing, the method 300 includes:

S310, it is judged that whether the pending data of this data transmission unit output are 0；

S320, when determining that these pending data are 0, controls this N number of processing unit all in closed mode.

Therefore, the method for the process data that the embodiment of the present invention provides, it is possible to reduce circuit when realizing convolution algorithm Power consumption.

Alternatively, as one embodiment of the invention, the method also includes: this control unit is determining this pending data When being 0, the value controlling this i-th shift register exports this i+1 processing unit；This control unit is determining that this waits to locate When reason data are not 0, control this N number of processing unit and these pending data are processed, and control this i-th processing unit Value output is to this i-th shift register.

Fig. 6 shows device 500 according to embodiments of the present invention.This device includes: processor 520, transceiver 530, deposit Reservoir 510, N number of multiplier 540, N number of accumulator 550 and bus system 560.Wherein, this memorizer, this processor, this transmitting-receiving Device, this N number of multiplier are connected by this bus system with this N number of accumulator, and this memorizer is used for storing instruction, and this processor is used In performing the instruction of this memorizer storage, it is sequentially output same data controlling transceiver, and deposits when this processor performs this During the instruction of reservoir storage, this processor is used for: when the pending data determining this transceivers are 0, controls this and N number of takes advantage of Musical instruments used in a Buddhist or Taoist mass and N number of accumulator are all in closed mode.

Therefore, the device of the process data that the embodiment of the present invention provides, it is possible to reduce electricity realizing convolution algorithm when Road power consumption.

Should be understood that in embodiments of the present invention, this processor 520 can be CPU, and this processor 520 can also is that other General processor, digital signal processor (DSP), special IC (ASIC), FPGA or other PLDs, Discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor.

This memorizer 510 can include read only memory and random access memory, and to processor 520 provide instruction and Data.A part for memorizer 510 can also include nonvolatile RAM.Such as, memorizer 510 can also be deposited The information of storage device type.

This bus system 560 is in addition to including data/address bus, it is also possible to includes power bus, control bus and status signal Bus etc..But for the sake of understanding explanation, in the drawings various buses are all designated as bus system 560.

During realizing, each step of said method 300 can be by the integration logic electricity of the hardware in processor 520 The instruction of road or software form completes.Step in conjunction with the method disclosed in the embodiment of the present invention can be embodied directly in hardware Processor has performed, or completes with the hardware in processor and software module combination execution.Software module may be located at Machine memorizer, flash memory, read only memory, programmable read only memory or electrically erasable programmable memorizer, depositor etc. are originally In the storage medium that field is ripe.This storage medium is positioned at memorizer 510, and processor 520 reads the information in memorizer 510, The step of said method 300 is completed in conjunction with its hardware.For avoiding repeating, it is not detailed herein.

Should be understood that in embodiments of the present invention, device 500 according to embodiments of the present invention may be used for realizing shown in Fig. 3 Circuit 200.For sake of simplicity, do not repeat them here.

Should be understood that in embodiments of the present invention, " B corresponding with A " represents that B with A is associated, and may determine that B according to A.But Should also be understood that determining B to be not meant to only according to A according to A determines B, it is also possible to determine B according to A and/or out of Memory.

Those of ordinary skill in the art are it is to be appreciated that combine the list of each example that the embodiments described herein describes Unit and algorithm steps, it is possible to electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate hardware With the interchangeability of software, the most generally describe composition and the step of each example according to function.This A little functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Specially Industry technical staff can use different methods to realize described function to each specifically should being used for, but this realization is not It is considered as beyond the scope of this invention.

Those skilled in the art is it can be understood that arrive, for convenience of description and succinctly, foregoing description be System, device and the specific works process of unit, be referred to the corresponding process in preceding method embodiment, do not repeat them here.

In several embodiments provided herein, it should be understood that disclosed system, apparatus and method, permissible Realize by another way.Such as, device embodiment described above is only schematically, such as, and drawing of this unit Point, it is only a kind of logic function and divides, actual can have other dividing mode when realizing, and the most multiple unit or assembly can To combine or to be desirably integrated into another system, or some features can be ignored, or does not performs.It addition, it is shown or discussed Coupling each other direct-coupling or communication connection can be the INDIRECT COUPLING by some interfaces, device or unit or Communication connection, it is also possible to be electric, machinery or other form connect.

This as the unit that separating component illustrates can be or may not be physically separate, shows as unit Parts can be or may not be physical location, i.e. may be located at a place, or multiple net can also be distributed to On network unit.Some or all of unit therein can be selected according to the actual needs to realize embodiment of the present invention scheme Purpose.

It addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to be that two or more unit are integrated in a unit.Above-mentioned integrated Unit both can realize to use the form of hardware, it would however also be possible to employ the form of SFU software functional unit realizes.

If this integrated unit is using the form realization of SFU software functional unit and as independent production marketing or use, Can be stored in a computer read/write memory medium.Based on such understanding, technical scheme substantially or Person says the part contributing prior art, or this technical scheme completely or partially can be with the form body of software product Revealing to come, this computer software product is stored in a storage medium, including some instructions with so that a computer sets Standby (can be personal computer, server, or the network equipment etc.) perform the whole of each embodiment the method for the present invention or Part steps.And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store program code Medium.

The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is not limited thereto, and any Those familiar with the art, in the technical scope that the invention discloses, can readily occur in the amendment of various equivalence or replace Change.

Claims

1. the circuit processing data, it is characterised in that described circuit includes control unit and N number of processing unit, described control Unit processed is connected with data transmission unit respectively with each processing unit in described N number of processing unit, and described each process is single The same data that unit is respectively used to be sequentially output described data transmission unit process, i-th in described N number of processing unit The outfan of individual processing unit is connected with the input of i+1 processing unit, and wherein, N is the integer more than 1, the value model of i Enclosing is the integer in 1 to N-1,

Described control unit is used for:

When the pending data determining that described data transmission unit exports are 0, control described N number of processing unit all in closedown State.

Circuit the most according to claim 1, it is characterised in that described circuit also includes N number of shift register, described N number of The input of the i-th shift register in shift register is connected with the outfan of described i-th processing unit, and described i-th The outfan of individual shift register is connected with the input of described i+1 processing unit,

Described control unit is additionally operable to:

When determining that described pending data are 0, the value controlling described i-th shift register exports at described i+1 Reason unit；

When determining that described pending data are not 0, control described N number of processing unit and described pending data processed, And the value controlling described i-th processing unit exports described i-th shift register.

Circuit the most according to claim 2, it is characterised in that described circuit also includes memory element, described memory element For storing weights, described i+1 processing unit specifically for:

The weights corresponding with described i+1 processing unit and the described pending data that export described memory element are taken advantage of Method computing, and the value of the value obtained by described multiplying and the output of described i-th processing unit is carried out additive operation.

Circuit the most according to claim 3, it is characterised in that described N number of processing unit includes N number of multiplier and N number of tired Add device, wherein, the i-th accumulator in the outfan of the i-th multiplier in described N number of multiplier and described N number of accumulator Input be connected, the outfan of described i-th accumulator is connected with the input of described i-th shift register,

Described i-th multiplier is for the weights corresponding with described i-th multiplier that export described memory element and described Pending data carry out multiplying；

I+1 accumulator is for adding the value of i+1 multiplier output and the value of described i-th shift register output Method computing.

Circuit the most according to claim 3, it is characterised in that described pending data and/or the storage of described memory element The power side that weights are 2, described N number of processing unit includes N number of shifting processing unit and N number of accumulator, wherein, described N number of The input of the i-th accumulator in the outfan of the i-th shifting processing unit in shifting processing unit and described N number of accumulator End is connected, and the outfan of described i-th accumulator is connected with the input of described i-th shift register,

Described i-th shifting processing unit is for according to described pending data, to corresponding with described i-th shifting processing unit Weights carry out shifting function, to complete multiplying, or

Described i-th shifting processing unit is for according to the weights corresponding with described i-th shifting processing unit, locating described waiting Reason data carry out shifting function, to complete multiplying；

I+1 accumulator is for the value exporting i+1 shifting processing unit and the value of described i-th shift register output Carry out additive operation.

6. according to the circuit according to any one of claim 3 to 5, it is characterised in that described control unit is additionally operable to:

When determining that described pending data are 0, control described memory element and be closed.

7. according to the circuit according to any one of claim 2 to 6, it is characterised in that described control unit includes switch element With routing unit, described switch element is for controlling the opening and closing of described N number of processing unit, and described routing unit is used for selecting Selecting the mode of operation of described N number of shift register, described mode of operation includes shifting accumulation mode and shift LD pattern.

8. according to the circuit according to any one of claim 3 to 7, it is characterised in that it is single that described circuit also includes that compression processes Unit, described compression processing unit is compressed for the weights storing described memory element and decompresses.

9. an image processing system, it is characterised in that including:

M circuit, image input units, nonlinear mapping unit and the output as according to any one of claim 1 to 8 is single Unit, the input of+1 circuit of outfan and kth of the kth circuit in described M circuit is connected, and wherein, M is positive integer, k Span be the integer in 1 to M-1,

Described image input units, for the data of different images row are carried out delay disposal, is sequentially output data；

Described nonlinear mapping unit carries out non-linear fortune for the result exporting the n-th processing unit in m-th circuit Calculate；

Described output unit is for exporting the result of described nonlinear mapping unit output.

System the most according to claim 9, it is characterised in that described system also includes:

At least one buffer unit, the corresponding multiple circuit of each buffer unit at least one buffer unit described, described often Individual buffer unit is for storing the value of the n-th processing unit output of corresponding circuits.

11. 1 kinds of methods processing data, it is characterised in that described method is to use according to any one of claim 1 to 8 Input data are processed by circuit, and described method includes:

Described control unit judges whether the pending data that described data transmission unit exports are 0；

When determining that described pending data are 0, described control unit controls described N number of processing unit all in closed mode.

12. methods according to claim 11, it is characterised in that described method also includes:

Described control unit is when determining that described pending data are 0, and the value controlling described i-th shift register exports institute State i+1 processing unit；

Described control unit, when determining that described pending data are not 0, controls described N number of processing unit to described pending number According to processing, and the value controlling described i-th processing unit exports described i-th shift register.