CN110517183A

CN110517183A - A kind of high-speed low-power-consumption image processor based on retinal mechanisms

Info

Publication number: CN110517183A
Application number: CN201910684793.6A
Authority: CN
Inventors: 周军; 项晓强; 刘丽丽
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-07-26
Filing date: 2019-07-26
Publication date: 2019-11-29
Anticipated expiration: 2039-07-26
Also published as: CN110517183B

Abstract

A kind of high-speed low-power-consumption image processor based on retinal mechanisms of the disclosure of the invention, belongs to graphics process field.The problem of for background technique, this patent propose the tone mapping processor based on human eye retina's technology Self-configuring of a kind of high speed, low-power consumption.The application of the smart machines such as Embedded not only can handle the problem of low dynamic device shows high-dynamics image, and image that can also be darker to night brightness carries out the enhancing in brightness.It can handle more pixel datas with unit work consumptiom, improve the data-handling efficiency of Intelligent internet of things equipment；The processing speed for realizing 189 frame image per second can satisfy the application demand of the relevant devices such as unmanned vehicle, unmanned plane substantially beyond the processing technique that existing related hardware is realized completely；Function can be carried out according to the demand of currently used environment or technical real-time programmable adjusts, the functional mobility of the processor greatly enhanced.

Description

A kind of high-speed low-power-consumption image processor based on retinal mechanisms

Technical field

The invention belongs to field of image processings, especially embedded image processor.

Background technique

Existing the relevant technologies, in terms of algorithm is chiefly used in scientific research or other relevant academic researches at present, mostly in CPU End uses.It is very low to will cause the speed of service in this way, power consumption is high, can not real time processed images, more can not for high definition video steaming Processing.And it is used at the end CPU, higher cost, it is even more impossible to be applied in lesser embedded Internet of Things terminal.It is existing at present Some scholars give reality by current tone mapping and based on its improved associated picture Processing Algorithm in FPGA hardware platform Existing, although these, which are realized, uses different innovative technologies, power consumption and processing speed effect are all considerable.But for speed Degree, power consumption and the higher occasion of cost requirement cost, such as unmanned vehicle, unmanned plane or unmanned detection intelligent things network termination When upper, just seeming, power has deficiency.And the hardware realization for some FPGA that current some scholars propose, the figure after hardware handles As also having certain loss in picture quality compared with software treated image.

With the combination that Internet of Things and artificial intelligence technology are applied, numerous Intelligent treatment technology, it will be embedded into Internet of Things Network termination realizes the intelligence of internet of things equipment.Meanwhile Internet of Things application apparatus is higher to power consumption and cost requirement.Therefore Need the embedded intelligence special FPGA processor of matched low-power consumption, low cost, high speed.And compared to before Some image procossing FPGA processors, the processing framework and technology of Yao Youxin, to reach required technical requirement.

At present people using it is more be the tone-mapping algorithm based on mathematical method, at the image based on retinal mechanisms Reason method and be applied on hardware current investigate known only one kind according to us.That is RaquelEt al. " Real- Time tone mapping on GPU and FPGA " realizing on FPGA based on part mechanism of retina of proposing in article With the image processing algorithm of histogram equalization.Which employs part mechanism of retina, and by itself and traditional histogram equalization knot Altogether.This method compares the algorithm of more traditional mathematics level, and treated, and image seems more naturally, still it is still deposited In some defects, the mechanism of view film process is not applied completely, only with the horizontal cell in retina treatment mechanism Part, algorithm are applied not comprehensive enough.And it is not fast enough in processing speed, and power consumption is not low enough, it can not be applied to unmanned plane, In the scene for needing high speed processing video flowing such as unmanned, can not also be applied to the equipment such as the Internet of Things of sensitive power consumption it In.

Summary of the invention

The problem of for background technique, this patent propose a kind of high speed, low-power consumption based on human eye retina The tone mapping processor of technology Self-configuring.The application of the smart machines such as Embedded not only can handle low dynamic and set The problem of standby display high-dynamics image, image that can also be darker to night brightness carry out the enhancing in brightness.

Technical solution of the present invention is a kind of high-speed low-power-consumption image processor based on retinal mechanisms, and the processor is successively Pass through: photoreceptor module, horizontal cell processing module, second are based on parallel deblocking processing module, pipeline processes Module, Beale's ganglion cells processing module handle image；

The photoreceptor module includes: that consecutive frame feature sharing module and first handle mould based on parallel deblocking Block, input picture input consecutive frame feature sharing module and first based on parallel deblocking processing module simultaneously；

The horizontal cell processing module includes: convolution kernel selector, convolution kernel decompression module, the first convolution module, The output data of consecutive frame feature sharing module in the photoreceptor module successively passes through convolution kernel selector, convolution kernel solution The processing of compression module, first is handled based on parallel deblocking in the convolution kernel decompression module and photoreceptor module The output data of module inputs the first convolution module simultaneously；

The Beale's ganglion cells processing module includes: the second convolution module and convolution kernel compression module, the multilayer convolution Track process modules and the output data of convolution kernel compression module input to the second convolution module simultaneously；Second convolution module Output is the output of image processor.

Further, described first based on parallel deblocking processing module or second based on parallel deblocking at Managing module includes: data zoning controller, BRAM array, data array module；The data zoning controller controls simultaneously BRAM array and data array module, input data are stored in BRAM array first, and then BRAM array is according to data zonal control Input is stored in data array module by the instruction of device, and the output of data array module is based on parallel deblocking processing module Output.

Further, the BRAM array includes 15 small BRAM, carries out piecemeal processing method to input picture are as follows:

The pixel data of preceding 15 row of input picture has been cached in 15 small BRAM, includes 1024 in each BRAM Input data updates 15 pixel datas, and horse when needing to calculate an output pixel data in a clock every time Subsequent and convolution kernel dot product and addition sum operation are completed in upper output；

When sliding window is entered a new line, using S-shaped sliding window, so that the last part data of every a line operation obtain To multiplexing, and when convolution has exported data line, the new data line of input picture has been cached to 15 In a BRAM in BRAM, prepare to export the next line output of image；

When reading 15 pixel datas from 15 BRAM every time, the new data of the 16th row of input picture can be written to In first BRAM；For convolution kernel with to the point multiplication operation of convolved data, the data from different BRAM will pass through one Multiplexer is deposited in the data register of the multiplier array of a 15*15；Data zoning controller can be configured dynamically Multiplexer and data register, so that being input to the data of the 15*15 of multiplier array can be rearranged, it is complete with convolution kernel One data dot product operation in a pair.

Further, the adjacent feature sharing module includes: BRAM, mean module, variance module, selects core module, volume Volume module, shift register, feedback module, the mean module, variance module, convolution module, shift register directly from BRAM acquisition data, the data while output that mean module is calculated to variance module and select core module, and square module calculates To data also export to selecting core module, the data for selecting core module to be calculated are exported to convolution module, and convolution module exports number According to shift register is given, shift register output data are to feedback module；

Its mean value is calculated for the first frame image data of high-speed video stream input-mean module first；Second frame inputs standard Difference module calculates the variance of the second frame using the mean value that first frame calculates；Third frame inputs convolution module, by first frame and The mean value and variance that second frame calculates, which are sent to, selects core module, chooses different convolution for the convolution of each pixel of third frame Core；When carrying out third frame convolution, while the image convolution data extracted can be calculated to the mean value and variance of third frame； Subsequent every frame image all directly inputs convolution module, and mean value and the method feeding that previous frame obtains are selected core module, are present frame The convolution of each pixel chooses different convolution, while the image convolution data extracted can be calculated to the equal of present frame Value and variance, prepare for a later frame.

Further, the track process modules include: the storage array of 7 BRAM composition, using S-shaped sliding window Mode reads data from storage array.

It further, further include zero detection module before the convolution module in the horizontal cell processing module, it is described Non-duplicate data is only stored in piece in convolution kernel compression module.In the part of 15*15 convolution, algorithm is according to each to convolution number According to value range well adapt to the feature of each convolution kernel each to provide different convolution kernels to convolved data so that The image that convolution comes out seems more natural.But since convolution nuclear volume is more, it can occupy and store money in biggish Source, therefore for the convolution nuclear properties in algorithm, the convolution kernel of the part 15*15 is only partially non-zero number, and is in being in The heart is symmetrical, there is more Data duplication, thus is directed to the characteristic, which only stores convolution kernel in piece and work as In non-duplicate data, it is possible to reduce memory space of the convolution Nuclear Data in piece, while decreasing and extracting data from RAM Power consumption.Simultaneously in order to further decrease power consumption, there is zero presence for the data in convolution window and the data in convolution kernel When, which can directly ignore neutral element, it is made directly to be not involved in calculating, can also further decrease DSP circuit in this way Power consumption.

The processor have the characteristics that following technical advantage and:

1, high energy efficiency ratio: unit work consumptiom can handle more pixel datas, at the data for improving Intelligent internet of things equipment Manage efficiency.

2, high-throughput: this patent realizes the processing speed of 189 frame image per second compared to relatively current existing technology, Substantially beyond the processing technique that existing related hardware is realized, it can satisfy the relevant devices such as unmanned vehicle, unmanned plane completely Application demand.

3, may be programmed: due to this patent be using design platform of the FPGA as application specific processor, thus can be according to current The demand of use environment carries out function or the adjustment of technical real-time programmable, the functional mobility of the processor greatly enhanced.

Detailed description of the invention

Fig. 1 is the overall architecture schematic diagram for the tone mapping image processor based on mechanism of retina that patent proposes；

Fig. 2 is using S-shaped sliding window technique based on parallel deblocking module diagram；

Fig. 3 is not use consecutive frame feature technology of sharing circuit block diagram；

Fig. 4 is not use consecutive frame feature technology of sharing circuit block diagram；

Fig. 5 is multilayer convolution pipelining technique schematic diagram of the present invention；

Fig. 6 is convolution kernel decompression module diagram.

Specific embodiment

Fig. 1 is the overall architecture of the tone mapping image processor based on mechanism of retina.Horizontal cell in figure and double The algoritic module of Beale's ganglion cell processing was inspired by the respective handling stage in human eye retina's treatment mechanism.For reality Existing high-throughput and high energy efficiency ratio, propose a series of innovative technologies in the patent, such as: based on parallel deblocking technology, Consecutive frame feature technology of sharing (because when handling high definition video steaming, the front and back consecutive frame of video flowing has similar feature, thus Using this technology, it is possible to reduce efficiency and resource consumption, and promote processing speed), the pipeline processes of multilayer convolution, with And convolution kernel compress technique.Specific technology explanation will illustrate in detail below.

1, based on parallel deblocking technology

The tone mapping image processor based on mechanism of retina of this patent includes two parts convolution (15*15 and 7* 7)；For the convolution of the part 15*15, inputs picture (1024*768) and be stored in one big BRAM of on piece, and for Each pixel once carries out dot product, completes convolution.In order to reduce read access time and improve throughput, input picture is carried out Piecemeal processing, the pixel data of preceding 15 row of input picture first have been cached in 15 small BRAM, include in each BRAM 1024 input datas, as shown in Figure 2.Therefore, it when needing to calculate an output pixel data, needs to update every time 15 data can export at once in a clock, to complete subsequent and convolution kernel dot product and addition summation behaviour Make.A large amount of clock can be saved in this way.When sliding window is entered a new line, using S-shaped rather than Z-shaped sliding window, can be with So that the last part data of every a line operation are multiplexed, time and the function for re-reading data can be saved in this way Consumption.And when convolution has exported data line (1*1024), the new data line of input picture has also been cached to In a BRAM in 15 BRAM, prepare to export the next line output of image.1024 clocks can be saved in this way Waiting time.

In order to save the waiting time, when reading 15 pixel datas from 15 BRAM every time, the 16th row of input picture New data can be written in first BRAM.In this way, when the first row pixel data of output image is complete Cheng Shi, the next line data of input picture are just ready in first BRAM.In this way, the convolution of next line is just not required to It waits, can directly start convolution.For convolution kernel with to the point multiplication operation of convolved data, the data from different BRAM will It can be in the data register by the multiplier array that a multiplexer is deposited at a 15*15.Data zoning controller meeting Dynamic configuration multiplexer and data register, so that being input to the data of the 15*15 of multiplier array can be rearranged, To complete one-to-one data dot product operation with convolution kernel.For example, being operated for first time line feed, the of input picture The data of two to the tenth five-element can move up, and the data of the 16th row (being stored in first BRAM) can be placed in bottom end.When When starting that the 17th row data of input picture are written into BRAM, these data have been written in second BRAM.Meanwhile The data of third to food row can move up, and the data of the 16th and the 17th row (are stored in the 1st and the 2nd BRAM In) bottom end can be placed in.Rest part can be analogized according to above-mentioned, it is known that the convolution of picture in its entirety completes output.Simultaneously The operation that edge fills out 0 it has not been related to during convolution, identical designing technique is also applied for 7*7 convolution, when reducing processing Between and power consumption.

2, consecutive frame feature technology of sharing

Thus according to the handling principle of basic version circuit and algorithm, according to the spy using consecutive frame feature technology of sharing Point devises improvement circuit as shown in Figure 3.For high-speed video stream first frame, we can calculate its mean value first, for Two frames can calculate the variance of the second frame using the mean value that first frame has just calculated.When third frame arrives, consecutive frame spy is utilized Similar principle is levied, the mean value and variance for using front cross frame to calculate are sent to third as the mean value of this frame and variance Frame selects core module, so that the convolution for each pixel of third frame chooses different convolution kernels.When carrying out third frame convolution, (variance of the part is calculated using the mean value and variance that the image convolution data extracted can be calculated to third frame simultaneously The mean value that two frames calculate), to select mean value used in core part and variance as the 4th frame convolution.Subsequent frame is pressed According to this principle, for independent a certain frame, it is only necessary to extract a data from BRAM, greatly increase processing speed Degree, and significantly reduce power consumption.

Thus according to the handling principle of basic version circuit and algorithm, we are according to using consecutive frame feature technology of sharing The characteristics of, devise improvement circuit as shown in Figure 4.For high-speed video stream first frame, we can calculate its mean value first, right In the second frame, the variance of the second frame can be calculated using the mean value that first frame has just calculated.When third frame arrives, utilization is adjacent The similar principle of frame feature, the mean value and variance for using front cross frame to calculate are sent to as the mean value of this frame and variance Third frame selects core module, so that the convolution for each pixel of third frame chooses different convolution kernels.Carrying out third frame volume When product, while (variance of the part calculates the mean value and variance that the image convolution data extracted can be calculated to third frame The mean value calculated using the second frame), to select mean value used in core part and variance as the 4th frame convolution.It is subsequent Frame in accordance with this principle, for independent a certain frame, it is only necessary to extract a data from BRAM, greatly increase Processing speed, and significantly reduce power consumption.3, multilayer convolution flowing water

It include two parts volume in the overall architecture of our the tone mapping image processors based on mechanism of retina Product, as mentioned above it is possible, i.e. 15*15 and 7*7 two parts convolution.One is to handle the convolution of the 15*15 of part in horizontal cell, Another is exactly the convolution of the 7*7 of Beale's ganglion cells processing part.In order to continuously perform two convolution, BRAM buffer is needed to deposit Store up the intermediate data between two convolution.However, due to the repetitive read-write of data, it will cause to increase a large amount of power consumption.

In order to reduce power consumption, multilayer convolution pipelining is devised in this patent.As shown in Figure 5, when the part 15*15 When convolution completes a part of data, 7*7 convolution can be immediately begun to.Due to using 0 filling in convolution, starting 7*7 When convolution, it is only necessary to which 3 rows and 4 pixel datas can start the convolution of 7*7.This multilayer convolution framework can be substantially reduced The power consumption of reading and writing data, and the size of the buffer area BRAM can be also reduced well.

4, convolution kernel decompression module

In the part of 15*15 convolution, algorithm is according to each to the value range of convolved data, each to mention to convolved data For different convolution kernels, the feature of each convolution kernel is well adapted in this way, so that the image that convolution comes out seems more It is natural.But since convolution nuclear volume is more, biggish interior storage resource can be occupied, therefore for the convolution kernel in algorithm Characteristic, the convolution kernel of the part 15*15 are only partially non-zero numbers, and are to be centrosymmetric, and have more Data duplication, thus For the characteristic, which only stores the non-duplicate data in convolution kernel in piece, can subtract Few memory space of the convolution Nuclear Data in piece, while decreasing the power consumption that data are extracted from RAM.Simultaneously in order to further Power consumption is reduced, in the presence of having zero for the data in convolution window and the data in convolution kernel, which can directly be neglected Slightly neutral element, makes it directly be not involved in calculating, can also further decrease the power consumption of DSP circuit in this way.

The present invention includes

1, Energy Efficiency Ratio is high: can handle more pixel datas in unit work consumptiom.

2, processing speed is fast: the tone mapping special-purpose imageprocessor based on mechanism of retina of this patent design can be Each second, output was up to the high-definition image of the 1024*768 of 189 frames.Fully meet the unmanned intelligence such as current some unmanned planes, unmanned vehicle It can demand of the equipment to high-speed low-power-consumption image processor.

3, hardware resource utilization rate is low, and the flexibility of the restructural programming of processor is high: at the image of this patent design Application specific processor is managed, avoids the waste of resource from using, the recycling rate of waterused of resource has been accomplished a very high level.

Above-mentioned technological merit has involved in above-described several technologies.Each technology is directed to power consumption, speed The optimization of degree and hardware resource etc..

Claims

1. a kind of high-speed low-power-consumption image processor based on retinal mechanisms, the processor pass sequentially through: photoreceptor module, Horizontal cell processing module, second handle mould based on parallel deblocking processing module, track process modules, Beale's ganglion cells Block handles image；

The photoreceptor module includes: consecutive frame feature sharing module and first based on parallel deblocking processing module, Input picture inputs consecutive frame feature sharing module and first based on parallel deblocking processing module simultaneously；

The horizontal cell processing module includes: convolution kernel selector, convolution kernel decompression module, the first convolution module, described The output data of consecutive frame feature sharing module in photoreceptor module successively passes through convolution kernel selector, convolution kernel decompression The processing of module, first based on parallel deblocking processing module in the convolution kernel decompression module and photoreceptor module Output data simultaneously input the first convolution module；

The Beale's ganglion cells processing module includes: the second convolution module and convolution kernel compression module, the flowing water of the multilayer convolution Line processing module and the output data of convolution kernel compression module input to the second convolution module simultaneously；The output of second convolution module For the output of image processor.

2. a kind of high-speed low-power-consumption image processor based on retinal mechanisms as described in claim 1, it is characterised in that institute It includes: data subregion based on parallel deblocking processing module that first, which is stated, based on parallel deblocking processing module or second Controller, BRAM array, data array module；The data zoning controller controls BRAM array and data array mould simultaneously Block, input data are stored in BRAM array first, and then input is stored in number according to the instruction of data zoning controller by BRAM array According to array module, the output of data array module is the output based on parallel deblocking processing module.

3. a kind of high-speed low-power-consumption image processor based on retinal mechanisms as described in claim 1, it is characterised in that institute Adjacent feature sharing module is stated to include: BRAM, mean module, variance module, select core module, is convolution module, shift register, anti- Module is presented, the mean module, variance module, convolution module, shift register directly obtain data, mean module meter from BRAM Obtained data while output to variance module and selects core module, the data that square module is calculated also exports to selecting core mould Block, the data for selecting core module to be calculated are exported to convolution module, and convolution module outputs data to shift register, shift LD Device outputs data to feedback module；

Its mean value is calculated for the first frame image data of high-speed video stream input-mean module first；Second frame inputs standard differential mode Block calculates the variance of the second frame using the mean value that first frame calculates；Third frame inputs convolution module, by first frame and second The mean value and variance that frame calculates, which are sent to, selects core module, chooses different convolution kernels for the convolution of each pixel of third frame； When carrying out third frame convolution, while the image convolution data extracted can be calculated to the mean value and variance of third frame；Afterwards Continue every frame image and all directly input convolution module, mean value and the method feeding that previous frame obtains are selected into core module, are that present frame is every The convolution of a pixel chooses different convolution, while the image convolution data extracted can be calculated to the mean value of present frame And variance, it prepares for a later frame.

4. a kind of high-speed low-power-consumption image processor based on retinal mechanisms as described in claim 1, it is characterised in that institute The storage array that track process modules include: 7 BRAM composition is stated, is read from storage array using S-shaped sliding window mode Access evidence.

5. a kind of high-speed low-power-consumption image processor based on retinal mechanisms as described in claim 1, it is characterised in that institute Stating before the convolution module in horizontal cell processing module further includes zero detection module, the piece in the convolution kernel compression module Interior storage non-duplicate data.

6. a kind of high-speed low-power-consumption image processor based on retinal mechanisms as claimed in claim 2, it is characterised in that institute Stating BRAM array includes 15 small BRAM, carries out piecemeal processing method to input picture are as follows:

The pixel data of preceding 15 row of input picture has been cached in 15 small BRAM, and 1024 inputs are included in each BRAM Data update 15 pixel datas, and at once defeated when needing to calculate an output pixel data in a clock every time Out, subsequent and convolution kernel dot product and addition sum operation are completed；

When sliding window is entered a new line, using S-shaped sliding window, so that the last part data of every a line operation are answered With, and when convolution has exported data line, the new data line of input picture has been cached to 15 BRAM and has worked as In a BRAM in, for export image next line output prepare；

When reading 15 pixel datas from 15 BRAM every time, the new data of the 16th row of input picture can be written to first In a BRAM；For convolution kernel with to the point multiplication operation of convolved data, the data from different BRAM will be multiplexed by one Device is deposited in the data register of the multiplier array of a 15*15；Data zoning controller can dynamically configure multiplexing Device and data register complete one with convolution kernel so that being input to the data of the 15*15 of multiplier array can be rearranged To one data dot product operation.