A kind of guiding filtering acceleration optimization method based on CUDA
Technical field
The present invention relates to Computer Applied Technologies and field of image processing, more particularly to a kind of CUDA that is based on (to unifiedly calculate
Equipment framework) guiding filtering accelerate optimization method.
Background technique
Image filtering is the important means of image procossing, is had great importance and researching value.Due to imaging system, pass
Defeated medium and recording equipment etc. it is not perfect, digital picture formed at it, transmission log during often by a variety of noises
Pollution.And image filtering, i.e., the noise of target image is inhibited under conditions of retaining image minutia as far as possible, is
Indispensable operation in image preprocessing, treatment effect quality will directly influence subsequent image processing and analysis have
Effect property and reliability.
Image filtering method can be divided into two kinds: one is linearly moving constant filtering, filtering core weight and input picture
Content is unrelated, is represented as gaussian filtering, mean filter, Laplce's filtering etc.;Another kind is Linear shift variant filtering, is represented as drawing
Filtering is led, the content information for being included using original image, referred to as guidance figure information are needed in filtering.Bilateral filter
Wave kernel function considers the information of pixel space difference in image template, it is contemplated that margin of image element value information, wherein guiding
Figure and input figure are same piece image, therefore two-sided filter is regarded as a kind of simple form of guiding filtering.Combine bilateral
The guidance figure and input figure of filter are different, can obtain more preferably filter effect.But two-sided filter and joint are double
Side filter is there is also some apparent defects, the application compressed such as two-sided filter in details enhancing and high dynamic range images
In, an apparent edge gradient flop phenomenon can all occur, so the algorithm of filter itself and structurally need into
One step is improved.Guiding filtering concept is formal from 2010 to be proposed, one side has the characteristics that bilateral filtering is protected side and denoised, and gram
The influence of artifact is taken, simultaneously because the substantial connection between Laplacian Matrix, guiding filtering increases in image denoising, image
By force, HDR (high dynamic range images) compression, flash/noflash denoising[1], scratch figure, defogging and cascade sampling etc. fields obtain
It is widely applied.The algorithm is simple and effective, but needs to calculate complicated matrix and solve large linear systems, results in and draws
It leads filtering algorithm and consumes a large amount of operation time and space, the needs being unable to satisfy in practical application.
To sum up, navigational figure filtering algorithm calculation amount is larger, it is difficult to improve while guaranteeing algorithm accuracy and calculates
The execution efficiency of method.It is therefore traditional that based on the framework of CPU, people handle to algorithm accuracy and in real time wants being difficult to meet
It asks, only goes meet the needs of in practical application using image processor (GPU).
Summary of the invention
The guiding filtering based on CUDA that the present invention provides a kind of accelerates optimization method, and the present invention is guaranteeing image filtering matter
While amount, and computational efficiency is improved, reduces computation complexity, described below:
A kind of guiding filtering acceleration optimization method based on CUDA, it includes following step that the guiding filtering, which accelerates optimization method,
It is rapid:
Input picture p and navigational figure I is read in into global storage by host end memory, by constructing the first kernel letter
Number, respectively obtain input picture p, navigational figure I, image I*P, image I*I neighborhood window Image neighborhood mean value;
The covariance that the second kernel function successively seeks image (I, p), the variance of navigational figure I are constructed, and then seeks filtering
Wave key parameter a and b;
The first kernel function is called to seek the neighboring mean value mean_a of parameter a, the neighboring mean value mean_b of parameter b, in turn
Final filter result is obtained, result is saved in corresponding global storage, host end memory is arrived in outflow.
It is described by constructing the first kernel function, respectively obtain input picture p, navigational figure I, image I*P, image I*I
The Image neighborhood mean value of neighborhood window the step of specifically:
Input picture p, navigational figure I, image I*P, image I*I are divided in the calculating of the Image neighborhood mean value of neighborhood window
The read group total of Image neighborhood window pixel value is not converted to;
The summation of Image neighborhood window pixel value is calculated separately by constructing the first kernel function.
The described the step of summation of Image neighborhood window pixel value is calculated separately by the first kernel function of building specifically:
The summation of neighborhood window pixel value is realized using integrogram, carries out CUDA by 4 kernel functions in the first kernel function
Parallel optimization.
The guiding filtering accelerates optimization method further include: calls 4 kernel functions in the first kernel function, obtains neighborhood window
Mouth number of pixels N, and it is saved in constant storage.
The beneficial effect of the technical scheme provided by the present invention is that:
The present invention realizes guiding filtering algorithm on the basis of furtheing investigate guiding filtering algorithm, based on CUDA programming,
Image smoothing, image are sprouted wings, image enhancement and image flash denoise four instance aspects, and are based on c program and Matlab program
Carry out Experimental comparison.The advantages of the present invention over the prior art are that:
(1) thinking is novel, guides filtering algorithm using CUDA framework and designs, breaks through the time restriction of serial programming,
With larger innovative significance.
(2) execution efficiency is high, can reach real-time processing to a certain extent.This method utilizes GPU Floating-point Computation ability, simultaneously
The advantage of row calculating etc. effectively increases the execution efficiency of guiding filtering algorithm while guaranteeing image filtering effect,
Guiding filtering algorithm is fast implemented.
(3) it realizes that simply hardware requirement is low, the calling to GPU parallel architecture, written in code is completed under C language environment
It is easy, while achieving that the processing of large-scale data in the GPU hardware of consumer level.
Detailed description of the invention
Fig. 1 guiding filtering algorithm flow chart of the present invention;
Fig. 2 image smoothing effect contrast figure;
(a) input picture, (b) c program exports image, and (c) CUDA program exports image;
Fig. 3 image feather effect comparison diagram;
(a) input picture, (b) navigational figure, (c) c program exports image, and (d) CUDA program exports image;
Fig. 4 image enhancement effects comparison diagram of the present invention;
(a) input picture, (b) c program exports image, and (c) CUDA program exports image;
Fig. 5 image flash of the present invention denoises effect contrast figure.
(a) input picture, (b) navigational figure, (c) c program exports image, and (d) CUDA program exports image.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further
Ground detailed description.
Nearly ten years, computer graphics processor (Graphics Pocessing Unit, GPU) is only handled by script
The special equipment of computer graphical develops into the processor of high degree of parallelism, multithreading, multicore.The operation energy of mainstream GPU at present
Power is more than mainstream universal cpu already, and future, gap can be increasing from development trend.Unified calculation equipment framework be by
NVIDIA (tall and handsome to reach) company releases a kind of using GPU as the software and hardware architecture system of data parallel, it is one complete
Whole GPGPU (graphics processing unit) solution.The appearance of CUDA reduces programmer and carries out general-purpose computations using GPU
Development difficulty.Due to CUDA special programming model and storing data method, so that a large amount of and complicated similar operations can be with
It is handled simultaneously by thread, greatly reduces the execution time of program.
For this purpose, the present invention proposes that a kind of guiding filtering based on CUDA accelerates optimization method, using CUDA building CPU and
GPU cooperative working environment, using CPU as being responsible for carrying out the host of the strong issued transaction and serial computing of logicality, using GPU as
It is responsible for executing the coprocessor of the threading parallel processing of height, realizes that Image neighborhood window pixel value is asked using CUDA multiple programming
With, and then Image neighborhood mean value being obtained, while utilizing register and Texture memory, optimization algorithm step obtains guiding filtering
Key parameter, and then realize the global optimization to algorithm.Technical scheme is as follows:
Embodiment 1
A kind of guiding filtering based on CUDA accelerates optimization method, referring to Fig. 1, guiding filtering accelerate optimization method include with
Lower step:
101: input picture p and navigational figure I being read in into global storage by host end memory, by constructing the first kernel
Function, respectively obtain input picture p, navigational figure I, image I*P, image I*I neighborhood window Image neighborhood mean value;
102: the second kernel function of building successively seeks the covariance of image (I, p), the variance of navigational figure I, Jin Erqiu
Take filtering key parameter a and b;
103: the first kernel function being called to seek the neighboring mean value mean_a of parameter a, the neighboring mean value of parameter b
Mean_b, and then final filter result is obtained, result is saved in corresponding global storage, host side is arrived in outflow
Memory.
Wherein, by constructing the first kernel function, input picture p, navigational figure I, image I*P, image I*I are obtained respectively
The Image neighborhood mean value of neighborhood window the step of specifically:
Input picture p, navigational figure I, image I*P, image I*I are divided in the calculating of the Image neighborhood mean value of neighborhood window
The read group total of Image neighborhood window pixel value is not converted to;
The summation of Image neighborhood window pixel value is calculated separately by constructing the first kernel function.
Further, by constructing the step of the first kernel function calculates separately the summation of Image neighborhood window pixel value tool
Body are as follows:
The summation of neighborhood window pixel value is realized using integrogram, carries out CUDA by 4 kernel functions in the first kernel function
Parallel optimization.
The guiding filtering accelerates optimization method further include:
4 kernel functions in the first kernel function are called, obtain neighborhood window pixel number N, and be saved in constant storage.
This method is programmed using CUDA and carries out parallel optimization to guiding filtering algorithm, is guaranteeing the same of filter result effect
When, and the execution efficiency of guiding filtering algorithm can be greatly improved, the real-time place of guiding filtering algorithm is realized to a certain extent
Reason.
The method in embodiment 1 is described below with reference to specific calculation formula, calculating step, it is as detailed below to retouch
It states:
Embodiment 2
Guiding filtering algorithm is realized based on a Local Linear Model, in Local Linear Model, if input figure
As being p, navigational figure I, filtering output image is q, and Local Linear Model is assumed with the neighborhood window ω of center pixel kk
There are following linear relationships:
(1)
Wherein, ωkIt is the square window with side length for r, akAnd bkIt is neighborhood window ωkIn linear coefficient, IiFor guidance
Image is in neighborhood window ωkIn pixel value, qiFor neighborhood window ωkIn filtering output.Coefficient akAnd bkIt can be defeated by seeking
Enter image p and exports the minimum difference of image q to determine, i.e., so that formula (2) reaches minimum.
E (a in formula (2)k,bk) it is neighborhood window ωkIn cost function output, piIt is input picture in neighborhood window
ωkIn pixel value, ε be a chastening variance adjusting parameter, the purpose is to prevent akValue is excessive.Linear regression solves
Above formula can obtain:
In formula, μkAnd σk 2It is navigational figure I respectively in neighborhood window ωkMean value and variance.| ω | it is neighborhood window ωk
In number of pixels,It is input picture p in neighborhood window ωkIn mean value.
Since each pixel can be included in multiple neighborhood window ωkIn, in different neighborhood window ωkIn be calculated
QiAlso different, so need to qiIt is averaging processing, by calculating a in all windowskAnd bk, filtering output is such as formula
(5)。
(5)
Wherein, Respectively ak, bkAt point i
The average value of all overlapping neighborhood windows.
By analyzing formula (3), formula (4) it is found that μk、IipiRespectively represent navigational figure I, output
Image p, I × p are in its neighborhood window ωkIn mean value, σk 2It is I in neighborhood window ωkIn variance.It is deposited between variance and mean value
In DX=E (X2)-(EX)2Relationship, calculated using mean value.Therefore in guiding filtering algorithm, Image neighborhood mean value needs
It calculates repeatedly, is part most time-consuming in entire algorithm, therefore, how quickly to seek image of the image in certain vertex neighborhood window
Neighboring mean value just becomes a key for realizing guiding filtering algorithm, and an emphasis link of CUDA of the present invention optimization.
The present invention constructs the first kernel function using formula (6), realizes the calculating of image domains mean value.
Mean_p=boxfilter (p, r)/N (6)
Wherein, the neighboring mean value of mean_p representing input images p, p is in neighborhood for boxfilter (p, r) representing input images
The sum of pixel value in window, N represent number of pixels in neighborhood window, and r represents neighborhood window side length.Wherein neighborhood window pixel
Number N, can be by seeking neighborhood window pixel to all 1's matrix identical with required image size and obtaining.The calculating step is ability
Well known to field technique personnel, the embodiment of the present invention does not repeat them here this.
Using the above method, the calculating of Image neighborhood mean value can be changed into the summation meter of Image neighborhood window pixel value
It calculates, is convenient for CUDA parallel processing.The present invention realizes that neighborhood window pixel value is summed using integrogram, passes through the first kernel
4 kernel functions carry out CUDA parallel optimization in function, and the specific implementation steps are as follows (assuming that data used have been located in GPU
In video memory):
(I) the 1st kernel function is responsible for parallel computation image i-th and arranges (1≤i≤picture traverse) from the 1st row to jth (1
≤ j≤picture altitude) row pixel and, start-up parameter be block dimension be 1024 × 1, grid dimension be 1 × 1.Each line
Journey completes the calculating of a column data in image by recursive call, uses register to save intermediate data in circulation, at this time data
Reading meets global storage and merges access.
The data that (II) the 1st kernel function generates need to carry out the processing of data boundary, the 2nd kernel function with
Behavior processes data in units border issue, start-up parameter be block dimension be 16 × 16, grid dimension be ((picture traverse+
DimBlock.x-1)/dimBlock.x) × ((picture altitude+dimBlock.y-1)/dimBlock.y) a block.Wherein
DimBlock.x indicates thread block in the dimension of x-axis, and dimBlock.y indicates thread block in the dimension of y-axis.
(III) the 3rd kernel function is responsible for parallel computation image jth row (1≤j≤picture altitude) and is arranged from the 1st to i-th
Arrange (1≤i≤picture traverse) pixel and.The restriction of non-merged access when to eliminate reading data, using first to this
Kernel function input data carries out matrix permutation, then the 1st kernel function is called to be calculated, and adopts in data storage
Storage mode is write with by column.
The data that (IV) the 3rd kernel function generates are also required to carry out the processing of data boundary, the 4th kernel function
To arrange the border issue for processes data in units, start-up parameter is identical as the 2nd kernel function, and output data is the 1st at this time
The neighborhood window pixel value of a kernel function input picture and, and be saved into corresponding global storage.
And so on, it can successively acquire the neighboring mean value mean_I of navigational figure I, the neighboring mean value mean_ of image I*P
The neighboring mean value mean_II of Ip, image I*I.
Here it is worth noting that, the programming model of CUDA is that CPU and GPU cooperates.Traditional CPU architecture is hard by it
The influence of part framework effectively cannot carry out general-purpose computations using resource, and can make GPU that can not only execute tradition using CUDA
Graphics calculations, moreover it is possible to efficiently execute general-purpose computations.It is time-consuming in order to reduce data transmission as far as possible, arithmetic speed is improved, this
2 data transmission are only carried out between invention setting CPU memory and GPU video memory, i.e. input picture p and navigational figure I are by host side
Memory is passed to equipment end video memory, and output image q is passed to host end memory by equipment end video memory, the specific steps of which are as follows:
(I) constructs CPU and GPU cooperative working environment using CUDA;
Input picture p and navigational figure I by the global storage of host memory reading device video memory, and is tied to by (II)
Texture memory.
(III) distributes number of threads, sets kernel start-up parameter as each block distribution 16 × 16, each grid has ((figure
Image width degree+dimBlock.x-1)/dimBlock.x) × ((picture altitude+dimBlock.y-1)/dimBlock.y) a
Image is carried out chessboard division by block.Wherein dimBlock.x indicates thread block in the dimension of x-axis, and dimBlock.y indicates line
Dimension of the journey block in y-axis.
(IV) calls the first kernel function (including 4 kernel functions), by complete 1 square identical with required image size
Battle array seeks neighborhood window pixel and obtains neighborhood window pixel number N, and be saved into constant storage.
N successively seeks the neighborhood of input picture p and navigational figure I in (V) first kernel function of calling and constant storage
Mean value, the neighboring mean value mean_II of the neighboring mean value mean_Ip of image I*P, image I*I, and result is successively stored in correspondence
Global storage.
(VI) constructs covariance cov_Ip, the variance var_I of image I that the second kernel function successively seeks image (I, p),
And then it constructs kernel function and seeks filtering key parameter a and b.
That is, constructing covariance kernel function according to formula cov_Ip=mean_Ip-mean_I.*mean_p seeks image
The covariance of (I, p);
The variance that variance kernel function seeks image I is constructed according to formula var_I=mean_II-mean_I.*mean_I;
Parameter a kernel function, which is constructed, according to formula a=cov_Ip./(var_I+ ε) seeks filtering key parameter a;
Parameter b kernel function, which is constructed, according to formula b=mean_p-a.*mean_I seeks filtering key parameter b.
(VII) calls the first kernel function to seek neighborhood window mean value to filtering key parameter a and b, and acquires final filtering knot
Fruit q, and result is saved in corresponding global storage.
That is, calling the first kernel function to acquire the neighbour of key parameter a according to formula mean_a=boxfilter (a, r)/N
Domain mean value;
According to formula mean_b=boxfilter (b, r)/N, the first kernel function is called to acquire the neighborhood of key parameter b
Mean value;
Output q kernel function is constructed according to formula q=mean_a.*I+mean_b and acquires final filter result q, and will knot
Fruit is saved in corresponding global storage.
(VIII) spreads out of the filter result image in the global storage for being stored in equipment video memory to host memory.
In addition, guiding filtering algorithm is when realizing image emergence algorithm, it is different from above-mentioned process:
When r, g, b component data of (I) input picture p and navigational figure I copies into the video memory of GPU by the memory of CPU, by
In being related to multiple data transmission, the present invention is flowed using CUDA, and such data assignment operation and kernel function execute when intersecting progress, can
Improve the utilization rate of GPU resource;Especially when the amount of data is large, the advantage of CUDA stream is obvious;
(II) when solving key parameter a, the present invention realizes in a 3 components r, g, b using a kernel function
It calculates, it is that 16 × 16, block is arranged with two-dimensional address that start-up parameter, which is set as block dimension,.Per thread successively will first
Data in global storage var_I_rr, var_I_rg, var_I_rb, var_I_gg, var_I_gb, var_I_b are saved in
Register, the Sigma matrix of building 3 × 3, and determinant computation equations are utilized, result is stored in register;Secondly each
Sigma matrix inversion and cov_Ip matrix are multiplied unified calculation by thread with the inverse matrix, to increase the calculating of program execution
Closeness makes full use of the calculated performance of GPU.
Embodiment 3
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific example to the present invention
Technical solution be described in further detail.
Present example use 7 operating system of windows, CPU be Intel Core i5-3470, dominant frequency 3.2GHz,
Installed System Memory is 4GB;GPU is NVIDIA GeForce GTX660, and which includes 5 stream multiprocessors (SMS), each SMS
Containing 192 CUDA cores, onboard global memory is 2048Mbytes, and memory bandwidth is 192 bits, supports CUDA
Compute Capability is 3.0.The present invention is analyzed using the Visual Profile that CUDA Toolkit is carried simultaneously
All data realizes the quantitative analysis to program feature.
For the validity for verifying this method, present example is to guiding filtering algorithm in image smoothing, image emergence, image
4 application fields such as enhancing and flash denoising carry out CUDA parallel optimization, and filter effect figure and speed-up ratio table are as follows:
1 image smoothing of example
In this example, filter radius r is 16, and filtering parameter eps is 0.04.Input picture p and navigational figure I are set as same
Piece image, output image q are final output, and 1 effect of example is as shown in Figure 2.As can be seen from Figure 2: input picture p
In details, mutation, edge and noise all obtained a degree of inhibition, obtain more satisfactory image smoothing effect.
2 image of example is sprouted wings
In this example, filter radius r is 60, and filtering parameter eps is 0.000001.Input picture p and navigational figure I are set
For the different image of two width, output image q is final output, and 2 effect of example is as shown in Figure 3.As can be seen from Figure 3:
Output image feather effect is obvious, and marginal portion realizes asymptotic variation, has achieved the effect that natural sparse model.
3 image enhancement of example
In this example, filter radius r is 16, and filtering parameter eps is 0.01.Input picture p and navigational figure I are set as same
Piece image, output image q are final output, and 3 effect of example is as shown in Figure 4.As seen from Figure 4: exporting image
Entirety or local feature have all obtained apparent enhancing, effectively increase the identification capability of image detail part.
4 flash of example denoising
In this example, filter radius r is 8, and filtering parameter eps is 0.0004.Input picture p and navigational figure I are set as
The different image of two width, output image q are final output, and 4 effect of example is as shown in Figure 5.As seen from Figure 5: output
Image q coloring denoising effect is coordinated naturally, has obtained ideal treatment effect.
In addition to this, from Fig. 2~5 as can be seen that the present invention it is smooth, sprout wings, in terms of enhancing, flash denoise 4 all
It is with the former algorithm effect based on c program essentially identical, it was demonstrated that accuracy of the invention.In order to which acceleration more of the invention is imitated
Fruit is based respectively on Matlab programming, realizes guiding filtering algorithm based on c program and CUDA programming, and carried out Experimental comparison.No
Time loss and speed-up ratio with image in different resolution processing is as shown in table 1:
Table 1 is based on distinct program programming and realizes guiding filtering algorithm time loss (ms) and speed-up ratio
From table 1 it follows that realizing guiding filtering algorithm, base of the present invention compared to based on Matlab program and c program
It is greatly shortened in the time loss of CUDA Parallel Implementation;Wherein, the acceleration effect that image is sprouted wings is particularly evident, may be implemented more than 60
Speed-up ratio again;Simultaneously it can also be seen that being continuously increased with image resolution ratio, acceleration effect of the invention are also more obvious.
Bibliography:
[1]Petschnigg G,Szeliski R,Agrawala M,et al.Digital photography with
flash and no-flash image pairs[J].ACM transactions on graphics(TOG),2004,23
(3):664-672.
It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention
Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.