Summary of the invention
The invention provides a kind of guiding filtering based on CUDA and accelerate optimization method, the present invention, while guarantee image filtering quality, turn improves counting yield, reduces computation complexity, described below:
An optimization method is accelerated in guiding filtering based on CUDA, and described guiding filtering is accelerated optimization method and comprised the following steps:
Input picture p and navigational figure I being read in global storage by host side internal memory, by building the first kernel function, obtaining input picture p, navigational figure I, image I*P, image I*I respectively in the Image neighborhood average of neighborhood window;
Build the covariance that the second kernel function asks for image (I, p) successively, the variance of navigational figure I, and then ask for filtering key parameter a and b;
Call the neighboring mean value mean_a that the first kernel function asks for parameter a, the neighboring mean value mean_b of parameter b, and then obtain final filter result, result is saved in corresponding global storage, spreads out of host side internal memory.
Described by structure first kernel function, acquisition input picture p, navigational figure I, image I*P, image I*I are specially in the step of the Image neighborhood average of neighborhood window respectively:
Input picture p, navigational figure I, image I*P, image I*I are converted to respectively the read group total of Image neighborhood window pixel value in the calculating of the Image neighborhood average of neighborhood window;
By building the summation of the first kernel function difference computed image neighborhood window pixel value.
The described step by the summation of computed image neighborhood window pixel value respectively of structure first kernel function is specially:
Adopt integrogram to realize the summation of neighborhood window pixel value, carry out CUDA parallel optimization by 4 kernel functions in the first kernel function.
Described guiding filtering is accelerated optimization method and is also comprised: call 4 kernel functions in the first kernel function, obtains neighborhood window pixel number N, and is saved in constant storage.
The beneficial effect of technical scheme provided by the invention is:
The present invention guides on the basis of filtering algorithm in further investigation, filtering algorithm is guided based on CUDA programming realization, at image smoothing, image emergence, image enhaucament and image flash denoising four instance aspects, carry out Experimental comparison with based on c program and Matlab program.The present invention's advantage is compared with prior art:
(1) thinking is novel, utilizes CUDA framework to guide filtering algorithm design, breaks through the time restriction of serial programming, have larger innovative significance.
(2) execution efficiency is high, can reach real-time process to a certain extent.The method utilizes the advantage of the aspects such as GPU Floating-point Computation ability, parallel computation, while guarantee image filtering effect, effectively improves the execution efficiency guiding filtering algorithm, achieves guiding filtering algorithm fast.
(3) realize simply, hardware requirement is low, under C language environment, complete calling GPU parallel architecture, and code is write easily, just can realize the process of large-scale data in the GPU hardware of consumer level simultaneously.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below embodiment of the present invention is described further in detail.
Nearly ten years, computer graphics processor (Graphics Pocessing Unit, GPU) is developed into the processor of high degree of parallelism, multithreading, multinuclear by the specialized equipment being originally processing computer graphics.The arithmetic capability of current main flow GPU exceedes main flow universal cpu already, and from development trend, gap can be increasing in the future.Unified calculation equipment framework be by NVIDIA (tall and handsome reach) company release a kind of using the software and hardware architecture system of GPU as data parallel, it is complete GPGPU (graphics processing unit) solution.The appearance of CUDA reduces programmer and uses GPU to carry out the development difficulty of general-purpose computations.The programming model special due to CUDA and storage data method, make a large amount of and the similar computing of complexity can be processed by thread simultaneously, greatly reduce the execution time of program.
For this reason, the present invention proposes a kind of guiding filtering based on CUDA and accelerates optimization method, CUDA is utilized to build CPU and GPU cooperative working environment, using CPU as being responsible for the main frame carrying out the strong issued transaction of logicality and serial computing, using the coprocessor of GPU as the highly threading parallel processing of responsible execution, CUDA multiple programming is utilized to realize the summation of Image neighborhood window pixel value, and then obtain Image neighborhood average, utilize register and Texture memory simultaneously, optimized algorithm step, obtain and guide filtering key parameter, and then achieve the global optimization to algorithm.Technical scheme of the present invention is as follows:
Embodiment 1
An optimization method is accelerated in guiding filtering based on CUDA, see Fig. 1, guides filtering to accelerate optimization method and comprises the following steps:
101: input picture p and navigational figure I is read in global storage by host side internal memory, by building the first kernel function, obtain input picture p, navigational figure I, image I*P, image I*I respectively in the Image neighborhood average of neighborhood window;
102: build the covariance that the second kernel function asks for image (I, p) successively, the variance of navigational figure I, and then ask for filtering key parameter a and b;
103: call the neighboring mean value mean_a that the first kernel function asks for parameter a, the neighboring mean value of parameter b
Mean_b, and then obtain final filter result, result is saved in corresponding global storage, spreads out of host side internal memory.
Wherein, by building the first kernel function, acquisition input picture p, navigational figure I, image I*P, image I*I are specially in the step of the Image neighborhood average of neighborhood window respectively:
Input picture p, navigational figure I, image I*P, image I*I are converted to respectively the read group total of Image neighborhood window pixel value in the calculating of the Image neighborhood average of neighborhood window;
By building the summation of the first kernel function difference computed image neighborhood window pixel value.
Further, the step by building the summation of computed image neighborhood window pixel value respectively of the first kernel function is specially:
Adopt integrogram to realize the summation of neighborhood window pixel value, carry out CUDA parallel optimization by 4 kernel functions in the first kernel function.
This guiding filtering is accelerated optimization method and is also comprised:
Call 4 kernel functions in the first kernel function, obtain neighborhood window pixel number N, and be saved in constant storage.
This method utilizes CUDA to programme to guiding filtering algorithm to carry out parallel optimization, while guarantee filter result effect, greatly can improve again the execution efficiency guiding filtering algorithm, achieve the real-time process guiding filtering algorithm to a certain extent.
Below in conjunction with concrete computing formula, calculation procedure, the method in embodiment 1 is described, described below:
Embodiment 2
Guide filtering algorithm to realize based on a Local Linear Model, in Local Linear Model, if input picture is p, navigational figure is I, and filtering output image is q, and Local Linear Model hypothesis is with the neighborhood window ω of center pixel k
kthere is following linear relationship:
(1)
Wherein, ω
ktake the length of side as the square window of r, a
kand b
kneighborhood window ω
kin linear coefficient, I
ifor navigational figure is at neighborhood window ω
kin pixel value, q
ifor neighborhood window ω
kin filtering export.Coefficient a
kand b
kby ask for input picture p and output image q minimize difference to determine, namely make formula (2) reach minimum.
E (a in formula (2)
k, b
k) be neighborhood window ω
kin the output of cost function, p
ifor input picture is at neighborhood window ω
kin pixel value, ε is a chastening variance adjustment parameter, its objective is and prevents a
kvalue is excessive.Linear regression solves above formula and can obtain:
In formula, μ
kand σ
k 2that navigational figure I is at neighborhood window ω respectively
kaverage and variance.| ω | be neighborhood window ω
kin number of pixels,
that input picture p is at neighborhood window ω
kin average.
Because each pixel can be included in multiple neighborhood window ω
kin, at different neighborhood window ω
kin the q that calculates
ialso different, so need q
ibe averaging processing, by calculating a in all windows
kand b
k, filtering exports such as formula (5).
(5)
Wherein,
be respectively a
k, b
kat the mean value of all overlapping neighborhood window at an i place.
By analyzing known to formula (3), formula (4), μ
k,
i
ip
irepresent navigational figure I, output image p, I × p respectively at its neighborhood window ω
kin average, σ
k 2that I is at neighborhood window ω
kin variance.DX=E (X is there is between variance and average
2)-(EX)
2relation, average can be utilized to calculate.Therefore in guiding filtering algorithm, Image neighborhood average needs repeatedly to calculate, it is part the most consuming time in whole algorithm, therefore, how to ask for the Image neighborhood average of image in certain vertex neighborhood window fast, just becoming the key realizing guiding filtering algorithm, is also the emphasis link that CUDA of the present invention optimizes.
The present invention adopts formula (6) to build the first kernel function, realizes the calculating of image domains average.
mean_p=boxfilter(p,r)/N (6)
Wherein, the neighboring mean value of mean_p representing input images p, boxfilter (p, r) representing input images p is pixel value sum in neighborhood window, and N represents number of pixels in neighborhood window, and r represents the neighborhood window length of side.Wherein neighborhood window pixel number N, by asking neighborhood window pixel and obtaining to all 1's matrix identical with required image size.This calculation procedure is conventionally known to one of skill in the art, and the embodiment of the present invention does not repeat this.
Adopt said method, the calculating of Image neighborhood average can be changed into the read group total of Image neighborhood window pixel value, be convenient to carry out CUDA parallel processing.The present invention adopts integrogram to realize the summation of neighborhood window pixel value, carries out CUDA parallel optimization by 4 kernel functions in the first kernel function, its specific implementation step following (supposing that data used have been arranged in GPU video memory):
(I) the 1st kernel function be responsible for parallel computation image i-th arrange (1≤i≤picture traverse) from the 1st row to jth (1≤j≤picture altitude) row pixel and, its start-up parameter is block dimension be 1024 × 1, grid dimension is 1 × 1.Each thread completes the calculating of a column data in image by recursive call, adopts register to preserve intermediate data in circulation, and now digital independent meets global storage and merges access.
The data that (II) the 1st kernel function produces need the process carrying out data boundary, 2nd kernel function is with behavior processes data in units boundary problem, start-up parameter is block dimension be 16 × 16, grid dimension is the individual block of ((picture traverse+dimBlock.x-1)/dimBlock.x) × ((picture altitude+dimBlock.y-1)/dimBlock.y).Wherein dimBlock.x represents the dimension of thread block in x-axis, and dimBlock.y represents the dimension of thread block in y-axis.
(III) the 3rd kernel function be responsible for parallel computation image jth row (1≤j≤picture altitude) from the 1st row to i-th row (1≤i≤picture traverse) pixel and.For the restriction of unconsolidated access during elimination digital independent, adopt and first matrix permutation is carried out to this kernel function input data, then call the 1st kernel function and calculate, adopt when data store and write storage mode by row.
The data that (IV) the 3rd kernel function produces also need the process carrying out data boundary, 4th kernel function take row as the boundary problem of processes data in units, start-up parameter is identical with the 2nd kernel function, now export neighborhood window pixel value that data are the 1st kernel function input picture and, and be saved in corresponding global storage.
The like, the neighboring mean value mean_I of navigational figure I can be tried to achieve successively, the neighboring mean value mean_II of the neighboring mean value mean_Ip of image I*P, image I*I.
Here what deserves to be explained is, the programming model of CUDA is CPU and GPU collaborative work.Traditional CPU architecture can not effectively utilize resource to carry out general-purpose computations by the impact of its hardware structure, and utilizes CUDA that GPU can be made can not only to perform traditional graphics calculations, can also perform general-purpose computations efficiently.Consuming time in order to reduce data transmission as far as possible, improve arithmetic speed, the present invention sets between CPU internal memory and GPU video memory and only carries out 2 data transfer, namely input picture p and navigational figure I imports equipment end video memory into by host side internal memory, and output image q imports host side internal memory into by equipment end video memory, its concrete steps are as follows:
(I) utilizes CUDA to build CPU and GPU cooperative working environment;
(II) by the global storage of input picture p and navigational figure I by host memory reading device video memory, and is tied to Texture memory.
(III) distributes number of threads, setting kernel start-up parameter is that each block distributes 16 × 16, each grid has ((picture traverse+dimBlock.x-1)/dimBlock.x) × and ((picture altitude+dimBlock.y-1)/dimBlock.y) individual block, carries out chessboard division by image.Wherein dimBlock.x represents the dimension of thread block in x-axis, and dimBlock.y represents the dimension of thread block in y-axis.
(IV) calls the first kernel function (namely comprising 4 kernel functions), by all 1's matrix identical with required image size is asked neighborhood window pixel and, obtain neighborhood window pixel number N, and be saved in constant storage.
(V) is called N in the first kernel function and constant storage and is asked for the neighboring mean value of input picture p and navigational figure I successively, the neighboring mean value mean_Ip of image I*P, the neighboring mean value mean_II of image I*I, and successively result is kept at corresponding global storage.
(VI) builds the covariance cov_Ip that the second kernel function asks for image (I, p) successively, the variance var_I of image I, and then structure kernel function asks for filtering key parameter a and b.
That is, build according to formula cov_Ip=mean_Ip-mean_I.*mean_p the covariance that covariance kernel function asks for image (I, p);
The variance that variance kernel function asks for image I is built according to formula var_I=mean_II-mean_I.*mean_I;
Build parameter a kernel function according to formula a=cov_Ip./(var_I+ ε) and ask for filtering key parameter a;
Build parameter b kernel function according to formula b=mean_p-a.*mean_I and ask for filtering key parameter b.
(VII) is called the first kernel function and is asked neighborhood window average to filtering key parameter a and b, and tries to achieve final filter result q, and result is saved in corresponding global storage.
That is, according to formula mean_a=boxfilter (a, r)/N, the neighboring mean value that the first kernel function tries to achieve key parameter a is called;
According to formula mean_b=boxfilter (b, r)/N, call the neighboring mean value that the first kernel function tries to achieve key parameter b;
Build output q kernel function according to formula q=mean_a.*I+mean_b and try to achieve final filter result q, and result is saved in corresponding global storage.
The filter result image be kept in the global storage of equipment video memory is spread out of host memory by (VIII).
In addition, guide filtering algorithm when realizing image emergence algorithm, different from above-mentioned flow process:
R, g, b component data of (I) input picture p and navigational figure I is copied into during to the video memory of GPU by the internal memory of CPU, owing to relating to many data transfer, the present invention adopts CUDA to flow, and when such data assignment operation and kernel function execution intersection are carried out, can improve the utilization rate of GPU resource; Particularly when data volume is larger, the advantage of CUDA stream is obvious;
(II), when solving key parameter a, the present invention adopts a kernel function to realize the calculating of 3 components r, g, b in a, and it is that 16 × 16, block arranges with two-dimensional address that its start-up parameter is set to block dimension.First each thread successively by global storage var_I_rr, var_I_rg, var_I_rb, data in var_I_gg, var_I_gb, var_I_b are saved in register, build the Sigma matrix of 3 × 3, and utilize determinant computation equations, by result stored in register; Secondly each thread is by Sigma matrix inversion, and cov_Ip matrix is multiplied with this inverse matrix and unifiedly calculates, and to increase the computational intensity of program execution, makes full use of the calculated performance of GPU.
Embodiment 3
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with concrete example, technical scheme of the present invention is described in further detail.
Example of the present invention adopts windows 7 operating system, and CPU is Intel Core i5-3470, and dominant frequency is 3.2GHz, and Installed System Memory is 4GB; GPU is NVIDIA GeForce GTX660, which includes 5 stream multiprocessors (SMS), each SMS contains 192 CUDA cores, and Ban Zai global memory is 2048Mbytes, memory bandwidth is 192 bits, supports that CUDA Compute Capability is 3.0.The Visual Profile that the present invention simultaneously utilizes CUDA Toolkit to carry, to analyze every data, realizes the quantitative analysis to program feature.
For verifying the validity of this method, 4 applications such as example of the present invention is sprouted wings at image smoothing, image to guiding filtering algorithm, image enhaucament and flash denoising carry out CUDA parallel optimization, its filter effect figure and speed-up ratio table as follows:
Example 1 image smoothing
In this example, filter radius r is 16, and filtering parameter eps is 0.04.Input picture p and navigational figure I is set to same piece image, and output image q is final Output rusults, and example 1 effect as shown in Figure 2.As can be seen from Figure 2: the details in input picture p, sudden change, edge and noise are obtained for suppression to a certain degree, more satisfactory image smoothing effect is obtained.
Example 2 image is sprouted wings
In this example, filter radius r is 60, and filtering parameter eps is 0.000001.Input picture p and navigational figure I are set to the different image of two width, and output image q is final Output rusults, and example 2 effect as shown in Figure 3.As can be seen from Figure 3: output image feather effect is obvious, and marginal portion achieves asymptotic change, reaches the effect of natural sparse model.
Example 3 image enhaucament
In this example, filter radius r is 16, and filtering parameter eps is 0.01.Input picture p and navigational figure I is set to same piece image, and output image q is final Output rusults, and example 3 effect as shown in Figure 4.As seen from Figure 4: the entirety of output image or local feature are obtained for obvious enhancing, effectively improve the identification capability of image detail part.
Example 4 flash denoising
In this example, filter radius r is 8, and filtering parameter eps is 0.0004.Input picture p and navigational figure I are set to the different image of two width, and output image q is final Output rusults, and example 4 effect as shown in Figure 5.As seen from Figure 5: the painted denoising effect of output image q is coordinated naturally, obtains desirable treatment effect.
In addition, as can be seen from Fig. 2 ~ 5, the present invention is substantially identical with the former algorithm effect based on c program in level and smooth, emergence, enhancing, flash denoising 4, demonstrates accuracy of the present invention.In order to acceleration effect more of the present invention, programme based on Matlab respectively, guide filtering algorithm based on c program and CUDA programming realization, and carried out Experimental comparison.Time loss and the speed-up ratio of different resolution image procossing are as shown in table 1:
Table 1 guides filtering algorithm time loss (ms) and speed-up ratio based on distinct program programming realization
As can be seen from Table 1, realize guiding filtering algorithm compared to based on Matlab program and c program, the time loss that the present invention is based on CUDA Parallel Implementation shortens greatly; Wherein, the acceleration effect that image is sprouted wings is particularly evident, can realize the speed-up ratio of more than 60 times; Also can find out, along with the continuous increase of image resolution ratio, acceleration effect of the present invention is also more obvious simultaneously.
List of references:
[1]Petschnigg G,Szeliski R,Agrawala M,et al.Digital photography with flash and no-flash image pairs[J].ACM transactions on graphics(TOG),2004,23(3):664-672.
It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, the invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.