CN104899840A

CN104899840A - Guided-filtering optimization speed-up method based on CUDA

Info

Publication number: CN104899840A
Application number: CN201510324806.0A
Authority: CN
Inventors: 何凯; 王新磊; 王晓文; 葛云峰
Original assignee: Tianjin University
Current assignee: TIANJIN BOHUA ANCHUANG TECHNOLOGY Co.,Ltd.
Priority date: 2015-06-12
Filing date: 2015-06-12
Publication date: 2015-09-09
Anticipated expiration: 2035-06-12
Also published as: CN104899840B

Abstract

The invention discloses a guided-filtering optimization speed-up method based on CUDA, and the method comprises the following steps: enabling an input image p and a guide image I to be read into a global storage unit from a memory of a host end; respectively obtaining neighborhood mean values of the input image p, the guide image I, an image I*P and an image I*I at neighborhood windows through the construction of a first core function; constructing a second core function, and sequentially obtaining the covariance of images (I, p) and the covariance of the image I, thereby obtaining the key parameters a and b of filtering; performing the call of the first core function to obtain a neighborhood mean value mean_a of the parameter a and the neighborhood mean value mean_b of the parameter b, thereby obtaining the final filtering result q; enabling the result q to be stored in the corresponding global storage unit, and outputting the result q to the memory of the host end. The method employs the advantages in the floating point calculation and parallel computing capability of a GPU, guarantees the filtering effect of an image, and effectively improves the execution efficiency of a guided filtering algorithm, and quickly achieves the guided filtering algorithm.

Description

Optimization method is accelerated in a kind of guiding filtering based on CUDA

Technical field

The present invention relates to Computer Applied Technology and image processing field, particularly relate to a kind of guiding filtering based on CUDA (unified calculation equipment framework) and accelerate optimization method.

Background technology

Image filtering is the important means of image procossing, has great importance and researching value.Due to the imperfection of imaging system, transmission medium and recording unit etc., digital picture is often subject to the pollution of multiple noise in its formation, transmission log process.And image filtering, namely under the condition as far as possible retaining image detail feature, the noise of target image is suppressed, be indispensable operation in Image semantic classification, the quality of its treatment effect will directly have influence on the validity and reliability of successive image process and analysis.

Image filtering method can be divided into two kinds: one linearly moves constant filtering, and the content of its filtering core weights and input picture has nothing to do, and is represented as gaussian filtering, mean filter, Laplce's filtering etc.; Another kind is Linear shift variant filtering, is represented as guiding filtering, need to utilize original image to comprise in filtering content information, be referred to as guiding figure information.Namely bilateral filtering kernel function considers the information of pixel space difference in image template, considers again margin of image element value information, and wherein guide figure and input figure to be same piece image, therefore two-sided filter can think a kind of simple form guiding filtering.The guiding figure of associating two-sided filter is different with input figure, can obtain more preferably filter effect.But also there are some obvious defects in two-sided filter and associating two-sided filter, if two-sided filter is in the application that details strengthens and high dynamic range images compresses, there is an edge gradient flop phenomenon clearly in capital, so the algorithm of wave filter itself and texturally need further improvement.Filtering concept is guided formally to propose from 2010, it has the feature that bilateral filtering protects limit denoising on the one hand, overcome again the impact of artifact, simultaneously due to the substantial connection between Laplacian Matrix, guide filtering at image denoising, image enhaucament, HDR (high dynamic range images) compression, flash/noflash denoising ^[1], the field such as figure, mist elimination and cascade sampling of scratching is widely used.This algorithm is simply effective, but needs the matrix of calculation of complex and solve large linear systems, result in and guides filtering algorithm to consume a large amount of operation time and space, cannot meet the needs in practical application.

Generally speaking, navigational figure filtering algorithm calculated amount is comparatively large, is difficult to the execution efficiency improving algorithm while ensureing algorithm accuracy.Therefore traditional framework based on CPU, being difficult to meet people to algorithm accuracy and the real-time requirement processed, only has employing image processor (GPU) to go the demand met in practical application.

Summary of the invention

The invention provides a kind of guiding filtering based on CUDA and accelerate optimization method, the present invention, while guarantee image filtering quality, turn improves counting yield, reduces computation complexity, described below:

An optimization method is accelerated in guiding filtering based on CUDA, and described guiding filtering is accelerated optimization method and comprised the following steps:

Input picture p and navigational figure I being read in global storage by host side internal memory, by building the first kernel function, obtaining input picture p, navigational figure I, image I*P, image I*I respectively in the Image neighborhood average of neighborhood window;

Build the covariance that the second kernel function asks for image (I, p) successively, the variance of navigational figure I, and then ask for filtering key parameter a and b;

Call the neighboring mean value mean_a that the first kernel function asks for parameter a, the neighboring mean value mean_b of parameter b, and then obtain final filter result, result is saved in corresponding global storage, spreads out of host side internal memory.

Described by structure first kernel function, acquisition input picture p, navigational figure I, image I*P, image I*I are specially in the step of the Image neighborhood average of neighborhood window respectively:

Input picture p, navigational figure I, image I*P, image I*I are converted to respectively the read group total of Image neighborhood window pixel value in the calculating of the Image neighborhood average of neighborhood window;

By building the summation of the first kernel function difference computed image neighborhood window pixel value.

The described step by the summation of computed image neighborhood window pixel value respectively of structure first kernel function is specially:

Adopt integrogram to realize the summation of neighborhood window pixel value, carry out CUDA parallel optimization by 4 kernel functions in the first kernel function.

Described guiding filtering is accelerated optimization method and is also comprised: call 4 kernel functions in the first kernel function, obtains neighborhood window pixel number N, and is saved in constant storage.

The beneficial effect of technical scheme provided by the invention is:

The present invention guides on the basis of filtering algorithm in further investigation, filtering algorithm is guided based on CUDA programming realization, at image smoothing, image emergence, image enhaucament and image flash denoising four instance aspects, carry out Experimental comparison with based on c program and Matlab program.The present invention's advantage is compared with prior art:

(1) thinking is novel, utilizes CUDA framework to guide filtering algorithm design, breaks through the time restriction of serial programming, have larger innovative significance.

(2) execution efficiency is high, can reach real-time process to a certain extent.The method utilizes the advantage of the aspects such as GPU Floating-point Computation ability, parallel computation, while guarantee image filtering effect, effectively improves the execution efficiency guiding filtering algorithm, achieves guiding filtering algorithm fast.

(3) realize simply, hardware requirement is low, under C language environment, complete calling GPU parallel architecture, and code is write easily, just can realize the process of large-scale data in the GPU hardware of consumer level simultaneously.

Accompanying drawing explanation

Fig. 1 the present invention guides filtering algorithm process flow diagram;

Fig. 2 image smoothing effect contrast figure;

(a) input picture, (b) c program output image, (c) CUDA program output image;

Fig. 3 image feather effect comparison diagram;

(a) input picture, (b) navigational figure, (c) c program output image, (d) CUDA program output image;

Fig. 4 image enhancement effects comparison diagram of the present invention;

(a) input picture, (b) c program output image, (c) CUDA program output image;

Fig. 5 image flash of the present invention denoising effect comparison diagram.

(a) input picture, (b) navigational figure, (c) c program output image, (d) CUDA program output image.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below embodiment of the present invention is described further in detail.

Nearly ten years, computer graphics processor (Graphics Pocessing Unit, GPU) is developed into the processor of high degree of parallelism, multithreading, multinuclear by the specialized equipment being originally processing computer graphics.The arithmetic capability of current main flow GPU exceedes main flow universal cpu already, and from development trend, gap can be increasing in the future.Unified calculation equipment framework be by NVIDIA (tall and handsome reach) company release a kind of using the software and hardware architecture system of GPU as data parallel, it is complete GPGPU (graphics processing unit) solution.The appearance of CUDA reduces programmer and uses GPU to carry out the development difficulty of general-purpose computations.The programming model special due to CUDA and storage data method, make a large amount of and the similar computing of complexity can be processed by thread simultaneously, greatly reduce the execution time of program.

For this reason, the present invention proposes a kind of guiding filtering based on CUDA and accelerates optimization method, CUDA is utilized to build CPU and GPU cooperative working environment, using CPU as being responsible for the main frame carrying out the strong issued transaction of logicality and serial computing, using the coprocessor of GPU as the highly threading parallel processing of responsible execution, CUDA multiple programming is utilized to realize the summation of Image neighborhood window pixel value, and then obtain Image neighborhood average, utilize register and Texture memory simultaneously, optimized algorithm step, obtain and guide filtering key parameter, and then achieve the global optimization to algorithm.Technical scheme of the present invention is as follows:

Embodiment 1

An optimization method is accelerated in guiding filtering based on CUDA, see Fig. 1, guides filtering to accelerate optimization method and comprises the following steps:

101: input picture p and navigational figure I is read in global storage by host side internal memory, by building the first kernel function, obtain input picture p, navigational figure I, image I*P, image I*I respectively in the Image neighborhood average of neighborhood window;

102: build the covariance that the second kernel function asks for image (I, p) successively, the variance of navigational figure I, and then ask for filtering key parameter a and b;

103: call the neighboring mean value mean_a that the first kernel function asks for parameter a, the neighboring mean value of parameter b

Mean_b, and then obtain final filter result, result is saved in corresponding global storage, spreads out of host side internal memory.

Wherein, by building the first kernel function, acquisition input picture p, navigational figure I, image I*P, image I*I are specially in the step of the Image neighborhood average of neighborhood window respectively:

Further, the step by building the summation of computed image neighborhood window pixel value respectively of the first kernel function is specially:

This guiding filtering is accelerated optimization method and is also comprised:

Call 4 kernel functions in the first kernel function, obtain neighborhood window pixel number N, and be saved in constant storage.

This method utilizes CUDA to programme to guiding filtering algorithm to carry out parallel optimization, while guarantee filter result effect, greatly can improve again the execution efficiency guiding filtering algorithm, achieve the real-time process guiding filtering algorithm to a certain extent.

Below in conjunction with concrete computing formula, calculation procedure, the method in embodiment 1 is described, described below:

Embodiment 2

Guide filtering algorithm to realize based on a Local Linear Model, in Local Linear Model, if input picture is p, navigational figure is I, and filtering output image is q, and Local Linear Model hypothesis is with the neighborhood window ω of center pixel k _kthere is following linear relationship:

q_{i} = a_{k} I_{i} + b_{k}, &ForAll; i &Element; ω_{k}

(1)

Wherein, ω _ktake the length of side as the square window of r, a _kand b _kneighborhood window ω _kin linear coefficient, I _ifor navigational figure is at neighborhood window ω _kin pixel value, q _ifor neighborhood window ω _kin filtering export.Coefficient a _kand b _kby ask for input picture p and output image q minimize difference to determine, namely make formula (2) reach minimum.

E (a_{k}, b_{k}) = Σ_{i &Element; ω_{k}} [{(a_{k} I_{i} + b_{k} - p_{i})}^{2} + ϵ {a_{k}}^{2}] - - - (2)

E (a in formula (2) _k, b _k) be neighborhood window ω _kin the output of cost function, p _ifor input picture is at neighborhood window ω _kin pixel value, ε is a chastening variance adjustment parameter, its objective is and prevents a _kvalue is excessive.Linear regression solves above formula and can obtain:

a_{k} = \frac{\frac{1}{| ω |} Σ_{i &Element; ω_{k}} I_{i} p_{i} - μ_{k} {\overset{&OverBar;}{p}}_{k}}{{σ_{k}}^{2} + ϵ} - - - (3)

b_{k} = {\overset{&OverBar;}{p}}_{k} - a_{k} μ_{k} - - - (4)

In formula, μ _kand σ _k ²that navigational figure I is at neighborhood window ω respectively _kaverage and variance.| ω | be neighborhood window ω _kin number of pixels, that input picture p is at neighborhood window ω _kin average.

Because each pixel can be included in multiple neighborhood window ω _kin, at different neighborhood window ω _kin the q that calculates _ialso different, so need q _ibe averaging processing, by calculating a in all windows _kand b _k, filtering exports such as formula (5).

q_{i} = \frac{1}{| ω |} Σ_{k : i &Element; ω_{k}} (a_{k} I_{i} + b_{k}) = {\overset{&OverBar;}{a}}_{i} I_{i} + {\overset{&OverBar;}{b}}_{i}

(5)

Wherein,

{\overset{&OverBar;}{a}}_{i} = \frac{1}{| ω |} Σ_{k &Element; ω_{k}} a_{k},

{\overset{&OverBar;}{b}}_{i} = \frac{1}{| ω |} Σ_{k &Element; ω_{k}} b_{k},

be respectively a _k, b _kat the mean value of all overlapping neighborhood window at an i place.

By analyzing known to formula (3), formula (4), μ _k, i _ip _irepresent navigational figure I, output image p, I × p respectively at its neighborhood window ω _kin average, σ _k ²that I is at neighborhood window ω _kin variance.DX=E (X is there is between variance and average ²)-(EX) ²relation, average can be utilized to calculate.Therefore in guiding filtering algorithm, Image neighborhood average needs repeatedly to calculate, it is part the most consuming time in whole algorithm, therefore, how to ask for the Image neighborhood average of image in certain vertex neighborhood window fast, just becoming the key realizing guiding filtering algorithm, is also the emphasis link that CUDA of the present invention optimizes.

The present invention adopts formula (6) to build the first kernel function, realizes the calculating of image domains average.

mean_p＝boxfilter(p,r)/N (6)

Wherein, the neighboring mean value of mean_p representing input images p, boxfilter (p, r) representing input images p is pixel value sum in neighborhood window, and N represents number of pixels in neighborhood window, and r represents the neighborhood window length of side.Wherein neighborhood window pixel number N, by asking neighborhood window pixel and obtaining to all 1's matrix identical with required image size.This calculation procedure is conventionally known to one of skill in the art, and the embodiment of the present invention does not repeat this.

Adopt said method, the calculating of Image neighborhood average can be changed into the read group total of Image neighborhood window pixel value, be convenient to carry out CUDA parallel processing.The present invention adopts integrogram to realize the summation of neighborhood window pixel value, carries out CUDA parallel optimization by 4 kernel functions in the first kernel function, its specific implementation step following (supposing that data used have been arranged in GPU video memory):

(I) the 1st kernel function be responsible for parallel computation image i-th arrange (1≤i≤picture traverse) from the 1st row to jth (1≤j≤picture altitude) row pixel and, its start-up parameter is block dimension be 1024 × 1, grid dimension is 1 × 1.Each thread completes the calculating of a column data in image by recursive call, adopts register to preserve intermediate data in circulation, and now digital independent meets global storage and merges access.

The data that (II) the 1st kernel function produces need the process carrying out data boundary, 2nd kernel function is with behavior processes data in units boundary problem, start-up parameter is block dimension be 16 × 16, grid dimension is the individual block of ((picture traverse+dimBlock.x-1)/dimBlock.x) × ((picture altitude+dimBlock.y-1)/dimBlock.y).Wherein dimBlock.x represents the dimension of thread block in x-axis, and dimBlock.y represents the dimension of thread block in y-axis.

(III) the 3rd kernel function be responsible for parallel computation image jth row (1≤j≤picture altitude) from the 1st row to i-th row (1≤i≤picture traverse) pixel and.For the restriction of unconsolidated access during elimination digital independent, adopt and first matrix permutation is carried out to this kernel function input data, then call the 1st kernel function and calculate, adopt when data store and write storage mode by row.

The data that (IV) the 3rd kernel function produces also need the process carrying out data boundary, 4th kernel function take row as the boundary problem of processes data in units, start-up parameter is identical with the 2nd kernel function, now export neighborhood window pixel value that data are the 1st kernel function input picture and, and be saved in corresponding global storage.

The like, the neighboring mean value mean_I of navigational figure I can be tried to achieve successively, the neighboring mean value mean_II of the neighboring mean value mean_Ip of image I*P, image I*I.

Here what deserves to be explained is, the programming model of CUDA is CPU and GPU collaborative work.Traditional CPU architecture can not effectively utilize resource to carry out general-purpose computations by the impact of its hardware structure, and utilizes CUDA that GPU can be made can not only to perform traditional graphics calculations, can also perform general-purpose computations efficiently.Consuming time in order to reduce data transmission as far as possible, improve arithmetic speed, the present invention sets between CPU internal memory and GPU video memory and only carries out 2 data transfer, namely input picture p and navigational figure I imports equipment end video memory into by host side internal memory, and output image q imports host side internal memory into by equipment end video memory, its concrete steps are as follows:

(I) utilizes CUDA to build CPU and GPU cooperative working environment;

(II) by the global storage of input picture p and navigational figure I by host memory reading device video memory, and is tied to Texture memory.

(III) distributes number of threads, setting kernel start-up parameter is that each block distributes 16 × 16, each grid has ((picture traverse+dimBlock.x-1)/dimBlock.x) × and ((picture altitude+dimBlock.y-1)/dimBlock.y) individual block, carries out chessboard division by image.Wherein dimBlock.x represents the dimension of thread block in x-axis, and dimBlock.y represents the dimension of thread block in y-axis.

(IV) calls the first kernel function (namely comprising 4 kernel functions), by all 1's matrix identical with required image size is asked neighborhood window pixel and, obtain neighborhood window pixel number N, and be saved in constant storage.

(V) is called N in the first kernel function and constant storage and is asked for the neighboring mean value of input picture p and navigational figure I successively, the neighboring mean value mean_Ip of image I*P, the neighboring mean value mean_II of image I*I, and successively result is kept at corresponding global storage.

(VI) builds the covariance cov_Ip that the second kernel function asks for image (I, p) successively, the variance var_I of image I, and then structure kernel function asks for filtering key parameter a and b.

That is, build according to formula cov_Ip=mean_Ip-mean_I.*mean_p the covariance that covariance kernel function asks for image (I, p);

The variance that variance kernel function asks for image I is built according to formula var_I=mean_II-mean_I.*mean_I;

Build parameter a kernel function according to formula a=cov_Ip./(var_I+ ε) and ask for filtering key parameter a;

Build parameter b kernel function according to formula b=mean_p-a.*mean_I and ask for filtering key parameter b.

(VII) is called the first kernel function and is asked neighborhood window average to filtering key parameter a and b, and tries to achieve final filter result q, and result is saved in corresponding global storage.

That is, according to formula mean_a=boxfilter (a, r)/N, the neighboring mean value that the first kernel function tries to achieve key parameter a is called;

According to formula mean_b=boxfilter (b, r)/N, call the neighboring mean value that the first kernel function tries to achieve key parameter b;

Build output q kernel function according to formula q=mean_a.*I+mean_b and try to achieve final filter result q, and result is saved in corresponding global storage.

The filter result image be kept in the global storage of equipment video memory is spread out of host memory by (VIII).

In addition, guide filtering algorithm when realizing image emergence algorithm, different from above-mentioned flow process:

R, g, b component data of (I) input picture p and navigational figure I is copied into during to the video memory of GPU by the internal memory of CPU, owing to relating to many data transfer, the present invention adopts CUDA to flow, and when such data assignment operation and kernel function execution intersection are carried out, can improve the utilization rate of GPU resource; Particularly when data volume is larger, the advantage of CUDA stream is obvious;

(II), when solving key parameter a, the present invention adopts a kernel function to realize the calculating of 3 components r, g, b in a, and it is that 16 × 16, block arranges with two-dimensional address that its start-up parameter is set to block dimension.First each thread successively by global storage var_I_rr, var_I_rg, var_I_rb, data in var_I_gg, var_I_gb, var_I_b are saved in register, build the Sigma matrix of 3 × 3, and utilize determinant computation equations, by result stored in register; Secondly each thread is by Sigma matrix inversion, and cov_Ip matrix is multiplied with this inverse matrix and unifiedly calculates, and to increase the computational intensity of program execution, makes full use of the calculated performance of GPU.

Embodiment 3

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with concrete example, technical scheme of the present invention is described in further detail.

Example of the present invention adopts windows 7 operating system, and CPU is Intel Core i5-3470, and dominant frequency is 3.2GHz, and Installed System Memory is 4GB; GPU is NVIDIA GeForce GTX660, which includes 5 stream multiprocessors (SMS), each SMS contains 192 CUDA cores, and Ban Zai global memory is 2048Mbytes, memory bandwidth is 192 bits, supports that CUDA Compute Capability is 3.0.The Visual Profile that the present invention simultaneously utilizes CUDA Toolkit to carry, to analyze every data, realizes the quantitative analysis to program feature.

For verifying the validity of this method, 4 applications such as example of the present invention is sprouted wings at image smoothing, image to guiding filtering algorithm, image enhaucament and flash denoising carry out CUDA parallel optimization, its filter effect figure and speed-up ratio table as follows:

Example 1 image smoothing

In this example, filter radius r is 16, and filtering parameter eps is 0.04.Input picture p and navigational figure I is set to same piece image, and output image q is final Output rusults, and example 1 effect as shown in Figure 2.As can be seen from Figure 2: the details in input picture p, sudden change, edge and noise are obtained for suppression to a certain degree, more satisfactory image smoothing effect is obtained.

Example 2 image is sprouted wings

In this example, filter radius r is 60, and filtering parameter eps is 0.000001.Input picture p and navigational figure I are set to the different image of two width, and output image q is final Output rusults, and example 2 effect as shown in Figure 3.As can be seen from Figure 3: output image feather effect is obvious, and marginal portion achieves asymptotic change, reaches the effect of natural sparse model.

Example 3 image enhaucament

In this example, filter radius r is 16, and filtering parameter eps is 0.01.Input picture p and navigational figure I is set to same piece image, and output image q is final Output rusults, and example 3 effect as shown in Figure 4.As seen from Figure 4: the entirety of output image or local feature are obtained for obvious enhancing, effectively improve the identification capability of image detail part.

Example 4 flash denoising

In this example, filter radius r is 8, and filtering parameter eps is 0.0004.Input picture p and navigational figure I are set to the different image of two width, and output image q is final Output rusults, and example 4 effect as shown in Figure 5.As seen from Figure 5: the painted denoising effect of output image q is coordinated naturally, obtains desirable treatment effect.

In addition, as can be seen from Fig. 2 ~ 5, the present invention is substantially identical with the former algorithm effect based on c program in level and smooth, emergence, enhancing, flash denoising 4, demonstrates accuracy of the present invention.In order to acceleration effect more of the present invention, programme based on Matlab respectively, guide filtering algorithm based on c program and CUDA programming realization, and carried out Experimental comparison.Time loss and the speed-up ratio of different resolution image procossing are as shown in table 1:

Table 1 guides filtering algorithm time loss (ms) and speed-up ratio based on distinct program programming realization

As can be seen from Table 1, realize guiding filtering algorithm compared to based on Matlab program and c program, the time loss that the present invention is based on CUDA Parallel Implementation shortens greatly; Wherein, the acceleration effect that image is sprouted wings is particularly evident, can realize the speed-up ratio of more than 60 times; Also can find out, along with the continuous increase of image resolution ratio, acceleration effect of the present invention is also more obvious simultaneously.

List of references:

[1]Petschnigg G,Szeliski R,Agrawala M,et al.Digital photography with flash and no-flash image pairs[J].ACM transactions on graphics(TOG),2004,23(3):664-672.

It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, the invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. an optimization method is accelerated in the guiding filtering based on CUDA, it is characterized in that, described guiding filtering is accelerated optimization method and comprised the following steps:

2. optimization method is accelerated in a kind of guiding filtering based on CUDA according to claim 1, it is characterized in that, described by structure first kernel function, acquisition input picture p, navigational figure I, image I*P, image I*I are specially in the step of the Image neighborhood average of neighborhood window respectively:

3. optimization method is accelerated in a kind of guiding filtering based on CUDA according to claim 2, it is characterized in that, the described step by the summation of computed image neighborhood window pixel value respectively of structure first kernel function is specially:

4. optimization method is accelerated in a kind of guiding filtering based on CUDA according to claim 1, it is characterized in that, described guiding filtering is accelerated optimization method and also comprised: