CN104899840B

CN104899840B - A kind of guiding filtering acceleration optimization method based on CUDA

Info

Publication number: CN104899840B
Application number: CN201510324806.0A
Authority: CN
Inventors: 何凯; 王新磊; 王晓文; 葛云峰
Original assignee: Tianjin University
Current assignee: TIANJIN BOHUA ANCHUANG TECHNOLOGY Co.,Ltd.
Priority date: 2015-06-12
Filing date: 2015-06-12
Publication date: 2018-12-18
Anticipated expiration: 2035-06-12
Also published as: CN104899840A

Abstract

The invention discloses a kind of, and the guiding filtering based on CUDA accelerates optimization method, the guiding filtering accelerates optimization method the following steps are included: input picture p and navigational figure I is read in global storage by host end memory, by constructing the first kernel function, respectively obtain input picture p, navigational figure I, image I*P, image I*I neighborhood window Image neighborhood mean value；The covariance that the second kernel function successively seeks image (I, p), the variance of navigational figure I are constructed, and then seeks filtering key parameter a and b；It calls the first kernel function to obtain neighboring mean value mean_a, the neighboring mean value mean_b of parameter b of parameter a, and then obtains final filter result q, result is saved in corresponding global storage, host end memory is arrived in outflow.This method effectively increases the execution efficiency of guiding filtering algorithm, has fast implemented guiding filtering algorithm using the advantage of GPU Floating-point Computation ability, parallel computation etc. while guaranteeing image filtering effect.

Description

A kind of guiding filtering acceleration optimization method based on CUDA

Technical field

The present invention relates to Computer Applied Technologies and field of image processing, more particularly to a kind of CUDA that is based on (to unifiedly calculate Equipment framework) guiding filtering accelerate optimization method.

Background technique

Image filtering is the important means of image procossing, is had great importance and researching value.Due to imaging system, pass Defeated medium and recording equipment etc. it is not perfect, digital picture formed at it, transmission log during often by a variety of noises Pollution.And image filtering, i.e., the noise of target image is inhibited under conditions of retaining image minutia as far as possible, is Indispensable operation in image preprocessing, treatment effect quality will directly influence subsequent image processing and analysis have Effect property and reliability.

Image filtering method can be divided into two kinds: one is linearly moving constant filtering, filtering core weight and input picture Content is unrelated, is represented as gaussian filtering, mean filter, Laplce's filtering etc.；Another kind is Linear shift variant filtering, is represented as drawing Filtering is led, the content information for being included using original image, referred to as guidance figure information are needed in filtering.Bilateral filter Wave kernel function considers the information of pixel space difference in image template, it is contemplated that margin of image element value information, wherein guiding Figure and input figure are same piece image, therefore two-sided filter is regarded as a kind of simple form of guiding filtering.Combine bilateral The guidance figure and input figure of filter are different, can obtain more preferably filter effect.But two-sided filter and joint are double Side filter is there is also some apparent defects, the application compressed such as two-sided filter in details enhancing and high dynamic range images In, an apparent edge gradient flop phenomenon can all occur, so the algorithm of filter itself and structurally need into One step is improved.Guiding filtering concept is formal from 2010 to be proposed, one side has the characteristics that bilateral filtering is protected side and denoised, and gram The influence of artifact is taken, simultaneously because the substantial connection between Laplacian Matrix, guiding filtering increases in image denoising, image By force, HDR (high dynamic range images) compression, flash/noflash denoising^[1], scratch figure, defogging and cascade sampling etc. fields obtain It is widely applied.The algorithm is simple and effective, but needs to calculate complicated matrix and solve large linear systems, results in and draws It leads filtering algorithm and consumes a large amount of operation time and space, the needs being unable to satisfy in practical application.

To sum up, navigational figure filtering algorithm calculation amount is larger, it is difficult to improve while guaranteeing algorithm accuracy and calculates The execution efficiency of method.It is therefore traditional that based on the framework of CPU, people handle to algorithm accuracy and in real time wants being difficult to meet It asks, only goes meet the needs of in practical application using image processor (GPU).

Summary of the invention

The guiding filtering based on CUDA that the present invention provides a kind of accelerates optimization method, and the present invention is guaranteeing image filtering matter While amount, and computational efficiency is improved, reduces computation complexity, described below:

A kind of guiding filtering acceleration optimization method based on CUDA, it includes following step that the guiding filtering, which accelerates optimization method, It is rapid:

Input picture p and navigational figure I is read in into global storage by host end memory, by constructing the first kernel letter Number, respectively obtain input picture p, navigational figure I, image I*P, image I*I neighborhood window Image neighborhood mean value；

The covariance that the second kernel function successively seeks image (I, p), the variance of navigational figure I are constructed, and then seeks filtering Wave key parameter a and b；

The first kernel function is called to seek the neighboring mean value mean_a of parameter a, the neighboring mean value mean_b of parameter b, in turn Final filter result is obtained, result is saved in corresponding global storage, host end memory is arrived in outflow.

It is described by constructing the first kernel function, respectively obtain input picture p, navigational figure I, image I*P, image I*I The Image neighborhood mean value of neighborhood window the step of specifically:

Input picture p, navigational figure I, image I*P, image I*I are divided in the calculating of the Image neighborhood mean value of neighborhood window The read group total of Image neighborhood window pixel value is not converted to；

The summation of Image neighborhood window pixel value is calculated separately by constructing the first kernel function.

The described the step of summation of Image neighborhood window pixel value is calculated separately by the first kernel function of building specifically:

The summation of neighborhood window pixel value is realized using integrogram, carries out CUDA by 4 kernel functions in the first kernel function Parallel optimization.

The guiding filtering accelerates optimization method further include: calls 4 kernel functions in the first kernel function, obtains neighborhood window Mouth number of pixels N, and it is saved in constant storage.

The beneficial effect of the technical scheme provided by the present invention is that:

The present invention realizes guiding filtering algorithm on the basis of furtheing investigate guiding filtering algorithm, based on CUDA programming, Image smoothing, image are sprouted wings, image enhancement and image flash denoise four instance aspects, and are based on c program and Matlab program Carry out Experimental comparison.The advantages of the present invention over the prior art are that:

(1) thinking is novel, guides filtering algorithm using CUDA framework and designs, breaks through the time restriction of serial programming, With larger innovative significance.

(2) execution efficiency is high, can reach real-time processing to a certain extent.This method utilizes GPU Floating-point Computation ability, simultaneously The advantage of row calculating etc. effectively increases the execution efficiency of guiding filtering algorithm while guaranteeing image filtering effect, Guiding filtering algorithm is fast implemented.

(3) it realizes that simply hardware requirement is low, the calling to GPU parallel architecture, written in code is completed under C language environment It is easy, while achieving that the processing of large-scale data in the GPU hardware of consumer level.

Detailed description of the invention

Fig. 1 guiding filtering algorithm flow chart of the present invention；

Fig. 2 image smoothing effect contrast figure；

(a) input picture, (b) c program exports image, and (c) CUDA program exports image；

Fig. 3 image feather effect comparison diagram；

(a) input picture, (b) navigational figure, (c) c program exports image, and (d) CUDA program exports image；

Fig. 4 image enhancement effects comparison diagram of the present invention；

Fig. 5 image flash of the present invention denoises effect contrast figure.

(a) input picture, (b) navigational figure, (c) c program exports image, and (d) CUDA program exports image.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further Ground detailed description.

Nearly ten years, computer graphics processor (Graphics Pocessing Unit, GPU) is only handled by script The special equipment of computer graphical develops into the processor of high degree of parallelism, multithreading, multicore.The operation energy of mainstream GPU at present Power is more than mainstream universal cpu already, and future, gap can be increasing from development trend.Unified calculation equipment framework be by NVIDIA (tall and handsome to reach) company releases a kind of using GPU as the software and hardware architecture system of data parallel, it is one complete Whole GPGPU (graphics processing unit) solution.The appearance of CUDA reduces programmer and carries out general-purpose computations using GPU Development difficulty.Due to CUDA special programming model and storing data method, so that a large amount of and complicated similar operations can be with It is handled simultaneously by thread, greatly reduces the execution time of program.

For this purpose, the present invention proposes that a kind of guiding filtering based on CUDA accelerates optimization method, using CUDA building CPU and GPU cooperative working environment, using CPU as being responsible for carrying out the host of the strong issued transaction and serial computing of logicality, using GPU as It is responsible for executing the coprocessor of the threading parallel processing of height, realizes that Image neighborhood window pixel value is asked using CUDA multiple programming With, and then Image neighborhood mean value being obtained, while utilizing register and Texture memory, optimization algorithm step obtains guiding filtering Key parameter, and then realize the global optimization to algorithm.Technical scheme is as follows:

Embodiment 1

A kind of guiding filtering based on CUDA accelerates optimization method, referring to Fig. 1, guiding filtering accelerate optimization method include with Lower step:

101: input picture p and navigational figure I being read in into global storage by host end memory, by constructing the first kernel Function, respectively obtain input picture p, navigational figure I, image I*P, image I*I neighborhood window Image neighborhood mean value；

102: the second kernel function of building successively seeks the covariance of image (I, p), the variance of navigational figure I, Jin Erqiu Take filtering key parameter a and b；

103: the first kernel function being called to seek the neighboring mean value mean_a of parameter a, the neighboring mean value of parameter b

Mean_b, and then final filter result is obtained, result is saved in corresponding global storage, host side is arrived in outflow Memory.

Wherein, by constructing the first kernel function, input picture p, navigational figure I, image I*P, image I*I are obtained respectively The Image neighborhood mean value of neighborhood window the step of specifically:

Further, by constructing the step of the first kernel function calculates separately the summation of Image neighborhood window pixel value tool Body are as follows:

The guiding filtering accelerates optimization method further include:

4 kernel functions in the first kernel function are called, obtain neighborhood window pixel number N, and be saved in constant storage.

This method is programmed using CUDA and carries out parallel optimization to guiding filtering algorithm, is guaranteeing the same of filter result effect When, and the execution efficiency of guiding filtering algorithm can be greatly improved, the real-time place of guiding filtering algorithm is realized to a certain extent Reason.

The method in embodiment 1 is described below with reference to specific calculation formula, calculating step, it is as detailed below to retouch It states:

Embodiment 2

Guiding filtering algorithm is realized based on a Local Linear Model, in Local Linear Model, if input figure As being p, navigational figure I, filtering output image is q, and Local Linear Model is assumed with the neighborhood window ω of center pixel k_k There are following linear relationships:

(1)

Wherein, ω_kIt is the square window with side length for r, a_kAnd b_kIt is neighborhood window ω_kIn linear coefficient, I_iFor guidance Image is in neighborhood window ω_kIn pixel value, q_iFor neighborhood window ω_kIn filtering output.Coefficient a_kAnd b_kIt can be defeated by seeking Enter image p and exports the minimum difference of image q to determine, i.e., so that formula (2) reaches minimum.

E (a in formula (2)_k,b_k) it is neighborhood window ω_kIn cost function output, p_iIt is input picture in neighborhood window ω_kIn pixel value, ε be a chastening variance adjusting parameter, the purpose is to prevent a_kValue is excessive.Linear regression solves Above formula can obtain:

In formula, μ_kAnd σ_k ²It is navigational figure I respectively in neighborhood window ω_kMean value and variance.| ω | it is neighborhood window ω_k In number of pixels,It is input picture p in neighborhood window ω_kIn mean value.

Since each pixel can be included in multiple neighborhood window ω_kIn, in different neighborhood window ω_kIn be calculated Q_iAlso different, so need to q_iIt is averaging processing, by calculating a in all windows_kAnd b_k, filtering output is such as formula (5)。

(5)

Wherein, Respectively a_k, b_kAt point i The average value of all overlapping neighborhood windows.

By analyzing formula (3), formula (4) it is found that μ_k、I_ip_iRespectively represent navigational figure I, output Image p, I × p are in its neighborhood window ω_kIn mean value, σ_k ²It is I in neighborhood window ω_kIn variance.It is deposited between variance and mean value In DX=E (X²)-(EX)²Relationship, calculated using mean value.Therefore in guiding filtering algorithm, Image neighborhood mean value needs It calculates repeatedly, is part most time-consuming in entire algorithm, therefore, how quickly to seek image of the image in certain vertex neighborhood window Neighboring mean value just becomes a key for realizing guiding filtering algorithm, and an emphasis link of CUDA of the present invention optimization.

The present invention constructs the first kernel function using formula (6), realizes the calculating of image domains mean value.

Mean_p=boxfilter (p, r)/N (6)

Wherein, the neighboring mean value of mean_p representing input images p, p is in neighborhood for boxfilter (p, r) representing input images The sum of pixel value in window, N represent number of pixels in neighborhood window, and r represents neighborhood window side length.Wherein neighborhood window pixel Number N, can be by seeking neighborhood window pixel to all 1's matrix identical with required image size and obtaining.The calculating step is ability Well known to field technique personnel, the embodiment of the present invention does not repeat them here this.

Using the above method, the calculating of Image neighborhood mean value can be changed into the summation meter of Image neighborhood window pixel value It calculates, is convenient for CUDA parallel processing.The present invention realizes that neighborhood window pixel value is summed using integrogram, passes through the first kernel 4 kernel functions carry out CUDA parallel optimization in function, and the specific implementation steps are as follows (assuming that data used have been located in GPU In video memory):

(I) the 1st kernel function is responsible for parallel computation image i-th and arranges (1≤i≤picture traverse) from the 1st row to jth (1 ≤ j≤picture altitude) row pixel and, start-up parameter be block dimension be 1024 × 1, grid dimension be 1 × 1.Each line Journey completes the calculating of a column data in image by recursive call, uses register to save intermediate data in circulation, at this time data Reading meets global storage and merges access.

The data that (II) the 1st kernel function generates need to carry out the processing of data boundary, the 2nd kernel function with Behavior processes data in units border issue, start-up parameter be block dimension be 16 × 16, grid dimension be ((picture traverse+ DimBlock.x-1)/dimBlock.x) × ((picture altitude+dimBlock.y-1)/dimBlock.y) a block.Wherein DimBlock.x indicates thread block in the dimension of x-axis, and dimBlock.y indicates thread block in the dimension of y-axis.

(III) the 3rd kernel function is responsible for parallel computation image jth row (1≤j≤picture altitude) and is arranged from the 1st to i-th Arrange (1≤i≤picture traverse) pixel and.The restriction of non-merged access when to eliminate reading data, using first to this Kernel function input data carries out matrix permutation, then the 1st kernel function is called to be calculated, and adopts in data storage Storage mode is write with by column.

The data that (IV) the 3rd kernel function generates are also required to carry out the processing of data boundary, the 4th kernel function To arrange the border issue for processes data in units, start-up parameter is identical as the 2nd kernel function, and output data is the 1st at this time The neighborhood window pixel value of a kernel function input picture and, and be saved into corresponding global storage.

And so on, it can successively acquire the neighboring mean value mean_I of navigational figure I, the neighboring mean value mean_ of image I*P The neighboring mean value mean_II of Ip, image I*I.

Here it is worth noting that, the programming model of CUDA is that CPU and GPU cooperates.Traditional CPU architecture is hard by it The influence of part framework effectively cannot carry out general-purpose computations using resource, and can make GPU that can not only execute tradition using CUDA Graphics calculations, moreover it is possible to efficiently execute general-purpose computations.It is time-consuming in order to reduce data transmission as far as possible, arithmetic speed is improved, this 2 data transmission are only carried out between invention setting CPU memory and GPU video memory, i.e. input picture p and navigational figure I are by host side Memory is passed to equipment end video memory, and output image q is passed to host end memory by equipment end video memory, the specific steps of which are as follows:

(I) constructs CPU and GPU cooperative working environment using CUDA；

Input picture p and navigational figure I by the global storage of host memory reading device video memory, and is tied to by (II) Texture memory.

(III) distributes number of threads, sets kernel start-up parameter as each block distribution 16 × 16, each grid has ((figure Image width degree+dimBlock.x-1)/dimBlock.x) × ((picture altitude+dimBlock.y-1)/dimBlock.y) a Image is carried out chessboard division by block.Wherein dimBlock.x indicates thread block in the dimension of x-axis, and dimBlock.y indicates line Dimension of the journey block in y-axis.

(IV) calls the first kernel function (including 4 kernel functions), by complete 1 square identical with required image size Battle array seeks neighborhood window pixel and obtains neighborhood window pixel number N, and be saved into constant storage.

N successively seeks the neighborhood of input picture p and navigational figure I in (V) first kernel function of calling and constant storage Mean value, the neighboring mean value mean_II of the neighboring mean value mean_Ip of image I*P, image I*I, and result is successively stored in correspondence Global storage.

(VI) constructs covariance cov_Ip, the variance var_I of image I that the second kernel function successively seeks image (I, p), And then it constructs kernel function and seeks filtering key parameter a and b.

That is, constructing covariance kernel function according to formula cov_Ip=mean_Ip-mean_I.*mean_p seeks image The covariance of (I, p)；

The variance that variance kernel function seeks image I is constructed according to formula var_I=mean_II-mean_I.*mean_I；

Parameter a kernel function, which is constructed, according to formula a=cov_Ip./(var_I+ ε) seeks filtering key parameter a；

Parameter b kernel function, which is constructed, according to formula b=mean_p-a.*mean_I seeks filtering key parameter b.

(VII) calls the first kernel function to seek neighborhood window mean value to filtering key parameter a and b, and acquires final filtering knot Fruit q, and result is saved in corresponding global storage.

That is, calling the first kernel function to acquire the neighbour of key parameter a according to formula mean_a=boxfilter (a, r)/N Domain mean value；

According to formula mean_b=boxfilter (b, r)/N, the first kernel function is called to acquire the neighborhood of key parameter b Mean value；

Output q kernel function is constructed according to formula q=mean_a.*I+mean_b and acquires final filter result q, and will knot Fruit is saved in corresponding global storage.

(VIII) spreads out of the filter result image in the global storage for being stored in equipment video memory to host memory.

In addition, guiding filtering algorithm is when realizing image emergence algorithm, it is different from above-mentioned process:

When r, g, b component data of (I) input picture p and navigational figure I copies into the video memory of GPU by the memory of CPU, by In being related to multiple data transmission, the present invention is flowed using CUDA, and such data assignment operation and kernel function execute when intersecting progress, can Improve the utilization rate of GPU resource；Especially when the amount of data is large, the advantage of CUDA stream is obvious；

(II) when solving key parameter a, the present invention realizes in a 3 components r, g, b using a kernel function It calculates, it is that 16 × 16, block is arranged with two-dimensional address that start-up parameter, which is set as block dimension,.Per thread successively will first Data in global storage var_I_rr, var_I_rg, var_I_rb, var_I_gg, var_I_gb, var_I_b are saved in Register, the Sigma matrix of building 3 × 3, and determinant computation equations are utilized, result is stored in register；Secondly each Sigma matrix inversion and cov_Ip matrix are multiplied unified calculation by thread with the inverse matrix, to increase the calculating of program execution Closeness makes full use of the calculated performance of GPU.

Embodiment 3

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific example to the present invention Technical solution be described in further detail.

Present example use 7 operating system of windows, CPU be Intel Core i5-3470, dominant frequency 3.2GHz, Installed System Memory is 4GB；GPU is NVIDIA GeForce GTX660, and which includes 5 stream multiprocessors (SMS), each SMS Containing 192 CUDA cores, onboard global memory is 2048Mbytes, and memory bandwidth is 192 bits, supports CUDA Compute Capability is 3.0.The present invention is analyzed using the Visual Profile that CUDA Toolkit is carried simultaneously All data realizes the quantitative analysis to program feature.

For the validity for verifying this method, present example is to guiding filtering algorithm in image smoothing, image emergence, image 4 application fields such as enhancing and flash denoising carry out CUDA parallel optimization, and filter effect figure and speed-up ratio table are as follows:

1 image smoothing of example

In this example, filter radius r is 16, and filtering parameter eps is 0.04.Input picture p and navigational figure I are set as same Piece image, output image q are final output, and 1 effect of example is as shown in Figure 2.As can be seen from Figure 2: input picture p In details, mutation, edge and noise all obtained a degree of inhibition, obtain more satisfactory image smoothing effect.

2 image of example is sprouted wings

In this example, filter radius r is 60, and filtering parameter eps is 0.000001.Input picture p and navigational figure I are set For the different image of two width, output image q is final output, and 2 effect of example is as shown in Figure 3.As can be seen from Figure 3: Output image feather effect is obvious, and marginal portion realizes asymptotic variation, has achieved the effect that natural sparse model.

3 image enhancement of example

In this example, filter radius r is 16, and filtering parameter eps is 0.01.Input picture p and navigational figure I are set as same Piece image, output image q are final output, and 3 effect of example is as shown in Figure 4.As seen from Figure 4: exporting image Entirety or local feature have all obtained apparent enhancing, effectively increase the identification capability of image detail part.

4 flash of example denoising

In this example, filter radius r is 8, and filtering parameter eps is 0.0004.Input picture p and navigational figure I are set as The different image of two width, output image q are final output, and 4 effect of example is as shown in Figure 5.As seen from Figure 5: output Image q coloring denoising effect is coordinated naturally, has obtained ideal treatment effect.

In addition to this, from Fig. 2~5 as can be seen that the present invention it is smooth, sprout wings, in terms of enhancing, flash denoise 4 all It is with the former algorithm effect based on c program essentially identical, it was demonstrated that accuracy of the invention.In order to which acceleration more of the invention is imitated Fruit is based respectively on Matlab programming, realizes guiding filtering algorithm based on c program and CUDA programming, and carried out Experimental comparison.No Time loss and speed-up ratio with image in different resolution processing is as shown in table 1:

Table 1 is based on distinct program programming and realizes guiding filtering algorithm time loss (ms) and speed-up ratio

From table 1 it follows that realizing guiding filtering algorithm, base of the present invention compared to based on Matlab program and c program It is greatly shortened in the time loss of CUDA Parallel Implementation；Wherein, the acceleration effect that image is sprouted wings is particularly evident, may be implemented more than 60 Speed-up ratio again；Simultaneously it can also be seen that being continuously increased with image resolution ratio, acceleration effect of the invention are also more obvious.

Bibliography:

[1]Petschnigg G,Szeliski R,Agrawala M,et al.Digital photography with flash and no-flash image pairs[J].ACM transactions on graphics(TOG),2004,23 (3):664-672.

It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of guiding filtering based on CUDA accelerates optimization method, which is characterized in that the guiding filtering accelerates optimization method The following steps are included:

Input picture p and navigational figure I is read in into global storage by host end memory, by constructing the first kernel function, point Not Huo Qu input picture p, navigational figure I, image I*p, image I*I neighborhood window Image neighborhood mean value；

The covariance that the second kernel function successively seeks image (I, p), the variance of navigational figure I are constructed, and then seeks filtering and closes Bond parameter a and b；

It calls the first kernel function to seek the neighboring mean value mean_a of parameter a, the neighboring mean value mean_b of parameter b, and then obtains Result is saved in corresponding global storage by final filter result, and host end memory is arrived in outflow；

When realizing image emergence function, image resolution ratio is 946 × 756, time-consuming 62.3ms, acceleration ratio 60.2；

When realizing image smoothing function, image resolution ratio is 942 × 659, time-consuming 21.2ms, acceleration ratio 11.0；

When realizing image enhancement functions, image resolution ratio is 1024 × 1024, time-consuming 59.7ms, acceleration ratio 16.2；

When realizing flash denoising function, image resolution ratio is 1024 × 1024, time-consuming 89.7ms, acceleration ratio 10.5；

It is described by constructing the first kernel function, obtain that input picture p, navigational figure I, image I*p, image I*I is in neighbour respectively The step of Image neighborhood mean value of domain window specifically:

Input picture p, navigational figure I, image I*p, image I*I are turned respectively in the calculating of the Image neighborhood mean value of neighborhood window It is changed to the read group total of Image neighborhood window pixel value；

The summation of Image neighborhood window pixel value is calculated separately by constructing the first kernel function；

The summation of neighborhood window pixel value is realized using integrogram, it is parallel to carry out CUDA by 4 kernel functions in the first kernel function Optimization；

The guiding filtering accelerates optimization method further include:

4 kernel functions in the first kernel function are called, obtain neighborhood window pixel number N, and be saved in constant storage；

The guiding filtering accelerates optimization method further include:

It realizes that neighborhood window pixel value is summed using integrogram, is carried out by 4 kernel functions in the first kernel function CUDA parallel optimization, the specific implementation steps are as follows:

(I) the 1st kernel function is responsible for pixel of the column of parallel computation image i-th from the 1st row to jth row and start-up parameter It is 1024 × 1, grid dimension for block dimension is 1 × 1；

Per thread completes the calculating of a column data in image by recursive call, uses register to save mediant in circulation According to reading data meets global storage and merges access at this time；

The data that (II) the 1st kernel function generates need to carry out the processing of data boundary, and the 2nd kernel function is with behavior Processes data in units border issue, start-up parameter be block dimension be 16 × 16, grid dimension be ((picture traverse+ DimBlock.x-1)/dimBlock.x) × ((picture altitude+dimBlock.y-1)/dimBlock.y) a block, wherein DimBlock.x indicates thread block in the dimension of x-axis, and dimBlock.y indicates thread block in the dimension of y-axis；

(III) the 3rd kernel function be responsible for parallel computation image jth row from the 1st column to i-th column pixel and；It is right first Kernel function input data carries out matrix permutation, then the 1st kernel function is called to be calculated, and adopts in data storage Storage mode is write with by column；

The data that (IV) the 3rd kernel function generates are also required to carry out the processing of data boundary, and the 4th kernel function is to arrange The border issue of data is handled for unit, start-up parameter is identical as the 2nd kernel function, and output data is the 1st at this time The neighborhood window pixel value of kernel function input picture and, and be saved into corresponding global storage；

And so on, the neighboring mean value mean_I of navigational figure I, the neighboring mean value mean_Ip of image I*p can be successively acquired, is schemed As the neighboring mean value mean_II of I*I.