CN107316324A

CN107316324A - Method based on the CUDA real-time volume matchings realized and optimization

Info

Publication number: CN107316324A
Application number: CN201710352967.XA
Authority: CN
Inventors: 陈龙; 谢国栋; 崔明月; 黄凯
Original assignee: National Sun Yat Sen University
Current assignee: Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2017-05-18
Filing date: 2017-05-18
Publication date: 2017-11-03
Anticipated expiration: 2037-05-18
Also published as: CN107316324B

Abstract

The present invention relates to the technical field of computer vision, the method for matching and optimizing more particularly, to the real-time volume realized based on CUDA.The present invention is a kind of to handle to carry out left and right input picture intensive Stereo matching, and the method for generating real-time disparity map using CUDA parallelizations.Including：Do census conversions to the figure of left and right two, generate a character string, with Hamming code away from obtaining an initial cost, each paths of pixel 8 of Dynamic Programming take a most short path and to obtain final cost, obtain an initial intensive disparity map；Super-pixel segmentation is carried out using k means algorithms to left figure, super-pixel flat blocks one by one is obtained, is optimized using super-pixel plane fitting to optimize initial parallax.Accelerate parallel the invention further relates to multitask GPU, and in particular in NVIDA CUDA framework Parallel Implementation multiple tasks, belong to GPGPU calculating fields.The calculating time is greatly reduced by the optimization of GPU multiple threads, real-time disparity map is obtained.

Description

Method based on the CUDA real-time volume matchings realized and optimization

Technical field

The present invention relates to the technical field of computer vision, more particularly, to the real-time volume realized based on CUDA The method matched somebody with somebody and optimized.

Background technology

Stereoscopic vision is an important topic in computer vision, and it can cause people to get in object and scene Depth information, be later stage 3D rebuild and content analysis basis.Stereo matching is one of key technology of stereoscopic vision, is passed through The left and right picture that the binocular camera demarcated is shot, after being alignd through overcorrection, Stereo matching is direct acquisition parallax information Final steps.

The direct purpose of Stereo matching is to obtain accurate parallax information, and the accuracy that matches and it is ageing be to weigh The standard of method of completing the square, how to balance precision and it is ageing be an important problem.The solid matching method point of current main flow For local matching, three kinds of half global registration and global registration, precision increases successively, while the time of consumption also increases successively.

Half global registration is preferable in precision and ageing balance, therefore we use half global registration, and puts forth effort to carry High accurancy and precision, reduces run time.

In prospect and the marginal portion of background, precision is difficult to hold, and how to accept or reject and obtains in prospect expansion and noise Good effect is to improve the key factor of accuracy.

It is the important means of front and rear scape edge segmentation by carrying out super-pixel plane segmentation to image, and dividing method It is one of study hotspot of computer vision field, also there is extensive application in image procossing, Object identifying.It is accurate by quick True super-pixel flat blocks segmentation, fitting significantly improves the accuracy of parallax.

GPGPU is the technology that GPU carries out large-scale calculations, and CUDA is the GPGPU architecture that NVIDA companies provide, GPU tools There are the floating-point operation ability and memory bandwidth far above CPU, there is the concurrency of height, can only be one by one different from CPU programs Order perform, GPU hardware design supports multitask to share, and program can simultaneously be run by multiple threads.

Therefore program is operated on GPU by CUDA, can greatly shortens run time, improved ageing.

Not yet find possessing pinpoint accuracy and the patent or document of real-time disparity map can be exported at present.

The content of the invention

There is provided the real-time volume realized based on CUDA at least one defect for overcoming described in above-mentioned prior art by the present invention Matching and the method for optimization, are to take into account accurate and real-time solid matching method, add super-pixel segmentation to be fitted parallax Plane improves accuracy, reduces error hiding probability, make use of CUDA multi-threaded parallelizations to design, and the algorithm routine of script is calculated Time is reduced to a millimeter rank by second rank, can reach real-time dense Stereo Matching.

Technical scheme：Method based on the CUDA real-time volume matchings realized and optimization, wherein, including it is following Step：

S1. a frame size is selected, being census except each pixel of marginal portion to left images is converted to character Sequence；

S2. in given disparity range, each character string in the range of left figure pixel, traversal right figure is selected, is calculated Hamming code is away from obtaining initial cost；

S3. path polymerize, and finds each path polymerizing value minimum put by Dynamic Programming, obtains energy function；

S4. the parallax value corresponding to energy function minimum is chosen with WTA；

S5. left and right consistency check carries out later stage verification, obtains initial dense disparity map；

S6. each pixel of left figure is converted into CIElab color spaces by RGB color；

S7. super-pixel flat blocks are marked off by k-means aggregating algorithms, iteration merges super picture several times up to convergence Plain flat blocks；

S8. to each super-pixel plane multiple repairing weld, Calculation Plane parameter；

S9. calculated with obtained plane parameter and obtain new parallax value；

S10. interpolation arithmetic is done to new parallax, reduces black block, smooth disparity figure.

Programmed algorithm based on CUDA is all applied in above step.

Compared with prior art, beneficial effect is：In the present invention, CUDA streams are used for multiple times, shared drive block, thread is same Step, multithreading shares paralell design, and optimizes data structure, all greatly improves operation efficiency.To left images Census conversions are carried out, Hamming code are calculated away from cost is drawn, the energy value being polymerize with dynamic programming path optimizing obtains initial Dense disparity map, based on the understanding to super-pixel plane, obtain each super-pixel block with k-means clustering convergences, will be each Super-pixel block regards a plane as, and we, which calculate plane parameter, is used to be fitted new parallax value.Whole a set of flow because There is the addition of super-pixel segmentation and plane fitting so that disparity map result accuracy is improved, and select parallel using CUDA Changing programming also causes real-time to be guaranteed, therefore the present invention has accomplished good balance in accuracy and ageing aspect, This does not have other document from present's view and patent was accomplished.

Brief description of the drawings

Fig. 1 represents the flow chart of a whole set of algorithm.

Fig. 2 represents the distribution of thread lattice and calculation when path polymerization flows parallel using 8 CUDA.

Fig. 3 represents the segmentation figure that k-means polymerizations are obtained.

Fig. 4 represents that the disparity map ultimately generated renders figure.

Embodiment

Accompanying drawing being given for example only property explanation, it is impossible to be interpreted as the limitation to this patent；It is attached in order to more preferably illustrate the present embodiment Scheme some parts to have omission, zoom in or out, do not represent the size of actual product；To those skilled in the art, Some known features and its explanation may be omitted and will be understood by accompanying drawing.Being given for example only property of position relationship described in accompanying drawing Explanation, it is impossible to be interpreted as the limitation to this patent.

CUDA manages thread using thread lattice, and each copy of kernel function can be by built-in variable blockIdx come really Determine the index of thread block, gridDim obtains the quantity of thread block, and blockDim obtains the quantity of thread, using different GridDim and blockDim are the keys of the different parallelization degree of acquisition.When needing multiple thread blocks and each thread block again During comprising multiple threads, index computational methods are：Int tid=threadIdx.x+blockIdx.x*blockDim.x；Obtain Correct index tid, which can just navigate to us, needs thread to be processed, it is established that mapping relations, is specifically shown in Fig. 2

It is read-write in GPU internal storage structure to have：Register, shared drive and global memory.CUDA context switching machines System is very efficient, almost zero-overhead, and the speed for accessing register is very fast, and register is preferentially used so should try one's best. CUDA shared drive buffering areas are resided on physics GPU, so access speed is also quickly, the presence of shared drive can make operation The intercommunication of multiple threads in thread block.Global memory is because GPU and CPU can carry out write operation, speed to it It is slower.Therefore in CUDA programmings, the various internal memory modes of reasonable employment can effectively improve efficiency and the speed of service.

CUDA can use thread synchronization mechanism _ _ syncthreads ()；Thread in thread block is synchronized.The letter Number may insure that all threads in same thread block can be just performed under it after all sentences before having performed the sentence One sentence, can typically arrange in pairs or groups with shared drive and occur.

We also use the mode of CUDA streams, and CUDA flow tables show a GPU operation queue, and the operation in the queue Performed with the sequencing for being added to queue.The parallel of task level can be realized using CUDA streams, can be simultaneously between kernel function Row processing, can also be in main frame and exchanged between equipment data while GPU is performing kernel function.

In step one, what we chose is n*m block size progress census conversions.In CUDA programs, 16* have chosen 16 thread block, to improve efficiency with shared drive behind facilitating, the width of picture is width, and height is height, selection Kernel sizes are ((width+15)/16, (height+15)/16) respectively.The size of shared drive is (n+16) * (m+16), Data are first copied into shared drive from global memory, census conversions, synchronizing thread after conversion, by character string knot is then carried out Fruit is stored in gpu global memories.The thought of Census conversions is the pixel that will be chosen, it is assumed that the gray value put for (x, y) It is compared with the gray value of each pixel of its neighborhood block size, it is bigger than (x, y) to be set to 0, it is small for 1.

In step 2, the disparity range that we choose is d, and the thread block of 32*8 sizes is chosen in cuda, uses (32+ D) shared drive of * 8 sizes, first copies to the character string that each pixel is obtained in step one altogether from gpu global memories Enjoy in internal memory, corresponding character string in d disparity range is then looked for from right figure, two pairs of character sequences are calculated by xor operation The Hamming code of row away from.The Hamming code quantity different away from two equal length word correspondence positions are represented, we represent two with d (x, y) Hamming distance between word x, y.Two character strings are carried out with XOR, and statistical result is 1 number, then this number is just Hamming code away from.D parallax is traveled through, selection Hamming code is stored into global memory away from minimum parallax, obtain what we needed Initial cost values.

In step 3, path polymerization is so as to finding the energy function of minimum to determine to regard in order to optimize energy function Difference.

Lr (p, d)=C (p, d)+min (Lr (p-r, d),

Lr(p-r,d-1)+P1,Lr(p-r,d+1)+P1,

In above-mentioned formula, C (p, d) be exactly last step we obtain p points parallax be d when to cost values. And Lr (p, d) then represents that in the parallax on r paths of making an inventory be d path cost and value, the cost polymerization of each point is expressed as Value is that " (the current parallax cost polymerizing value of path consecutive points, the disparity difference of path consecutive points is 1 generation to current cost+min Valency polymerizing value+P1, the parallax interpolation of path consecutive points is more than 1 minimum cost polymerizing value+P2) parallaxes of-path consecutive points inserts Minimum cost polymerizing value of the value more than 1 ", last that is set to prevent numeral excessive.Wherein P1, P2 are phase respectively Adjacent pixel parallax value difference is 1, and the penalty coefficient more than 1, and it is to ensure that disparity map is put down to add the two regularization terms It is sliding to keep edge simultaneously.P1 is the adaptation to object out-of-flatness surface, and P2 is (is probably multiple objects) discontinuous to gray scale Adapt to.

The image want in 2D is found so that the problem of minimum disparity map of cost polymerizing value is individual NP-hard, then individually The path of the selection such as (such as from left to right, from right to left) match point causes cost polymerizing value minimum (i.e. finally on the r of certain direction L is minimum), the problem reforms into soluble in polynomial time.We are using the dynamic programming algorithm with global nature Minimum value is calculated to optimize.To make p cost want minimum, then premise must be the Least-cost of the point q in neighborhood, Q wants Least-cost, then must assure that q neighborhood point m Least-cost, so hands on to divide and rule and draws minimum Value.

If we solve only along every a line, then constraint in the ranks consider completely less than, q be p neighborhood point its This real when is weakened to have arrived the left-hand point or right-hand point that q is p, such to seek excellent effect difference and occur that long-tail is imitated Should.Then, we select to optimize using 8 directions path.Namely r=8, end product is the polymerizing value of 8 paths, The energy function that namely we require.

In the program that we realize, 8 CUDA streams are used, while kernel functions are performed, can also have been transmitted Data, and each path can carry out computing simultaneously, and to improve efficiency, the block of 4 32*8 sizes has been used respectively, The grid of (height/8,1) size comes calculated level path and vertical-path, has used 4 32*8 size block, The grid of ((height+width)/8,1) size calculates oblique path, has used the shared drive of 32*8 sizes to store picture The path polymerization cost values of vegetarian refreshments, when asking for minimum value using parallel stipulations and the method for syncthreads thread synchronizations, pole The big differentiation for reducing thread beam, and repeatedly instructed in the circulating cycle with pragma unroll so that circulation is sufficiently spread out parallel Operation, improves compiler efficiency, and Fig. 2 displayings are flowed using 8 CUDA, the pass of parallel operation and memory headroom distribution between kernel System's figure.

In step 4, WTA (Winner-take-all) is a kind of competition learning rule, for learning without tutor, used in this In refer to that only the smallest of energy value can be activated and use, that is, choose the minimum parallax value of energy value.Still selected in this program The thread block of 32*8 sizes, the shared drive block of 128*8 sizes, first by the energy function obtained in previous step from global memory Copy in shared drive, then compare successively, minimum result is carried out into parallel stipulations to reduce branch, then by parallax value d It is stored in new global memory.

In step 5, follow-up treatment measures are often used in disparity map：Left and right consistency detection.Block is only Appear in piece image, and those points that can't see in another width figure.Block is typically a little one piece of continuous region, LRC Check effect is to realize that occlusion detection obtains the corresponding shielded image of left figure.For a point p in left figure, that tries to achieve regards Difference is d1, then corresponding points of the p in right figure should be (p-d1), and the parallax value of (p-d1) is denoted as d2.If | d1-d2 |> Threshold, p are labeled as blocking a little, using 16*16 block and (width/16, height/16) size in this program Grid, left images are traveled through, each thread one pixel of correspondence, if | d1-d2 |>1, then parallax is set to 0 guarantor There is global memory

In step 6, the stage of super-pixel segmentation plane has been come into.Use SLIC methods in the present invention, i.e., it is simple Linear iteraction cluster.The foundation of sub-clustering is the color similarity and propinquity between pixel, and coloured image is converted into Then 5 dimensional feature vectors are constructed distance metric by 5 dimensional feature vectors under CIELAB color spaces and XY coordinates, to figure As pixel carries out the process of Local Clustering.Therefore first have to image being transformed into CIElab color spaces from rgb color space.It is first First RGB to XYZ conversion：

[X, Y, Z]=[M] * [R, G, B], wherein M are a 3x3 matrixes：

XYZ to Lab conversion：

L=116*f (Y1) -16

A=500* (f (X1)-f (Y1))

B=200* (f (Y1)-f (Z1))

Wherein f is the correction function of a similar Gamma function：

Work as x>When 0.008856, f (x)=x^ (1/3)；

Work as x<When=0.008856, f (x)=(7.787*x)+(16/116)

X1, Y1, Z1 are the XYZ values after linear normalization respectively.

In CUDA programs, the thread block of 16*16 sizes is used so that each pixel corresponds to a thread line Journey, each thread carries out above-mentioned function change, and computational efficiency substantially achieves theoretical peak, afterwards again protects obtained value In the lab global variables that there is gpu.

In step 7, k-means algorithms be a kind of simple efficient and widely used cluster clustering method its be mainly The algorithm of data aggregation is calculated, mainly by constantly taking the algorithm from the nearest average of seed point.By picture initialization segmentation n blocks Super-pixel, and n cluster heart point lab information and xy coordinates are stored in gpu global memories, using the block of w*h sizes, often One pixel of individual thread correspondences, all searches the cluster heart point information of 2w*2h regions class, calculates lab color informations and two points Euclidean distance between position, each thread, which updates the cluster heart point of minimum Euclidean distance, copies to global memory, iteration Repeatedly until convergence；

D_lab=(P_lab.x-C_lab.x)²+(P_lab.y-C_lab.y)²+(P_lab.z-C_lab.z)²

D_xy=(P_xy.x-C_xy.x)²+(P_xy.y-C_xy.y)²

D=D_lab+w*D_xy

Above formula D_labRepresent that P points are put down with the Euclidean distance that C points in 2w*2h regions are cluster heart point in lab color spaces Fang He, D_xyThe Euclidean distance quadratic sum of P points and cluster heart point C coordinate in picture is represented, w is a weight parameter, each picture The information of cluster heart point after element renewal is planar tags.One threshold value is set, and the super-pixel plane less than this threshold value can be with The average Lab color distances of the plane computations of surrounding, the minimum carry out plane fusion of distance.Fig. 3 is to have carried out k-means and melted Super-pixel segmentation figure after conjunction.

In step 8, label has been beaten to each pixel in image in step 7, has shown that each pixel has pair The super-pixel block answered.Known three points can represent a plane, so in each 3 parallax values of super-pixel block stochastical sampling not For 0 and different point plane equation is determined by three points sampled：

A=(x1 × z2-x2 × z1) × (y2 × z3-y3 × z2)-(x2 × z3-x3 × z2) × (y1 × z2-y2 × z1)

B=y1 × z2-y2 × z1；

P0=((z2 × d1-z1 × d2) × (y2 × z3-y3 × z2)-(z3 × d2-z2 × d3) × (y1 × z2-y2 × z1))/A；

P1=(z3 × d2-z2 × d3-P0 × (x2 × z3-x3 × z2))/B；

P2=(d1-P0 × x_i-P1×y_i)/z_i；

What above formula was represented is the equation of Calculation Plane parameter, if A<0, then plane parameter P0=0, P1=0, P2=- If 1. B<0, then

P1=(z3 × d2-z2 × d3-P0 × (x2 × z3-x3 × z2))/B；

P2 parameters depend on Z_iSize whether be less than 0, initialization for the first time seek plane parameter when, Z is 1, (x, y, D) X-coordinate of pixel, Y-coordinate and parallax value are represented respectively.

After plane parameter is obtained, you can try to achieve plane equation：S=P0 × x+P1 × y+P2；Remaining planar point is used Equation is fitted, and finds out the number of match point in a plane, and circulation is multiple, is chosen the most plane parameter of match point, is protected It is stored in result.All match points are added up, new parameter is drawn, the plane parameter for carrying out a new round is calculated, and will Final result is stored in data structure.

In step 9, we have been obtained for the label and plane parameter of each pixel in image in step 8, because One flat blocks can be thought that parallax value is varied less for super-pixel flat blocks segmentation let us, therefore we use correspondence Plane parameter equation replace initial parallax value, so as to reach the effect of optimization.New parallactic equation is exactly：D=P0 × x +P1×y+P2；If d is in [0,255] is interval, just original initial parallax value is replaced with this new d value.Note On the edge line of super-pixel flat blocks, it is not replaced, still using initial parallax value.

In step 10, row interpolation is entered to disparity map.In order that disparity map is more smooth, we are to pixel that parallax value is 0 Carry out interpolation processing.The interval that parallax value is 0 is found first, it is assumed that is [start, end], is then compared start-1 and end+1 The parallax value of point, is inserted into [start, end] interval by that less parallax value.By interpolation arithmetic, on the ground that parallax is 0 Square transition is more smooth, and what the black surround that the edge split in front and rear scape is caused because of prospect expansion also can be relatively good is smooth, carries The high accuracy for regarding illustration.

In the present invention, CUDA streams are used for multiple times in we, and shared drive block, thread synchronization, multithreading is shared parallelization and set Meter, and data structure is optimized, all greatly improve operation efficiency.Census conversions are being carried out to left images, the Chinese is calculated Plain code is away from cost is drawn, the energy value being polymerize with dynamic programming path optimizing obtains initial dense disparity map, based on to super The understanding of pixel planes, each super-pixel block is obtained with k-means clustering convergences, is regarded each super-pixel block as one and is put down Face, we, which calculate plane parameter, is used to be fitted new parallax value.Whole a set of flow is because have super-pixel segmentation and plane to intend The addition of conjunction so that disparity map result accuracy is improved, and select to use CUDA parallel programmings also so that real-time is obtained Guarantee has been arrived, therefore the present invention has accomplished good balance in accuracy and ageing aspect, this is not different from present's view Document and patent accomplished.

Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not pair The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms.There is no necessity and possibility to exhaust all the enbodiments.It is all this Any modifications, equivalent substitutions and improvements made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims

1. the method based on the CUDA real-time volume matchings realized and optimization, it is characterised in that comprise the following steps：

S1. a frame size is selected, being census except each pixel of marginal portion to left images is converted to character sequence Row；

S2. in given disparity range, each character string in the range of left figure pixel, traversal right figure is selected, Hamming is calculated Code distance, obtains initial cost；

S7. super-pixel flat blocks are marked off by k-means aggregating algorithms, iteration is several times up to convergence, and it is flat to merge super-pixel Face block；

S9. calculated with obtained plane parameter and obtain new parallax value；

2. the method according to claim 1 based on the CUDA real-time volume matchings realized and optimization, it is characterised in that：Institute In the step of stating, the programmed algorithm based on CUDA is all applied in above step.