CN102647588B

CN102647588B - GPU (Graphics Processing Unit) acceleration method used for hierarchical searching motion estimation

Info

Publication number: CN102647588B
Application number: CN201110040025.0A
Authority: CN
Inventors: 王振宇; 王荣刚; 董胜富; 高文
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2011-02-17
Filing date: 2011-02-17
Publication date: 2014-09-24
Anticipated expiration: 2031-02-17
Also published as: CN102647588A

Abstract

The invention discloses a hierarchical searching motion estimation method used by utilizing GPU (Graphics Processing Unit) parallel computing capability acceleration, and the method comprises the following steps: generating images of different image layers in a hierarchical searching algorithm; self-adaptively carrying out threading distribution; carrying out SAD computation on each search image block at each search point in parallel; and utilizing a CPU (Central Processing Unit) to cooperatively look for the smallest SAD in parallel. The self-adaptive threading distribution scheme provided by the invention can satisfy the requirement of the resolution ratios of different searched images and the size of a searching area, the image downsapling is processed by the GPU in parallel to obtain better acceleration speed and reduce data communication between the CPU and the GPU, and the GPU can be effectively prevented from idling with a smallest SAD value lookup method cooperatively carried out by the GPU/CPU.

Description

A kind of GPU accelerated method for hierarchical search motion estimation

Technical field

The present invention relates to the method for estimating of field of video processing, the hierarchical search motion estimation accelerated method that specifically a kind of GPU of use aiding CPU carries out.

Technical background

Estimation is the important technology of field of video processing, in hybrid video coding, the motion detection based on piece coupling, to there being important effect in the application such as image tracing.Motion estimation algorithm is to each piece, calculates the sad value on all or part of possible search point in hunting zone, by finding minimum sad value, obtains the best or approaches best motion vector.Full searching moving algorithm for estimating is to each possible point in hunting zone, calculates the SAD of this position, finally obtains best motion vector.Full searching moving algorithm for estimating has very high operand.Therefore a lot of fast search algorithms have been there are, as three step search, hierarchical search, hexagon search etc.Wherein, hierarchical search motion estimation algorithm has higher systematicness, is easy to realize by hardware-accelerated method.

Basic hierarchical search motion estimation algorithm is that present frame and reference frame are divided into a plurality of image layer.Original image is minimum one deck, and upwards each image layer is obtained by an adjacent image layer down-sampling of below.While carrying out hierarchical search motion estimation, in the appointment region of search that the o of first take is search center, carry out the estimation based on full search on top layer images layer, sad value according to specifying all search points in region of search, obtains optimal match point a in this image layer.And then search obtains a new optimal match point b in next tomographic image layer be take the appointment region of search that a is search center.This process is successively carried out each image layer, finally obtains minimum one deck, i.e. optimal match point c (seeing Fig. 1) on original image layer, be the motion vector that hierarchical search motion estimation algorithm obtains.When only having an image layer, i.e. full searching moving algorithm for estimating.Therefore full searching moving algorithm for estimating can be regarded as a special case of hierarchical search.The quality of hierarchical search motion estimation is relevant with image layer number and each image layer down-sampling rate of dividing.Conventionally original image is divided into 2～3 image layer, each image layer is carried out 1/2 down-sampling by the image layer of low one deck and is obtained.

Because GPU is with respect to CPU, there is the features such as computing unit is many, memory read-write band is roomy, be well suited for the acceleration as hierarchical search motion estimation.The appearance of the GPU of especially general object (GPGPU), makes the exploitation of carrying out high-performance algorithm on GPU become comparatively easy.GPU based on general object is comprised of a plurality of multiprocessors, and each multiprocessor has a plurality of arithmetic elements, and each arithmetic element has the register of oneself.Arithmetic elements all in a multiprocessor, in each clock cycle, are all carried out identical instruction, therefore can carry out large batch of parallel data processing.The GPU of general object is above made into a plurality of thread block by all concurrent sets of threads, and a thread block is only moved on a multiprocessor, and a plurality of thread block can be shared a multiprocessor.Thread in thread block can carry out exchanges data by the shared storage on multiprocessor.And thread in the different threads piece of concurrent execution can not be communicated by letter, can not carry out synchronous.In addition, the thread in a thread block is divided into again some Warp, and the thread in a Warp is always carried out same instruction (seeing Fig. 2).Due to resource-constraineds such as register, arithmetic elements on upper each multiprocessor of GPU, the Thread Count that thread block can be held is limited.And in thread block, the size of Warp is the integral multiple of processing units quantity in a multiprocessor.While only having part thread to carry out certain instruction in a Warp, all threads all can be carried out this instruction, and those do not need the thread of carrying out this instruction operation result can be abandoned.Therefore the part thread idle running that, should as far as possible avoid branch operation to cause.

Find by prior art documents, in the method that existing use GPU accelerates motion estimation algorithm, only for full search or universe, eliminate the method that searching moving algorithm for estimating accelerates.And owing to being subject to the Thread Count restriction that thread block can hold, these accelerated methods are subject to larger restriction on hunting zone, cannot meet actual needs.

Summary of the invention

The object of the present invention is to provide a kind of accelerated method based on GPU/CPU collaborative process for hierarchical search motion estimation, make full use of the computational resource of GPU, share CPU computation burden, and significantly improve the speed of hierarchical search motion estimation.

For achieving the above object, the hierarchical search motion estimation method based on GPU/CPU collaborative process that the present invention proposes comprises the following steps:

The first step: present frame and reference frame transmission are caused to GPU video memory, and described present frame and reference frame are divided into L layer, the 1st layer is present frame and reference frame original image.Utilize GPU generate all the other L-1 floor height tomographic images and leave in video memory, each tomographic image is 1/2 down-sampled images of a lower tomographic image on wide and high both direction.

While using GPU to carry out down-sampling generation K+1 tomographic image layer to K tomographic image layer, it is the down-sampled images piece of 2D * 2D that K tomographic image is divided into size.Distribute a thread block, the out of order execution concurrently of all thread block to described each down-sampled images piece.The two-dimentional sets of threads of a D * D of application in described each thread block.The thread that in described each sets of threads, coordinate is (x, y) is followed down-sampling point of four dot generation for (2x, 2y), (2x+1,2y), (2x, 2y+1), (2x+1,2y+1) according to coordinate in described down-sampled images piece.Finally, in a described sets of threads, the down-sampling point of all thread computes obtains D * D image block of correspondence position in K+1 tomographic image layer.

From described L tomographic image layer (top layer images layer), start successively to carry out second step to the six steps, until execute the 1st tomographic image layer (original image layer).

Second step: K tomographic image layer is carried out to piece division and thread distribution.Described K tomographic image layer size is designated as M * N, by the block size of appointment (n * n), K tomographic image is divided into individual searching image piece, distributes a thread block for the search of motion vector to described each searching image piece.The out of order execution concurrently of described all thread block.

Follow according to the region of search of setting on described K tomographic image layer and GPU computing capability and set the thread distribution method to each search point of region of search.Region of search width is designated as H, is highly designated as V, and the Thread Count that upper each thread block of GPU can create is designated as T.If V * H≤T creates the two-dimentional sets of threads of H * V in described thread block.The sad value at the search point place that described in the thread computes that in described sets of threads, coordinate is (x, y), region of search internal coordinate is (x, y), this scheme is designated as thread allocative decision A.If V * H>T creates the one dimension sets of threads that a size is H in described thread block.The thread that in described sets of threads, sequence number is x, calculates the sad value at the search point place of described region of search internal coordinate (x, k), wherein, 0≤k<V, this scheme is designated as thread allocative decision B.

The 3rd step: thread block described in second step is obtained the Search Results of described searching image piece on K+1 tomographic image layer from GPU video memory, calculates the search center of described searching image piece on K tomographic image layer.If K=L, present image layer is top layer images layer, the search center of described searching image piece is designated as to the position of first pixel of the described searching image piece upper left corner.If K<L, the coordinate of described searching image piece central pixel point on described K tomographic image layer is designated as (x, y), and described pixel respective coordinates on described K+1 tomographic image layer is in described K+1 tomographic image layer, searching image piece size is designated as m * m, and described pixel drops on coordinate in K+1 tomographic image layer and is searching image piece inner.By coordinate in described K+1 tomographic image layer, be the motion vector (the 6th step at K+1 layer draws) of searching image piece take advantage of 2 search centers as described the above searching image piece of K tomographic image layer.

The 4th step: each thread described in second step calculates SAD to distributed search pixel point, and by result store in the linear array of shared drive.For the allocative decision of thread described in second step A, each thread is directly stored to shared drive by the sad value of its calculating and thread coordinate.For the allocative decision of thread described in second step B, described each thread is used the scheme of two stage Searchs: the first stage, the thread that described sequence number is x from described region of search internal coordinate is point start upwards search, until searched for the point that coordinate is (x, 0).Wherein, V is region of search height.In search procedure, record minimum sad value and corresponding search point ordinate.Minimum sad value and the ordinate of first paragraph searching record are designated as to SAD ₁, y ₁.Second stage, corresponding with the first stage, the thread that described sequence number is x from described region of search internal coordinate is point start downward search, until searched for coordinate for the point of (x, V-1).Minimum sad value and the ordinate of second segment searching record are designated as to SAD ₂, y ₂.Finally compare SAD1 and SAD2, minimum sad value and respective coordinates are write to shared drive.If two sad values are identical, by close the coordinate of search point write shared drive.

The 5th step: search described searching image the piece minimum SAD on all search points and corresponding coordinate in described region of search.Each thread block described in second step, when the 4th step completes, has obtained TN sad value and the respective coordinates (TN is the number of threads in described thread block) of linear discharge.Now by merger, operate and find minimum sad value and respective coordinates.First use the 1st to individual thread is TN sad value relatively.I.e. x thread x and the relatively individual sad value, and less sad value and respective coordinates are write to x position.If sad value is identical, by from search center, the search point coordinates close to more writes x position.Now obtain individual sad value and respective coordinates.Then continue with first to individual thread comparison individual sad value.This process is carried out repeatedly, until residue SAD number is less than the Thread Count in the upper Warp of GPU.

Because all threads in a Warp of thread block are always carried out identical instruction, when residue SAD number is less than the twice of Warp thread number (being designated as WN), will there is thread idle running.When residue SAD number is less than WN, GPU end-of-job, passes described residue SAD back CPU, by CPU, carries out traversal, finds out minimum SAD and coordinate.This coordinate is the motion vector of current search image block.

The 6th step: if described K tomographic image layer is not original image layer, i.e. K ≠ 1, in the described K tomographic image layer that CPU obtains traversal, the motion vector of each searching image piece is passed GPU video memory back, for K-1 layer, searches for.If K=1, this hierarchical search motion estimation algorithm finishes.

The present invention has the following advantages:

1. the present invention distributes by reasonably carrying out data division and thread, has effectively utilized the upper hundreds of arithmetic unit of GPU unit, has improved to greatest extent degree of parallelism, and has greatly promoted the speed of hierarchical search motion estimation algorithm.

2. the present invention uses that GPU is parallel to carry out down-sampling and obtain each image layer in hierarchical search algorithm, and leaves in GPU video memory, reduces the communication between CPU and GPU, has improved acceleration efficiency;

3. the present invention can follow region of search size, block size and the GPU disposal ability according to appointment in each image layer in hierarchical search algorithm, carries out adaptively rational data division and thread and distributes;

4. the data that the employing that the present invention proposes is according to pixels listed as division divide and thread allocative decision to meet region of search larger, the situation that is greater than the number of threads that the single thread block of GPU can hold is put in search.Expanded the range of application of this algorithm;

5. for two, search in the situation that some sad value is identical, need to compare the size of two search point motion vectors and also determine best motion vectors.In the present invention, in the scheme that according to pixels row are divided, adopt two sections to search element, can effectively utilize the positional information of search point, avoided the computing that this GPU of comparing motion vector magnitude is bad in the situation that sad value is identical, improved processing speed;

6. the present invention, in searching the stipulations process of minimum SAD and optimal motion vectors, is shared the SAD comparison that is finally less than Warp quantity by CPU, has avoided the thread idle running of GPU, has also utilized the computing capability of CPU, has improved search speed.

Accompanying drawing explanation

Fig. 1 is hierarchical search motion estimation algorithm schematic diagram;

Fig. 2 is general object GPU hardware configuration and thread tissue and mapping relations schematic diagram;

Fig. 3 is the flow chart of the embodiment of the present invention;

Fig. 4 is that in the embodiment of the present invention, GPU generates the schematic diagram of the 2nd tomographic image layer to the 1st tomographic image layer down-sampling;

Fig. 5 is the thread distribution schematic diagram that in the embodiment of the present invention, GPU carries out parallel search on the 3rd tomographic image layer;

Fig. 6 is that in the embodiment of the present invention, GPU carries out the schematic diagram of two sections of searching algorithms of parallel search on the 3rd tomographic image layer;

Fig. 7 is that in the embodiment of the present invention, the interior all threads of thread block are used conflation algorithms to find the schematic diagram of minimum SAD to all SAD of search gained;

Fig. 8 is that the thread that in the embodiment of the present invention, GPU carries out parallel search on the 2nd layer and the 1st tomographic image layer distributes schematic diagram;

Fig. 9 is that in the embodiment of the present invention, GPU calculated the schematic diagram of search center carry out parallel search on the 2nd tomographic image layer before.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in detail.The present embodiment is only one embodiment of the present of invention rather than whole embodiment.

The video sequence of 1920 * 1080 resolution is carried out to hierarchical search motion estimation, and on original image layer, searching image block size is 8 * 8, and region of search will cover centered by current block, the wide and high region that is 256 pixels.Thread block of GPU of using can have 512 threads, and in thread block, Warp size is 16 threads.

Under this application scenarios, can adopt following embodiment, divide three image layer: the 1st tomographic image layer is original image layer, searching image block size is 8 * 8, and region of search is 8 * 8 regions centered by current block; The 2nd tomographic image layer is 960 * 540 image in different resolution of original image down-sampling, and searching image block size is 8 * 8, and region of search is 16 * 16 regions centered by current block; The 3rd tomographic image layer is 480 * 270 image in different resolution of Fig. 2 tomographic image layer down-sampling, and searching image block size is 4 * 4, and region of search is to be 64 * 64 regions centered by current block.Searching image block size needs in ground floor image layer and application requirements are consistent, region of search size and other each layer of searching image block size of each layer all can be specified as required voluntarily.In this embodiment, each region of search of appointment and search block size, be one group of selectable value, and inwhole.

In this embodiment, the step that the present invention realizes as shown in Figure 3:

The first step, CPU is transferred on GPU video memory with reference to frame and present frame.The image of former 1920 * 1080 resolution is divided into 60 * 34 down-sampled images pieces according to the image block of 32 * 32 sizes.60 * 34 thread block of the upper establishment of GPU, create 16 * 16 two-dimentional sets of threads in each thread block.A described sets of threads is processed the down-sampled images piece of 32 * 32,16 * 16 of correspondence position image block on generation the 2nd tomographic image layer.Described all thread parallels carry out down-sampling computing.For example, in sets of threads, coordinate is (5,5) in thread computes 32 * 32 down-sampled images pieces, coordinate is (10,10), (10,11), (11,10), (11,11) average of four points, generate the point that on the 2nd tomographic image layer, in correspondence position 16 * 16 image blocks, coordinate is (5,5), as shown in Figure 4.All the other threads in like manner.The down-sampling formula using in this embodiment or filter are a selectable scheme in all embodiment, rather than unique scheme.

Making to use the same method resolution is 960 * 540 the 2nd tomographic image layer is divided into the down-sampled images piece of 30 * 17 32 * 32, and distributes 30 * 17 thread block down-sampling that walks abreast to obtain the 3rd tomographic image layer of 480 * 270 resolution.

Second step, determines the thread allocative decision on the 3rd tomographic image layer.Because region of search size on the 3rd tomographic image layer is 64 * 64, totally 4096 search points, have surpassed the upper thread block of GPU and have held 512 of maximum threads quantity.Therefore this layer is upper, for each search piece distributes a sets of threads that size is 64, and all sad values of a row search point in thread computes 64 * 64 regions of search in described sets of threads, as shown in Figure 5.Because the 3rd tomographic image layer is top layer images layer, so search center is coordinate corresponding to first pixel of the current block upper left corner.

The 3rd step, each thread, according to the method for two sections of search, is found the point of SAD minimum in the 1 row search point of region of search.As shown in Figure 6, the thread that in sets of threads, sequence number is x, calculating region of search internal coordinate is that (x, 0) is to the point of (x, 63).The point that first described thread is (x, 32) from coordinate starts upwards to search the point that coordinate is (x, 0), and then the point that is (x, 33) from coordinate starts to search coordinate downwards for the point of (x, 63).And record in two sections minimum separately sad value and coordinate.Finally minimum sad value in more described two sections, writes shared drive by minimum sad value and coordinate.

The 4th step, as shown in Figure 7, by conflation algorithm, 64 sad values relatively depositing in shared drive.This algorithm iteration twice, finally obtains 16 less sad values.

The 5th step, because the Warp size of this GPU is 16, when therefore remaining 16 threads, GPU will remaining 16 sad values and coordinate return CPU, by CPU, travel through and obtain minimum sad value and coordinate, thereby obtain motion vector.

The 6th step, CPU passes back to the motion vector of all searching image pieces the video memory of GPU, for the search on the 2nd tomographic image layer is prepared.

The 7th step, determines the thread allocative decision on the 2nd tomographic image layer.Because region of search size on the 2nd tomographic image layer is 16 * 16, totally 256 search points, do not surpass the Thread Count 512 that the upper thread block of GPU can be held.Therefore, each searching image piece is distributed to the two-dimentional sets of threads of 16 * 16, the sad value of a search point in sets of threads in a thread computes same coordinate, as shown in Figure 8.

The 8th step, the search center of each searching image piece on calculating the 2nd tomographic image layer.For example, take coordinate as (64,64) point is the searching image piece A of first pixel of the upper left corner, its center is coordinate (68,68) pixel, described pixel respective coordinates on the 3rd tomographic image is (34,34), described pixel is the inside of 4 * 4 searching image piece B of first pixel of the upper left corner with (32,32) in the 3rd tomographic image layer.Suppose that it is (5 ,-4) that searching image piece B searches for the motion vector obtaining on the 3rd tomographic image layer, is made as (10 ,-8) by the search center of searching image piece A on the 2nd tomographic image layer, as shown in Figure 9.

The 9th step is consistent on same the 3rd tomographic image layer of all the other steps of the enterprising line search of the 2nd tomographic image layer.Finally obtain the motion vector of all searching image pieces on the 2nd tomographic image layer, and leave in video memory.

The tenth step, at the enterprising line search of the 1st tomographic image layer, on the 1st tomographic image layer, thread distribution and search plan are consistent with the 2nd tomographic image layer.After having searched on the 1st tomographic image layer and obtaining the motion vector of each searching image piece, no longer pass GPU back, but directly export as result of calculation.

So far, this embodiment completes in steps, and obtains final calculation result.

This invention, by above step, makes full use of the computation capability of GPU, completes fast hierarchical search motion estimation.By adaptive thread allocative decision, make the application requirements of the GPU of different computing capabilitys to the image of different resolution, different region of search and search block size, all can provide high performance acceleration.

Claims

1. the hierarchical search motion estimation algorithm accelerating based on GPU, is characterized in that: comprise the following steps:

The first step: CPU is transferred to video memory with reference to frame and present frame, by the parallel down-sampling that carries out each image layer of GPU, is operated, and generates each image layer for hierarchical search, and top layer images layer is made as to present image layer;

Second step: for each searching image piece of present image layer distributes a thread block for search, thread block is selected thread allocative decision according to the region of search size of appointment on GPU disposal ability and present image layer, and distributes required thread;

The 3rd step: each thread is used the Search Results of last layer image layer, the search center of each searching image piece on initialization present image layer; The 4th step: each thread is pressed distributed search point and calculated the SAD at each search point, and by result store in shared drive;

The 5th step: each thread block is used parallel reduction to search the minimum SAD recording in shared drive, when residue SAD number is less than or equal to thread block Warp size, stops reduction;

The 6th step: each thread block is passed remaining SAD in the 5th step and corresponding search point coordinates back CPU, is traveled through and is searched by CPU, finds out minimum SAD and obtains motion vector;

The 7th step: if present image layer is the 1st tomographic image layer, export the motion vector of each searching image piece; Otherwise, pass motion vector back video memory, and present image layer is made as to lower one deck image layer, and rebound second step.

2. algorithm as claimed in claim 1, is characterized in that, in the described first step, the parallel down-sampling operation of carrying out each image layer of GPU, specifically comprises: will treat that it is the down-sampled images piece of 2D * 2D that down-sampled images is divided into size; Distribute a thread block, the out of order execution concurrently of all thread block to described each down-sampled images piece; The two-dimentional sets of threads of a D * D of application in described each thread block; The thread that in described each sets of threads, coordinate is (x, y) is down-sampling point of four dot generation of (2x, 2y), (2x+1,2y), (2x, 2y+1), (2x+1,2y+1) according to coordinate in described down-sampled images piece.

3. algorithm as claimed in claim 1, it is characterized in that, in described second step, thread block is selected thread allocative decision according to the region of search size of appointment on GPU disposal ability and present image layer, specifically comprise: when field of search field width is H, height is V, when a thread block of GPU can distribute maximum thread to be T, if V * H≤T, in described thread block, create the two-dimentional sets of threads of H * V, region of search internal coordinate is (x described in the thread computes that in described sets of threads, coordinate is (x, y), the sad value at search point place y), this scheme is designated as thread allocative decision A; If V * H>T creates the one dimension sets of threads that a size is H in described thread block; The thread that in described sets of threads, sequence number is x, calculates the sad value at the search point place of described region of search internal coordinate (x, k), wherein, 0≤k<V, this scheme is designated as thread allocative decision B.

4. algorithm as claimed in claim 3, is characterized in that, described thread allocative decision B adopts two stage Search schemes, and the first stage, the thread that in sets of threads, sequence number is x from described region of search internal coordinate is point start upwards search, until searched for the point that coordinate is (x, 0), wherein V is region of search height; Second stage, corresponding with the described first stage, the thread that described sequence number is x from described region of search internal coordinate is point start downward search, until searched for coordinate for the point of (x, V-1).