CN103873874A

CN103873874A - Full search motion estimation method based on programmable parallel processor

Info

Publication number: CN103873874A
Application number: CN201410056657.XA
Authority: CN
Inventors: 隆刚; 金明; 史方
Original assignee: Tong Wei Technology (shenzhen) Co Ltd
Current assignee: Tong Wei Technology (shenzhen) Co Ltd
Priority date: 2014-02-19
Filing date: 2014-02-19
Publication date: 2014-06-18
Anticipated expiration: 2034-02-19
Also published as: CN103873874B

Abstract

The invention discloses a full search motion estimation method based on a programmable parallel processor. The method comprises the steps that a thread group is built in the programmable parallel processor and comprises threads, N*N current frame macro blocks are divided into a plurality of current frame sub blocks, and one-to-one correspondence relationship is built between each current frame sub block and the corresponding thread, the current frame macro blocks are loaded into a shared internal memory of the programmable parallel processor, and when the current frame macro blocks are researched, the current frame sub block data of the shared internal memory are directly called by the threads, wherein N is a natural number. The processing of each current frame macro block is mapped into a GPU (Graphic Processing Unit) thread group, the data of the macro blocks can be conveniently loaded into the on-chip shared internal memory of a GPU, and the current macro block data can be shared in the thread group; in such a way, in the processing process of the whole research region corresponding to the macro block, the data do not needed to be loaded from a video memory repeatedly, and thus precious off-chip access memory bandwidth is saved.

Description

A kind of full searching moving method of estimation based on programmable parallel processor

Technical field

The present invention relates to image compression field, in particular, relate to a kind of full searching moving method of estimation based on programmable parallel processor.

Background technology

In the video coding system based on motion compensation, estimation is amount of calculation maximum, the highest link of memory bandwidth requirement.It often, by find the blocks and optimal matching blocks with respect to current macro in reference frame, reduces the time-domain correlation of video data, thereby reaches the object of compression.Due to reasons such as frame size are large, hunting zone large, reference frame quantity is many, the motion search in high-quality video coded system needs great memory bandwidth.

Graphic process unit (GPU) is a kind of emerging parallel computation equipment, for the limited parallel ability of calculating with respect to common multi-core CPU, the feature that its large-scale data is parallel, contribute to solve high-quality video encryption algorithm, the huge amount of calculation problem of the full-search algorithm that especially whole pixel motion is estimated.With respect to CPU, it is advantageous that theoretical computing capability of unit interval, and inferior position is the bottleneck of the outer memory bandwidth of its sheet.

Be estimated as example with whole pixel motion H.264, the macro block (MB) of each the 16x16 size to present frame, in its reference frame, have corresponding motion estimation search scope, each searching position is corresponding to a motion vector (motion vector, hereinafter to be referred as MV) of macro block.And each macro block can be divided into 16x8,8x16,8x8,8x4,4x8, the sub-block of 4x4 size.Only entirely search for as example taking the whole pixel of a 4x4 sub-block, hypothetical reference frame number is only 1, and long to a width is width, and wide is heigtht, and the frame of video that search width is search_range, needs the position N=width/4*height/4*search_range searching for ².With 1080P HD video, the typical case that search width is 32 is example, and the full volumes of searches of a frame video will reach 132710400 more than!

Because the essence of full-search algorithm is based on macro block, be therefore applicable to very much the Parallel Implementation of GPU.But algorithm need to carry out a large amount of reading repeatedly to reference frame, and because the memory bandwidth of GPU is limited, algorithm is without optimizing the efficiency that is difficult to reach higher.

Summary of the invention

Technical problem to be solved by this invention is to provide a kind of full searching moving method of estimation based on programmable parallel processor that can reduce programmable parallel processor memory access data volume.

The object of the invention is to be achieved through the following technical solutions:

Based on a full searching moving method of estimation for programmable parallel processor, comprise step:

In programmable parallel processor, set up sets of threads, sets of threads comprises thread;

The present frame macroblock partition of N × N is become to multiple present frame sub-blocks, and each present frame sub-block and thread are set up mapping relations one to one;

Each thread is loaded into corresponding present frame sub-block the shared drive of programmable parallel processor;

Described N is natural number.

Further, described full searching moving method of estimation also comprises the read step of reference frame:

Each present frame macro block has a corresponding contraposition macro block (co-located MB) in reference frame, centered by contraposition macro block, sets up the search piece of M × M, and described M is the natural number that is greater than N;

Search piece is divided into multiple search sub-blocks, and search sub-block and thread are set up mapping relations one to one;

Each thread is loaded into corresponding search sub-block the shared drive of programmable parallel processor.Divide equally the data of the search piece that reads M × M by thread, each thread independence paired running, is independent of each other, and the speed of reading out data and effect have remarkable lifting.

Further, described full searching moving method of estimation also comprises the full search arithmetic step of macro block:

Described contraposition macro block has multiple two dimensional displacement quantity in search piece, and every kind of displacement forms a reference macroblock, and the set of all reference macroblocks forms a hunting zone; Described each thread at least carries out search arithmetic to a reference macroblock, draws local optimum result; The set of the reference macroblock that each thread is corresponding forms a search section,

Gather the local optimum result of all threads, then calculate the optimum Search Results of present frame macro block in whole search piece.The reference frame search blocks of data that completes present frame macro block and correspondence thereof is written into, and enters the search phase.

Now need contraposition macro block to carry out displacement in the scope of search piece, each the displacement of contraposition macro block all can produce a new macro block data (being made as reference macroblock), adopt the mode of permutation and combination, each contraposition macro block can be combined into the individual different reference macroblock of (M-N) x (M-N) in search piece, present frame macro block need to contrast with each reference macroblock, calculate residual error (SAD) value, finally obtain optimal cost and motion vector MV.In order to promote operation efficiency, can share the search arithmetic of all reference macroblocks equally each thread, each thread reads whole present frame macro block data, then contrast one by one with its responsible reference macroblock respectively, finally obtain local optimum result, concurrent operation between each like this thread, is independent of each other, and has significantly promoted operation efficiency.

Further, programmable parallel processor comprises the first computing kernel and the second computing kernel, in programmable parallel processor, start the first computing kernel, the first computing kernel comprises multiple threads, and each thread reads present frame macro block and search blocks of data from shared drive; Fall into a trap and calculate described local optimum result in its search section,

By the local optimum result store of each thread to external memory storage, then in programmable parallel processor, start the second computing kernel, the second computing kernel reads each local optimum result from external memory storage, calculates the optimum Search Results of present frame macro block in whole search piece.Register is that thread oneself can be seen separately, and shared drive, external memory storage can, by all thread accesses, because shared drive does not take outside memory bandwidth, therefore store the present frame macro block of whole macro block into shared drive.

Further, all possible pattern of each thread traverses is searched for entirely, and the Search Results at every turn calculating is stored in corresponding register, and described Search Results comprises motion vector and coding cost;

If the coding cost of calculating is less than the coding cost of register, replace the coding cost of register by the coding cost of current calculating,

Each thread completes after the search arithmetic of its search section, and the motion vector of register-stored and coding cost are local optimum result.Generally speaking, memory bandwidth level: register is obviously greater than shared drive, shared drive is much larger than external memory storage, and the each result of thread exists in register, while finally completing traversal, just writes external memory storage from register, economizes like this memory bandwidth.

The first computing kernel has loaded in sets of threads after the search blocks of data of corresponding present frame macro block and reference frame, to each thread, first travel through successively its responsible 2x2 search section (i.e. 4 reference macroblock data), try to achieve successively the sad value of each 4x4 sub-block, write in temporary register, do not take the global memory outside sheet.Try to achieve after the sad value of 4x4 sub-block, draw 8x4 by cumulative equally, 4x8,16x8,8x16, the sad value of various other patterns such as 16x16, exists in temporary register equally.Continue in ergodic process, each pattern is upgraded respectively to its best motion vector (MV) and coding cost (be initialized as infinity, whenever the cost of current search position calculation gained is less than known best cost at present, just upgrade it).When traversal is complete, each thread is by the best MV and the coding cost that draw in its 2x2 hunting zone like this, and these data will be written in Pian Wai global memory as intermediate data.The tissue of these data can be considered a two-dimensional array, and its element size is identical with original video frame, and each element has comprised local optimum MV and the corresponding search cost of its corresponding current frame pixel place macro block.

The second computing core design is the further processing that each thread is responsible for a macro block, be that each thread can be considered and carries out reading in of 16x16 element in intermediate data, then by the data in this region of traversal, the result of more each local optimum successively, finally obtains the MV of current macro global optimum in whole 32x32 region of search and the cost of encoding thereof.After this just can carry out the selection of optimal mode, and point pixel motion of refinement is estimated.

Further, adjacent two search pieces are overlapping, form overlay region, and non-overlapping portion is divided formation increment district;

For the present frame macro block of every a line, each sets of threads first reads from texture memory the search blocks of data that first present frame is corresponding;

Follow-up present frame macro block only reads the increment district data of search piece from texture memory; Store data corresponding this increment district into shared drive, as the overlay region data of next present frame macro block search simultaneously.Search piece (SW) in hypothetical reference frame be a width centered by corresponding present frame position as S_width be highly the rectangle of S_height, a line number of macroblocks is MB _h, a row number of macroblocks is MB _v.Adjacent search piece is overlapping, no longer separate.The overlapping region of search size of two adjacency search pieces is that S_height × (S_width-MB_width), wherein MB_width is the width of current macro,

Prior art is written into the required outer memory bandwidth of sheet of search window data:

BW_normal＝MB _H×(S_height×S_width)×MB _V

The memory bandwidth that adopts the technical program to carry out after data reusing optimization is:

BW_proposed＝(S_height×S_width+(MB _H-1)×S_height×MB_width)×MB _V

Taking macroblock size as 16x16, VGA resolution (is MB _hbe 40, MB _vbe 30), the situation that search piece is 48x48 is example, the memory bandwidth saving rate of this optimization will reach nearly three times:

BW_proposed/BW_normal＝(48×48+39×48×16)/(40×48×48)＝1/2.86

Be not difficult to find out, for high-resolution frame of video, along with increasing of horizontal direction macroblock number, this Optimal Ratio will be higher.Therefore, day by day account for today of main flow at HD video, the value of this optimization is self-evident.

Further, programmable parallel processor comprises the first computing kernel and the second computing kernel, starts the first computing kernel in programmable parallel processor, and the first computing kernel moves sets of threads corresponding to a row number of macroblocks simultaneously.

Further, described present frame and the reference frame texture memory that first prestores, described thread reads present frame and reference frame data from texture memory, then stores the shared drive in programmable parallel processor into.When present frame macro block is searched for, directly read by shared drive the macro block data that reference frame is corresponding.Different from common one dimension global memory access mode, shared drive in programmable parallel processor (describing as an example of GPU example below) sheet, be that its inner memory space has optimization to the local access of two-dimensional space, therefore in the time of region close in adjacent GPU thread accesses two-dimensional space, its memory access efficiency is higher.For this feature, first the technical program is loaded as texture memory by present frame macro block and reference frame, instead of be present in outside sheet with common global memory's form, like this as long as the required data of current thread in the texture cache in sheet, just can be saved valuable memory bandwidth.In addition, because shared drive has built-in BORDER PROCESSING mechanism, needn't take computational resource special processing is carried out in border, facilitate algorithm realization.

Texture memory is exactly external memory physically in fact, corresponding GPU, be and be independent of GPU chip and be present in the video memory chip on video card, connect by chip-scale data/address bus between the two, communication speed can be slower than the computing unit of GPU chip with shared storage direct communication in chip, mode with common mode access external memory is different, exactly when data are while there is external memory by texture memory mode, when reading it, GPU arithmetic core the texture cache in processor chips (texture cache) can be used, therefore produce (2D spatial coherence) memory access effect of optimization.

Further, described programmable parallel processor is GPU.

Further, the number of threads of each sets of threads equals the pixel quantity of corresponding present frame macro block, and each present frame sub-block represents the data of a pixel.

After deliberation, prior art is asked sad value to present frame macro block and contraposition macro block each time, all will reload present frame macro block data, and this has just wasted memory bandwidth.The present invention is mapped to the processing of each present frame macro block in a GPU sets of threads, thereby can very easily the data of this macro block be loaded into shared drive in the sheet of GPU, realizes and in sets of threads, shares current macro data.Like this, in the processing procedure of whole region of search corresponding to this macro block, needn't repeat again from video memory, to load these data, thereby save the outer memory bandwidth of valuable sheet.In addition, divide equally the data that read present frame macro block by thread, each thread independence paired running, is independent of each other, and the speed of reading out data and effect have remarkable lifting.

Brief description of the drawings

Fig. 1 is the method schematic diagram that the present invention is based on the full searching moving method of estimation of programmable parallel processor;

Fig. 2 is the overlapping schematic diagram of search piece of two adjacency search positions of the embodiment of the present invention;

Fig. 3 is the data reusing principle schematic of the reference frame search piece of the embodiment of the present invention;

Fig. 4 is the method schematic diagram of the compression intermediate data of the embodiment of the present invention.

Embodiment

Described N is natural number.

Select GPU as programmable parallel processor below, each present frame sub-block represents the data instance of a pixel, and the invention will be further described with preferred embodiment by reference to the accompanying drawings.

In GPU, set up sets of threads, sets of threads comprises thread;

The macroblock partition of present frame is become to many present frames macro block, and each present frame sub-block and thread are set up mapping relations one to one, and the number of threads of each sets of threads equals the pixel quantity of corresponding present frame macro block, and each present frame sub-block represents the data of a pixel;

Each present frame macro block has a corresponding contraposition macro block in reference frame, centered by contraposition macro block, sets up the search piece of M × M, and described M is the natural number that is greater than N; Search piece is divided into multiple search sub-blocks, and search sub-block and thread are set up mapping relations one to one;

Each thread is loaded into present frame sub-block and corresponding search sub-block thereof in the shared drive of programmable parallel processor;

When present frame macro block is searched for, directly by present frame macro block and the search blocks of data of thread dispatching shared drive;

Present frame and the reference frame texture memory that first prestores, thread reads present frame and reference frame data from texture memory, then stores the shared drive in programmable parallel processor into.

Consider the long width of being of a width, wide is heigtht, the frame of video that search width is search_range, and its present frame macroblock number is width/16*height/16, each present frame macro block need to be searched for search_range ²inferior, therefore memory access amount be width/16*height/16*search_range ²* 16*16. is with 1080P HD video, single reference frame, and the typical case that search width is 32 is example, the memory access amount of a frame video will reach 2123366400 more than, very huge.As shown in Figure 2, search piece corresponding to two searching positions that the horizontal direction of current macro in same reference frame is adjacent is respectively SA1, SA2, both have significantly overlapping (in fact only having small part independently), if when upper two the parallel threads of GPU are processed them respectively, the repeatability that is obviously written into data is high.Therefore first present embodiment is loaded as texture memory with reference to frame, instead of be present in outside sheet with common global memory's form, texture memory has built-in BORDER PROCESSING mechanism, needn't take computational resource special processing is carried out in border, has facilitated algorithm realization.Finally read the shared drive of programmable parallel processor from texture memory with reference to the data of frame, when present frame is compared with reference frame, the unified data that read reference frame from shared drive, need not frequently access the texture memory outside sheet, save valuable memory bandwidth.

What further, the present invention can also be used for solving reference frame repeats to read problem.

Be greater than contraposition macro block in view of searching for the pixel region of required loading in reference frame, therefore, search piece can overlap in reference frame, and adjacent two search pieces are overlapping, form overlay region, and non-overlapping portion is divided formation increment district.The size of present frame macro block and contraposition macro block are equal, and therefore, the pixel quantity that search piece comprises is also greater than present frame macro block, and each thread will read multiple pixels.

Follow-up present frame macro block only reads the increment district data of search piece from texture memory; Store data corresponding this increment district into shared drive, as the overlay region data of next present frame macro block search simultaneously.In programmable parallel processor, start the first computing kernel, the first computing kernel moves sets of threads corresponding to a row number of macroblocks simultaneously.

Each GPU thread reads the pixel of a rectangular area separately.As shown in Figure 3, with current macro size 16x16, region of search 32x32 is example, and reference frame search piece SW is 48x48(SW_width × SW_height), therefore each thread need to read the rectangular area of size for 3x3 pixel.

Making pixel that a certain GPU thread the is corresponding world coordinates in present frame is (x, y), and local coordinate in macro block is (i, j), and the pixel in the required texture memory reading of this thread is:

P{(x–16+i*3+m，y–16+j*3+n)，m=0，1，2，3;n=0，1，2，3}

Search piece (SW) in hypothetical reference frame be a width centered by corresponding present frame position as S_width be highly the rectangle of S_height, a line number of macroblocks is MB _h, a row number of macroblocks is MB _v.Adjacent search piece is overlapping, no longer separate.The overlapping region of search size of two adjacency search pieces is that S_height × (S_width-MB_width), wherein MB_width is the width of current macro,

Prior art is written into the outer memory bandwidth of the required sheet of search blocks of data:

BW_normal＝MB _H×(S_height×S_width)×MB _V

BW_proposed＝(S_height×S_width+(MB _H-1)×S_height×MB_width)×MB _V

BW_proposed/BW_normal＝(48×48+39×48×16)/(40×48×48)＝1/2.86

The present invention has also further solved the problem that repeatedly reads of intermediate data.

Step 1: calculate local optimum result.

Present frame macro block has corresponding contraposition macro block in its corresponding reference frame, and centered by contraposition macro block, the scope of foundation is greater than and covers completely the search piece of this contraposition macro block;

Described contraposition macro block has multiple two dimensional displacement quantity in search piece, and every kind of displacement forms a reference macroblock, and the set of all reference macroblocks forms a hunting zone;

In programmable parallel processor, start the first computing kernel, the first computing kernel comprises multiple threads, and each thread reads corresponding present frame macro block and search blocks of data from described shared drive,

Each thread at least carries out search arithmetic to a reference macroblock, the set of the reference macroblock that each thread is corresponding forms a search section, the Search Results at every turn calculating is stored in corresponding register, and described Search Results comprises motion vector and coding cost;

If the coding cost of calculating is less than the coding cost of register, replace the coding cost of register by the coding cost of current calculating, each thread completes after the search arithmetic of its search section, and the motion vector of register-stored and coding cost are local optimum result;

By local optimum result store in register corresponding to this thread;

When the first computing kernel completes after the search of whole search piece, by the local optimum result store of each thread to external memory storage,

If the contraposition macroblock size that present frame macro block is corresponding is 16x16, hunting zone is 32x32, the size of searching for piece is 32+16=48(x48), in 16x16 sets of threads, the present frame sub-block of average every thread is 1x1(present frame) and 3x3(search piece), and search section is 2x2(4 reference macroblock).

Step 2: the optimum Search Results that calculates whole present frame macro block.

The second computing kernel reads all local optimal result from external memory storage, calculates the optimum Search Results of whole present frame macro block.

Generally speaking, memory bandwidth level: register is obviously greater than shared drive, shared drive is much larger than external memory storage, and the each result of thread exists in register, while finally completing traversal, just writes external memory storage from register, economizes like this memory bandwidth.

Register is that thread oneself can be seen separately, and shared drive, external memory storage can, by all thread accesses, because shared drive does not take outside memory bandwidth, therefore store the present frame macro block of whole macro block into shared drive.

Referring to Fig. 4, the first computing kernel has loaded present frame macro block and corresponding search piece thereof in sets of threads, to each thread, first travel through successively its responsible 2x2 region of search, try to achieve successively the sad value of each 4x4 sub-block, write in temporary register, do not take the global memory outside sheet.

Try to achieve after the sad value of 4x4 sub-block, draw 8x4 by cumulative equally, 4x8,16x8,8x16, the sad value of various other patterns such as 16x16, exists in temporary register equally.

Continue in ergodic process, each pattern is upgraded respectively to its best motion vector (MV) and coding cost (be initialized as infinity, whenever the cost of current search position calculation gained is less than known best cost at present, just upgrade it).When traversal is complete, each thread is by the best MV and the coding cost that draw in its 2x2 hunting zone like this, and these data will be written in Pian Wai global memory as intermediate data.The tissue of these data can be considered a two-dimensional array, and its element size is identical with original video frame, and each element has comprised local optimum MV and the corresponding search cost of its corresponding current frame pixel place macro block.

The second computing core design is the further processing that each thread is responsible for a macro block, be that each thread can be considered and carries out reading in of 16x16 element in intermediate data, then by the data in this region of traversal, the result of more each local optimum successively, finally obtains the MV of current macro global optimum in whole 32x32 hunting zone and the cost of encoding thereof.After this just can carry out the selection of optimal mode, and point pixel motion of refinement is estimated.

The present invention is not limited to GPU, can also adopt other programmable parallel processors.

Above content is in conjunction with concrete preferred implementation further description made for the present invention, can not assert that specific embodiment of the invention is confined to these explanations.For general technical staff of the technical field of the invention, without departing from the inventive concept of the premise, can also make some simple deduction or replace, all should be considered as belonging to protection scope of the present invention.

Claims

1. the full searching moving method of estimation based on programmable parallel processor, is characterized in that, comprises step:

Described N is natural number.

2. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 1, is characterized in that, described full searching moving method of estimation also comprises the read step of reference frame:

Each present frame macro block has a corresponding contraposition macro block in reference frame, centered by contraposition macro block, sets up the search piece of M × M, and described M is the natural number that is greater than N;

Each thread is loaded into corresponding search sub-block the shared drive of programmable parallel processor.

3. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 2, is characterized in that, described full searching moving method of estimation also comprises the full search arithmetic step of macro block:

Described contraposition macro block has multiple two dimensional displacement quantity in search piece, and every kind of displacement forms a reference macroblock, and the set of all reference macroblocks forms a hunting zone; Described each thread at least carries out search arithmetic to a reference macroblock, draws local optimum result; The set of its corresponding reference macroblock forms a search section,

Gather the local optimum result of all threads, then calculate the optimum Search Results of present frame macro block in whole search piece.

4. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 3, is characterized in that,

Programmable parallel processor comprises the first computing kernel and the second computing kernel, starts the first computing kernel in programmable parallel processor, and the first computing kernel comprises multiple threads, and each thread reads present frame macro block and search blocks of data from shared drive; Fall into a trap and calculate described local optimum result in its search section,

By the local optimum result store of each thread to external memory storage, then in programmable parallel processor, start the second computing kernel, the second computing kernel reads each local optimum result from external memory storage, calculates the optimum Search Results of present frame macro block in whole search piece.

5. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 4, is characterized in that,

The all possible pattern of each thread traverses is searched for entirely, and the Search Results at every turn calculating is stored in corresponding register, and described Search Results comprises motion vector and coding cost;

Each thread completes after the search arithmetic of its search section, and the motion vector of register-stored and coding cost are local optimum result.

6. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 2, is characterized in that, adjacent two search pieces are overlapping, form overlay region, and non-overlapping portion is divided formation increment district;

Follow-up present frame macro block only reads the increment district data of search piece from texture memory; Store data corresponding this increment district into shared drive, as the overlay region data of next present frame macro block search simultaneously.

7. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 6, is characterized in that,

Programmable parallel processor comprises the first computing kernel and the second computing kernel, starts the first computing kernel in programmable parallel processor, and the first computing kernel moves sets of threads corresponding to a row number of macroblocks simultaneously.

8. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 2, it is characterized in that, described present frame and the reference frame texture memory that first prestores, described thread reads present frame and reference frame data from texture memory, then stores the shared drive in programmable parallel processor into.

9. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 1, is characterized in that, described programmable parallel processor is GPU.

10. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 1, is characterized in that, the number of threads of each sets of threads equals the pixel quantity of corresponding present frame macro block, and each present frame sub-block represents the data of a pixel.