CN103873874A - Full search motion estimation method based on programmable parallel processor - Google Patents

Full search motion estimation method based on programmable parallel processor Download PDF

Info

Publication number
CN103873874A
CN103873874A CN201410056657.XA CN201410056657A CN103873874A CN 103873874 A CN103873874 A CN 103873874A CN 201410056657 A CN201410056657 A CN 201410056657A CN 103873874 A CN103873874 A CN 103873874A
Authority
CN
China
Prior art keywords
search
parallel processor
present frame
macro block
programmable parallel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410056657.XA
Other languages
Chinese (zh)
Other versions
CN103873874B (en
Inventor
隆刚
金明
史方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tong Wei Technology (shenzhen) Co Ltd
Original Assignee
Tong Wei Technology (shenzhen) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tong Wei Technology (shenzhen) Co Ltd filed Critical Tong Wei Technology (shenzhen) Co Ltd
Priority to CN201410056657.XA priority Critical patent/CN103873874B/en
Publication of CN103873874A publication Critical patent/CN103873874A/en
Application granted granted Critical
Publication of CN103873874B publication Critical patent/CN103873874B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a full search motion estimation method based on a programmable parallel processor. The method comprises the steps that a thread group is built in the programmable parallel processor and comprises threads, N*N current frame macro blocks are divided into a plurality of current frame sub blocks, and one-to-one correspondence relationship is built between each current frame sub block and the corresponding thread, the current frame macro blocks are loaded into a shared internal memory of the programmable parallel processor, and when the current frame macro blocks are researched, the current frame sub block data of the shared internal memory are directly called by the threads, wherein N is a natural number. The processing of each current frame macro block is mapped into a GPU (Graphic Processing Unit) thread group, the data of the macro blocks can be conveniently loaded into the on-chip shared internal memory of a GPU, and the current macro block data can be shared in the thread group; in such a way, in the processing process of the whole research region corresponding to the macro block, the data do not needed to be loaded from a video memory repeatedly, and thus precious off-chip access memory bandwidth is saved.

Description

A kind of full searching moving method of estimation based on programmable parallel processor
Technical field
The present invention relates to image compression field, in particular, relate to a kind of full searching moving method of estimation based on programmable parallel processor.
Background technology
In the video coding system based on motion compensation, estimation is amount of calculation maximum, the highest link of memory bandwidth requirement.It often, by find the blocks and optimal matching blocks with respect to current macro in reference frame, reduces the time-domain correlation of video data, thereby reaches the object of compression.Due to reasons such as frame size are large, hunting zone large, reference frame quantity is many, the motion search in high-quality video coded system needs great memory bandwidth.
Graphic process unit (GPU) is a kind of emerging parallel computation equipment, for the limited parallel ability of calculating with respect to common multi-core CPU, the feature that its large-scale data is parallel, contribute to solve high-quality video encryption algorithm, the huge amount of calculation problem of the full-search algorithm that especially whole pixel motion is estimated.With respect to CPU, it is advantageous that theoretical computing capability of unit interval, and inferior position is the bottleneck of the outer memory bandwidth of its sheet.
Be estimated as example with whole pixel motion H.264, the macro block (MB) of each the 16x16 size to present frame, in its reference frame, have corresponding motion estimation search scope, each searching position is corresponding to a motion vector (motion vector, hereinafter to be referred as MV) of macro block.And each macro block can be divided into 16x8,8x16,8x8,8x4,4x8, the sub-block of 4x4 size.Only entirely search for as example taking the whole pixel of a 4x4 sub-block, hypothetical reference frame number is only 1, and long to a width is width, and wide is heigtht, and the frame of video that search width is search_range, needs the position N=width/4*height/4*search_range searching for 2.With 1080P HD video, the typical case that search width is 32 is example, and the full volumes of searches of a frame video will reach 132710400 more than!
Because the essence of full-search algorithm is based on macro block, be therefore applicable to very much the Parallel Implementation of GPU.But algorithm need to carry out a large amount of reading repeatedly to reference frame, and because the memory bandwidth of GPU is limited, algorithm is without optimizing the efficiency that is difficult to reach higher.
Summary of the invention
Technical problem to be solved by this invention is to provide a kind of full searching moving method of estimation based on programmable parallel processor that can reduce programmable parallel processor memory access data volume.
The object of the invention is to be achieved through the following technical solutions:
Based on a full searching moving method of estimation for programmable parallel processor, comprise step:
In programmable parallel processor, set up sets of threads, sets of threads comprises thread;
The present frame macroblock partition of N × N is become to multiple present frame sub-blocks, and each present frame sub-block and thread are set up mapping relations one to one;
Each thread is loaded into corresponding present frame sub-block the shared drive of programmable parallel processor;
Described N is natural number.
Further, described full searching moving method of estimation also comprises the read step of reference frame:
Each present frame macro block has a corresponding contraposition macro block (co-located MB) in reference frame, centered by contraposition macro block, sets up the search piece of M × M, and described M is the natural number that is greater than N;
Search piece is divided into multiple search sub-blocks, and search sub-block and thread are set up mapping relations one to one;
Each thread is loaded into corresponding search sub-block the shared drive of programmable parallel processor.Divide equally the data of the search piece that reads M × M by thread, each thread independence paired running, is independent of each other, and the speed of reading out data and effect have remarkable lifting.
Further, described full searching moving method of estimation also comprises the full search arithmetic step of macro block:
Described contraposition macro block has multiple two dimensional displacement quantity in search piece, and every kind of displacement forms a reference macroblock, and the set of all reference macroblocks forms a hunting zone; Described each thread at least carries out search arithmetic to a reference macroblock, draws local optimum result; The set of the reference macroblock that each thread is corresponding forms a search section,
Gather the local optimum result of all threads, then calculate the optimum Search Results of present frame macro block in whole search piece.The reference frame search blocks of data that completes present frame macro block and correspondence thereof is written into, and enters the search phase.
Now need contraposition macro block to carry out displacement in the scope of search piece, each the displacement of contraposition macro block all can produce a new macro block data (being made as reference macroblock), adopt the mode of permutation and combination, each contraposition macro block can be combined into the individual different reference macroblock of (M-N) x (M-N) in search piece, present frame macro block need to contrast with each reference macroblock, calculate residual error (SAD) value, finally obtain optimal cost and motion vector MV.In order to promote operation efficiency, can share the search arithmetic of all reference macroblocks equally each thread, each thread reads whole present frame macro block data, then contrast one by one with its responsible reference macroblock respectively, finally obtain local optimum result, concurrent operation between each like this thread, is independent of each other, and has significantly promoted operation efficiency.
Further, programmable parallel processor comprises the first computing kernel and the second computing kernel, in programmable parallel processor, start the first computing kernel, the first computing kernel comprises multiple threads, and each thread reads present frame macro block and search blocks of data from shared drive; Fall into a trap and calculate described local optimum result in its search section,
By the local optimum result store of each thread to external memory storage, then in programmable parallel processor, start the second computing kernel, the second computing kernel reads each local optimum result from external memory storage, calculates the optimum Search Results of present frame macro block in whole search piece.Register is that thread oneself can be seen separately, and shared drive, external memory storage can, by all thread accesses, because shared drive does not take outside memory bandwidth, therefore store the present frame macro block of whole macro block into shared drive.
Further, all possible pattern of each thread traverses is searched for entirely, and the Search Results at every turn calculating is stored in corresponding register, and described Search Results comprises motion vector and coding cost;
If the coding cost of calculating is less than the coding cost of register, replace the coding cost of register by the coding cost of current calculating,
Each thread completes after the search arithmetic of its search section, and the motion vector of register-stored and coding cost are local optimum result.Generally speaking, memory bandwidth level: register is obviously greater than shared drive, shared drive is much larger than external memory storage, and the each result of thread exists in register, while finally completing traversal, just writes external memory storage from register, economizes like this memory bandwidth.
The first computing kernel has loaded in sets of threads after the search blocks of data of corresponding present frame macro block and reference frame, to each thread, first travel through successively its responsible 2x2 search section (i.e. 4 reference macroblock data), try to achieve successively the sad value of each 4x4 sub-block, write in temporary register, do not take the global memory outside sheet.Try to achieve after the sad value of 4x4 sub-block, draw 8x4 by cumulative equally, 4x8,16x8,8x16, the sad value of various other patterns such as 16x16, exists in temporary register equally.Continue in ergodic process, each pattern is upgraded respectively to its best motion vector (MV) and coding cost (be initialized as infinity, whenever the cost of current search position calculation gained is less than known best cost at present, just upgrade it).When traversal is complete, each thread is by the best MV and the coding cost that draw in its 2x2 hunting zone like this, and these data will be written in Pian Wai global memory as intermediate data.The tissue of these data can be considered a two-dimensional array, and its element size is identical with original video frame, and each element has comprised local optimum MV and the corresponding search cost of its corresponding current frame pixel place macro block.
The second computing core design is the further processing that each thread is responsible for a macro block, be that each thread can be considered and carries out reading in of 16x16 element in intermediate data, then by the data in this region of traversal, the result of more each local optimum successively, finally obtains the MV of current macro global optimum in whole 32x32 region of search and the cost of encoding thereof.After this just can carry out the selection of optimal mode, and point pixel motion of refinement is estimated.
Further, adjacent two search pieces are overlapping, form overlay region, and non-overlapping portion is divided formation increment district;
For the present frame macro block of every a line, each sets of threads first reads from texture memory the search blocks of data that first present frame is corresponding;
Follow-up present frame macro block only reads the increment district data of search piece from texture memory; Store data corresponding this increment district into shared drive, as the overlay region data of next present frame macro block search simultaneously.Search piece (SW) in hypothetical reference frame be a width centered by corresponding present frame position as S_width be highly the rectangle of S_height, a line number of macroblocks is MB h, a row number of macroblocks is MB v.Adjacent search piece is overlapping, no longer separate.The overlapping region of search size of two adjacency search pieces is that S_height × (S_width-MB_width), wherein MB_width is the width of current macro,
Prior art is written into the required outer memory bandwidth of sheet of search window data:
BW_normal=MB H×(S_height×S_width)×MB V
The memory bandwidth that adopts the technical program to carry out after data reusing optimization is:
BW_proposed=(S_height×S_width+(MB H-1)×S_height×MB_width)×MB V
Taking macroblock size as 16x16, VGA resolution (is MB hbe 40, MB vbe 30), the situation that search piece is 48x48 is example, the memory bandwidth saving rate of this optimization will reach nearly three times:
BW_proposed/BW_normal=(48×48+39×48×16)/(40×48×48)=1/2.86
Be not difficult to find out, for high-resolution frame of video, along with increasing of horizontal direction macroblock number, this Optimal Ratio will be higher.Therefore, day by day account for today of main flow at HD video, the value of this optimization is self-evident.
Further, programmable parallel processor comprises the first computing kernel and the second computing kernel, starts the first computing kernel in programmable parallel processor, and the first computing kernel moves sets of threads corresponding to a row number of macroblocks simultaneously.
Further, described present frame and the reference frame texture memory that first prestores, described thread reads present frame and reference frame data from texture memory, then stores the shared drive in programmable parallel processor into.When present frame macro block is searched for, directly read by shared drive the macro block data that reference frame is corresponding.Different from common one dimension global memory access mode, shared drive in programmable parallel processor (describing as an example of GPU example below) sheet, be that its inner memory space has optimization to the local access of two-dimensional space, therefore in the time of region close in adjacent GPU thread accesses two-dimensional space, its memory access efficiency is higher.For this feature, first the technical program is loaded as texture memory by present frame macro block and reference frame, instead of be present in outside sheet with common global memory's form, like this as long as the required data of current thread in the texture cache in sheet, just can be saved valuable memory bandwidth.In addition, because shared drive has built-in BORDER PROCESSING mechanism, needn't take computational resource special processing is carried out in border, facilitate algorithm realization.
Texture memory is exactly external memory physically in fact, corresponding GPU, be and be independent of GPU chip and be present in the video memory chip on video card, connect by chip-scale data/address bus between the two, communication speed can be slower than the computing unit of GPU chip with shared storage direct communication in chip, mode with common mode access external memory is different, exactly when data are while there is external memory by texture memory mode, when reading it, GPU arithmetic core the texture cache in processor chips (texture cache) can be used, therefore produce (2D spatial coherence) memory access effect of optimization.
Further, described programmable parallel processor is GPU.
Further, the number of threads of each sets of threads equals the pixel quantity of corresponding present frame macro block, and each present frame sub-block represents the data of a pixel.
After deliberation, prior art is asked sad value to present frame macro block and contraposition macro block each time, all will reload present frame macro block data, and this has just wasted memory bandwidth.The present invention is mapped to the processing of each present frame macro block in a GPU sets of threads, thereby can very easily the data of this macro block be loaded into shared drive in the sheet of GPU, realizes and in sets of threads, shares current macro data.Like this, in the processing procedure of whole region of search corresponding to this macro block, needn't repeat again from video memory, to load these data, thereby save the outer memory bandwidth of valuable sheet.In addition, divide equally the data that read present frame macro block by thread, each thread independence paired running, is independent of each other, and the speed of reading out data and effect have remarkable lifting.
Brief description of the drawings
Fig. 1 is the method schematic diagram that the present invention is based on the full searching moving method of estimation of programmable parallel processor;
Fig. 2 is the overlapping schematic diagram of search piece of two adjacency search positions of the embodiment of the present invention;
Fig. 3 is the data reusing principle schematic of the reference frame search piece of the embodiment of the present invention;
Fig. 4 is the method schematic diagram of the compression intermediate data of the embodiment of the present invention.
Embodiment
Based on a full searching moving method of estimation for programmable parallel processor, comprise step:
In programmable parallel processor, set up sets of threads, sets of threads comprises thread;
The present frame macroblock partition of N × N is become to multiple present frame sub-blocks, and each present frame sub-block and thread are set up mapping relations one to one;
Each thread is loaded into corresponding present frame sub-block the shared drive of programmable parallel processor;
Described N is natural number.
After deliberation, prior art is asked sad value to present frame macro block and contraposition macro block each time, all will reload present frame macro block data, and this has just wasted memory bandwidth.The present invention is mapped to the processing of each present frame macro block in a GPU sets of threads, thereby can very easily the data of this macro block be loaded into shared drive in the sheet of GPU, realizes and in sets of threads, shares current macro data.Like this, in the processing procedure of whole region of search corresponding to this macro block, needn't repeat again from video memory, to load these data, thereby save the outer memory bandwidth of valuable sheet.In addition, divide equally the data that read present frame macro block by thread, each thread independence paired running, is independent of each other, and the speed of reading out data and effect have remarkable lifting.
Select GPU as programmable parallel processor below, each present frame sub-block represents the data instance of a pixel, and the invention will be further described with preferred embodiment by reference to the accompanying drawings.
Based on a full searching moving method of estimation for programmable parallel processor, comprise step:
In GPU, set up sets of threads, sets of threads comprises thread;
The macroblock partition of present frame is become to many present frames macro block, and each present frame sub-block and thread are set up mapping relations one to one, and the number of threads of each sets of threads equals the pixel quantity of corresponding present frame macro block, and each present frame sub-block represents the data of a pixel;
Each present frame macro block has a corresponding contraposition macro block in reference frame, centered by contraposition macro block, sets up the search piece of M × M, and described M is the natural number that is greater than N; Search piece is divided into multiple search sub-blocks, and search sub-block and thread are set up mapping relations one to one;
Each thread is loaded into present frame sub-block and corresponding search sub-block thereof in the shared drive of programmable parallel processor;
When present frame macro block is searched for, directly by present frame macro block and the search blocks of data of thread dispatching shared drive;
Present frame and the reference frame texture memory that first prestores, thread reads present frame and reference frame data from texture memory, then stores the shared drive in programmable parallel processor into.
Consider the long width of being of a width, wide is heigtht, the frame of video that search width is search_range, and its present frame macroblock number is width/16*height/16, each present frame macro block need to be searched for search_range 2inferior, therefore memory access amount be width/16*height/16*search_range 2* 16*16. is with 1080P HD video, single reference frame, and the typical case that search width is 32 is example, the memory access amount of a frame video will reach 2123366400 more than, very huge.As shown in Figure 2, search piece corresponding to two searching positions that the horizontal direction of current macro in same reference frame is adjacent is respectively SA1, SA2, both have significantly overlapping (in fact only having small part independently), if when upper two the parallel threads of GPU are processed them respectively, the repeatability that is obviously written into data is high.Therefore first present embodiment is loaded as texture memory with reference to frame, instead of be present in outside sheet with common global memory's form, texture memory has built-in BORDER PROCESSING mechanism, needn't take computational resource special processing is carried out in border, has facilitated algorithm realization.Finally read the shared drive of programmable parallel processor from texture memory with reference to the data of frame, when present frame is compared with reference frame, the unified data that read reference frame from shared drive, need not frequently access the texture memory outside sheet, save valuable memory bandwidth.
Texture memory is exactly external memory physically in fact, corresponding GPU, be and be independent of GPU chip and be present in the video memory chip on video card, connect by chip-scale data/address bus between the two, communication speed can be slower than the computing unit of GPU chip with shared storage direct communication in chip, mode with common mode access external memory is different, exactly when data are while there is external memory by texture memory mode, when reading it, GPU arithmetic core the texture cache in processor chips (texture cache) can be used, therefore produce (2D spatial coherence) memory access effect of optimization.
What further, the present invention can also be used for solving reference frame repeats to read problem.
Be greater than contraposition macro block in view of searching for the pixel region of required loading in reference frame, therefore, search piece can overlap in reference frame, and adjacent two search pieces are overlapping, form overlay region, and non-overlapping portion is divided formation increment district.The size of present frame macro block and contraposition macro block are equal, and therefore, the pixel quantity that search piece comprises is also greater than present frame macro block, and each thread will read multiple pixels.
For the present frame macro block of every a line, each sets of threads first reads from texture memory the search blocks of data that first present frame is corresponding;
Follow-up present frame macro block only reads the increment district data of search piece from texture memory; Store data corresponding this increment district into shared drive, as the overlay region data of next present frame macro block search simultaneously.In programmable parallel processor, start the first computing kernel, the first computing kernel moves sets of threads corresponding to a row number of macroblocks simultaneously.
Each GPU thread reads the pixel of a rectangular area separately.As shown in Figure 3, with current macro size 16x16, region of search 32x32 is example, and reference frame search piece SW is 48x48(SW_width × SW_height), therefore each thread need to read the rectangular area of size for 3x3 pixel.
Making pixel that a certain GPU thread the is corresponding world coordinates in present frame is (x, y), and local coordinate in macro block is (i, j), and the pixel in the required texture memory reading of this thread is:
P{(x–16+i*3+m,y–16+j*3+n),m=0,1,2,3;n=0,1,2,3}
Search piece (SW) in hypothetical reference frame be a width centered by corresponding present frame position as S_width be highly the rectangle of S_height, a line number of macroblocks is MB h, a row number of macroblocks is MB v.Adjacent search piece is overlapping, no longer separate.The overlapping region of search size of two adjacency search pieces is that S_height × (S_width-MB_width), wherein MB_width is the width of current macro,
Prior art is written into the outer memory bandwidth of the required sheet of search blocks of data:
BW_normal=MB H×(S_height×S_width)×MB V
The memory bandwidth that adopts the technical program to carry out after data reusing optimization is:
BW_proposed=(S_height×S_width+(MB H-1)×S_height×MB_width)×MB V
Taking macroblock size as 16x16, VGA resolution (is MB hbe 40, MB vbe 30), the situation that search piece is 48x48 is example, the memory bandwidth saving rate of this optimization will reach nearly three times:
BW_proposed/BW_normal=(48×48+39×48×16)/(40×48×48)=1/2.86
Be not difficult to find out, for high-resolution frame of video, along with increasing of horizontal direction macroblock number, this Optimal Ratio will be higher.Therefore, day by day account for today of main flow at HD video, the value of this optimization is self-evident.
The present invention has also further solved the problem that repeatedly reads of intermediate data.
Step 1: calculate local optimum result.
Present frame macro block has corresponding contraposition macro block in its corresponding reference frame, and centered by contraposition macro block, the scope of foundation is greater than and covers completely the search piece of this contraposition macro block;
Described contraposition macro block has multiple two dimensional displacement quantity in search piece, and every kind of displacement forms a reference macroblock, and the set of all reference macroblocks forms a hunting zone;
In programmable parallel processor, start the first computing kernel, the first computing kernel comprises multiple threads, and each thread reads corresponding present frame macro block and search blocks of data from described shared drive,
Each thread at least carries out search arithmetic to a reference macroblock, the set of the reference macroblock that each thread is corresponding forms a search section, the Search Results at every turn calculating is stored in corresponding register, and described Search Results comprises motion vector and coding cost;
If the coding cost of calculating is less than the coding cost of register, replace the coding cost of register by the coding cost of current calculating, each thread completes after the search arithmetic of its search section, and the motion vector of register-stored and coding cost are local optimum result;
By local optimum result store in register corresponding to this thread;
When the first computing kernel completes after the search of whole search piece, by the local optimum result store of each thread to external memory storage,
If the contraposition macroblock size that present frame macro block is corresponding is 16x16, hunting zone is 32x32, the size of searching for piece is 32+16=48(x48), in 16x16 sets of threads, the present frame sub-block of average every thread is 1x1(present frame) and 3x3(search piece), and search section is 2x2(4 reference macroblock).
Step 2: the optimum Search Results that calculates whole present frame macro block.
The second computing kernel reads all local optimal result from external memory storage, calculates the optimum Search Results of whole present frame macro block.
Generally speaking, memory bandwidth level: register is obviously greater than shared drive, shared drive is much larger than external memory storage, and the each result of thread exists in register, while finally completing traversal, just writes external memory storage from register, economizes like this memory bandwidth.
Register is that thread oneself can be seen separately, and shared drive, external memory storage can, by all thread accesses, because shared drive does not take outside memory bandwidth, therefore store the present frame macro block of whole macro block into shared drive.
Referring to Fig. 4, the first computing kernel has loaded present frame macro block and corresponding search piece thereof in sets of threads, to each thread, first travel through successively its responsible 2x2 region of search, try to achieve successively the sad value of each 4x4 sub-block, write in temporary register, do not take the global memory outside sheet.
Try to achieve after the sad value of 4x4 sub-block, draw 8x4 by cumulative equally, 4x8,16x8,8x16, the sad value of various other patterns such as 16x16, exists in temporary register equally.
Continue in ergodic process, each pattern is upgraded respectively to its best motion vector (MV) and coding cost (be initialized as infinity, whenever the cost of current search position calculation gained is less than known best cost at present, just upgrade it).When traversal is complete, each thread is by the best MV and the coding cost that draw in its 2x2 hunting zone like this, and these data will be written in Pian Wai global memory as intermediate data.The tissue of these data can be considered a two-dimensional array, and its element size is identical with original video frame, and each element has comprised local optimum MV and the corresponding search cost of its corresponding current frame pixel place macro block.
The second computing core design is the further processing that each thread is responsible for a macro block, be that each thread can be considered and carries out reading in of 16x16 element in intermediate data, then by the data in this region of traversal, the result of more each local optimum successively, finally obtains the MV of current macro global optimum in whole 32x32 hunting zone and the cost of encoding thereof.After this just can carry out the selection of optimal mode, and point pixel motion of refinement is estimated.
The present invention is not limited to GPU, can also adopt other programmable parallel processors.
Above content is in conjunction with concrete preferred implementation further description made for the present invention, can not assert that specific embodiment of the invention is confined to these explanations.For general technical staff of the technical field of the invention, without departing from the inventive concept of the premise, can also make some simple deduction or replace, all should be considered as belonging to protection scope of the present invention.

Claims (10)

1. the full searching moving method of estimation based on programmable parallel processor, is characterized in that, comprises step:
In programmable parallel processor, set up sets of threads, sets of threads comprises thread;
The present frame macroblock partition of N × N is become to multiple present frame sub-blocks, and each present frame sub-block and thread are set up mapping relations one to one;
Each thread is loaded into corresponding present frame sub-block the shared drive of programmable parallel processor;
Described N is natural number.
2. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 1, is characterized in that, described full searching moving method of estimation also comprises the read step of reference frame:
Each present frame macro block has a corresponding contraposition macro block in reference frame, centered by contraposition macro block, sets up the search piece of M × M, and described M is the natural number that is greater than N;
Search piece is divided into multiple search sub-blocks, and search sub-block and thread are set up mapping relations one to one;
Each thread is loaded into corresponding search sub-block the shared drive of programmable parallel processor.
3. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 2, is characterized in that, described full searching moving method of estimation also comprises the full search arithmetic step of macro block:
Described contraposition macro block has multiple two dimensional displacement quantity in search piece, and every kind of displacement forms a reference macroblock, and the set of all reference macroblocks forms a hunting zone; Described each thread at least carries out search arithmetic to a reference macroblock, draws local optimum result; The set of its corresponding reference macroblock forms a search section,
Gather the local optimum result of all threads, then calculate the optimum Search Results of present frame macro block in whole search piece.
4. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 3, is characterized in that,
Programmable parallel processor comprises the first computing kernel and the second computing kernel, starts the first computing kernel in programmable parallel processor, and the first computing kernel comprises multiple threads, and each thread reads present frame macro block and search blocks of data from shared drive; Fall into a trap and calculate described local optimum result in its search section,
By the local optimum result store of each thread to external memory storage, then in programmable parallel processor, start the second computing kernel, the second computing kernel reads each local optimum result from external memory storage, calculates the optimum Search Results of present frame macro block in whole search piece.
5. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 4, is characterized in that,
The all possible pattern of each thread traverses is searched for entirely, and the Search Results at every turn calculating is stored in corresponding register, and described Search Results comprises motion vector and coding cost;
If the coding cost of calculating is less than the coding cost of register, replace the coding cost of register by the coding cost of current calculating,
Each thread completes after the search arithmetic of its search section, and the motion vector of register-stored and coding cost are local optimum result.
6. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 2, is characterized in that, adjacent two search pieces are overlapping, form overlay region, and non-overlapping portion is divided formation increment district;
For the present frame macro block of every a line, each sets of threads first reads from texture memory the search blocks of data that first present frame is corresponding;
Follow-up present frame macro block only reads the increment district data of search piece from texture memory; Store data corresponding this increment district into shared drive, as the overlay region data of next present frame macro block search simultaneously.
7. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 6, is characterized in that,
Programmable parallel processor comprises the first computing kernel and the second computing kernel, starts the first computing kernel in programmable parallel processor, and the first computing kernel moves sets of threads corresponding to a row number of macroblocks simultaneously.
8. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 2, it is characterized in that, described present frame and the reference frame texture memory that first prestores, described thread reads present frame and reference frame data from texture memory, then stores the shared drive in programmable parallel processor into.
9. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 1, is characterized in that, described programmable parallel processor is GPU.
10. a kind of full searching moving method of estimation based on programmable parallel processor as claimed in claim 1, is characterized in that, the number of threads of each sets of threads equals the pixel quantity of corresponding present frame macro block, and each present frame sub-block represents the data of a pixel.
CN201410056657.XA 2014-02-19 2014-02-19 A kind of full search method for estimating based on programmable parallel processor Expired - Fee Related CN103873874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410056657.XA CN103873874B (en) 2014-02-19 2014-02-19 A kind of full search method for estimating based on programmable parallel processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410056657.XA CN103873874B (en) 2014-02-19 2014-02-19 A kind of full search method for estimating based on programmable parallel processor

Publications (2)

Publication Number Publication Date
CN103873874A true CN103873874A (en) 2014-06-18
CN103873874B CN103873874B (en) 2017-06-06

Family

ID=50911948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410056657.XA Expired - Fee Related CN103873874B (en) 2014-02-19 2014-02-19 A kind of full search method for estimating based on programmable parallel processor

Country Status (1)

Country Link
CN (1) CN103873874B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537657A (en) * 2014-12-23 2015-04-22 西安交通大学 Laser speckle image depth perception method implemented through parallel search GPU acceleration
CN105516726A (en) * 2015-11-27 2016-04-20 传线网络科技(上海)有限公司 Motion compensation matching method and system of video coding
CN105847810A (en) * 2016-01-29 2016-08-10 西安邮电大学 High efficiency video coding adder tree parallel implementation method
CN108012156A (en) * 2017-11-17 2018-05-08 深圳市华尊科技股份有限公司 A kind of method for processing video frequency and control platform
CN105578189B (en) * 2015-12-27 2018-05-25 西安邮电大学 Efficient video coding add tree Parallel Implementation method based on asymmetric partition mode
CN108259912A (en) * 2018-03-28 2018-07-06 天津大学 A kind of Parallel Implementation method of point of pixel motion estimation
CN108495138A (en) * 2018-03-28 2018-09-04 天津大学 A kind of integer pixel motion estimation method based on GPU
CN110072107A (en) * 2019-04-25 2019-07-30 南京理工大学 A kind of haze video-frequency compression method shared based on estimation
CN110895549A (en) * 2019-09-04 2020-03-20 成都四方伟业软件股份有限公司 Quantized data retrieval method and system
WO2020199050A1 (en) * 2019-03-29 2020-10-08 深圳市大疆创新科技有限公司 Video encoding method and device, and movable platform
CN114390117A (en) * 2021-12-01 2022-04-22 中电科思仪科技股份有限公司 High-speed continuous data stream storage processing device and method based on FPGA

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1568014A (en) * 2003-06-30 2005-01-19 杭州高特信息技术有限公司 Quick movement prediction method and structure thereof
CN1852442A (en) * 2005-08-19 2006-10-25 深圳市海思半导体有限公司 Layering motion estimation method and super farge scale integrated circuit
CN101212682A (en) * 2007-12-22 2008-07-02 深圳市同洲电子股份有限公司 Data loading device and method for motion search area
CN101370143A (en) * 2008-09-17 2009-02-18 华为技术有限公司 Image motion estimating method and apparatus
CN102197652A (en) * 2009-10-19 2011-09-21 松下电器产业株式会社 Decoding apparatus, decoding method, program and integrated circuit

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1568014A (en) * 2003-06-30 2005-01-19 杭州高特信息技术有限公司 Quick movement prediction method and structure thereof
CN1852442A (en) * 2005-08-19 2006-10-25 深圳市海思半导体有限公司 Layering motion estimation method and super farge scale integrated circuit
CN101212682A (en) * 2007-12-22 2008-07-02 深圳市同洲电子股份有限公司 Data loading device and method for motion search area
CN101370143A (en) * 2008-09-17 2009-02-18 华为技术有限公司 Image motion estimating method and apparatus
CN102197652A (en) * 2009-10-19 2011-09-21 松下电器产业株式会社 Decoding apparatus, decoding method, program and integrated circuit

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537657A (en) * 2014-12-23 2015-04-22 西安交通大学 Laser speckle image depth perception method implemented through parallel search GPU acceleration
CN105516726A (en) * 2015-11-27 2016-04-20 传线网络科技(上海)有限公司 Motion compensation matching method and system of video coding
CN105516726B (en) * 2015-11-27 2019-04-09 传线网络科技(上海)有限公司 The motion compensation matching process and system of Video coding
CN105578189B (en) * 2015-12-27 2018-05-25 西安邮电大学 Efficient video coding add tree Parallel Implementation method based on asymmetric partition mode
CN105847810B (en) * 2016-01-29 2018-08-21 西安邮电大学 A kind of efficient video coding add tree Parallel Implementation method
CN105847810A (en) * 2016-01-29 2016-08-10 西安邮电大学 High efficiency video coding adder tree parallel implementation method
CN108012156A (en) * 2017-11-17 2018-05-08 深圳市华尊科技股份有限公司 A kind of method for processing video frequency and control platform
CN108012156B (en) * 2017-11-17 2020-09-25 深圳市华尊科技股份有限公司 Video processing method and control platform
CN108259912A (en) * 2018-03-28 2018-07-06 天津大学 A kind of Parallel Implementation method of point of pixel motion estimation
CN108495138A (en) * 2018-03-28 2018-09-04 天津大学 A kind of integer pixel motion estimation method based on GPU
WO2020199050A1 (en) * 2019-03-29 2020-10-08 深圳市大疆创新科技有限公司 Video encoding method and device, and movable platform
CN110072107B (en) * 2019-04-25 2022-08-12 南京理工大学 Haze video compression method based on motion estimation sharing
CN110072107A (en) * 2019-04-25 2019-07-30 南京理工大学 A kind of haze video-frequency compression method shared based on estimation
CN110895549A (en) * 2019-09-04 2020-03-20 成都四方伟业软件股份有限公司 Quantized data retrieval method and system
CN110895549B (en) * 2019-09-04 2022-12-06 成都四方伟业软件股份有限公司 Quantized data retrieval method and system
CN114390117A (en) * 2021-12-01 2022-04-22 中电科思仪科技股份有限公司 High-speed continuous data stream storage processing device and method based on FPGA
CN114390117B (en) * 2021-12-01 2023-08-22 中电科思仪科技股份有限公司 High-speed continuous data stream storage processing device and method based on FPGA

Also Published As

Publication number Publication date
CN103873874B (en) 2017-06-06

Similar Documents

Publication Publication Date Title
CN103873874A (en) Full search motion estimation method based on programmable parallel processor
TWI656508B (en) Block operation for image processor with two-dimensional array of arrays and two-dimensional displacement register
TWI586149B (en) Video encoder, method and computing device for processing video frames in a block processing pipeline
CN107438860B (en) Architecture for high performance power efficient programmable image processing
US9292899B2 (en) Reference frame data prefetching in block processing pipelines
CN107533750B (en) Virtual image processor, and method and system for processing image data thereon
US9224186B2 (en) Memory latency tolerance in block processing pipelines
KR101971657B1 (en) Energy-efficient processor core architecture for image processors
Du et al. An accelerator for high efficient vision processing
KR102253027B1 (en) Statistical operation on a two-dimensional image processor
TWI533209B (en) Parallel hardware and software block processing pipelines
US20160021385A1 (en) Motion estimation in block processing pipelines
US20150092843A1 (en) Data storage and access in block processing pipelines
US10706006B2 (en) Image processor I/O unit
CN109710558A (en) SLAM arithmetic unit and method
TW201901483A (en) Circuit for performing absolute value of two input values and summing operation
Park et al. Programmable multimedia platform based on reconfigurable processor for 8K UHD TV
CN101951521A (en) Video image motion estimation method for extent variable block
CN101729903A (en) Method, system and multimedia processor for reading reference frame data
Li et al. A cache-based bandwidth optimized motion compensation architecture for video decoder
Ren et al. A parallel streaming motion estimation for real-time HD H. 264 encoding on programmable processors
Liu et al. A high parallel motion compensation implementation on a coarse-grained reconfigurable processor supporting H. 264 high profile decoding
Massanes et al. Cuda implementation of a block-matching algorithm for multiple gpu cards
Apewokin et al. Cat-tail DMA: Efficient image data transport for multicore embedded mobile systems
Yuejian et al. Heterogeneous associative cache for multimedia applications

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170606

Termination date: 20210219

CF01 Termination of patent right due to non-payment of annual fee