CN102647588B - GPU (Graphics Processing Unit) acceleration method used for hierarchical searching motion estimation - Google Patents

GPU (Graphics Processing Unit) acceleration method used for hierarchical searching motion estimation Download PDF

Info

Publication number
CN102647588B
CN102647588B CN201110040025.0A CN201110040025A CN102647588B CN 102647588 B CN102647588 B CN 102647588B CN 201110040025 A CN201110040025 A CN 201110040025A CN 102647588 B CN102647588 B CN 102647588B
Authority
CN
China
Prior art keywords
search
thread
gpu
image layer
coordinate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110040025.0A
Other languages
Chinese (zh)
Other versions
CN102647588A (en
Inventor
王振宇
王荣刚
董胜富
高文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Shenzhen Graduate School
Original Assignee
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Shenzhen Graduate School filed Critical Peking University Shenzhen Graduate School
Priority to CN201110040025.0A priority Critical patent/CN102647588B/en
Publication of CN102647588A publication Critical patent/CN102647588A/en
Application granted granted Critical
Publication of CN102647588B publication Critical patent/CN102647588B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a hierarchical searching motion estimation method used by utilizing GPU (Graphics Processing Unit) parallel computing capability acceleration, and the method comprises the following steps: generating images of different image layers in a hierarchical searching algorithm; self-adaptively carrying out threading distribution; carrying out SAD computation on each search image block at each search point in parallel; and utilizing a CPU (Central Processing Unit) to cooperatively look for the smallest SAD in parallel. The self-adaptive threading distribution scheme provided by the invention can satisfy the requirement of the resolution ratios of different searched images and the size of a searching area, the image downsapling is processed by the GPU in parallel to obtain better acceleration speed and reduce data communication between the CPU and the GPU, and the GPU can be effectively prevented from idling with a smallest SAD value lookup method cooperatively carried out by the GPU/CPU.

Description

A kind of GPU accelerated method for hierarchical search motion estimation
Technical field
The present invention relates to the method for estimating of field of video processing, the hierarchical search motion estimation accelerated method that specifically a kind of GPU of use aiding CPU carries out.
Technical background
Estimation is the important technology of field of video processing, in hybrid video coding, the motion detection based on piece coupling, to there being important effect in the application such as image tracing.Motion estimation algorithm is to each piece, calculates the sad value on all or part of possible search point in hunting zone, by finding minimum sad value, obtains the best or approaches best motion vector.Full searching moving algorithm for estimating is to each possible point in hunting zone, calculates the SAD of this position, finally obtains best motion vector.Full searching moving algorithm for estimating has very high operand.Therefore a lot of fast search algorithms have been there are, as three step search, hierarchical search, hexagon search etc.Wherein, hierarchical search motion estimation algorithm has higher systematicness, is easy to realize by hardware-accelerated method.
Basic hierarchical search motion estimation algorithm is that present frame and reference frame are divided into a plurality of image layer.Original image is minimum one deck, and upwards each image layer is obtained by an adjacent image layer down-sampling of below.While carrying out hierarchical search motion estimation, in the appointment region of search that the o of first take is search center, carry out the estimation based on full search on top layer images layer, sad value according to specifying all search points in region of search, obtains optimal match point a in this image layer.And then search obtains a new optimal match point b in next tomographic image layer be take the appointment region of search that a is search center.This process is successively carried out each image layer, finally obtains minimum one deck, i.e. optimal match point c (seeing Fig. 1) on original image layer, be the motion vector that hierarchical search motion estimation algorithm obtains.When only having an image layer, i.e. full searching moving algorithm for estimating.Therefore full searching moving algorithm for estimating can be regarded as a special case of hierarchical search.The quality of hierarchical search motion estimation is relevant with image layer number and each image layer down-sampling rate of dividing.Conventionally original image is divided into 2~3 image layer, each image layer is carried out 1/2 down-sampling by the image layer of low one deck and is obtained.
Because GPU is with respect to CPU, there is the features such as computing unit is many, memory read-write band is roomy, be well suited for the acceleration as hierarchical search motion estimation.The appearance of the GPU of especially general object (GPGPU), makes the exploitation of carrying out high-performance algorithm on GPU become comparatively easy.GPU based on general object is comprised of a plurality of multiprocessors, and each multiprocessor has a plurality of arithmetic elements, and each arithmetic element has the register of oneself.Arithmetic elements all in a multiprocessor, in each clock cycle, are all carried out identical instruction, therefore can carry out large batch of parallel data processing.The GPU of general object is above made into a plurality of thread block by all concurrent sets of threads, and a thread block is only moved on a multiprocessor, and a plurality of thread block can be shared a multiprocessor.Thread in thread block can carry out exchanges data by the shared storage on multiprocessor.And thread in the different threads piece of concurrent execution can not be communicated by letter, can not carry out synchronous.In addition, the thread in a thread block is divided into again some Warp, and the thread in a Warp is always carried out same instruction (seeing Fig. 2).Due to resource-constraineds such as register, arithmetic elements on upper each multiprocessor of GPU, the Thread Count that thread block can be held is limited.And in thread block, the size of Warp is the integral multiple of processing units quantity in a multiprocessor.While only having part thread to carry out certain instruction in a Warp, all threads all can be carried out this instruction, and those do not need the thread of carrying out this instruction operation result can be abandoned.Therefore the part thread idle running that, should as far as possible avoid branch operation to cause.
Find by prior art documents, in the method that existing use GPU accelerates motion estimation algorithm, only for full search or universe, eliminate the method that searching moving algorithm for estimating accelerates.And owing to being subject to the Thread Count restriction that thread block can hold, these accelerated methods are subject to larger restriction on hunting zone, cannot meet actual needs.
Summary of the invention
The object of the present invention is to provide a kind of accelerated method based on GPU/CPU collaborative process for hierarchical search motion estimation, make full use of the computational resource of GPU, share CPU computation burden, and significantly improve the speed of hierarchical search motion estimation.
For achieving the above object, the hierarchical search motion estimation method based on GPU/CPU collaborative process that the present invention proposes comprises the following steps:
The first step: present frame and reference frame transmission are caused to GPU video memory, and described present frame and reference frame are divided into L layer, the 1st layer is present frame and reference frame original image.Utilize GPU generate all the other L-1 floor height tomographic images and leave in video memory, each tomographic image is 1/2 down-sampled images of a lower tomographic image on wide and high both direction.
While using GPU to carry out down-sampling generation K+1 tomographic image layer to K tomographic image layer, it is the down-sampled images piece of 2D * 2D that K tomographic image is divided into size.Distribute a thread block, the out of order execution concurrently of all thread block to described each down-sampled images piece.The two-dimentional sets of threads of a D * D of application in described each thread block.The thread that in described each sets of threads, coordinate is (x, y) is followed down-sampling point of four dot generation for (2x, 2y), (2x+1,2y), (2x, 2y+1), (2x+1,2y+1) according to coordinate in described down-sampled images piece.Finally, in a described sets of threads, the down-sampling point of all thread computes obtains D * D image block of correspondence position in K+1 tomographic image layer.
From described L tomographic image layer (top layer images layer), start successively to carry out second step to the six steps, until execute the 1st tomographic image layer (original image layer).
Second step: K tomographic image layer is carried out to piece division and thread distribution.Described K tomographic image layer size is designated as M * N, by the block size of appointment (n * n), K tomographic image is divided into individual searching image piece, distributes a thread block for the search of motion vector to described each searching image piece.The out of order execution concurrently of described all thread block.
Follow according to the region of search of setting on described K tomographic image layer and GPU computing capability and set the thread distribution method to each search point of region of search.Region of search width is designated as H, is highly designated as V, and the Thread Count that upper each thread block of GPU can create is designated as T.If V * H≤T creates the two-dimentional sets of threads of H * V in described thread block.The sad value at the search point place that described in the thread computes that in described sets of threads, coordinate is (x, y), region of search internal coordinate is (x, y), this scheme is designated as thread allocative decision A.If V * H>T creates the one dimension sets of threads that a size is H in described thread block.The thread that in described sets of threads, sequence number is x, calculates the sad value at the search point place of described region of search internal coordinate (x, k), wherein, 0≤k<V, this scheme is designated as thread allocative decision B.
The 3rd step: thread block described in second step is obtained the Search Results of described searching image piece on K+1 tomographic image layer from GPU video memory, calculates the search center of described searching image piece on K tomographic image layer.If K=L, present image layer is top layer images layer, the search center of described searching image piece is designated as to the position of first pixel of the described searching image piece upper left corner.If K<L, the coordinate of described searching image piece central pixel point on described K tomographic image layer is designated as (x, y), and described pixel respective coordinates on described K+1 tomographic image layer is in described K+1 tomographic image layer, searching image piece size is designated as m * m, and described pixel drops on coordinate in K+1 tomographic image layer and is searching image piece inner.By coordinate in described K+1 tomographic image layer, be the motion vector (the 6th step at K+1 layer draws) of searching image piece take advantage of 2 search centers as described the above searching image piece of K tomographic image layer.
The 4th step: each thread described in second step calculates SAD to distributed search pixel point, and by result store in the linear array of shared drive.For the allocative decision of thread described in second step A, each thread is directly stored to shared drive by the sad value of its calculating and thread coordinate.For the allocative decision of thread described in second step B, described each thread is used the scheme of two stage Searchs: the first stage, the thread that described sequence number is x from described region of search internal coordinate is point start upwards search, until searched for the point that coordinate is (x, 0).Wherein, V is region of search height.In search procedure, record minimum sad value and corresponding search point ordinate.Minimum sad value and the ordinate of first paragraph searching record are designated as to SAD 1, y 1.Second stage, corresponding with the first stage, the thread that described sequence number is x from described region of search internal coordinate is point start downward search, until searched for coordinate for the point of (x, V-1).Minimum sad value and the ordinate of second segment searching record are designated as to SAD 2, y 2.Finally compare SAD1 and SAD2, minimum sad value and respective coordinates are write to shared drive.If two sad values are identical, by close the coordinate of search point write shared drive.
The 5th step: search described searching image the piece minimum SAD on all search points and corresponding coordinate in described region of search.Each thread block described in second step, when the 4th step completes, has obtained TN sad value and the respective coordinates (TN is the number of threads in described thread block) of linear discharge.Now by merger, operate and find minimum sad value and respective coordinates.First use the 1st to individual thread is TN sad value relatively.I.e. x thread x and the relatively individual sad value, and less sad value and respective coordinates are write to x position.If sad value is identical, by from search center, the search point coordinates close to more writes x position.Now obtain individual sad value and respective coordinates.Then continue with first to individual thread comparison individual sad value.This process is carried out repeatedly, until residue SAD number is less than the Thread Count in the upper Warp of GPU.
Because all threads in a Warp of thread block are always carried out identical instruction, when residue SAD number is less than the twice of Warp thread number (being designated as WN), will there is thread idle running.When residue SAD number is less than WN, GPU end-of-job, passes described residue SAD back CPU, by CPU, carries out traversal, finds out minimum SAD and coordinate.This coordinate is the motion vector of current search image block.
The 6th step: if described K tomographic image layer is not original image layer, i.e. K ≠ 1, in the described K tomographic image layer that CPU obtains traversal, the motion vector of each searching image piece is passed GPU video memory back, for K-1 layer, searches for.If K=1, this hierarchical search motion estimation algorithm finishes.
The present invention has the following advantages:
1. the present invention distributes by reasonably carrying out data division and thread, has effectively utilized the upper hundreds of arithmetic unit of GPU unit, has improved to greatest extent degree of parallelism, and has greatly promoted the speed of hierarchical search motion estimation algorithm.
2. the present invention uses that GPU is parallel to carry out down-sampling and obtain each image layer in hierarchical search algorithm, and leaves in GPU video memory, reduces the communication between CPU and GPU, has improved acceleration efficiency;
3. the present invention can follow region of search size, block size and the GPU disposal ability according to appointment in each image layer in hierarchical search algorithm, carries out adaptively rational data division and thread and distributes;
4. the data that the employing that the present invention proposes is according to pixels listed as division divide and thread allocative decision to meet region of search larger, the situation that is greater than the number of threads that the single thread block of GPU can hold is put in search.Expanded the range of application of this algorithm;
5. for two, search in the situation that some sad value is identical, need to compare the size of two search point motion vectors and also determine best motion vectors.In the present invention, in the scheme that according to pixels row are divided, adopt two sections to search element, can effectively utilize the positional information of search point, avoided the computing that this GPU of comparing motion vector magnitude is bad in the situation that sad value is identical, improved processing speed;
6. the present invention, in searching the stipulations process of minimum SAD and optimal motion vectors, is shared the SAD comparison that is finally less than Warp quantity by CPU, has avoided the thread idle running of GPU, has also utilized the computing capability of CPU, has improved search speed.
Accompanying drawing explanation
Fig. 1 is hierarchical search motion estimation algorithm schematic diagram;
Fig. 2 is general object GPU hardware configuration and thread tissue and mapping relations schematic diagram;
Fig. 3 is the flow chart of the embodiment of the present invention;
Fig. 4 is that in the embodiment of the present invention, GPU generates the schematic diagram of the 2nd tomographic image layer to the 1st tomographic image layer down-sampling;
Fig. 5 is the thread distribution schematic diagram that in the embodiment of the present invention, GPU carries out parallel search on the 3rd tomographic image layer;
Fig. 6 is that in the embodiment of the present invention, GPU carries out the schematic diagram of two sections of searching algorithms of parallel search on the 3rd tomographic image layer;
Fig. 7 is that in the embodiment of the present invention, the interior all threads of thread block are used conflation algorithms to find the schematic diagram of minimum SAD to all SAD of search gained;
Fig. 8 is that the thread that in the embodiment of the present invention, GPU carries out parallel search on the 2nd layer and the 1st tomographic image layer distributes schematic diagram;
Fig. 9 is that in the embodiment of the present invention, GPU calculated the schematic diagram of search center carry out parallel search on the 2nd tomographic image layer before.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in detail.The present embodiment is only one embodiment of the present of invention rather than whole embodiment.
The video sequence of 1920 * 1080 resolution is carried out to hierarchical search motion estimation, and on original image layer, searching image block size is 8 * 8, and region of search will cover centered by current block, the wide and high region that is 256 pixels.Thread block of GPU of using can have 512 threads, and in thread block, Warp size is 16 threads.
Under this application scenarios, can adopt following embodiment, divide three image layer: the 1st tomographic image layer is original image layer, searching image block size is 8 * 8, and region of search is 8 * 8 regions centered by current block; The 2nd tomographic image layer is 960 * 540 image in different resolution of original image down-sampling, and searching image block size is 8 * 8, and region of search is 16 * 16 regions centered by current block; The 3rd tomographic image layer is 480 * 270 image in different resolution of Fig. 2 tomographic image layer down-sampling, and searching image block size is 4 * 4, and region of search is to be 64 * 64 regions centered by current block.Searching image block size needs in ground floor image layer and application requirements are consistent, region of search size and other each layer of searching image block size of each layer all can be specified as required voluntarily.In this embodiment, each region of search of appointment and search block size, be one group of selectable value, and inwhole.
In this embodiment, the step that the present invention realizes as shown in Figure 3:
The first step, CPU is transferred on GPU video memory with reference to frame and present frame.The image of former 1920 * 1080 resolution is divided into 60 * 34 down-sampled images pieces according to the image block of 32 * 32 sizes.60 * 34 thread block of the upper establishment of GPU, create 16 * 16 two-dimentional sets of threads in each thread block.A described sets of threads is processed the down-sampled images piece of 32 * 32,16 * 16 of correspondence position image block on generation the 2nd tomographic image layer.Described all thread parallels carry out down-sampling computing.For example, in sets of threads, coordinate is (5,5) in thread computes 32 * 32 down-sampled images pieces, coordinate is (10,10), (10,11), (11,10), (11,11) average of four points, generate the point that on the 2nd tomographic image layer, in correspondence position 16 * 16 image blocks, coordinate is (5,5), as shown in Figure 4.All the other threads in like manner.The down-sampling formula using in this embodiment or filter are a selectable scheme in all embodiment, rather than unique scheme.
Making to use the same method resolution is 960 * 540 the 2nd tomographic image layer is divided into the down-sampled images piece of 30 * 17 32 * 32, and distributes 30 * 17 thread block down-sampling that walks abreast to obtain the 3rd tomographic image layer of 480 * 270 resolution.
Second step, determines the thread allocative decision on the 3rd tomographic image layer.Because region of search size on the 3rd tomographic image layer is 64 * 64, totally 4096 search points, have surpassed the upper thread block of GPU and have held 512 of maximum threads quantity.Therefore this layer is upper, for each search piece distributes a sets of threads that size is 64, and all sad values of a row search point in thread computes 64 * 64 regions of search in described sets of threads, as shown in Figure 5.Because the 3rd tomographic image layer is top layer images layer, so search center is coordinate corresponding to first pixel of the current block upper left corner.
The 3rd step, each thread, according to the method for two sections of search, is found the point of SAD minimum in the 1 row search point of region of search.As shown in Figure 6, the thread that in sets of threads, sequence number is x, calculating region of search internal coordinate is that (x, 0) is to the point of (x, 63).The point that first described thread is (x, 32) from coordinate starts upwards to search the point that coordinate is (x, 0), and then the point that is (x, 33) from coordinate starts to search coordinate downwards for the point of (x, 63).And record in two sections minimum separately sad value and coordinate.Finally minimum sad value in more described two sections, writes shared drive by minimum sad value and coordinate.
The 4th step, as shown in Figure 7, by conflation algorithm, 64 sad values relatively depositing in shared drive.This algorithm iteration twice, finally obtains 16 less sad values.
The 5th step, because the Warp size of this GPU is 16, when therefore remaining 16 threads, GPU will remaining 16 sad values and coordinate return CPU, by CPU, travel through and obtain minimum sad value and coordinate, thereby obtain motion vector.
The 6th step, CPU passes back to the motion vector of all searching image pieces the video memory of GPU, for the search on the 2nd tomographic image layer is prepared.
The 7th step, determines the thread allocative decision on the 2nd tomographic image layer.Because region of search size on the 2nd tomographic image layer is 16 * 16, totally 256 search points, do not surpass the Thread Count 512 that the upper thread block of GPU can be held.Therefore, each searching image piece is distributed to the two-dimentional sets of threads of 16 * 16, the sad value of a search point in sets of threads in a thread computes same coordinate, as shown in Figure 8.
The 8th step, the search center of each searching image piece on calculating the 2nd tomographic image layer.For example, take coordinate as (64,64) point is the searching image piece A of first pixel of the upper left corner, its center is coordinate (68,68) pixel, described pixel respective coordinates on the 3rd tomographic image is (34,34), described pixel is the inside of 4 * 4 searching image piece B of first pixel of the upper left corner with (32,32) in the 3rd tomographic image layer.Suppose that it is (5 ,-4) that searching image piece B searches for the motion vector obtaining on the 3rd tomographic image layer, is made as (10 ,-8) by the search center of searching image piece A on the 2nd tomographic image layer, as shown in Figure 9.
The 9th step is consistent on same the 3rd tomographic image layer of all the other steps of the enterprising line search of the 2nd tomographic image layer.Finally obtain the motion vector of all searching image pieces on the 2nd tomographic image layer, and leave in video memory.
The tenth step, at the enterprising line search of the 1st tomographic image layer, on the 1st tomographic image layer, thread distribution and search plan are consistent with the 2nd tomographic image layer.After having searched on the 1st tomographic image layer and obtaining the motion vector of each searching image piece, no longer pass GPU back, but directly export as result of calculation.
So far, this embodiment completes in steps, and obtains final calculation result.
This invention, by above step, makes full use of the computation capability of GPU, completes fast hierarchical search motion estimation.By adaptive thread allocative decision, make the application requirements of the GPU of different computing capabilitys to the image of different resolution, different region of search and search block size, all can provide high performance acceleration.

Claims (4)

1. the hierarchical search motion estimation algorithm accelerating based on GPU, is characterized in that: comprise the following steps:
The first step: CPU is transferred to video memory with reference to frame and present frame, by the parallel down-sampling that carries out each image layer of GPU, is operated, and generates each image layer for hierarchical search, and top layer images layer is made as to present image layer;
Second step: for each searching image piece of present image layer distributes a thread block for search, thread block is selected thread allocative decision according to the region of search size of appointment on GPU disposal ability and present image layer, and distributes required thread;
The 3rd step: each thread is used the Search Results of last layer image layer, the search center of each searching image piece on initialization present image layer; The 4th step: each thread is pressed distributed search point and calculated the SAD at each search point, and by result store in shared drive;
The 5th step: each thread block is used parallel reduction to search the minimum SAD recording in shared drive, when residue SAD number is less than or equal to thread block Warp size, stops reduction;
The 6th step: each thread block is passed remaining SAD in the 5th step and corresponding search point coordinates back CPU, is traveled through and is searched by CPU, finds out minimum SAD and obtains motion vector;
The 7th step: if present image layer is the 1st tomographic image layer, export the motion vector of each searching image piece; Otherwise, pass motion vector back video memory, and present image layer is made as to lower one deck image layer, and rebound second step.
2. algorithm as claimed in claim 1, is characterized in that, in the described first step, the parallel down-sampling operation of carrying out each image layer of GPU, specifically comprises: will treat that it is the down-sampled images piece of 2D * 2D that down-sampled images is divided into size; Distribute a thread block, the out of order execution concurrently of all thread block to described each down-sampled images piece; The two-dimentional sets of threads of a D * D of application in described each thread block; The thread that in described each sets of threads, coordinate is (x, y) is down-sampling point of four dot generation of (2x, 2y), (2x+1,2y), (2x, 2y+1), (2x+1,2y+1) according to coordinate in described down-sampled images piece.
3. algorithm as claimed in claim 1, it is characterized in that, in described second step, thread block is selected thread allocative decision according to the region of search size of appointment on GPU disposal ability and present image layer, specifically comprise: when field of search field width is H, height is V, when a thread block of GPU can distribute maximum thread to be T, if V * H≤T, in described thread block, create the two-dimentional sets of threads of H * V, region of search internal coordinate is (x described in the thread computes that in described sets of threads, coordinate is (x, y), the sad value at search point place y), this scheme is designated as thread allocative decision A; If V * H>T creates the one dimension sets of threads that a size is H in described thread block; The thread that in described sets of threads, sequence number is x, calculates the sad value at the search point place of described region of search internal coordinate (x, k), wherein, 0≤k<V, this scheme is designated as thread allocative decision B.
4. algorithm as claimed in claim 3, is characterized in that, described thread allocative decision B adopts two stage Search schemes, and the first stage, the thread that in sets of threads, sequence number is x from described region of search internal coordinate is point start upwards search, until searched for the point that coordinate is (x, 0), wherein V is region of search height; Second stage, corresponding with the described first stage, the thread that described sequence number is x from described region of search internal coordinate is point start downward search, until searched for coordinate for the point of (x, V-1).
CN201110040025.0A 2011-02-17 2011-02-17 GPU (Graphics Processing Unit) acceleration method used for hierarchical searching motion estimation Active CN102647588B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110040025.0A CN102647588B (en) 2011-02-17 2011-02-17 GPU (Graphics Processing Unit) acceleration method used for hierarchical searching motion estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110040025.0A CN102647588B (en) 2011-02-17 2011-02-17 GPU (Graphics Processing Unit) acceleration method used for hierarchical searching motion estimation

Publications (2)

Publication Number Publication Date
CN102647588A CN102647588A (en) 2012-08-22
CN102647588B true CN102647588B (en) 2014-09-24

Family

ID=46660136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110040025.0A Active CN102647588B (en) 2011-02-17 2011-02-17 GPU (Graphics Processing Unit) acceleration method used for hierarchical searching motion estimation

Country Status (1)

Country Link
CN (1) CN102647588B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10296340B2 (en) 2014-03-13 2019-05-21 Arm Limited Data processing apparatus for executing an access instruction for N threads

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103747262B (en) * 2014-01-08 2017-10-10 中山大学 A kind of method for estimating based on GPU
CN105448273A (en) * 2014-09-01 2016-03-30 扬智科技股份有限公司 Image processing method and system
CN106878737B (en) * 2017-03-02 2019-10-08 西安电子科技大学 Estimation accelerated method in efficient video coding
CN107872674A (en) * 2017-11-23 2018-04-03 上海交通大学 A kind of layering motion estimation method and device for ultra high-definition Video Applications
CN110837395B (en) * 2018-08-17 2022-03-25 北京图森智途科技有限公司 Normalization processing method, device and system for multi-GPU parallel training
CN110009551A (en) * 2019-04-09 2019-07-12 浙江大学 A kind of real-time blood vessel Enhancement Method of CPUGPU collaboration processing
CN116739884B (en) * 2023-08-16 2023-11-03 北京蓝耘科技股份有限公司 Calculation method based on cooperation of CPU and GPU

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1852442A (en) * 2005-08-19 2006-10-25 深圳市海思半导体有限公司 Layering motion estimation method and super farge scale integrated circuit
CN101472181A (en) * 2007-12-30 2009-07-01 英特尔公司 Configurable performance motion estimation for video encoding

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100994773B1 (en) * 2004-03-29 2010-11-16 삼성전자주식회사 Method and Apparatus for generating motion vector in hierarchical motion estimation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1852442A (en) * 2005-08-19 2006-10-25 深圳市海思半导体有限公司 Layering motion estimation method and super farge scale integrated circuit
CN101472181A (en) * 2007-12-30 2009-07-01 英特尔公司 Configurable performance motion estimation for video encoding

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于CUDA的并行全搜索运动估计算法;甘新标等;《计算机辅助设计与图形学学报》;20100331;第22卷(第3期);457-460 *
甘新标等.基于CUDA的并行全搜索运动估计算法.《计算机辅助设计与图形学学报》.2010,第22卷(第3期),457-460.

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10296340B2 (en) 2014-03-13 2019-05-21 Arm Limited Data processing apparatus for executing an access instruction for N threads

Also Published As

Publication number Publication date
CN102647588A (en) 2012-08-22

Similar Documents

Publication Publication Date Title
CN102647588B (en) GPU (Graphics Processing Unit) acceleration method used for hierarchical searching motion estimation
CN111178519B (en) Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
Yin et al. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications
Kowalczuk et al. Real-time stereo matching on CUDA using an iterative refinement method for adaptive support-weight correspondences
Kim et al. A novel zero weight/activation-aware hardware architecture of convolutional neural network
Budden et al. Deep tensor convolution on multicores
US20190213439A1 (en) Switchable propagation neural network
CN102141976A (en) Method for storing diagonal data of sparse matrix and SpMV (Sparse Matrix Vector) realization method based on method
CN111738433A (en) Reconfigurable convolution hardware accelerator
CN105550974A (en) GPU-based acceleration method of image feature extraction algorithm
Sun et al. Optimizing SpMV for diagonal sparse matrices on GPU
Zlateski et al. ZNNi: maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs
Li et al. High throughput hardware architecture for accurate semi-global matching
CN110135569A (en) Heterogeneous platform neuron positioning three-level flow parallel method, system and medium
CN103198451A (en) Method utilizing graphic processing unit (GPU) for achieving rapid wavelet transformation through segmentation
CN109165733A (en) Multi-input multi-output matrix maximum pooling vectorization implementation method
CN104537278A (en) Hardware acceleration method for predication of RNA second-stage structure with pseudoknot
CN106484532B (en) GPGPU parallel calculating method towards SPH fluid simulation
CN110533710A (en) A kind of method and processing unit of the binocular ranging algorithm based on GPU
CN107316324B (en) Method for realizing real-time stereo matching and optimization based on CUDA
Ling et al. Lite-stereo: a resource-efficient hardware accelerator for real-time high-quality stereo estimation using binary neural network
CN102799564A (en) Fast fourier transformation (FFT) parallel method based on multi-core digital signal processor (DSP) platform
Li et al. Fast convolution operations on many-core architectures
CN109448092B (en) Load balancing cluster rendering method based on dynamic task granularity
CN115859011A (en) Matrix operation method, device and unit, and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant