CN107231558B

CN107231558B - A kind of implementation method of the H.264 parallel encoder based on CUDA

Info

Publication number: CN107231558B
Application number: CN201710368717.5A
Authority: CN
Inventors: 杨振
Original assignee: Jiangsu Fire Interactive Technology Co Ltd
Current assignee: Jiangsu Fire Interactive Technology Co Ltd
Priority date: 2017-05-23
Filing date: 2017-05-23
Publication date: 2019-10-22
Anticipated expiration: 2037-05-23
Also published as: CN107231558A

Abstract

The implementation method of the present invention relates to a kind of H.264 parallel encoder based on CUDA, this method include the parallelization processing of the optimization of encoder overall structure and each functional module on CUDA.The overall structure optimization includes frame level separation being carried out to encoder functionality module, and carry out task division to CPU and GPU.GPU carries out 4 inter-prediction, intraframe predictive coding, entropy coding, deblocking filtering processes in functional module of the module level to encoder respectively, realizes parallelization of the encoder on CUDA from parallel model design and storage model etc..

Description

A kind of implementation method of the H.264 parallel encoder based on CUDA

[technical field]

The invention belongs to field of video encoding more particularly to a kind of realization sides of the H.264 parallel encoder based on CUDA Method.

[background technique]

Now, H.264/AVC as current most popular video encoding standard, with its high image quality and high compression ratio Performance and will be widely welcomed, but improve picture quality and code efficiency, while the calculating also considerably increased H.264 is multiple Miscellaneous degree, and the existing serial structure encoder based on general processor is unable to reach the performance of high definition real-time coding, and it is dedicated The development cost of hardware is high, and the period is long, poor universality, is not suitable for large-scale use, so needing to find one for H.264 encoder The efficient implementation method of kind.

[summary of the invention]

In order to solve the above problem in the prior art, the invention proposes a kind of H.264 parallel encoding based on CUDA The implementation method of device.

The technical solution adopted by the present invention is specific as follows:

A kind of implementation method of the H.264 parallel encoder based on CUDA, method includes the following steps:

(1) H.264 coder structure is adjusted, including frame level separation and right is carried out to encoder functionality module Task of the encoder on CPU and GPU is divided；

(2) each functional module of encoder parallelization on CUDA is run, i.e., in module level to H.264 encoder Functional module carry out 4 inter-prediction, intraframe predictive coding, entropy coding, deblocking filtering processes respectively.

Further, the frame level separation of functional module includes the following steps:

(1.1) according to the function of encoder core function, each power function in core function is separated into independent Loop body makes each power function carry out independent loops in frame level-one；

(1.2) large data structure in encoder is divided into multiple simple data structures according to its life cycle, and And it is localized according to its actual life cycle.

Further, the step 1.2 specifically includes:

The large data structure is divided into local variable, pseudo- global variable and true global variable three types；

If (a) large data structure is local variable, do not change；

If (b) large data structure is pseudo- global variable, by the method for renaming, by the puppet global variable Different variables is divided into according to its practical life cycle；

If (c) large data structure is true global variable, in the data structure for investigating the true global variable, it is No to have Partial Variable be pseudo- global variable or local variable, if so, then isolating these variables from the true global variable It goes, the processing such as above-mentioned steps b is carried out again to the pseudo- global variable separated.

Further, the task of CPU and GPU, which divides, includes:

(2.1) input of video file is completed by CPU and video file is pre-processed；

(2.2) CPU by video file primitive frame and reference frame send GPU to, subsequent coding is carried out by GPU and is grasped Make；

(2.3) GPU carries out inter-prediction；

(2.4) GPU executes intraframe predictive coding；

(2.5) GPU carries out parallelization entropy coding；

(2.6) GPU carries out deblocking filtering.

Further, the inter-prediction uses multiresolution multiwindow (MRMW) algorithm.

Further, during intraframe predictive coding, data are loaded in such a way that primary reading is repeatedly handled, i.e., often The kernel function inside of the data that the multiple macro blocks of a thread block loading processing into corresponding shared memory need, CUDA is logical It crosses one layer of circulation and predictive coding is carried out to these data, data will be rebuild after the data processing that this reads terminates and are write back, Then new data are reloaded to be handled；The tissue of corresponding kernel is double loop structure, and outer loop controls variable The number of corresponding load, the corresponding data loaded every time of memory loop control variable need number of processing.

Further, it is handled as unit of macro block inside kernel, the macro block includes multiple sub-macroblocks, pre- in frame Surveying coding includes three phases:

First stage: each sub-macroblock transfers to a thread in intra prediction thread block to carry out intra-prediction process；

Second stage: DCT is carried out to a row or column pixel in a sub-macroblock by a thread in DCT thread block Processing；

Phase III: quantification treatment is carried out to a pixel by a thread in quantization thread block.

Further, during parallelization entropy coding, each CUDA thread block handles 8 continuous macro blocks, each The entropy coding of one sub-macroblock of thread process.

Further, the deblocking filtering is as unit of frame, calculating and filtering including boundary intensity.

Further, the pretreatment includes the separation and the setting of encoder basic parameter to video component YUV.

This method has the beneficial effect that the execution efficiency for improving H.264 encoder, in the premise for not reducing coding efficiency The lower computation complexity for reducing coding, improves coding rate.

[Detailed description of the invention]

Described herein the drawings are intended to provide a further understanding of the invention, constitutes part of this application, but It does not constitute improper limitations of the present invention, in the accompanying drawings:

Fig. 1 is that the present invention divides schematic diagram to the loop body of core function.

Fig. 2 is the schematic diagram that data structure of the present invention is simplified and localizes.

Fig. 3 is the task division figure on CPU-GPU of the present invention.

Fig. 4 is inter prediction encoding storage model of the present invention.

Fig. 5 is CAVLC coding stage CUDA parallel model of the present invention.

Fig. 6 is deblocking filtering function seperated schematic diagram.

[specific embodiment]

Come that the present invention will be described in detail below in conjunction with attached drawing and specific embodiment, illustrative examples therein and says It is bright to be only used to explain the present invention but not as a limitation of the invention.

The present invention is based on serial program X264 H.264 to be proposed simultaneously based on the analysis to this program according to CUDA framework Capable H.264 encoder frame and the method that parallel H.264 encoder is realized on CUDA.This method includes following two side Face:

(1) overall structure optimizes

Overall structure optimization is adjusted to H.264 coder structure, to the H.264 parallel encoder based on CUDA Frame is designed, and the adjustment and design mainly include two aspects: carrying out frame level separation to encoder functionality module；And it is right CPU and GPU carries out task division.

(2) parallelization of each functional module on CUDA is distinguished in functional module of the module level to H.264 encoder Inter-prediction, intraframe predictive coding, 4 entropy coding, deblocking filtering processes are carried out, from the side such as parallel model design and storage model Realize parallelization of the encoder on CUDA in face.

The two aspects of this method are described in detail below.

The frame level of functional module separates:

Specific step is as follows for the frame level separation of functional module:

(1.1) the loose function degree of coupling

In H.264 encoder, core function (main function) is a big loop body, as shown in above Fig. 1, A For main function comprising the D1 ' ... of lower section, D5, D6 ..., D7, E1, E2, E3, E4, E5 all functions are whole as one A big loop body, recycling each time for main function are carried out all function, this mode loop body path length, If directly carrying out the exploitation of concurrent program, function load is too heavy.

Therefore the function of the invention according to core function, by the entire Loop partitioning of core function at multiple relatively independent Loop body, as shown in below Fig. 1, each function carries out independent loops in frame level-one, by D1 ' ..., D5, D6 ..., D7, E1, E2, Each of E3, E4, E5 are divided into independent loop body, such as D1 ' function is a loop body, and D5 function is a loop body, D7 letter Number is a loop body, and E1 function is loop body etc..In this way, each function independently focuses on a task, independent loops, During each loop body executes, the locality of instruction is more preferable, and Failure count is low.

(1.2) data structure in H.264 encoder is simplified and is localized

Referring to fig. 2, in order to reduce the time that data are transmitted, the present invention is by the large data structure in encoder according to its life The life period is divided into multiple simple data structures, and is localized according to its actual life cycle.Specifically, described big Type data structure is segmented into local variable, pseudo- global variable and true global variable three types.

For the local variable A in local variable, such as Fig. 2 in function 0, do not change.

For pseudo- global variable B, although being global variable, the sphere of action of the variable can split into multiple realities The puppet global variable is divided into different variables according to its practical life cycle then by the method for renaming by border life cycle. As shown in Fig. 2, the variate-value between function 0 and function 1 is not related for pseudo- global variable B, 2 can be split into Life cycle, thus by the puppet global variable renamed as B0 in function 1, and variable B is not used in function 2, then Variable B can not be re-defined in function 2.

For true global variable C, then in the data structure for needing to investigate the true global variable, if Partial Variable is pseudo- Global variable or local variable, if so, then these variables are separated from C, the pseudo- global variable separated again into Row as above processing.If shown in Fig. 2, true global variable C can split into a pseudo- global variable and a local variable, then limit The sphere of action of the puppet global variable is made in function 0 and function 1, limits the sphere of action of local variable C0 only in function 2.

The task of CPU and GPU divides

With reference to Fig. 3, it illustrates the present invention H.264 task of each functional module of encoder on CPU and GPU divide with And the data mobility status between CPU-GPU.

(2.1) input of video file is completed by CPU first and video file is pre-processed, including to video component The separation of YUV and the setting of encoder basic parameter etc..

(2.2) CPU sends primitive frame and reference frame to GPU, carries out subsequent encoding operation by GPU.

GPU is handled frame by executing four modules, basic procedure is: pre- to the interframe of a frame as unit of frame Surveying terminates and then carries out corresponding intraframe predictive coding, then entropy coding is carried out to obtained variable design coefficient, with such It pushes away, until the entropy coding of whole frame and deblocking filtering pass result data back CPU again after terminating.

(2.3) GPU executes inter-prediction.

Inter-prediction is that demand the best part is H.264 calculated in encoder, and calculation amount needed for conventional inter is about The 70% of entire encoder is accounted for, although picture quality is preferably complicated.The present invention uses the more windows of multiresolution in the prior art Mouth (MRMW) algorithm carries out inter-prediction.It is opposite using MRMW algorithm since the present invention has carried out frame level separation to functional module The time of inter-prediction can be greatly reduced in the prior art.

(2.4) GPU executes intraframe predictive coding.

Intra prediction degree of parallelism is not high, and it is 1 macro block that CUDA per thread block can handle maximum data volume simultaneously (256 pixel), pressure for shared memory is simultaneously little, and there are the passes between Producer-consumer problem between adjacent macroblocks System, in order to be reduced to the access times of related data in global storage, the present invention adds in such a way that primary reading is repeatedly handled Carry data.That is the data of the multiple macro blocks needs of per thread block loading processing into corresponding shared memory, CUDA's Predictive coding is carried out to these data by one layer of circulation inside kernel function, after the data processing that this reads terminates Data will be rebuild to write back, new data are then reloaded and handled.The tissue of corresponding kernel is double loop structure, outside The number of the corresponding load of layer loop control variable, the corresponding data loaded every time of memory loop control variable need number of processing.

Referring to attached drawing 4, it illustrates the storage models of intraframe predictive coding.Fig. 4 upper left is by multiple macro blocks (MB) group At a picture frame, every time read frame data when, all from original image frame read a strip, and store to share deposit In reservoir (as shown in Fig. 4 upper right), the strip is handled as unit of macro block inside kernel.

In the middle part of Fig. 4 and lower part shows kernel to the treatment process of a macro block.The left of Fig. 4 partially illustrates one The macro block of 4*4 comprising sub-macroblock 0 arrives sub-macroblock 15, and each sub-macroblock includes 4*4 pixel, and intraframe predictive coding includes Three phases:

First stage: if Fig. 4 is left and bottom left section, each sub-macroblock transfers to intra prediction thread block (prediction Thread block) in a thread carry out intra-prediction process, need 16 threads (thread 0 to thread 15) altogether.

Second stage: if the positive neutralization positive bottom of Fig. 4 is divided, by a thread in DCT thread block in a sub-macroblock A row or column pixel carry out DCT processing, need 64 threads (thread 0 to thread 63) altogether.

Phase III: such as the right neutralization lower right-most portion of Fig. 4, by one in quantization thread block (quant Thread block) A thread carries out quantification treatment (in a manner of row major) to a pixel, needs 256 threads (0 value thread 255 of thread) altogether.

(2.5) GPU carries out parallelization entropy coding.

With reference to attached drawing 5, it is CAVLC coding stage CUDA parallel model, shows the brightness AC compounent entropy coding stage The mapping relations of data and thread.Wherein each CUDA thread block handles 8 continuous macro blocks, i.e. thread block B₀Handle the 0th row In MB0 to MB7, thread block B₁₄MB112 to MB119 is handled, and so on.Continuous 16 threads are handled respectively in thread block 16 sub-macroblocks in one macro block.1200 thread blocks are shared in Fig. 5, per thread block includes 128 threads, and Thread Count reaches 130560, the entropy coding of each one sub-macroblock of thread process, to realize 130560 thread parallel entropy codings. Although entropy coding is the component of branch's intensity, the frame level Jing Guo functional module separates, and various components is separated, Some individual paths are eliminated, realize that large-scale data parallel is enough to make up branch operation bring shadow by a large amount of threads It rings.

(2.6) GPU carries out deblocking filtering, as shown in fig. 6, the deblocking filtering is as unit of frame, including boundary intensity It calculates and filters.

By the above process, the present invention realizes the parallelization H.264 on CUDA in terms of system and module level two Process reduces the computation complexity of coding under the premise of not reducing coding efficiency, improves coding rate.

The above description is only a preferred embodiment of the present invention, thus it is all according to the configuration described in the scope of the patent application of the present invention, The equivalent change or modification that feature and principle are done, is included in the scope of the patent application of the present invention.

Claims

1. a kind of implementation method of the H.264 parallel encoder based on CUDA, which is characterized in that method includes the following steps:

(1) H.264 coder structure is adjusted, including frame level separation is carried out to encoder functionality module, and to the volume Task of the code device on CPU and GPU is divided；

(2) each functional module of encoder parallelization on CUDA is run, i.e., in module level to the function of H.264 encoder Energy module carries out 4 inter-prediction, intraframe predictive coding, entropy coding, deblocking filtering processes respectively；

The frame level separation of functional module includes the following steps:

(1.1) according to the function of encoder core function, each power function in core function is separated into independent circulation Body makes each power function carry out independent loops in frame level-one；

(1.2) large data structure in encoder is divided into multiple simple data structures, and root according to its life cycle It is localized according to its actual life cycle；

The step 1.2 specifically includes:

The large data structure is divided into local variable, pseudo- global variable and true global variable three types；It is described pseudo- global Although variable refers to global variable, sphere of action can split into the variable of multiple practical life cycles；

If (a) large data structure is local variable, do not change；

If (b) large data structure is pseudo- global variable, by renaming method, by the puppet global variable according to Its practical life cycle is divided into different variables；

If (c) large data structure is true global variable, in the data structure for investigating the true global variable, if having Partial Variable is pseudo- global variable or local variable, if so, then separate these variables from the true global variable, it is right The pseudo- global variable separated carries out the processing such as above-mentioned steps b again；

During intraframe predictive coding, data are loaded in such a way that primary reading is repeatedly handled, i.e. per thread block is to correspondence Shared memory in the data that need of the multiple macro blocks of loading processing, by one layer of circulation to this inside the kernel function of CUDA A little data carry out predictive coding, and data will be rebuild after the data processing that this reads terminates and are write back, are then reloaded new Data are handled；The tissue of corresponding kernel is double loop structure, and outer loop controls the number of the corresponding load of variable, The corresponding data loaded every time of memory loop control variable need number of processing；

It is handled as unit of macro block inside kernel, the macro block includes multiple sub-macroblocks, and intraframe predictive coding includes three A stage:

Second stage: DCT is carried out to a row or column pixel in a sub-macroblock by a thread in DCT thread block Reason；

2. the method according to claim 1, wherein the task division of CPU and GPU includes:

(2.2) CPU by video file primitive frame and reference frame send GPU to, subsequent encoding operation is carried out by GPU；

(2.3) GPU carries out inter-prediction；

(2.4) GPU executes intraframe predictive coding；

(2.5) GPU carries out parallelization entropy coding；

(2.6) GPU carries out deblocking filtering.

3. according to the method described in claim 2, it is characterized in that, the inter-prediction uses multiresolution multiwindow (MRMW) Algorithm.

4. according to method described in claim 2-3 any one, which is characterized in that during parallelization entropy coding, each CUDA thread block handles 8 continuous macro blocks, the entropy coding of each one sub-macroblock of thread process.

5. according to method described in claim 2-3 any one, which is characterized in that the deblocking filtering is as unit of frame, packet Include the calculating and filtering of boundary intensity.

6. according to method described in claim 2-3 any one, which is characterized in that the pretreatment includes to video component The separation of YUV and the setting of encoder basic parameter.