CN115866268A

CN115866268A - Ultra-high-definition video layered decoding method based on processor heterogeneous parallel computing

Info

Publication number: CN115866268A
Application number: CN202211518947.2A
Authority: CN
Inventors: 陈祥; 毕江; 王立冬; 孙海峰; 李程; 张宁; 郑建宏; 张金沙
Original assignee: Beijing Radio And Television Station; Chengdu Sobey Digital Technology Co Ltd
Current assignee: Beijing Radio And Television Station; Chengdu Sobey Digital Technology Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-03-28

Abstract

The invention discloses an ultra-high definition video layered decoding method based on processor heterogeneous parallel computing, which belongs to the field of video decoding and comprises the following steps: s1, concurrent scheduling enables a CPU decoding basic layer and a GPU decoding enhancement layer to be simultaneously carried out, and the enhancement layer and the basic layer decoding data of the same frame are synchronously obtained; s2, uploading the base layer video data to a GPU; s3, calculating the obtained LL of the GPU basic layer data and the GPU enhancement layer data by using a GPU to obtain an enhanced decoding picture of the basic layer and recording the enhanced decoding picture as LL; s4, performing H1V1 inverse wavelet transform on the obtained LL, LH, HL and HH by using a GPU to obtain a decoding picture; and S5, selecting different processing modes according to the input decoding output parameters. The invention can be compatible with the production system currently used by the television station, can effectively reduce the pressure of the storage bandwidth when realizing multi-resolution layered decoding, reduces the bandwidth pressure and realizes multi-layer real-time editing with higher resolution.

Description

Ultra-high definition video layered decoding method based on processor heterogeneous parallel computing

Technical Field

The invention relates to the field of video decoding, in particular to an ultra-high definition video layered decoding method based on processor heterogeneous parallel computing.

Background

Professional encoding and decoding formats widely used in the ultra-high-definition content production in the broadcasting and television industry in China are XAVC, PRORES and DNXHR generally, and industry technology service providers represented by Sobel can already realize real-time editing of 4K source codes and 11 layers through years of practice and exploration, but are limited by the computing power and processing power of hardware such as CPU (Central processing Unit), GPU (graphics processing Unit) and the like and the PCI-E (peripheral component interconnect-express) bandwidth capability, and multi-layer real-time editing with higher breadth is difficult to realize. At present, a general solution to this problem is to use proxy resolution, that is, an entity file with multiple resolutions is generated simultaneously during production, and a file with a low resolution is used during editing, which can solve the problem caused by the limitation of hardware capability, but increase the overhead of storage and computing capability and the complexity of the system, thereby increasing the construction cost of the production environment.

Layered decoding is easy to conceive of wavelet transform-based decoding represented by JPEG-XS, but the JPEG-XS is a specific coding and decoding format, is difficult to meet the compatibility of proprietary formats in the existing ultra-high definition content production in the broadcasting and television industry in China at present, and is difficult to occupy different storage bandwidths during layered decoding.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an ultra-high-definition video layered decoding method based on processor heterogeneous parallel computing.

The purpose of the invention is realized by the following scheme:

a super-high-definition video layered decoding method based on processor heterogeneous parallel computing comprises the following steps:

s1, concurrent scheduling enables a CPU decoding basic layer and a GPU decoding enhancement layer to be simultaneously carried out, and the enhancement layer and the basic layer decoding data of the same frame are synchronously obtained; the enhancement layer decodes and gets the data of four frequency bands, the first frequency band locates at the upper left corner of the data and is marked as LL, the second frequency band locates at the upper right corner and is marked as LH, the third frequency band locates at the lower left corner and is marked as HL, the last frequency band locates at the lower right corner and is marked as HH; decoding the base layer to obtain base layer video data;

s2, uploading the base layer video data to a GPU through a CUDA API;

s3, performing addition operation on the obtained LL of the GPU basic layer data and the GPU enhancement layer data by using a GPU to obtain an enhanced decoding picture of the basic layer and recording the enhanced decoding picture as LL;

s4, performing inverse wavelet transform on the obtained LL, LH, HL and HH by using a GPU to obtain a final decoding picture;

s5, selecting different processing modes according to the input decoding output parameters: if the texture object is a DirectX or OpenGL two-dimensional texture object, using a CUDA and an interactive interface corresponding to the CUDA to complete data communication; if the buffer is CUDA buffer, directly copying data by using a CUDA API; and if the CPU buffer is the CPU buffer, using the CUDA API to carry out data downlink.

Further, in step S1, the GPU decoding the enhancement layer comprises the sub-steps of:

step 3.1: analyzing the structure of the enhancement layer code stream by using a CPU (Central processing Unit), and completing frame header analysis of the enhancement layer code stream to obtain parameters necessary for decoding, wherein the parameters comprise width and height information, a slicing height, the width and height of each block in the slice, and the number of blocks and a quantization table contained in each block group in the slice;

step 3.2: analyzing the code stream by using a CPU to obtain a byte initial position corresponding to each independently decoded block group and a quantization coefficient of each fragment;

step 3.3: the code stream data, the chunk byte initial position and the fragment quantization coefficient data are uplinked to a GPU;

step 3.4: the CUDA concurrent scheduling starts a plurality of thread concurrent scheduling units, each thread concurrent scheduling unit completes the complete decoding of one block group, and the thread concurrent scheduling units cooperate to complete the decoding of the whole frame of data;

step 3.5: and calling a CUDA API to complete equipment synchronization, waiting for the GPU to complete decoding of all the block groups, and finally obtaining the wavelet coefficient located in the CUDA video memory.

Further, in step 3.4, it is assumed that all threads in one thread concurrent scheduling unit are one-dimensionally distributed, and the range of the thread ID variable threadidx.x in that thread concurrent scheduling unit is [0,31], that is, there are 32 threads in each thread concurrent scheduling unit, and the method includes the following sub-steps:

step 3.4.1: and (3) decoding the hole coefficient: the cavity coefficient is divided into two levels, one is at a block level, and the block-level cavity coefficient represents that a certain block in the block group is completely effective, completely ineffective or partially effective; the other is an intra-block coding block level, which indicates whether a certain block in a block is valid; in the whole cavity coefficient decoding process, the decoding results of the two-stage cavity coefficients are respectively stored in shared storage and are respectively recorded as block _ mode and codegorup _ mode, and the decoding results are effective in the whole subsequent decoding process; the decoding of the hole coefficient in the whole block group is carried out according to the following steps in the order of the blocks in the block group until all the blocks in the block group finish the decoding of the hole coefficient, and after the decoding of the hole coefficient of all the block groups is finished, the number of the decoded and read bits of the block group is aligned according to bytes so as to skip the filling bit;

step 3.4.2: counting the number of effective coefficients in the block group;

step 3.4.3: prefix coefficient decoding: the prefix coefficient decoding process is a process of analyzing a code stream according to bits, each time a value of one bit in the code stream is 1, an effective coefficient is represented, and the number of continuous bit values in front of the bit, which are 0, is the prefix coefficient; reading the bit number with fixed length for each thread in a parallel decoding mode, completing the decoding of the bit numbers through thread communication in a thread concurrent scheduling unit until the statistical number of decoding effective coefficients reaches the total amount of effective coding coefficients in a block group, and aligning the decoded read bit number of the block group according to bytes after the decoding of prefix coefficients of the block group is completed so as to skip filling bit;

step 3.4.4: postfix coefficient decoding and inverse quantization: decoding a suffix coefficient by using a method for decoding one coefficient by each thread in the thread concurrent scheduling unit; each thread in the thread concurrent scheduling unit independently takes the threadIdx.x variable as a statistical initial state, and all effective coefficients are decoded circularly in a mode that each span is 32;

step 3.4.5: and (3) inverse quantization coefficient post-processing: an inverse hadamard transform is performed every fourth component group within the block and the data is written back from the shared memory to the global storage.

Further, in step 3.4.1, the following sub-steps are included:

step 3.4.1.1: reading a bit by a thread with a threadadIdx.x variable equal to 0 in the thread concurrent scheduling unit, updating the block _ mode and the block group decoded read bit quantity, judging the block _ mode state after the thread concurrent scheduling unit is synchronized, if the block is invalid, completely setting all block hole coefficients corresponding to the current block to be 0, and then executing the step 3.4.1.1 again; otherwise, executing the next step;

step 3.4.1.2: reading a bit by a thread with a threadadIdx.x variable equal to 0 in the thread concurrent scheduling unit, updating the block _ mode and the block group decoded read bit quantity, judging the block _ mode state after the thread concurrent scheduling unit is synchronized, if the block is completely effective, completely setting the block hole coefficient corresponding to the current block to be 1, and then re-executing 3.4.1.1, otherwise, executing the next step;

step 3.4.1.3: reading a bit by a thread with a threadadIdx.x variable equal to 0 in the thread concurrent scheduling unit, updating the number of block _ mode and block group decoded read bits, judging the state of the block _ mode after the thread concurrent scheduling unit is synchronized, executing a step 3.4.1.4 if the block hole coefficient is represented to have a tree structure, and otherwise executing a step 3.4.1.5;

step 3.4.1.4: the current step shows that the current block hole coefficient is composed of a multi-level tree structure, the tree structure is decoded, and after the tree structure decoding is completed, the step 3.4.1.1 is executed again;

step 3.4.1.5: the current step indicates that each code group of the current block corresponds to a hole coefficient, each thread in the thread concurrent scheduling unit decodes a bit in parallel until all hole coefficients are decoded, the thread with the thrededdx.x variable equal to 31 updates the number of bits already read by block group decoding, and finally the step 3.4.1.1 is executed again.

Further, in step 3.4.2, the following sub-steps are included:

step 3.4.2.1: each thread in the thread concurrent scheduling unit accesses codetrip _ mode in a way of 32 span each time by taking a threedidX.x variable as an initial state of loop statistics, and accumulates the access result each time to obtain the non-zero number codetrip _ count _ thread in the hole coefficient of the statistic code block in each thread until the statistics of all the hole coefficients of the code block is completed;

step 3.4.2.2: and (3) counting the sum of codeegroup _ count _ thread in all threads in the thread concurrent scheduling unit by using a __ shfl _ up _ sync method in the thread concurrent scheduling unit, and multiplying the final sum by 4 to obtain the total number of effective coefficients in the block group.

Further, in step 3.4.3, the following sub-steps are included:

step 3.4.3.1: each thread in the thread concurrent scheduling unit reads 4 continuous bytes in the block group code stream as prefix _ data _ read by using the sum of the readIdx.x variable and the number of the decoded read bits of the block group and dividing the sum by 32 as an initial position, wherein the reading position cannot exceed the total size of the current block group code stream;

step 3.4.3.2: the number of continuous bits from low to high in each thread analysis prefix _ data _ read in the thread concurrent scheduling unit is 0 is recorded as bit0_ left, the bits do not participate in the decoding of the current thread, and the bits are counted by the next thread;

step 3.4.3.3: the thread with the threeadidx.x variable equal to 0 in the thread concurrent scheduling unit shifts bits which participate in decoding in the high bits of the prefix _ data _ read to the left and discards the bits, and supplements 0 from the low bits, and other threads are not processed because the bits read by the threads are effective;

step 3.4.3.4: the number of bits with 1 in each thread analysis prefix _ data _ read in the thread concurrent scheduling unit is recorded as a coef _ count _ thread, namely the number of effective coefficients possibly decoded in the current thread;

step 3.4.3.5: each thread in the thread concurrent scheduling unit uses a __ shfl _ up _ sync method to count the accumulated sum of the count _ threads in all threads in front of the current thread, and the accumulated sum is recorded as the count _ offset _ threads, and meanwhile, the number of the count _ threads of the current thread is updated;

step 3.4.3.6: updating bit0_ left to the next thread by using a __ shfl _ up _ sync method in the thread concurrent scheduling unit, and simultaneously setting bit0_ left corresponding to the thread with threadedx.x equal to 0 to be 0;

step 3.4.3.7: each thread in the thread concurrent scheduling unit independently takes coef _ index _ thread as a statistical parameter to be circularly executed, and decoding of prefix _ data _ thread is completed; the cyclic decoding process is divided into the following steps: firstly, counting the number of continuous bit values 0 in prefix _ data _ read from high to low as bit0_ length, and recording bit0_ length + bit0_ left into a shared memory corresponding to coef _ index _ thread + coef _ offset _ thread + coef _ decode _ count as prefix _ decode _ data; then, counting bit0_ length +1+ bit0 _leftto the number of effective bits read by the current thread and recording as bits _ read _ thread; resetting bit0_ left to 0 and discarding the bit0_ length +1 bits of the prefix _ data _ read high bit by bit left shift;

step 3.4.3.8: a __ shfl _ up _ sync method is used in the thread concurrent scheduling unit to count the sum of all thread bits _ read _ threads in the thread concurrent scheduling unit; updating the statistical result to the quantity of the block group decoding read bits by the thread with the readidx.x being equal to 31, and updating the current thread coef _ offset _ thread + coef _ count _ thread to coef _ decode _ count;

step 3.4.3.9: and (3) synchronizing the thread concurrent scheduling units, judging whether the number of the coef _ decode _ counts is more than or equal to the total number of the effective coding coefficients in the block group, if true, indicating that all threads of the thread concurrent scheduling units finish decoding prefix coefficients of the block group, and exiting circulation, otherwise, repeating the processes from the step 3.4.3.1 to the step 3.4.3.8.

Further, in step 3.4.4, the following sub-steps are included:

step 3.4.4.1: each thread in the thread concurrent scheduling unit reads prefix _ decode _ data circularly accessed in a mode of 32 spans each time to obtain a prefix coefficient by taking the readIdx.x variable as a statistical initial state, and analyzes the bit number corresponding to the suffix coefficient corresponding to the prefix coefficient;

step 3.4.4.2: a __ shfl _ up _ sync method is used in the thread concurrent scheduling unit to count the cumulative sum of the number of bits corresponding to the suffix coefficients of all threads in front of the current thread, so as to obtain the bit offset of the suffix coefficients of the current thread;

step 3.4.4.3: each thread in the thread concurrent scheduling unit respectively reads bit values of bit numbers corresponding to continuous suffix coefficients in a block group code stream by taking the bit offset of the quantity of the read bits of the block group decoding plus the suffix coefficients of the current thread as an initial position to obtain the suffix coefficients;

step 3.4.4.4: each thread in the thread concurrent scheduling unit calculates decoding coefficients corresponding to the prefix coefficient and the suffix coefficient, performs inverse quantization according to inverse quantization parameters, and stores an inverse quantization result in prefix _ decode _ data corresponding to a prefix coefficient reading position;

step 3.4.4.5: and the thread concurrent scheduling unit is synchronous, and the thread with the threadIdx.x equal to 31 updates the bit offset of the suffix coefficient and the bit number corresponding to the suffix coefficient into the block group decoding read bit number.

Further, in step 3.4.5, the following sub-steps are included:

step 3.4.5.1: judging whether the current block is valid according to the block _ mode obtained in the step 3.4.1, if so, executing the following steps, otherwise, processing the next block;

step 3.4.5.2: calculating the mapping relation between the relative position and the absolute position in the block according to the codego _ mode obtained in the step 3.4.1, recording the mapping relation into a shared memory, and counting the number of effective coefficients in the current block and recording the number of the effective coefficients as block _ coef _ valid; if the codego _ mode corresponding to the current block is accessed according to the access index coef _ index _ in _ block of the codego _ mode corresponding to the block, the relative position is represented by the sum of the numbers of all the effective coefficients in front of the position corresponding to the coef _ index _ in _ block, and the absolute position is coef _ block _ index;

step 3.4.5.3: each thread in the thread concurrent scheduling unit takes coef _ index = threadedIdx.x as an initial state of loop statistics, and the mapping from a relative position in the block to an absolute position of the block is completed in a mode that each span is 32;

step 3.4.5.4: the thread concurrent scheduling unit is synchronized, and the thread with the threadIdx.x equal to 0 updates block _ coef _ valid to coef _ out _ count;

step 3.4.5.5: each thread in the thread concurrent scheduling unit takes coef4_ index = threadedX.x as the initial state of the loop statistics, reads continuous 4 components corresponding to [ coef4_ index 4, coef4_index 4+3] in output _ decode _ data in a mode of 32 span each time, and writes the components into the same position after inverse Hadamard transformation;

step 3.4.5.6: and each thread in the thread concurrent scheduling unit reads data from a corresponding position in output _ decode _ data and writes the data into the global memory according to the corresponding relation between the current block and the image.

Further, in step 3.4.1.4, the tree structure decoding includes the following sub-steps:

step a: the thread equal to 0 in the first 16 threads in the thread concurrent scheduling unit respectively reads a bit according to the sequence of the readidx.x variable/4 and the quantity of the block group decoded read bits, and updates the quantity of the block group decoded read bits by the thread with the readidx.x equal to 0;

step b: thread communication is completed by using __ shfl _ up _ sync in the thread concurrent scheduling unit, and bit values read in the step a are shared by every 4 threads and are marked as bit _ level0;

step c: counting the accumulated sum of bit _ level0 in all threads in front of the current thread in step b by using thread communication in the thread concurrent scheduling unit, and recording the accumulated sum as bit _ level0_ offset;

step d: b, respectively reading bit _ level0_ offset and one bit of the position corresponding to the quantity of the block group decoded read bits as bit _ level1 if the bit _ level0 value obtained after the step b is 1; the thread with the threadIdx.x equal to 15 updates the value of bit _ level0_ offset + bit _ level0 into the block group decoding read bit number;

step e: thread communication in the thread concurrent scheduling unit counts bit _ level1 accumulation sum in all threads in front of the current thread and records as bit _ level1_ offset, and the bit _ level1_ offset is multiplied by 4;

step f: the first 16 threads in the thread concurrent scheduling unit, if bit _ level0 and bit _ level1 are both 1, the number of bits already read by adding bit _ level1_ offset to block decoding is taken as the initial position, 4 continuous bits in the block group code stream are read, the read result is updated to the positions corresponding to thrreadidx.x 4, thrreadidx.x 4+1, thrreadidx.x 4+2 and thrreadidx.x 4+3 in the block hole coefficient of the current block according to the sequence from high to low, and bit _ level1_ offset is added by 4; if not, setting the corresponding position in the codegorup _ mode of the current block to be 0; finally, the thread with threadidx.x equal to 15 updates bit _ level1_ offset into the block group decoded read bit number.

Further, in step 3.4.5.3, the mapping includes the sub-steps of: specifically, prefix _ decode _ data is read according to the sequence of coef _ index + coef _ out _ count, and the read result is stored in another shared memory block according to the absolute position and is recorded as output _ decode _ data.

The beneficial effects of the invention include:

the invention provides a novel wavelet transform-based CPU + GPU layered decoding scheme, which can overcome the defects of the scheme in the background, not only can be compatible with a production system used by a television station at present, but also can effectively reduce the pressure of storage bandwidth during multi-resolution layered decoding, and simultaneously can reduce the bandwidth pressure of PCI-E by using a CUDA and graphic image API interaction mode during rendering, thereby realizing multilayer real-time editing with higher resolution.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart illustrating decoding of only a base layer;

FIG. 2 is a flow chart illustrating decoding of a base layer and an enhancement layer;

fig. 3 is a schematic diagram of a decoding process of the enhancement layer.

Detailed Description

All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.

In recent years, the development of the GPU is rapid, and through creative thinking, the inventor of the present invention finds that the heterogeneous parallel computing of the CPU + the GPU can be completely utilized to realize the allocation of different computing requirements, tasks requiring highly parallel and a large amount of repeated computing are allocated to the GPU for execution, tasks with low parallelism and high computing complexity are allocated to the CPU for execution, and hardware resources are reasonably and fully utilized to achieve higher production efficiency.

The general computing framework represented by the CUDA of the Nvidia has the characteristics of simple entry and good ecology, can realize interaction with graphics image APIs such as DirectX and OpenGL and the like to reduce data exchange between a CPU and a GPU, and is a suitable parallel computing framework for coding and decoding. A decoding and rendering framework in a 4K professional format in the broadcasting and television industry is generally CPU decoding, and then image data are decoded in an uplink mode through a PCI-E to be rendered in a GPU, but when the resolution reaches 8K or even 16K, the PCI-E bandwidth cannot meet the requirement of real-time rendering of multi-layer data.

Example 1

In the embodiments of the present invention, the following are included: if the situation is that only the base layer is decoded, as shown in fig. 1, that is, when the editing format does not need to perform enhancement layer decoding, the file reading only needs to read the base layer code stream, and at this time, the CPU is directly called to decode the base layer, which is completely consistent with the original format efficiency of the base layer, and is not described again. If the base layer and the enhancement layer need to be decoded, as shown in fig. 2, an embodiment of the present invention includes the following steps:

step S1: concurrent scheduling allows the CPU decoding base layer and the GPU decoding enhancement layer to be performed simultaneously; the enhancement layer and base layer decoded data for the same frame are synchronized. The enhancement layer decodes to obtain data of 4 frequency bands, the first frequency band located at the top left corner of the data is denoted as LL, the second frequency band located at the top right corner is denoted as LH, the third frequency band located at the bottom left corner is denoted as HL, and the last frequency band located at the bottom right corner is denoted as HH. The base layer decodes to obtain base layer video data.

Step S2: and uploading the base layer video data to the GPU through a CUDA API.

And step S3: and performing addition operation on the obtained GPU base layer data and LLl of the GPU enhancement layer data by using a GPU to obtain an enhancement decoding picture of the base layer, and recording the enhancement decoding picture as LL.

And step S4: and performing one-time inverse wavelet transform (such as H1V1 inverse wavelet transform) on the obtained LL, LH, HL and HH by using a GPU to obtain a final decoding picture.

Step S5: different processing modes are selected according to the input decoding output parameters. If the texture object is a DirectX or OpenGL two-dimensional texture object, using a CUDA corresponding interaction interface to complete data communication; if the buffer is the CUDA buffer, directly copying data by using a CUDA API; and if the CPU buffer is the CPU buffer, using a CUDA API to carry out data downlink.

Example 2

In a specific implementation process, on the basis of embodiment 1, in a process of decoding an enhancement layer by a GPU, as shown in fig. 3, the following sub-steps are included:

step 3.1: and analyzing the structure of the enhancement layer code stream by using the CPU, completing frame header analysis of the enhancement layer code stream, and obtaining parameters necessary for decoding, wherein the parameters comprise width and height information, slice (slice) height, width and height of a block in the slice, the block number contained in each block group in the slice, a quantization table and the like.

Step 3.2: and analyzing the code stream by using the CPU to obtain a byte initial position corresponding to each independently decodable block group and a quantization coefficient of each slice.

Step 3.3: and uploading data such as code stream data, block group byte initial positions, slice quantization coefficients and the like to a GPU.

Step 3.4: and starting a plurality of thread concurrent scheduling units by CUDA concurrent scheduling, wherein each thread concurrent scheduling unit completes the complete decoding of one block group, and the plurality of thread concurrent scheduling units cooperate to complete the decoding of the whole frame of data.

Step 3.5: and calling a CUDA API to complete equipment synchronization, waiting for the GPU to complete decoding of all blockgroups, and finally obtaining a wavelet coefficient located in a CUDA video memory.

Example 3

On the basis of embodiment 2, in step 3.4, it is assumed that all threads in one thread concurrent scheduling unit are one-dimensionally distributed, and the range of the thread ID variable threadidx.x built in that one thread concurrent scheduling unit is [0,31], that is, each thread concurrent scheduling unit has 32 threads, and the method includes the following sub-steps:

step 3.4.1: and (3) decoding the hole coefficient: the cavity coefficient is divided into two levels, one is at a block level, and the block-level cavity coefficient represents that a certain block in the block group is completely effective, completely ineffective or partially effective; the other is the intra block coding group level, which indicates whether a block in the block is valid; in the whole cavity coefficient decoding process, the decoding results of the two-stage cavity coefficients are respectively stored in shared storage and are respectively recorded as block _ mode and codegorup _ mode, and the decoding results are effective in the whole subsequent decoding process; the decoding of the hole coefficient in the whole block group is carried out according to the following steps in the order of the blocks in the block group until all the blocks in the block finish the decoding of the hole coefficient, and after the decoding of the hole coefficient of all the block groups is finished, the number of the block group decoded read bits is aligned according to bytes so as to skip the filling bit;

step 3.4.2: counting the number of effective coefficients in the block group;

step 3.4.4: postfix coefficient decoding and inverse quantization: decoding a suffix coefficient by using a method for decoding one coefficient by each thread in the thread concurrent scheduling unit; each thread in the thread concurrent scheduling unit independently takes a threaded Idx.x variable (a built-in variable in the CUDA thread concurrent scheduling unit) as a statistical initial state, and all effective coefficients are decoded circularly in a mode that each span is 32;

Example 4

On the basis of example 2, in step 3.4.1, the following substeps are included:

step 3.4.1.2: reading a bit by a thread with a threeadidx.x variable equal to 0 in the thread concurrent scheduling unit, updating the block _ mode and the number of bits already read by block group decoding, judging the state of the block _ mode after the thread concurrent scheduling unit is synchronized, if the block is completely effective, completely setting the block cavity coefficient corresponding to the current block to be 1, then re-executing 3.4.1.1, and if not, executing the next step;

step 3.4.1.5: the current step shows that each code group of the current block corresponds to a hole coefficient, each thread in the thread concurrent scheduling unit decodes a bit in parallel until all hole coefficients are decoded, the thread with the threadIdx.x equal to 31 updates the number of decoded and read bits of the block group, and finally the step 3.4.1.1 is executed again.

Example 5

On the basis of embodiment 3, in step 3.4.2, the following substeps are included:

step 3.4.2.1: each thread in the thread concurrent scheduling unit accesses the codegram _ mode in a way of 32 spans each time by taking the threadIdx.x variable as an initial state of loop statistics, and accumulates the access results each time to obtain codegram _ count _ thread (the number of nonzero hole coefficients in the statistic code group in each thread) until the statistics of all the code group hole coefficients is completed;

step 3.4.2.2: and (3) counting the sum of codeegrup _ count _ thread in all threads in the thread concurrent scheduling unit by using __ shfl _ up _ sync (a method for finishing thread communication in a CUDA thread concurrent scheduling unit) in the thread concurrent scheduling unit, and multiplying the final sum by 4 to obtain the total number of effective coefficients in the block group.

Example 6

On the basis of embodiment 3, in step 3.4.3, the following substeps are included:

step 3.4.3.1: each thread in the thread concurrent scheduling unit takes the number of the readedbits of the threeadidx.x variable + block group decoding divided by 32 as an initial position, 4 continuous bytes in a block group code stream are read as prefix _ data _ read, and the reading position cannot exceed the total size of the current block group code stream;

step 3.4.3.6: updating bit0_ left to the next adjacent subsequent thread by using a __ shfl _ up _ sync method in the thread concurrent scheduling unit, and simultaneously setting bit0_ left corresponding to the thread with the threadIdx being equal to 0 to be 0;

step 3.4.3.9: and (3) synchronizing the thread concurrent scheduling units, judging whether the number of the coef _ decode _ counts is more than or equal to the total number of the effective coding coefficients in the block group, if so, indicating that all threads of the thread concurrent scheduling units finish decoding prefix coefficients of the packet group (block group), and exiting circulation, otherwise, repeating the processes from the step 3.4.3.1 to the step 3.4.3.8.

Example 7

On the basis of example 2, in step 3.4.4, the following substeps are included:

step 3.4.4.2: using a __ shfl _ up _ sync method to count the cumulative sum of the number of bits corresponding to the suffix coefficients of all the threads in front of the current thread in the thread concurrent scheduling unit to obtain the bit offset of the suffix coefficient of the current thread;

step 3.4.4.3: each thread in the thread concurrent scheduling unit respectively takes the read bit quantity of the block group decoding plus the bit offset of the suffix coefficient of the current thread as an initial position, and reads the bit value of the bit quantity corresponding to the continuous suffix coefficient in the block group code stream to obtain the suffix coefficient;

Example 8

On the basis of example 2, in step 3.4.5, the following substeps are included:

step 3.4.5.5: each thread in the thread concurrent scheduling unit takes coef4_ index = threeadidx.x as an initial state of loop statistics, reads continuous 4 components corresponding to [ coef4_ index 4, coef4_index 4+3] in output _ decode _ data in a mode of 32 span each time, and writes the components into the same position after inverse Hadamard transformation;

Example 9

On the basis of embodiment 4, in step 3.4.1.4, the tree structure decoding includes the following sub-steps:

step a: the thread (threadIdx.x & 3) equal to 0 in the first 16 threads in the thread concurrent scheduling unit respectively reads a bit according to the sequence of the threadIdx.x variable/4 plus the quantity of the block group decoded read bits, and updates the quantity of the block group decoded read bits by the thread with the threadIdx.x equal to 0;

step f: the first 16 threads in the thread concurrent scheduling unit, if bit _ level0 and bit _ level1 are both 1, the number of bits already read by adding bit _ level1_ offset to block decoding is taken as the initial position, 4 continuous bits in the block group code stream are read, the read result is updated to the positions corresponding to thrreadidx.x 4, thrreadidx.x 4+1, thrreadidx.x 4+2 and thrreadidx.x 4+3 in the block hole coefficient of the current block according to the sequence from high to low, and bit _ level1_ offset is added by 4; otherwise, setting the corresponding position in the codereup _ mode of the current block to be 0; finally, the thread with threadidx.x equal to 15 updates bit _ level1_ offset into the block group decoded read bit number.

Example 10

On the basis of embodiment 8, in step 3.4.5.3, the mapping comprises the sub-steps of: specifically, prefix _ decode _ data is read according to the sequence of coef _ index + coef _ out _ count, and the read result is stored in another shared memory block according to the absolute position and is recorded as output _ decode _ data.

According to an aspect of the application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

The parts not involved in the present invention are the same as or can be implemented using the prior art.

The above-described embodiment is only one embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be easily made based on the application and principle of the present invention disclosed in the present application, and the present invention is not limited to the method described in the above-described embodiment of the present invention, so that the above-described embodiment is only preferred, and not restrictive.

Other embodiments than the above examples may be devised by those skilled in the art based on the foregoing disclosure, or by adapting and using knowledge or techniques of the relevant art, and features of various embodiments may be interchanged or substituted and such modifications and variations that may be made by those skilled in the art without departing from the spirit and scope of the present invention are intended to be within the scope of the following claims.

Claims

1. An ultra-high-definition video layered decoding method based on processor heterogeneous parallel computing is characterized by comprising the following steps of:

s2, uploading the base layer video data to a GPU through a CUDA API;

2. The ultra high definition video layered decoding method based on processor heterogeneous parallel computing according to claim 1, wherein in step S1, the GPU decoding the enhancement layer comprises the sub-steps of:

step 3.4: the CUDA concurrent scheduling starts a plurality of thread concurrent scheduling units, each thread concurrent scheduling unit completes the complete decoding of one block group, and the thread concurrent scheduling units cooperate to complete the decoding of the whole frame data;

step 3.5: and calling a CUDA API to complete equipment synchronization, waiting for the GPU to complete decoding of all the block groups, and finally obtaining the wavelet coefficients positioned in the CUDA video memory.

3. The ultra high definition video layered decoding method based on processor heterogeneous parallel computing according to claim 2, wherein in step 3.4, it is assumed that all threads in one thread concurrent scheduling unit are one-dimensionally distributed, the range of the thread ID variable threadidx.x in that one thread concurrent scheduling unit is [0,31], that is, there are 32 threads per thread concurrent scheduling unit, and the method includes the following sub-steps:

step 3.4.1: and (3) decoding the hole coefficient: the cavity coefficient is divided into two levels, one is at a block level, and the block-level cavity coefficient represents that a certain block in the block group is completely effective, completely ineffective or partially effective; the other is the intra block coding group level, which indicates whether a block in the block is valid; in the whole cavity coefficient decoding process, the decoding results of the two-stage cavity coefficients are respectively stored in shared storage and are respectively recorded as block _ mode and codegorup _ mode, and the decoding results are effective in the whole subsequent decoding process; the decoding of the hole coefficient in the whole block group is carried out according to the following steps in the sequence of the blocks in the block group until all the blocks in the block group finish the decoding of the hole coefficient, and the number of the block group decoding read bits is aligned according to bytes after the decoding of the hole coefficient of all the block groups is finished so as to skip the filling bit;

step 3.4.2: counting the number of effective coefficients in the block group;

4. The ultra high definition video layered decoding method based on processor heterogeneous parallel computing according to claim 2, characterized by comprising, in step 3.4.1, the following sub-steps:

step 3.4.1.5: the current step shows that each code group of the current block corresponds to a hole coefficient, each thread in the thread concurrent scheduling unit decodes a bit in parallel until all hole coefficients are decoded, the thread update block group decoding read bit number with the threadidx.x variable equal to 31, and finally step 3.4.1.1 is executed again.

5. The ultra high definition video layered decoding method based on processor heterogeneous parallel computing according to claim 3, wherein in step 3.4.2, the following sub-steps are included:

6. The ultra high definition video layered decoding method based on processor heterogeneous parallel computing according to claim 3, characterized by comprising, in step 3.4.3, the following sub-steps:

step 3.4.3.8: a __ shfl _ up _ sync method is used in the thread concurrent scheduling unit to count the sum of all thread bits _ read _ threads in the thread concurrent scheduling unit; the thread with the readidx.x equal to 31 updates the statistical result to the number of the block group decoding read bits, and updates the current thread coef _ offset _ thread + coef _ count _ thread to coef _ decode _ count;

step 3.4.3.9: and (3) synchronizing the thread concurrent scheduling units, judging whether the number of the coef _ decode _ counts is more than or equal to the total number of the effective coding coefficients in the block groups, if so, indicating that all threads of the thread concurrent scheduling units finish decoding prefix coefficients of the block groups, and exiting circulation, otherwise, repeating the processes from the step 3.4.3.1 to the step 3.4.3.8.

7. The ultra high definition video layered decoding method based on processor heterogeneous parallel computing according to claim 2, characterized by comprising, in step 3.4.4, the following sub-steps:

step 3.4.4.5: and the thread concurrent scheduling unit is synchronous, and the thread with the threeadidx.x being equal to 31 updates the bit offset of the suffix coefficient and the number of bits corresponding to the suffix coefficient into the number of the block group decoding read bits.

8. The ultra high definition video layered decoding method based on processor heterogeneous parallel computing according to claim 2, characterized by comprising, in step 3.4.5, the following sub-steps:

step 3.4.5.2: calculating the mapping relation between the relative position and the absolute position in the block according to the codego _ mode obtained in the step 3.4.1, recording the mapping relation into a shared memory, and counting the number of effective coefficients in the current block and recording the number of the effective coefficients as block _ coef _ valid; if the codego _ mode corresponding to the current block is accessed according to the access index coef _ index _ in _ block of the codego _ mode corresponding to the block, the relative position is represented by the sum of all the effective coefficients in front of the position corresponding to the coef _ index _ in _ block, and the absolute position is coef _ block _ index;

9. The ultra high definition video layered decoding method based on processor heterogeneous parallel computing according to claim 4, wherein in step 3.4.1.4, the tree structure decoding comprises the following sub-steps:

step b: thread communication is completed in the thread concurrent scheduling unit by using __ shfl _ up _ sync, and the bit value read in the step a is shared by every 4 thread groups and is marked as bit _ level0;

step c: counting the cumulative sum of bit _ level0 in all threads in front of the current thread in step b by thread communication in the thread concurrent scheduling unit and recording as bit _ level0_ offset;

step f: the first 16 threads in the thread concurrent scheduling unit, if bit _ level0 and bit _ level1 are both 1, the number of bits already read by bit _ level1_ offset plus block decoding is taken as the initial position, continuous 4 bits in the block group code stream are read, the read result is updated to the positions corresponding to the code block hole coefficients of the current block, namely, the thrreadidx.x.4, the thrreadidx.x.x +1, the thrreadidx.x + 4+2 and the thrreadidx.x + 4+3 in the code block hole coefficients of the current block according to the sequence from high to low, and bit _ level1_ offset is added by 4; if not, setting the corresponding position in the codegorup _ mode of the current block to be 0; finally, the thread with threadidx.x equal to 15 updates bit _ level1_ offset into the block group decoded read bit number.

10. The ultra high definition video layered decoding method based on processor heterogeneous parallel computing according to claim 8, wherein in step 3.4.5.3, the mapping comprises the sub-steps of: specifically, prefix _ decode _ data is read according to the sequence of coef _ index + coef _ out _ count, and the read result is stored into another shared memory block according to the absolute position and is recorded as output _ decode _ data.