CN110337002B - HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform - Google Patents

HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform Download PDF

Info

Publication number
CN110337002B
CN110337002B CN201910752152.XA CN201910752152A CN110337002B CN 110337002 B CN110337002 B CN 110337002B CN 201910752152 A CN201910752152 A CN 201910752152A CN 110337002 B CN110337002 B CN 110337002B
Authority
CN
China
Prior art keywords
decoding
ctu
thread
decoded
hevc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910752152.XA
Other languages
Chinese (zh)
Other versions
CN110337002A (en
Inventor
胡栋
张文祥
李毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910752152.XA priority Critical patent/CN110337002B/en
Publication of CN110337002A publication Critical patent/CN110337002A/en
Application granted granted Critical
Publication of CN110337002B publication Critical patent/CN110337002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/44Decoders specially adapted therefor, e.g. video decoders which are asymmetric with respect to the encoder
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/96Tree coding, e.g. quad-tree coding

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The method provided by the invention aims at the problems of huge data volume of high-definition video and ultrahigh processing complexity of HEVC decoding, makes full use of the dependency in HEVC data, and provides an HEVC multilevel parallel decoding method on a multi-core processor platform. Firstly, realizing a wave-front parallel algorithm based on CTU units by utilizing the data dependency relationship among the CTU units in a pixel decoding reconstruction module; secondly, in a rapid loop filtering module, a fused loop filtering algorithm is realized by fully utilizing the data dependency relationship between the deblocking filtering and the sample adaptive compensation; and finally, a pipeline parallel technology is introduced between the two modules, so that a multi-level efficient parallel decoding algorithm of the decoder is realized. In the decoding process, each task is executed by an independent thread and is bound to an independent core to run, so that the parallel computing performance of the multi-core processor is fully utilized, and the decoding efficiency is improved.

Description

HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform
Technical Field
The invention relates to the field of digital video signal coding and decoding, in particular to a multi-layer multi-task efficient parallel decoding method on a multi-core processor platform.
Background
With the development of mobile Internet and the continuous progress of Internet Video application, in order to meet the continuous requirements of people on videos such as High Definition (HD), a new generation of Video coding international standard hevc (high Efficiency Video coding) is developed jointly by the Video coding international standard organization JCT-VC jointly constructed by MPEG and VCEG in 2010, and formally becomes the international standard in 1 month in 2013. The aim of the high efficiency coding HEVC is to improve the video coding efficiency, and the compression rate is doubled compared with the high-level (high profile) of h.264/AVC under the same image quality. Considering the huge data volume of high-definition video coding and decoding and a complex coding and decoding system, the HEVC standard introduces a plurality of parallel processing technical means, such as Tile taking blocks as parallel particles and WPP wavefront parallel, and effectively improves the coding and decoding performance. Meanwhile, the multi-core processor has been developed greatly in recent years, and the effective combination of the two becomes an important factor for the successful application of the HEVC technology.
Scholars at home and abroad have made some research on multi-core processors in combination with video coding and decoding standards. Internationally, hyuneki Baik et al in the paper "a complex-based Adaptive Tile Partitioning Algorithm For HEVC Decoder Parallelization" (2015) proposed an efficient Adaptive Tile Partitioning Algorithm For HEVC Decoder Parallelization that divides video frames into tiles that are independent and processed simultaneously on multiple cores; however, the encoding end of the method must adopt an encoding mode of independent tile partition, and the decoding end and the encoding end have high coupling degree, so that the method has great limitation. A published paper of An Efficient HEVC Loop filter (ILF) Architecture of An Efficient HEVC Loop filter for a Multicore Scalable HEVC Hardware decoder (2018) by HyunMi Kim et al provides effective multi-core utilization for ultra-high definition video application, and the provided novel memory organization and management technology solves the problem of data dependence among a plurality of processing units.
The domestic scholars also put forward some decoding methods of the multi-core platform. In 2014, maedi et al (2014) of university of great courseware and communication engineering college proposed a HEVC parallel decoder based on a CPU + GPU hybrid platform, which adopts a CUDA hardware platform and utilizes hardware advantages to complete system optimization. According to analysis of data dependency relationship among CTU units in a decoding process, Didi et al (2016) provides an HEVC multilevel parallel decoding method based on a Tilera multi-core processor. Korean et al (2018) propose an HEVC parallel decoding method based on the combination of task level and data level of a multi-core processor based on the relationship between data and task level in the decoding process.
The research results of the predecessors on the multi-core platform are comprehensively made, and include the steps of providing OWF overlapped wavefront parallel, a pipeline thread pool technology, a 3D-WPP algorithm, a parallel algorithm combining a data level and a task level, and a rapid deblocking filtering algorithm. Although there is a great improvement in some aspects, the great difference in the computation complexity of different CTUs caused by the difference in the complexity of local textures in each frame of image in a video sequence is not considered, and the interactive communication cost between the deblocking filtering and sample adaptive compensation module core and the cache is also not considered, and at the same time, the core resources of the multi-core platform are not fully utilized.
Disclosure of Invention
The invention aims to solve the technical problem of further improving the decoding efficiency by decoding HEVC high-definition single code stream in real time on the premise of ensuring the quality of decoded images.
The technical scheme adopted by the invention is as follows: a HEVC multilevel parallel decoding method on a multi-core processor platform realizes the control of a decoding process by a main thread, a plurality of slave threads realize the independent parallel decoding of a CTU unit, each thread is bound on an available core of the multi-core processor, and the high-efficiency parallel decoding on the multi-core processor platform is realized, and the method comprises the following steps:
step 1: the main thread completes initialization work, including initializing an HEVC decoder, applying for a register unit, initializing cache, initializing a decoding task queue and emptying the task queue;
step 2: reading sequence code streams of HEVC coding, calling a network adaptation layer NAL (network element analysis) analysis function, analyzing various encapsulated parameter information, and obtaining profile, level, image frame type, image size parameter and loop filtering parameter information required by decoding;
and step 3: performing entropy decoding according to various parameter information generated by analyzing the network adaptation layer in the step 2 to obtain a syntax element for representing the video sequence;
step 4, acquiring the number of rows of CTUs in the current frame image according to syntax elements obtained by entropy decoding in the step 3, creating thread numbers with the same number as the number of rows of CTUs in the current frame image in a thread pool, and binding each thread to different cores through a multi-core function library;
step 5, establishing a CTU initialization dependency table according to the total number of CTU units in the current frame image, wherein the coordinates of each element in the dependency table correspond to the coordinates of the CTU units in the frame image in one frame image, the value of each element represents the number of adjacent CTU units on which the CTU at the corresponding position depends, for one CTU unit, each time the decoding of one CTU unit on which the CTU depends is completed, the current CTU has a dependency satisfied, therefore, the value of the corresponding element of the CTU unit in the CTU dependency table is reduced by 1, and when the value of the element is reduced to 0, the CTU unit is indicated to have removed the dependency relationship with all and its adjacent CTUs;
step 6: adding a first CTU unit in a dependency table into a task queue to be decoded;
and 7: judging whether the queue to be decoded is empty, if not, judging whether a thread pool is empty, if not, receiving the task allocation of a main thread by the idle-state slave thread, and executing the decoding task in the task queue to be decoded; if the thread pool is empty, waiting for the slave thread to return to the idle state again, and waiting for task allocation, and if the queue to be decoded is empty, executing the step 9;
and 8: the main thread obtains the CTU unit after decoding according to the information returned by the slave thread, updates the dependency table, and adds the CTU unit with the element value of 0 in the dependency table into the task queue to be decoded;
and step 9: when a slave thread completes a CTU unit pixel decoding reconstruction task, adopting a pipeline parallel technology, taking out a new decoding task from a task queue to be decoded by the slave thread to continue executing the pixel decoding reconstruction task, and scheduling another idle slave thread from a thread pool to perform fusion loop filtering processing on the CTU unit which just completes pixel decoding reconstruction when the data dependency is met; repeating the step 8 and the step 9 until the current frame image is decoded, and executing the step 10;
step 10: detecting whether the video code stream is completely decoded, and if the decoding is completed, releasing all resources and destroying the thread pool; otherwise, reading the next frame of image and executing the step 5.
Further, the HEVC decoder in step 1 includes coding units CU based on a cyclic hierarchical structure of a quadtree.
Further, in step 2, the parameter information includes a picture parameter set PPS, a video parameter set VPS, a sequence parameter set SPS, supplemental enhancement information SEI, and Slice header information of the picture.
Further, the step 7 of receiving task allocation of the main thread by the slave thread and executing the decoding task in the task queue to be decoded specifically includes the following steps: obtaining CTU unit information to be decoded and required peripheral CTU unit information from an external memory through Cache communication; decoding the CTU unit; writing the decoded CTU unit back to an external memory through Cache communication; and informing the main thread of the decoding end.
Further, if the CTU unit to be decoded is an inter-frame CTU unit, the pixel data of the reference CTU unit is obtained from the external memory through Cache communication.
Further, the fused loop filtering processing in the step 9 includes deblocking filtering and SAO filtering; the method specifically comprises the following steps: according to the data dependency relationship among the brightness component, the chrominance component and the sample adaptive compensation in the deblocking filtering process, a new CTU-like decoding object is manufactured to be used as a processing object of loop filtering, wherein the CTU-like decoding object is from a current CTU unit and samples of an upper left CTU unit and an upper left CTU unit; after the deblocking filtering is finished, the range of the current CTU-like decoding object is subdivided for processing pixel samples in SAO filtering; the extent of the CTU-like decoding object is shifted left by one column and up by one row of pixel samples in the SAO filtered samples.
Further, the task queue to be decoded in step 6 is a first-in first-out queue, the task queue to be decoded stores CTU units to be decoded, and after a CTU unit decoding task is completed by a slave thread, a new decoding task is taken out from the top of the task queue to be decoded and added to the decoding thread.
Has the advantages that: the invention can improve the real-time response capability of the program by using the multithread design mode, improve the design structure of the program, more effectively play the functions of the processor, reduce the frequent scheduling and switching of system resources and reduce the expense for creating and destroying the thread object. When a plurality of threads access shared resources in multi-thread design, correct concurrent operation is coordinated through locking and unlocking operation and matching with condition variables, so that the decoding efficiency of the system is integrally improved. Compared with the prior art, the invention has the following beneficial effects:
the invention relates to a multi-level high-efficiency parallel decoding method on a multi-core processor platform, which divides an HEVC decoder into a pixel decoding reconstruction part and a loop filtering part under the limit of an original parallel framework, wherein a pixel decoding reconstruction module is designed and realizes a wave front parallel algorithm based on a CTU unit, the loop filtering module is designed and realizes a fused loop filtering algorithm, a pipeline parallel mode is utilized between the two modules, and the high performance of parallel computation of the multi-core processor is utilized.
Drawings
Fig. 1 is a HEVC decoding flow block diagram;
FIG. 2 is a CTU dependency representation intent;
FIG. 3 is a schematic diagram of a CTU unit based wavefront parallel decoding system design modeling;
FIG. 4 is a schematic diagram of the buffer interaction based on the CTU unit wavefront parallel decoding algorithm;
FIG. 5 is a diagram illustrating dependency of luma, chroma and sample adaptive compensation data in deblocking filtering;
FIG. 6 is a schematic diagram of deblocking filtering and sample adaptive compensation data dependency;
FIG. 7 is a schematic diagram of a fused loop filtering process;
FIG. 8 is a schematic diagram of a pixel decoding reconstruction module and a loop filter module using a pipeline parallel technique;
FIG. 9 is a flow chart of a multi-level efficient parallel decoding algorithm;
FIG. 10 is a comparison result of an experiment of a multi-level efficient parallel decoding algorithm and a decoding algorithm combining a task level and a data level.
Detailed Description
The basic idea of the invention is: the high parallel computing performance of a multi-core processor is utilized, the computing complexity difference between local CTUs of each frame of image of a video sequence and the interactive communication cost between a deblocking filtering and sample adaptive compensation module core and a cache are fully considered, HEVC decoding is divided into two parts, namely pixel decoding reconstruction and loop filtering, and multi-level efficient parallel decoding is adopted.
Example (b):
the embodiment aims at the ultrahigh parallel computing performance of the multi-core processor to realize the real-time parallel decoding of the high-definition video of the HEVC.
Fig. 1 shows a block diagram of an HEVC decoder: firstly, entropy decoding is carried out on the coded binary bit stream so as to obtain a quantization coefficient and control information, and then inverse quantization and inverse transformation are carried out on the quantization coefficient so as to obtain residual information. And then the decoder performs intra-frame prediction and inter-frame prediction by using the control information, combines the prediction information with the restored residual error information, and performs loop filtering processing of deblocking filtering and sample adaptive compensation to obtain an output image.
The basic structure of the HEVC coding and decoding principle is basically consistent with that of H.264/AVC, but the improvement of the HEVC coding and decoding in performance is derived from deep optimization of a series of module levels and innovation of some design elements. The new characteristics that the performance improvement aiming at the high-definition video coding and decoding is more important are as follows: a coding unit CU based on a cyclic hierarchy of quadtrees; a plurality of parallelization implementations are provided for solving the huge data volume of the high-definition video. In this embodiment, image frame CTU units are used as parallel particles, a CTU dependency table and a decoding method are designed, each CTU unit is decoded in parallel, a task queue is created, and threads allocated to each Tile core perform multi-core parallel processing.
Fig. 2 shows the CTU dependency table and its initial values. The size of the table is the number of total CTU units in a frame, and each element in the table records the number of adjacent CTU units it depends on. Whenever the decoding of one CTU unit on which the CTU unit depends is completed, the corresponding element value of the CTU in the table is decremented by 1, and when the element value is lowered to 0, it indicates that the CTU unit has released the dependency relationship with the adjacent CTU, is ready to be decoded, and can be added to the decoding task queue by the main thread. The contents of the table record are as follows: the top left corner is the CTU unit at the beginning of a frame and is not dependent on other CTU units of the current frame; the CTU unit of the first row depends only on the CTU unit to its left; the CTU unit of the first column depends on the CTU units above and to the left and right, but the CTU unit above and to the right must be decoded later than the CTU units above and to the left, and therefore can be recorded as depending only on the CTU unit above and to the right, i.e. decoding can be started when the CTU unit above and to the right has been decoded; the last column of CTU units depends on its top left, top, and left CTU units, but the left CTU unit is definitely decoded later than the top CTU unit, so we can record as relying on the left CTU unit only, i.e. decoding can be started when its left CTU unit is decoded; the CTU units depended on by other CTU units have upper left, upper right and left CTU units, and the upper right and left CTU units must be decoded later than the upper left and upper CTU units, but there is no direct relationship between them, so both are depended on by the current CTU unit.
Fig. 3 is a schematic diagram illustrating design modeling of a wave-front parallel decoding system based on a CTU unit, where the upper left corner is a CTU relationship table that is established and continuously maintained, the lower left corner is an execution task queue that we design, and the lower left corner is a cache structure that we design, and M and P1, P2, P3, and Pn are threads that we implement by binding threads through cores, where M is a main thread (core) and P is a slave thread (core). The main thread (core) needs to maintain a CTU unit dependency table to track the dependencies among CTU units, and once all the dependencies of a CTU unit in the CTU dependency table are decoded, the CTU unit can be in a ready state, and thus can be added to the task queue as a decoding task. Once the idle state of the slave thread (core) occurs, taking the decoding task from the task queue to be allocated to the slave thread (core) for decoding; the secondary thread (core) aims at decoding the CTU unit, and enters a waiting state when decoding is completed by one CTU unit, and waits for the primary thread (core) to distribute a new decoding task; the CTU unit decoding task queue is a first-in first-out queue, and is added into a task queue to be decoded when the CTU unit is ready to be decoded, and a new decoding task is taken out from the top of the task queue to be decoded and added into a decoding thread after a slave thread (core) completes one CTU unit decoding task.
Fig. 4 is a schematic diagram illustrating buffer interaction based on a CTU unit wavefront parallel decoding algorithm. The wave-front parallel algorithm based on the CTU unit takes a single CTU unit as the minimum decoding unit, so compared with the traditional algorithm, the data interaction and the cache communication are more frequent, and the wave-front parallel algorithm specifically comprises the following steps:
step 1, initializing, wherein a slave thread in a thread pool waits for a main thread (core) to start the thread to perform a new decoding task;
step 2, obtaining CTU unit information needing to be decoded from an external memory through Cache communication;
step 3, obtaining relevant information of peripheral CTU units required by the current CTU unit from an external memory through Cache communication;
step 4, if the inter-frame CTU unit is the inter-frame CTU unit, acquiring pixel data of a reference CTU unit from an external memory through Cache communication;
step 5, decoding (reconstructing) the CTU unit;
step 6, writing the decoded CTU back to an external memory through Cache communication;
step 7, informing the main thread (core) of the end of decoding;
fig. 5 is a diagram illustrating dependency of luminance components, chrominance components and sample adaptive compensation data in deblocking filtering. According to the data relationship between deblocking filtering and sample adaptive compensation, the invention proposes a re-partitioning scheme to make new CTU-like decoding objects. The CTU-like decoding object contains all the data required for its deblocking filter, including the luminance and the two chrominance components. This class of CTU decoding objects includes samples from the current CTU and its upper left and upper CTUs. FIG. 5 shows that the size of the CTU-like decoded object remains at 64 × 64; however, the sample range is shifted to the left by four columns and up by four rows. As in fig. 5, the new 64 x 64 class CTU decoding object is the new actual processing object of the deblocking filter we have designed.
Fig. 6 shows a schematic diagram of the dependence of deblocking filtering and sample adaptive compensation data, for SAO filter algorithms, EO mode requires reference to neighboring samples. SAO source data is obtained from the results of the deblocking filter. Therefore, in order to implement the HEVC loop filter at the CTU level and to couple the deblocking filter and the SAO filter, the current CTU-like decoding object needs to be re-partitioned after the deblocking filter is performed. This means that the range of samples processed by the SAO filter is shifted to the left by one column and up by one row of samples, as indicated by the dashed box in fig. 6.
Fig. 7 shows a process flow diagram of a fused loop filtering scheme, step 1, with simultaneous luminance decision of vertical boundaries of multiple 8x8 luminance blocks. Step 2, deblocking filtering of the 8x8 Cb block is performed. Step 3, SAO of 8 × 8 Cb blocks is performed. Step 4, horizontal filtering of the vertical boundaries of multiple 8x8 luminance blocks is performed simultaneously. And 5, simultaneously carrying out luminance decision of horizontal boundaries of a plurality of 8x8 luminance blocks. And 6, deblocking and filtering 8x8 Cr blocks. Step 7, SAO of 8x8 Cr blocks is performed. Step 8, vertical filtering of the horizontal boundaries of multiple 8x8 luminance blocks is performed simultaneously. Step 9, SAO of 8 × 8 luminance blocks is performed. The above processes are executed in a loop.
Fig. 8 shows a schematic diagram of a pixel decoding reconstruction module and a loop filter module adopting a pipeline parallel technology. Between the pixel decoding reconstruction module and the loop filtering module, the pipeline parallelism as shown in fig. 8 is adopted, so that the thread waiting time can be effectively reduced, and the decoding efficiency is improved. After the pipeline parallel mode is adopted, the current pixel decoding reconstruction module and the loop filter can simultaneously decode when the data dependency relationship is met, and the thread does not need to wait for the completion of all decoding tasks of the pixel decoding reconstruction module, so that the thread waiting time is further reduced, and the decoding efficiency is improved.
Fig. 9 shows a flow chart of a multi-level efficient parallel decoding algorithm. The method specifically comprises the following steps:
step 1: the main thread completes initialization work, including initializing an HEVC decoder, applying for a register unit, initializing cache, initializing a decoding task queue and emptying the task queue;
step 2: reading a Sequence code stream coded by HEVC (high efficiency Video coding), calling a network adaptation layer NAL (network architecture layer) parsing function, parsing various encapsulated Parameter Information, including Parameter Set Information such as Picture Parameter Set (PPS) of an image, Sequence Parameter Set (SPS), Video Parameter Set (VPS), Video Parameter Set (Video Parameter Set), Supplementary Enhancement Information (SEI) and the like, and Slice header Information of the image, wherein the Information includes profile (level), level (level), image frame type, width and height of the image and loop filtering Parameter Information required by decoding, and then storing the Parameter Information into a decoded image object structure;
and step 3: and (4) carrying out entropy decoding according to various parameter information generated by analyzing the network adaptation layer in the step (2). Firstly, checking the type of an image frame, and if an I frame or a P frame is detected, carrying out entropy decoding on the frame; if B frames with mutually independent same levels are detected, threads in the thread pool are called to carry out frame level parallel entropy decoding, the entropy decoding is to decode an input binary sequence into syntax elements used for representing a video sequence, and each subsequent module carries out pixel reconstruction, filtering and the like according to the syntax elements.
Step 4, acquiring the number of CTU lines of the current frame image according to a series of syntax elements obtained by entropy decoding in the step 3, creating thread numbers with the same number as the number of CTU lines of the current frame image in a thread pool, binding each thread to different cores through a multi-core function library, and then entering a decoding main cycle; executing step 5-step 8, decoding and reconstructing the pixels to obtain a reconstructed frame;
step 5, creating a CTU initialization dependency table according to the total number of CTU units in the current frame image, wherein the dependency table needs to be initialized at the beginning of each frame, the coordinates of each element in the dependency table correspond to the coordinates of the CTU units in the frame image in one frame image, the numerical value of each element represents the number of adjacent CTU units on which the CTU at the corresponding position depends, for one CTU unit, each time the decoding of one CTU unit on which the CTU depends is completed, the current dependency of the CTU is satisfied, therefore, the corresponding element value of the CTU unit in the CTU dependency table is reduced by 1, and when the element value of the CTU unit is reduced to 0, the CTU unit is released from the dependency relationship with all and all the adjacent CTUs; adding a first CTU unit into a queue to be decoded;
step 6, if the queue to be decoded is not empty, indicating that a CTU unit needs to be decoded; if the thread pool is not empty, it represents that there is a thread (core) that is idling; at this time, the CTU unit needing to be decoded can be allocated to an idle thread (core) for decoding;
step 7, checking whether each thread has a message returned or not; there is a message returned to represent that decoding has ended; then the thread will return to the idle state again, and join the thread pool to wait for the next task allocation; acquiring the CTU unit after decoding from the returned message, and updating a CTU unit dependence table; if the updated item dependency becomes 0, indicating that decoding can begin, and thus adding to the queue to be decoded;
and step 8, when the CTU unit pixel decoding reconstruction in the image frame is finished, arranging a slave thread to perform fusion loop filtering processing operation on the CTU unit pixel, namely coupling deblocking filtering and sample adaptive compensation (SAO) by taking a CTU-like decoding object as a unit, wherein the processing operation is shown in figure 7. After the processing is finished, the thread is transferred into a task queue and enters a waiting state until the next CTU unit finishes the pixel decoding reconstruction operation, and then the next CTU unit is subjected to fusion loop filtering processing;
step 9, arranging a thread to perform pixel decoding reconstruction operation on the CTU units with the dependence relation satisfied in the next frame of image frame, and repeating the steps;
step 10, after decoding a frame of video code stream, detecting whether the video code stream is completely decoded, and if the decoding is completed, releasing all resources and destroying a thread pool; if not, returning to the step 5.
Fig. 10 is a schematic diagram of the decoding frame rate for performing multi-core parallel decoding on different full high definition 1080P video sequences in a multi-core processor when QP is 32. Where decoding performance is measured in terms of frame rate (fps, number of decoded frames per second).
The specific implementation case is as follows: the Tilera GX36 multi-core processor is used as an experimental platform and consists of 36 Tie cores, and the Tilera multi-core processor has a set of complete multi-core development tools, so that convenience is provided for realizing a multi-core parallel program. To verify the effect of the method of the invention, the following verification experiments were performed: the method is used for decoding, and 3 video sequences are selected as the video sequences with the resolution of 1920 multiplied by 1080 and the QP of 32, namely 'BasketallDrive', 'Captus' and 'Kimono'. The most complicated ra (random access) random access mode is selected for video coding, and CTU row blocks are designed to be 64 × 64 in size. The decoding method realizes multi-core parallel efficient decoding on the Tilera multi-core processor. The results of the experiment are shown in table 1. Meanwhile, the actual comparison analysis is made by comparing the image processing of Nanjing post and telecommunications university in 2018 with an HEVC parallel decoding algorithm based on the combination of a task level and a data level of a Korean peak in an image communication laboratory. As shown in fig. 9, MLP represents the parallel decoding algorithm combining task level and data level, and SMLP represents the multi-level parallel decoding method implemented by the present invention.
TABLE 1 results of the experiment
Figure GDA0003498040590000081
As can be seen from table 1, in the case of single-core decoding, the high-definition video decoding speed is limited and the real-time decoding effect cannot be achieved. When the number of the cores is 10, the multilevel parallel efficient decoding algorithm can meet the real-time decoding requirement. As the number of kernels continues to increase, the decoding speed increases, and the maximum decoding speed can reach above 59 fps.
From the experimental comparison result of fig. 10, it is found that, compared with the parallel decoding algorithm combining task level and data level, the multi-level parallel decoding algorithm based on the CTU unit has a decoding efficiency greatly improved under the condition of multi-core, and when the number of cores is 24, the average decoding efficiency is improved by about 10% compared with the parallel decoding algorithm combining task level and data level.
As can be seen from the results of the experiments in table 1 and fig. 9 in combination:
(1) the multilevel parallel high-efficiency decoding algorithm provided by the invention can realize real-time decoding of high-definition videos on a multi-core processor.
(2) Compared with a parallel decoding algorithm combining a task level and a data level, the multilevel parallel efficient decoding algorithm provided by the invention has the advantage that the decoding efficiency is greatly improved under the condition of multi-core.

Claims (7)

1. A HEVC multi-level parallel decoding method on a multi-core processor platform is characterized in that a main thread realizes control of a decoding process, a plurality of slave threads realize independent parallel decoding of CTU units, each thread is bound on an available core of the multi-core processor, and efficient parallel decoding on the multi-core processor platform is realized, and the method comprises the following steps:
step 1: the main thread completes initialization work, including initializing an HEVC decoder, applying for a register unit, initializing cache, initializing a decoding task queue and emptying the task queue;
step 2: reading sequence code streams of HEVC coding, calling a network adaptation layer NAL (network element analysis) analysis function, analyzing various encapsulated parameter information, and obtaining profile, level, image frame type, image size parameter and loop filtering parameter information required by decoding;
and step 3: performing entropy decoding according to various parameter information generated by analyzing the network adaptation layer in the step 2 to obtain a syntax element for representing the video sequence;
step 4, acquiring the number of rows of CTUs in the current frame image according to syntax elements obtained by entropy decoding in the step 3, creating thread numbers with the same number as the number of rows of CTUs in the current frame image in a thread pool, and binding each thread to different cores through a multi-core function library;
step 5, establishing a CTU initialization dependency table according to the total number of CTU units in the current frame image, wherein the coordinates of each element in the dependency table correspond to the coordinates of the CTU units in the frame image in one frame image, the value of each element represents the number of adjacent CTU units on which the CTU at the corresponding position depends, for one CTU unit, each time the decoding of one CTU unit on which the CTU depends is completed, the current CTU has a dependency satisfied, therefore, the value of the corresponding element of the CTU unit in the CTU dependency table is reduced by 1, and when the value of the element is reduced to 0, the CTU unit is indicated to have removed the dependency relationship with all and its adjacent CTUs;
step 6: adding a first CTU unit in a dependency table into a task queue to be decoded;
and 7: judging whether the queue to be decoded is empty, if not, judging whether a thread pool is empty, if not, receiving the task allocation of a main thread by the idle-state slave thread, and executing the decoding task in the task queue to be decoded; if the thread pool is empty, waiting for the slave thread to return to the idle state again, and waiting for task allocation, and if the queue to be decoded is empty, executing the step 9;
and 8: the main thread obtains the CTU unit after decoding according to the information returned by the slave thread, updates the dependency table, and adds the CTU unit with the element value of 0 in the dependency table into the task queue to be decoded;
and step 9: when a slave thread completes a CTU unit pixel decoding reconstruction task, adopting a pipeline parallel technology, taking out a new decoding task from a task queue to be decoded by the slave thread to continue executing the pixel decoding reconstruction task, and scheduling another idle slave thread from a thread pool to perform fusion loop filtering processing on the CTU unit which just completes pixel decoding reconstruction when the data dependency is met; repeating the step 8 and the step 9 until the current frame image is decoded, and executing the step 10;
step 10: detecting whether the video code stream is completely decoded, and if the decoding is completed, releasing all resources and destroying the thread pool; otherwise, reading the next frame of image and executing the step 5.
2. The HEVC multilevel parallel decoding method on the multi-core processor platform according to claim 1, wherein: the HEVC decoder in said step 1 comprises coding units, CUs, of a cyclic hierarchy based on a quadtree.
3. The HEVC multilevel parallel decoding method on the multi-core processor platform according to claim 1, wherein: in step 2, the parameter information includes a picture parameter set PPS, a video parameter set VPS, a sequence parameter set SPS, supplemental enhancement information SEI, and Slice header information of the picture.
4. The HEVC multilevel parallel decoding method on the multi-core processor platform according to claim 1, wherein: the slave thread in the step 7 receives the task allocation of the main thread and executes the decoding task in the task queue to be decoded, and the method specifically comprises the following steps: obtaining CTU unit information to be decoded and required peripheral CTU unit information from an external memory through Cache communication; decoding the CTU unit; writing the decoded CTU unit back to an external memory through Cache communication; and informing the main thread of the decoding end.
5. The HEVC multilevel parallel decoding method on the multi-core processor platform according to claim 4, wherein: and if the CTU unit needing to be decoded is the inter-frame CTU unit, acquiring the pixel data of the reference CTU unit from an external memory through Cache communication.
6. The HEVC multilevel parallel decoding method on the multi-core processor platform according to claim 1, wherein: the fused loop filtering processing in the step 9 comprises deblocking filtering and SAO filtering; the method specifically comprises the following steps: according to the data dependency relationship among the brightness component, the chrominance component and the sample adaptive compensation in the deblocking filtering process, a new CTU-like decoding object is manufactured to be used as a processing object of loop filtering, wherein the CTU-like decoding object is from a current CTU unit and samples of an upper left CTU unit and an upper left CTU unit; after the deblocking filtering is finished, the range of the current CTU-like decoding object is subdivided for processing pixel samples in SAO filtering; the extent of the CTU-like decoding object is shifted left by one column and up by one row of pixel samples in the SAO filtered samples.
7. The HEVC multilevel parallel decoding method on the multi-core processor platform according to claim 1, wherein: and 6, the task queue to be decoded in the step 6 is a first-in first-out queue, the CTU unit to be decoded is stored in the task queue to be decoded, and after a CTU unit decoding task is completed from a thread, a new decoding task is taken out from the top of the task queue to be decoded and added into the decoding thread.
CN201910752152.XA 2019-08-15 2019-08-15 HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform Active CN110337002B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910752152.XA CN110337002B (en) 2019-08-15 2019-08-15 HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910752152.XA CN110337002B (en) 2019-08-15 2019-08-15 HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform

Publications (2)

Publication Number Publication Date
CN110337002A CN110337002A (en) 2019-10-15
CN110337002B true CN110337002B (en) 2022-03-29

Family

ID=68149626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910752152.XA Active CN110337002B (en) 2019-08-15 2019-08-15 HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform

Country Status (1)

Country Link
CN (1) CN110337002B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111562948B (en) * 2020-06-29 2020-11-10 深兰人工智能芯片研究院(江苏)有限公司 System and method for realizing parallelization of serial tasks in real-time image processing system
CN111986070B (en) * 2020-07-10 2021-04-06 中国人民解放军战略支援部队航天工程大学 VDIF format data heterogeneous parallel framing method based on GPU
CN112468821B (en) * 2020-10-27 2023-02-10 南京邮电大学 HEVC core module-based parallel decoding method, device and medium
CN116841739B (en) * 2023-06-30 2024-04-19 沐曦集成电路(杭州)有限公司 Data packet reuse system for heterogeneous computing platforms

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103974081A (en) * 2014-05-08 2014-08-06 杭州同尊信息技术有限公司 HEVC coding method based on multi-core processor Tilera
CN105992008A (en) * 2016-03-30 2016-10-05 南京邮电大学 Multilevel multitask parallel decoding algorithm on multicore processor platform
CN107454406A (en) * 2017-08-18 2017-12-08 深圳市佳创视讯技术股份有限公司 The live high-speed decoding method of VR panoramic videos and system based on AVS+

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160191935A1 (en) * 2014-04-22 2016-06-30 Mediatek Inc. Method and system with data reuse in inter-frame level parallel decoding

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103974081A (en) * 2014-05-08 2014-08-06 杭州同尊信息技术有限公司 HEVC coding method based on multi-core processor Tilera
CN105992008A (en) * 2016-03-30 2016-10-05 南京邮电大学 Multilevel multitask parallel decoding algorithm on multicore processor platform
CN107454406A (en) * 2017-08-18 2017-12-08 深圳市佳创视讯技术股份有限公司 The live high-speed decoding method of VR panoramic videos and system based on AVS+

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Efficient In-Loop Filtering Across Tile Boundaries for Multi-Core HEVC Hardware Decoders With 4 K/8 K-UHD Video Applications;Seunghyun Cho等;《IEEE Transactions on Multimedia》;20150401;全文 *
基于TILE-Gx36多核处理器的HEVC视频并行编码技术的设计与实现;谷涛;《中国优秀硕士学位论文全文数据库》;20190228;全文 *
基于多核处理器的任务级与数据级相结合的HEVC并行解码技术与实现;韩峰;《中国优秀硕士学位论文全文数据库》;20190228;全文 *

Also Published As

Publication number Publication date
CN110337002A (en) 2019-10-15

Similar Documents

Publication Publication Date Title
CN110337002B (en) HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform
US8218641B2 (en) Picture encoding using same-picture reference for pixel reconstruction
US10230986B2 (en) System and method for decoding using parallel processing
JP7483035B2 (en) Video decoding method and video encoding method, apparatus, computer device and computer program thereof
US8213518B1 (en) Multi-threaded streaming data decoding
CN105992008B (en) A kind of multi-level multi-task parallel coding/decoding method in multi-core processor platform
US20090010337A1 (en) Picture decoding using same-picture reference for pixel reconstruction
CN108449603B (en) Based on the multi-level task level of multi-core platform and the parallel HEVC coding/decoding method of data level
CN112468821B (en) HEVC core module-based parallel decoding method, device and medium
Wang et al. Paralleling variable block size motion estimation of HEVC on multi-core CPU plus GPU platform
CN105791829A (en) HEVC parallel intra-frame prediction method based on multi-core platform
CN104521234B (en) Merge the method for processing video frequency and device for going block processes and sampling adaptive migration processing
Wang et al. Intra block copy in AVS3 video coding standard
CN101841722B (en) Detection method of detection device of filtering boundary strength
CN112422986B (en) Hardware decoder pipeline optimization method and application
WO2024098821A1 (en) Av1 filtering method and apparatus
Gudumasu et al. Software-based versatile video coding decoder parallelization
CN116600134A (en) Parallel video compression method and device adapting to graphic engine
CN102595137A (en) Fast mode judging device and method based on image pixel block row/column pipelining
Jiang et al. GPU-based intra decompression for 8K real-time AVS3 decoder
CN102075753B (en) Method for deblocking filtration in video coding and decoding
Yan et al. Parallel deblocking filter for H. 264/AVC implemented on Tile64 platform
CN102090064A (en) High performance deblocking filter
WO2012171401A1 (en) Parallel filtering method and apparatus
Han et al. A real-time ultra-high definition video decoder of AVS3 on heterogeneous systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant