CN110337002A - The multi-level efficient parallel decoding algorithm of one kind HEVC in multi-core processor platform - Google Patents

The multi-level efficient parallel decoding algorithm of one kind HEVC in multi-core processor platform Download PDF

Info

Publication number
CN110337002A
CN110337002A CN201910752152.XA CN201910752152A CN110337002A CN 110337002 A CN110337002 A CN 110337002A CN 201910752152 A CN201910752152 A CN 201910752152A CN 110337002 A CN110337002 A CN 110337002A
Authority
CN
China
Prior art keywords
decoding
ctu
thread
unit
decoded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910752152.XA
Other languages
Chinese (zh)
Other versions
CN110337002B (en
Inventor
胡栋
张文祥
李毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201910752152.XA priority Critical patent/CN110337002B/en
Publication of CN110337002A publication Critical patent/CN110337002A/en
Application granted granted Critical
Publication of CN110337002B publication Critical patent/CN110337002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/44Decoders specially adapted therefor, e.g. video decoders which are asymmetric with respect to the encoder
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/96Tree coding, e.g. quad-tree coding

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The multi-level parallel efficient decoding algorithm that the invention discloses a kind of based on multi-core processor.The method of the present invention handles complexity issue for the decoded superelevation of huge data volume and HEVC of HD video, makes full use of the dependence in HEVC data, proposes a kind of multi-level concurrent decoding algorithm in multi-core processor platform.The wavefront parallel algorithm based on CTU unit is realized using the data dependence relation between CTU unit in pixel decoding and reconstituting module first;Secondly, making full use of the data dependence relation between deblocking filtering and sample adaptive equalization in fast loop filter module, fusion loop filtering algorithm is realized;Pipeline parallel method technology is finally introduced between two modules, realizes the multi-level efficient parallel decoding algorithm of decoder.In decoding process, each task is executed by independent thread, and is bound to an independent core operation, is taken full advantage of the Parallel Computing Performance of multi-core processor, is improved decoding efficiency.

Description

The multi-level efficient parallel decoding algorithm of one kind HEVC in multi-core processor platform
Technical field
The present invention relates to encoding digital video signals to decode field, and in particular to one kind multilayer in multi-core processor platform Secondary multitask efficient parallel decoding algorithm.
Background technique
It is constantly progressive with the development of mobile internet with Internet Video Applications, in order to meet people to high definition (HD) etc. the continuous demand of videos, the video coding international standard tissue JCT-VC of MPEG and VCEG amalgamated consolidation in 2010 are common New International Video C Oding Standards HEVC (High Efficiency Video Coding) is developed, and in January, 2013 Formally become international standard." high efficient coding " HEVC's aims at raising video coding efficiency, before identical picture quality It puts, compression ratio is doubled than H.264/AVC high-grade (high profile).It is huge in view of HD video encoding and decoding The system of data volume and encoding and decoding complexity, HEVC standard introduce a variety of parallel processing technique means, as Tile with block for parallel Grain, WPP wavefront are parallel, effectively improve the performance of encoding and decoding.At the same time, multi-core processor has obtained considerable in recent years Development, an important factor for being effectively combined into HEVC technology successful application of the two.
Scholar both domestic and external combines video encoding and decoding standard to be made that some researchs on multi-core processor.It is international On, Hyunki Baik et al. is in paper " A Complexity-based Adaptive Tile Partitioning It is proposed in Algorithm For HEVC Decoder Parallelization " (2015) for HEVC decoder parallelization Effective adaptive tile partitioning algorithm, video frame is divided on multicore tile that is independent and handling simultaneously;However, should Method coding side must use the coding mode of independent tile subregion, and decoding end and coding side have the very high degree of coupling, thus There are biggish limitations.HyunMi Kim et al. is in paper " the An Efficient Architecture of In- delivered One kind is proposed in Loop Filters for Multicore Scalable HEVC Hardware Decoders " (2018) Efficient HEVC loop filter (ILF) framework provides effective multicore for ultra high-definition Video Applications and utilizes, the novelty proposed Storage organization and administrative skill solve the data dependence relevant issues between multiple processing units.
Domestic scholars also proposed the coding/decoding method of some multi-core platforms.2014, Dalian University of Technology's Information And Communication Ma Aidi of engineering college etc. (2014) proposes the HEVC code parallel decoder based on CPU+GPU mixing platform, it uses CUDA Hardware platform, and system optimization is completed using hardware superiority.Fang Di et al. (2016) based on each CTU unit in decoding process Between data dependence relation analysis, propose the multi-level parallel decoding method of HEVC based on Tilera multi-core processor.Korea Spro Peak et al. (2018) proposes the task based on multi-core processor based on the relationship in decoding process between data and task level The HEVC parallel decoding method that grade is combined with data level.
The research achievement that comprehensive forefathers work it out on multi-core platform, including propose parallel, assembly line before OWF superimposed wave Thread Pool Technology, 3D-WPP algorithm, data level combine parallel algorithm, quick deblocking filtering algorithm with task level.Although There is very big improvement in certain some aspects, but there is no in view of local grain complexity in each frame image in video sequence There are larger differences for difference CTU computation complexity caused by difference, while also not accounting for deblocking filtering and sample Cost is interactively communicated between adaptive equalization module core and caching, meanwhile, do not make full use of the core of multi-core platform yet Resource.
Summary of the invention
The technical problem to be solved by the present invention is under the premise of guaranteeing decoded image quality, to the single code of HEVC high definition The real-time decoding of stream, further increases decoding efficiency.
The technical scheme adopted by the invention is that: one kind HEVC efficient in multi-core processor platform is solved parallel at many levels Code algorithm is realized the control of decoding process, multiple independent parallel decodings that CTU unit is realized from thread, each line by main thread Cheng Jun is bundled in an available core of multi-core processor, realizes the efficient parallel decoding on multicore processing platform, including such as Lower step:
Step 1: main thread completes the work of initialization, including initialization HEVC decoder, applies for register cell, initially Change caching, initializes decoding task queue, empty task queue;
Step 2: reading in the sequence code stream of HEVC coding, call network adaptation layer NAL analytical function, parse all kinds of of encapsulation Parameter information obtains decoding required profile, level, image frame types, image size parameter and loop filtering parameter letter Breath;
Step 3: according in step 2 network adaptation layer parsing generate all kinds of parameter informations, carry out entropy decoding, obtain for Indicate the syntactic element of video sequence;
Step 4 obtains current frame image CTU line number according to the syntactic element that entropy decoding in step 3 obtains, in thread pool The Thread Count of quantity identical as current frame image CTU line number is created, and per thread is tied to by difference by multi-kernel function library Core on;
Step 5 establishes CTU initialization dependence table according to the overall number of CTU unit in current frame image, in the dependence table The coordinate pair of each element answers coordinate of the CTU unit in frame in a frame image, and the numerical value of each element indicates corresponding position The number for the adjacent C TU unit that CTU is relied on, for a CTU unit, whenever the CTU unit solution that it is relied on When code is completed, indicate that this current CTU has a dependence to meet, therefore the CTU unit relies on corresponding member in table in CTU Plain value just subtracts 1, when its element value is down to 0, show the CTU unit relieved with it is all and it adjacent C TU dependence Relationship;
Step 6: first CTU unit in dependence table is added to task queue to be decoded;
Step 7: judging whether queue to be decoded is sky, if not empty, then judge whether thread pool is sky, if thread pool is not For sky, then the slave thread of idle state receives the distribution of main thread task, executes the decoding task in task queue to be decoded;If line Cheng Chi is sky, then waits after returning idle state again from thread, and task distribution is waited to execute if queue to be decoded is sky Step 9;
Step 8: main thread according to from thread return come message obtain decoding complete CTU unit, update rely on table, will The CTU unit that element value is 0 in table is relied on to be added in task queue to be decoded;
Step 9:, should using pipeline parallel method technology after completing a CTU unit pixel decoding and reconstituting task from thread New decoding task is taken out from task queue to be decoded from thread and continues to execute pixel decoding and reconstituting task, when data dependency is full When sufficient, another idle CTU unit for just having completed pixel decoding and reconstituting to this from thread is dispatched from thread pool and is merged Loop filtering processing;Step 8 and step 9 are repeated, until executing step 10 after the completion of current frame image decoding;
Step 10: all detection video code flow whether complete by decoding, discharges all resource and destroying threads if completing Pond;Otherwise, next frame image is read, step 5 is executed.
Further, the HEVC decoder in the step 1 includes the coding list of the circulated layered structure based on quaternary tree First CU.
Further, in the step 2, the parameter information includes picture parameter set PPS, video parameter collection VPS, sequence The Slice head information of parameter set SPS, supplemental enhancement information SEI and image.
Further, the slave thread in the step 7 receives the distribution of main thread task, executes in task queue to be decoded Decoding task needs decoded CTU unit letter specifically includes the following steps: obtaining by Cache Cache Communication from external memory Breath and required periphery CTU unit information;Decode CTU unit;Decoded CTU unit is write back by Cache Cache Communication External memory;Notice main thread decoding terminates.
Further, if desired decoded CTU unit is interframe CTU unit, then through Cache Cache Communication out of outside It deposits and obtains its pixel data for referring to CTU unit.
Further, the fusion loop filtering processing in the step 9 includes that deblocking filtering and SAO are filtered;Specific packet Include following steps: according to the data during deblocking filtering between luminance component, chromatic component and sample adaptive equalization Dependence makes process object of the new class CTU decoder object as loop filtering, and the class CTU decoder object, which comes from, works as The sample of preceding CTU unit and its upper left and upper CTU unit;After the completion of deblocking filtering executes, to current class CTU decoder object Range is repartitioned, for handling the pixel samples in SAO filtering;The range of the class CTU decoder object is filtered in SAO It is moved to the left a column in the sample of processing and moves up one-row pixels sample.
Further, the task queue to be decoded in the step 6 is First Input First Output, the task queue to be decoded Middle storage needs decoded CTU unit, can be from task queue to be decoded after completing a CTU unit decoding task from thread It takes out new decoding task and the decoding thread is added in top.
The utility model has the advantages that improving by the present invention in that the real-time response ability of program can be improved with multi-thread design mode The design structure of program more effectively plays the function of processor, reduces frequent scheduling and switching to system resource, reduces The expense that thread object is created and is destroyed.In multi-thread design when multiple thread accesses shared resources, by locking, unlocking behaviour Make matching requirements variable to coordinate correct concurrent operations, to integrally improve the decoding efficiency of system.Compared to existing skill Art, the invention has the following advantages:
The present invention is a kind of efficient parallel decoding algorithm multi-level in multi-core processor platform, in original parallel frame Under conditions of limitation, HEVC decoder is divided into pixel decoding and reconstituting part and loop filtering part, pixel decoding and reconstituting mould Block designs and Implements the wavefront parallel algorithm based on CTU unit, and loop filtering module design simultaneously realizes fusion loop filtering calculation Method utilizes the parallel computation high-performance of multi-core processor, experimental result table between two modules in the way of pipeline parallel method It is bright, the present invention degree of parallelism promotion, multi-core parallel concurrent framework level, in terms of have preferable performance, and realize pair The real-time high-efficiency decoding that any parallel mode encodes the single code stream of full HD 1080P to be formed is not used.
Detailed description of the invention
Fig. 1 is HEVC decoding process block diagram;
Fig. 2 is that CTU relies on expression intention;
Fig. 3 is based on CTU unit wavefront parallel decoding system design setting model schematic diagram;
Fig. 4 is that interaction schematic diagram is cached based on CTU unit wavefront concurrent decoding algorithm;
Fig. 5 is luminance component in deblocking filtering, chromatic component and the signal of sample adaptive equalization data dependence relation Figure;
Fig. 6 is deblocking filtering and sample adaptive equalization data dependence relation schematic diagram;
Fig. 7 is fusion loop filtering processing schematic;
Fig. 8 is pixel decoding and reconstituting module and loop filtering module using pipeline parallel method technology schematic diagram;
Fig. 9 is multi-level efficient parallel decoding algorithm flow chart;
Figure 10 is multi-level efficient parallel decoding algorithm and the decoding algorithm Experimental comparison that task level is combined with data level As a result.
Specific embodiment
The basic idea of the invention is that: using the high Parallel Computing Performance of multi-core processor, fully consider that video sequence is every Computation complexity difference and deblocking filtering and sample adaptive equalization module core between one frame image local CTU Cost is interactively communicated between caching, HEVC decoding is divided into two parts of pixel decoding and reconstituting and loop filtering, is used Multi-level efficient parallel decoding.
Embodiment:
The present embodiment realized for the superelevation Parallel Computing Performance of multi-core processor the HD video of HEVC it is real-time simultaneously Row decoding.
Fig. 1 show HEVC decoder block diagram: entropy decoding is carried out to the binary bit stream after coding first, thus Quantization parameter and control information are obtained, then inverse quantization and inverse transformation are carried out to quantization parameter, obtains residual information.Next solution Code device carries out intra prediction using control information and inter-prediction, predictive information are combined with the residual information restored, It is handled using deblocking filtering and the loop filtering of sampling point adaptive equalization, the image exported.
The basic structure of HEVC encoding and decoding principle and H.264/AVC almost the same, but the encoding and decoding of HEVC are in performance Promote a series of innovation in the deep optimization and some design elements of module levels.Wherein for HD video encoding and decoding The more important new features of performance boost have: the coding unit CU of the circulated layered structure based on quaternary tree;In order to solve high definition The huge data volume of video provides a variety of parallelizations and realizes.The present embodiment just using picture frame CTU unit as parallel particle, It designs CTU and relies on table and coding/decoding method, each CTU unit of parallel decoding creates task queue, and distributes to each Tile core Thread carry out multi-core parallel concurrent processing.
Fig. 2 show CTU dependence table and its initial value.The size of table is the number of totality CTU unit in a frame, table In each element record the number of the adjacent C TU unit that it is relied on.Whenever the CTU unit solution that CTU unit is relied on When code is completed, corresponding element value of the CTU in table subtracts 1, when element value drops to 0, shows that the CTU unit has relieved It with the dependence of adjacent C TU, has been prepared for being decoded, it can be added to decoding task queue by main thread.The table The content of record is as follows: the upper left corner is the CTU unit of the beginning of a frame, is not rely on other CTU units of present frame;First Capable CTU unit only depends on the CTU unit on its left side;The CTU unit of first row depends on its upper left side, top and upper right The CTU unit of side, but the CTU unit in upper right side is to be later than upper left side and the CTU unit of top is decoded certainly, therefore It can recorde to only depend on the CTU unit in upper right side, i.e., can start to decode after the decoding of its upper right side CTU unit; The CTU unit of last column depends on the CTU unit on its upper left side, top and the left side, but the CTU unit on the left side is affirmed It is to be later than the CTU unit of top to be decoded, therefore we can recorde to only depend on the CTU unit on the left side, i.e., when its left side Side CTU unit can start to be decoded after decoding;The CTU unit that other CTU units are relied on have upper left side, on The CTU unit of side, upper right side and the left side, the CTU unit on upper right side and the left side are that be later than upper left side, the CTU of top mono- certainly Member is decoded, but there is no direct relationships between them, therefore both of which is the dependence of current CTU unit.
Fig. 3 is shown based on CTU unit wavefront parallel decoding system design setting model schematic diagram, the upper left corner be establish and not The CTU relation table of disconnected maintenance, the lower left corner are the execution task queues that we design, and are obtained by design buffer structure, M and P1, P2, P3, Pn are our threads, are realized in such a way that core binds thread, and wherein M is main thread (core), and P is from thread (core).It is main Thread (core) needs to safeguard a CTU unit dependence table to track the dependence between CTU unit, once CTU relies on some CTU in table When decoding is all completed in all dependences of unit, which may be in preparation state, therefore can be appointed as a decoding Business is added into task queue.Once there is the slave thread (core) of idle state, then decoding task distribution is taken out from task queue It is decoded to from thread (core);It is exactly to decode CTU unit from thread (core) purpose, it, can be into when a CTU unit is completed in decoding Enter wait state, main thread (core) is waited to distribute new decoding task;CTU unit decoding task queue is First Input First Output, Task queue to be decoded can be added to when CTU unit has been prepared for being decoded, when from thread (core) complete a CTU unit After decoding task, new decoding task can be taken out at the top of task queue to be decoded and the decoding thread is added.
It is that interaction schematic diagram is cached based on CTU unit wavefront concurrent decoding algorithm shown in Fig. 4.Wavefront based on CTU unit Parallel algorithm is with single CTU unit for minimum decoding unit, so the previous algorithm that compares, data interaction, Cache Communication is more Add frequently, specifically according to the following steps:
Step 1 initializes, and waits main thread (core) to start the thread from thread in thread pool and carries out new decoding task;
Step 2 obtains the decoded CTU unit information of needs from external memory by Cache Cache Communication;
Step 3, by Cache Cache Communication from external memory obtain current CTU unit needed for periphery CTU unit phase Close information;
Step 4 then needs to obtain it with reference to CTU from external memory by Cache Cache Communication if it is interframe CTU unit The pixel data of unit;
Step 5, decoding (reconstruction) CTU unit;
Decoded CTU unit is write back external memory by Cache Cache Communication by step 6;
Step 7, notice main thread (core) decoding terminate;
Fig. 5 indicates that luminance component, chromatic component and sample adaptive equalization data dependence relation show in deblocking filtering It is intended to.According to the data relationship between deblocking filtering and sample adaptive equalization, the present invention propose again partition scheme with Make new class CTU decoder object.Class CTU decoder object includes all data needed for its deblocking filter, including brightness With two chromatic components.Such CTU decoder object includes the sample from current CTU and its upper left and upper CTU.Fig. 5 shows class The size of CTU decoder object is maintained at 64 × 64;But sample range is moved to the left four column, moves up four rows.Such as Fig. 5, newly 64 × 64 class CTU decoder objects be we design deblocking filter new actual treatment object.
Fig. 6 indicates deblocking filtering and sample adaptive equalization data dependence relation schematic diagram, and SAO filter is calculated Method, EO mode require to refer to adjacent sample.SAO source data is obtained from the result of de-blocking filter.Therefore, in order to real at CTU grades Show HEVC loop filter and couple de-blocking filter and SAO filter, needs after executing de-blocking filter to current Class CTU decoder object carries out subregion again.This means that being moved to the left column and upward by the sample range of SAO filter process Mobile a line sample, as shown in the dotted line frame in Fig. 6.
Fig. 7 indicates the process flow diagram of fusion loop filtering scheme, step 1, while carrying out the vertical of multiple 8x8 luminance blocks The brightness decision on boundary.Step 2, the deblocking filtering of 8x8Cb block is carried out.Step 3, the SAO of 8x8Cb block is carried out.Step 4, together The horizontal filtering of the vertical boundary of the multiple 8x8 luminance blocks of Shi Jinhang.Step 5, while the horizontal boundaries of multiple 8x8 luminance blocks is carried out Brightness decision.Step 6, the deblocking filtering of 8x8Cr block is carried out.Step 7, the SAO of 8x8Cr block is carried out.Step 8, at the same into The vertical filtering of the horizontal boundary of the multiple 8x8 luminance blocks of row.Step 9, the SAO of 8x8 luminance block is carried out.Circulation executes the above mistake Journey.
Fig. 8 shows pixel decoding and reconstituting modules and loop filtering module to use pipeline parallel method technology schematic diagram.In pixel Between decoding and reconstituting module and loop filtering module, using pipeline parallel method as shown in Figure 8, when can effectively reduce thread waiting Between, improve decoding efficiency.After pipeline parallel method mode, current pixel decoding and reconstituting module and loop filtering are meeting data It can be decoded simultaneously when dependence, thread need not wait pixel decoding and reconstituting module whole decoding task complete again At further reducing thread latency, improve decoding efficiency.
Fig. 9 indicates multi-level efficient parallel decoding algorithm flow chart.Specifically according to the following steps:
Step 1: main thread completes the work of initialization, including initialization HEVC decoder, applies for register cell, initially Change caching, initializes decoding task queue, empty task queue;
Step 2: reading in the sequence code stream of HEVC coding, call network adaptation layer NAL (Network Abstract Layer) analytical function parses all kinds of parameter informations of encapsulation, PPS (Picture Parameter Set, figure including image As parameter set), SPS (Sequence Parameter Set, sequence parameter set), VPS (Video Parameter Set, video Parameter set), the parameter set informations such as SEI (Supplemental Enhancement Information, supplemental enhancement information) and The Slice head information of image, these information include profile (class) needed for understanding code, level (grade), picture frame class Then type, the width and height of image, the parameter information of loop filtering are saved in decoding image object structural body;
Step 3: according to all kinds of parameter informations that network adaptation layer parsing generates in step 2, carrying out entropy decoding.It first checks for Image frame types carry out the entropy decoding of the frame if detecting I frame or P frame;If detecting mutually independent B frame at the same level, adjust The parallel entropy decoding of frame level is carried out with the thread in thread pool, entropy decoding is that the binary sequence of input is decoded into for indicating to regard The syntactic element of frequency sequence, subsequent each module carry out pixel reconstruction, filtering etc. according to these syntactic elements.
Step 4, a series of syntactic elements obtained according to entropy decoding in step 3 obtain current frame image CTU line number, The Thread Count of quantity identical as current frame image CTU line number is created in thread pool, and is tied up per thread by multi-kernel function library Determine onto different core, enters decoding major cycle later;Execute step 5- step 8, the reconstructed frame that pixel decoding and reconstituting is obtained;
Step 5 relies on table according to the overall number creation CTU initialization of CTU unit in current frame image, and every frame starts to need Dependence table is initialized, the coordinate pair of each element answers coordinate of the CTU unit in frame in a frame image in the dependence table, each The numerical value of element indicates the number for the adjacent C TU unit that the CTU of corresponding position is relied on, for a CTU unit, whenever When the decoding of its CTU unit relied on is completed, indicate that this current CTU has a dependence to meet, therefore the CTU is mono- Member corresponding element value in CTU dependence table just subtracts 1, when its element value is down to 0, show the CTU unit relieved with All dependences with its adjacent C TU;First CTU unit is added to queue to be decoded;
If step 6, queue to be decoded are not sky, show there is CTU unit to need to decode;If thread pool is not empty, generation Table has thread (core) idle;It can be solved at this time needing decoded CTU unit to distribute to idle thread (core) Code;
Step 7 checks whether per thread has message return;There is message to return to representative decoding to be over;So line Journey can return idle state again, be added to thread pool and task next time is waited to distribute;Solution is obtained from returning in the message come The CTU unit that code is completed, and update CTU unit dependence table;If the item dependence being updated becomes 0, show to start to decode, Therefore it is added to queue to be decoded;
Step 8, after CTU unit pixel decoding and reconstituting terminates in picture frame, immediately arrange one from thread to its into Row fusion loop filtering processing operation couples deblocking filtering and sample adaptive equalization that is, as unit of class CTU decoder object SAO, processing operation are shown in Fig. 7.After the completion of processing, thread is transferred to task queue and enters wait state, until next CTU unit The operation of pixel decoding and reconstituting is completed, then carries out fusion loop filtering processing to it;
Step 9, the CTU unit met to dependence in next frame picture frame arrange thread to carry out pixel decoding Reconstructed operation, and repeat above step;
Step 10, complete a frame video code flow decoding after, detection video code flow whether all decoding complete, if complete Then discharge all resource and destroying threads pond;If not completing, return step 5.
Figure 10 indicates that in QP=32, it is different full HD that the present invention carries out multi-core parallel concurrent decoding in multi-core processor The decoded frame rate schematic diagram of 1080P video sequence.Wherein decoding performance frame per second (fps, each second decode frame number) Lai Hengliang.
Case is embodied: we are used as experiment porch using Tilera GX36 multi-core processor, it is by 36 Tile Core composition, Tilera multi-core processor possess the multi-Core Development tool of complete set, realize that multinuclear parallel program provides for us Convenience.In order to verify the effect of the method for the present invention, following confirmatory experiment is carried out: being decoded using the method for the present invention, selected It is 1920 × 1080, QP 32 that take 3 kinds of video sequences, which be resolution ratio, " BasketballDrive ", " Cactus ", "Kimono".Video coding mode selects RA (Random Access) the most complicated access module immediately, CTU row block difference It is designed to size 64 × 64.Coding/decoding method of the invention realizes multi-core parallel concurrent on Tilera multi-core processor and efficiently decodes. Experimental result is as shown in table 1.Pass through comparison Nanjing Univ. of Posts and Telecommunications's image procossings in 2018 and Image Communication laboratory Han Feng simultaneously The HEVC concurrent decoding algorithm combined based on the task level of multi-core processor with data level, make practical comparative analysis.Such as Shown in Fig. 9, the concurrent decoding algorithm that task level is combined with data level is indicated with MLP, indicates that present invention design is realized with SMLP Multi-level parallel decoding method.
1 experimental result of table
From table 1 it follows that high definition video decoding speed is limited and is unable to reach reality in the decoded situation of monokaryon When decoded effect.When nucleus number is in 10 core, multi-level parallel efficiently decoding algorithm of the invention can achieve real-time decoding and want It asks.When nucleus number continues growing, decoding speed increases therewith, and maximum decoding speed can reach 59fps or more.
From Figure 10 Experimental comparison results find, the multi-level concurrent decoding algorithm based on CTU unit of the design, compared with appoint The concurrent decoding algorithm that business grade is combined with data level, decoding efficiency are greatly improved in multicore, are in nucleus number When 24 core, average decoding efficiency promotes about 10% compared with the concurrent decoding algorithm that task level is combined with data level.
In conjunction with the experimental result from table 1 and Fig. 9 it can be seen that
(1) multi-level parallel efficiently decoding algorithm proposed by the invention can be realized HD video on multi-core processor Real-time decoding.
(2) multi-level parallel efficient decoding algorithm proposed by the invention, it is being combined compared with task level with data level and Row decoding algorithm, decoding efficiency are greatly improved in multicore.

Claims (7)

1. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform, which is characterized in that by main thread Realize the control of decoding process, multiple independent parallel decodings that CTU unit is realized from thread, per thread is bundled in multicore In one available core of processor, realizes the efficient parallel decoding on multicore processing platform, include the following steps:
Step 1: main thread completes the work of initialization, including initialization HEVC decoder, applies for register cell, and initialization is slow It deposits, initializes decoding task queue, empty task queue;
Step 2: reading in the sequence code stream of HEVC coding, call network adaptation layer NAL analytical function, parse all kinds of parameters of encapsulation Information obtains decoding required profile, level, image frame types, image size parameter and loop filtering parameter information;
Step 3: according to all kinds of parameter informations that network adaptation layer parsing generates in step 2, carrying out entropy decoding, obtain for indicating The syntactic element of video sequence;
Step 4 obtains current frame image CTU line number according to the syntactic element that entropy decoding in step 3 obtains, and creates in thread pool The Thread Count of quantity identical as current frame image CTU line number, and per thread is tied to by multi-kernel function library by different core On;
Step 5 establishes CTU initialization dependence table according to the overall number of CTU unit in current frame image, each in the dependence table The coordinate pair of element answers coordinate of the CTU unit in frame in a frame image, and the numerical value of each element indicates the CTU institute of corresponding position The number of the adjacent C TU unit of dependence, for a CTU unit, whenever the CTU unit decoding that it is relied on is completed When, as soon as indicate that this current CTU has a dependence to meet, therefore the CTU unit relies on corresponding element value in table in CTU Subtract 1, when its element value is down to 0, show the CTU unit relieved with it is all and it adjacent C TU dependence;
Step 6: first CTU unit in dependence table is added to task queue to be decoded;
Step 7: judging whether queue to be decoded is sky, if not empty, then judge whether thread pool is sky, if thread pool is not Sky, then the slave thread of idle state receives the distribution of main thread task, executes the decoding task in task queue to be decoded;If thread Pond is sky, then waits after returning idle state again from thread, and task distribution is waited to execute step if queue to be decoded is sky Rapid 9;
Step 8: main thread according to from thread return come message obtain decoding complete CTU unit, update rely on table, will rely on The CTU unit that element value is 0 in table is added in task queue to be decoded;
Step 9:, should be from line using pipeline parallel method technology after completing a CTU unit pixel decoding and reconstituting task from thread Journey takes out new decoding task from task queue to be decoded and continues to execute pixel decoding and reconstituting task, when data dependency meets When, another idle CTU unit for just having completed pixel decoding and reconstituting to this from thread is dispatched from thread pool carries out fusion ring Road filtering processing;Step 8 and step 9 are repeated, until executing step 10 after the completion of current frame image decoding;
Step 10: all detection video code flow whether complete by decoding, discharges all resource and destroying threads pond if completing;It is no Then, next frame image is read, step 5 is executed.
2. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 1, It is characterized by: the HEVC decoder in the step 1 includes the coding unit CU of the circulated layered structure based on quaternary tree.
3. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 1, It is characterized by: the parameter information includes picture parameter set PPS, video parameter collection VPS, sequence parameter set in the step 2 The Slice head information of SPS, supplemental enhancement information SEI and image.
4. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 1, It is characterized by: the slave thread in the step 7 receives the distribution of main thread task, the decoding executed in task queue to be decoded is appointed Business needs decoded CTU unit information and institute specifically includes the following steps: obtaining by Cache Cache Communication from external memory The periphery CTU unit information needed;Decode CTU unit;Decoded CTU unit is write back in outside by Cache Cache Communication It deposits;Notice main thread decoding terminates.
5. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 4, It is characterized by: if desired decoded CTU unit is interframe CTU unit, then obtained by Cache Cache Communication from external memory Its pixel data for referring to CTU unit.
6. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 1, It is characterized by: the fusion loop filtering processing in the step 9 includes that deblocking filtering and SAO are filtered;It specifically includes following Step: it is closed according to the data dependence during deblocking filtering between luminance component, chromatic component and sample adaptive equalization System, makes process object of the new class CTU decoder object as loop filtering, and the class CTU decoder object is mono- from current CTU The sample of member and its upper left and upper CTU unit;After the completion of deblocking filtering executes, to the range of current class CTU decoder object into Row is repartitioned, for handling the pixel samples in SAO filtering;What the range of the class CTU decoder object was filtered in SAO It is moved to the left a column in sample and moves up one-row pixels sample.
7. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 1, It is characterized by: the task queue to be decoded in the step 6 is First Input First Output, stored in the task queue to be decoded Need decoded CTU unit, after completing a CTU unit decoding task from thread, can be taken at the top of task queue to be decoded The decoding thread is added in new decoding task out.
CN201910752152.XA 2019-08-15 2019-08-15 HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform Active CN110337002B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910752152.XA CN110337002B (en) 2019-08-15 2019-08-15 HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910752152.XA CN110337002B (en) 2019-08-15 2019-08-15 HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform

Publications (2)

Publication Number Publication Date
CN110337002A true CN110337002A (en) 2019-10-15
CN110337002B CN110337002B (en) 2022-03-29

Family

ID=68149626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910752152.XA Active CN110337002B (en) 2019-08-15 2019-08-15 HEVC (high efficiency video coding) multi-level parallel decoding method on multi-core processor platform

Country Status (1)

Country Link
CN (1) CN110337002B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111562948A (en) * 2020-06-29 2020-08-21 深兰人工智能芯片研究院(江苏)有限公司 System and method for realizing parallelization of serial tasks in real-time image processing system
CN111986070A (en) * 2020-07-10 2020-11-24 中国人民解放军战略支援部队航天工程大学 VDIF format data heterogeneous parallel framing method based on GPU
CN112468821A (en) * 2020-10-27 2021-03-09 南京邮电大学 HEVC core module-based parallel decoding method, device and medium
CN113660496A (en) * 2021-07-12 2021-11-16 珠海全志科技股份有限公司 Multi-core parallel-based video stream decoding method and device
CN116841739A (en) * 2023-06-30 2023-10-03 沐曦集成电路(杭州)有限公司 Data packet reuse system for heterogeneous computing platforms

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103974081A (en) * 2014-05-08 2014-08-06 杭州同尊信息技术有限公司 HEVC coding method based on multi-core processor Tilera
US20160191922A1 (en) * 2014-04-22 2016-06-30 Mediatek Inc. Mixed-level multi-core parallel video decoding system
CN105992008A (en) * 2016-03-30 2016-10-05 南京邮电大学 Multilevel multitask parallel decoding algorithm on multicore processor platform
CN107454406A (en) * 2017-08-18 2017-12-08 深圳市佳创视讯技术股份有限公司 The live high-speed decoding method of VR panoramic videos and system based on AVS+

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160191922A1 (en) * 2014-04-22 2016-06-30 Mediatek Inc. Mixed-level multi-core parallel video decoding system
CN103974081A (en) * 2014-05-08 2014-08-06 杭州同尊信息技术有限公司 HEVC coding method based on multi-core processor Tilera
CN105992008A (en) * 2016-03-30 2016-10-05 南京邮电大学 Multilevel multitask parallel decoding algorithm on multicore processor platform
CN107454406A (en) * 2017-08-18 2017-12-08 深圳市佳创视讯技术股份有限公司 The live high-speed decoding method of VR panoramic videos and system based on AVS+

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SEUNGHYUN CHO等: "Efficient In-Loop Filtering Across Tile Boundaries for Multi-Core HEVC Hardware Decoders With 4 K/8 K-UHD Video Applications", 《IEEE TRANSACTIONS ON MULTIMEDIA》 *
谷涛: "基于TILE-Gx36多核处理器的HEVC视频并行编码技术的设计与实现", 《中国优秀硕士学位论文全文数据库》 *
韩峰: "基于多核处理器的任务级与数据级相结合的HEVC并行解码技术与实现", 《中国优秀硕士学位论文全文数据库》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111562948A (en) * 2020-06-29 2020-08-21 深兰人工智能芯片研究院(江苏)有限公司 System and method for realizing parallelization of serial tasks in real-time image processing system
CN111562948B (en) * 2020-06-29 2020-11-10 深兰人工智能芯片研究院(江苏)有限公司 System and method for realizing parallelization of serial tasks in real-time image processing system
CN111986070A (en) * 2020-07-10 2020-11-24 中国人民解放军战略支援部队航天工程大学 VDIF format data heterogeneous parallel framing method based on GPU
CN112468821A (en) * 2020-10-27 2021-03-09 南京邮电大学 HEVC core module-based parallel decoding method, device and medium
CN112468821B (en) * 2020-10-27 2023-02-10 南京邮电大学 HEVC core module-based parallel decoding method, device and medium
CN113660496A (en) * 2021-07-12 2021-11-16 珠海全志科技股份有限公司 Multi-core parallel-based video stream decoding method and device
CN113660496B (en) * 2021-07-12 2024-06-07 珠海全志科技股份有限公司 Video stream decoding method and device based on multi-core parallelism
CN116841739A (en) * 2023-06-30 2023-10-03 沐曦集成电路(杭州)有限公司 Data packet reuse system for heterogeneous computing platforms
CN116841739B (en) * 2023-06-30 2024-04-19 沐曦集成电路(杭州)有限公司 Data packet reuse system for heterogeneous computing platforms

Also Published As

Publication number Publication date
CN110337002B (en) 2022-03-29

Similar Documents

Publication Publication Date Title
CN105992008B (en) A kind of multi-level multi-task parallel coding/decoding method in multi-core processor platform
CN110337002A (en) The multi-level efficient parallel decoding algorithm of one kind HEVC in multi-core processor platform
CN108449603B (en) Based on the multi-level task level of multi-core platform and the parallel HEVC coding/decoding method of data level
US9247264B2 (en) Method and system for parallel encoding of a video
CN103688533B (en) Chroma intra prediction method and the device of line storage can be reduced
CN107241598B (en) GPU (graphics processing Unit) decoding method for multi-channel h.264 video conference
CN105898330A (en) Method and apparatus of using constrained intra block copy mode for coding video
Shen et al. Ultra fast H. 264/AVC to HEVC transcoder
CN104604235A (en) Transmitting apparatus and method thereof for video processing
CN103051892B (en) Embedded loop filter method and embedded loop filter
CN112468821B (en) HEVC core module-based parallel decoding method, device and medium
Ma et al. Residual-based video restoration for HEVC intra coding
CN104521234B (en) Merge the method for processing video frequency and device for going block processes and sampling adaptive migration processing
CN101635849B (en) Loop filtering method and loop filter
Wang et al. Intra block copy in AVS3 video coding standard
CN101841722B (en) Detection method of detection device of filtering boundary strength
CN111757109A (en) High-real-time parallel video coding and decoding method, system and storage medium
CN110446043A (en) A kind of HEVC fine grained parallel coding method based on multi-core platform
Jiang et al. Highly paralleled low-cost embedded HEVC video encoder on TI KeyStone multicore DSP
CN102595137A (en) Fast mode judging device and method based on image pixel block row/column pipelining
Jiang et al. GPU-based intra decompression for 8K real-time AVS3 decoder
Yan et al. Parallel deblocking filter for H. 264/AVC implemented on Tile64 platform
GB2459567A (en) Video signal edge filtering
Chen et al. Towards efficient wavefront parallel encoding of HEVC: Parallelism analysis and improvement
WO2012171401A1 (en) Parallel filtering method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant