CN110337002A

CN110337002A - The multi-level efficient parallel decoding algorithm of one kind HEVC in multi-core processor platform

Info

Publication number: CN110337002A
Application number: CN201910752152.XA
Authority: CN
Inventors: 胡栋; 张文祥; 李毅
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2019-10-15
Anticipated expiration: 2039-08-15
Also published as: CN110337002B

Abstract

The multi-level parallel efficient decoding algorithm that the invention discloses a kind of based on multi-core processor.The method of the present invention handles complexity issue for the decoded superelevation of huge data volume and HEVC of HD video, makes full use of the dependence in HEVC data, proposes a kind of multi-level concurrent decoding algorithm in multi-core processor platform.The wavefront parallel algorithm based on CTU unit is realized using the data dependence relation between CTU unit in pixel decoding and reconstituting module first；Secondly, making full use of the data dependence relation between deblocking filtering and sample adaptive equalization in fast loop filter module, fusion loop filtering algorithm is realized；Pipeline parallel method technology is finally introduced between two modules, realizes the multi-level efficient parallel decoding algorithm of decoder.In decoding process, each task is executed by independent thread, and is bound to an independent core operation, is taken full advantage of the Parallel Computing Performance of multi-core processor, is improved decoding efficiency.

Description

The multi-level efficient parallel decoding algorithm of one kind HEVC in multi-core processor platform

Technical field

The present invention relates to encoding digital video signals to decode field, and in particular to one kind multilayer in multi-core processor platform Secondary multitask efficient parallel decoding algorithm.

Background technique

It is constantly progressive with the development of mobile internet with Internet Video Applications, in order to meet people to high definition (HD) etc. the continuous demand of videos, the video coding international standard tissue JCT-VC of MPEG and VCEG amalgamated consolidation in 2010 are common New International Video C Oding Standards HEVC (High Efficiency Video Coding) is developed, and in January, 2013 Formally become international standard." high efficient coding " HEVC's aims at raising video coding efficiency, before identical picture quality It puts, compression ratio is doubled than H.264/AVC high-grade (high profile).It is huge in view of HD video encoding and decoding The system of data volume and encoding and decoding complexity, HEVC standard introduce a variety of parallel processing technique means, as Tile with block for parallel Grain, WPP wavefront are parallel, effectively improve the performance of encoding and decoding.At the same time, multi-core processor has obtained considerable in recent years Development, an important factor for being effectively combined into HEVC technology successful application of the two.

Scholar both domestic and external combines video encoding and decoding standard to be made that some researchs on multi-core processor.It is international On, Hyunki Baik et al. is in paper " A Complexity-based Adaptive Tile Partitioning It is proposed in Algorithm For HEVC Decoder Parallelization " (2015) for HEVC decoder parallelization Effective adaptive tile partitioning algorithm, video frame is divided on multicore tile that is independent and handling simultaneously；However, should Method coding side must use the coding mode of independent tile subregion, and decoding end and coding side have the very high degree of coupling, thus There are biggish limitations.HyunMi Kim et al. is in paper " the An Efficient Architecture of In- delivered One kind is proposed in Loop Filters for Multicore Scalable HEVC Hardware Decoders " (2018) Efficient HEVC loop filter (ILF) framework provides effective multicore for ultra high-definition Video Applications and utilizes, the novelty proposed Storage organization and administrative skill solve the data dependence relevant issues between multiple processing units.

Domestic scholars also proposed the coding/decoding method of some multi-core platforms.2014, Dalian University of Technology's Information And Communication Ma Aidi of engineering college etc. (2014) proposes the HEVC code parallel decoder based on CPU+GPU mixing platform, it uses CUDA Hardware platform, and system optimization is completed using hardware superiority.Fang Di et al. (2016) based on each CTU unit in decoding process Between data dependence relation analysis, propose the multi-level parallel decoding method of HEVC based on Tilera multi-core processor.Korea Spro Peak et al. (2018) proposes the task based on multi-core processor based on the relationship in decoding process between data and task level The HEVC parallel decoding method that grade is combined with data level.

The research achievement that comprehensive forefathers work it out on multi-core platform, including propose parallel, assembly line before OWF superimposed wave Thread Pool Technology, 3D-WPP algorithm, data level combine parallel algorithm, quick deblocking filtering algorithm with task level.Although There is very big improvement in certain some aspects, but there is no in view of local grain complexity in each frame image in video sequence There are larger differences for difference CTU computation complexity caused by difference, while also not accounting for deblocking filtering and sample Cost is interactively communicated between adaptive equalization module core and caching, meanwhile, do not make full use of the core of multi-core platform yet Resource.

Summary of the invention

The technical problem to be solved by the present invention is under the premise of guaranteeing decoded image quality, to the single code of HEVC high definition The real-time decoding of stream, further increases decoding efficiency.

The technical scheme adopted by the invention is that: one kind HEVC efficient in multi-core processor platform is solved parallel at many levels Code algorithm is realized the control of decoding process, multiple independent parallel decodings that CTU unit is realized from thread, each line by main thread Cheng Jun is bundled in an available core of multi-core processor, realizes the efficient parallel decoding on multicore processing platform, including such as Lower step:

Step 1: main thread completes the work of initialization, including initialization HEVC decoder, applies for register cell, initially Change caching, initializes decoding task queue, empty task queue；

Step 2: reading in the sequence code stream of HEVC coding, call network adaptation layer NAL analytical function, parse all kinds of of encapsulation Parameter information obtains decoding required profile, level, image frame types, image size parameter and loop filtering parameter letter Breath；

Step 3: according in step 2 network adaptation layer parsing generate all kinds of parameter informations, carry out entropy decoding, obtain for Indicate the syntactic element of video sequence；

Step 4 obtains current frame image CTU line number according to the syntactic element that entropy decoding in step 3 obtains, in thread pool The Thread Count of quantity identical as current frame image CTU line number is created, and per thread is tied to by difference by multi-kernel function library Core on；

Step 5 establishes CTU initialization dependence table according to the overall number of CTU unit in current frame image, in the dependence table The coordinate pair of each element answers coordinate of the CTU unit in frame in a frame image, and the numerical value of each element indicates corresponding position The number for the adjacent C TU unit that CTU is relied on, for a CTU unit, whenever the CTU unit solution that it is relied on When code is completed, indicate that this current CTU has a dependence to meet, therefore the CTU unit relies on corresponding member in table in CTU Plain value just subtracts 1, when its element value is down to 0, show the CTU unit relieved with it is all and it adjacent C TU dependence Relationship；

Step 6: first CTU unit in dependence table is added to task queue to be decoded；

Step 7: judging whether queue to be decoded is sky, if not empty, then judge whether thread pool is sky, if thread pool is not For sky, then the slave thread of idle state receives the distribution of main thread task, executes the decoding task in task queue to be decoded；If line Cheng Chi is sky, then waits after returning idle state again from thread, and task distribution is waited to execute if queue to be decoded is sky Step 9；

Step 8: main thread according to from thread return come message obtain decoding complete CTU unit, update rely on table, will The CTU unit that element value is 0 in table is relied on to be added in task queue to be decoded；

Step 9:, should using pipeline parallel method technology after completing a CTU unit pixel decoding and reconstituting task from thread New decoding task is taken out from task queue to be decoded from thread and continues to execute pixel decoding and reconstituting task, when data dependency is full When sufficient, another idle CTU unit for just having completed pixel decoding and reconstituting to this from thread is dispatched from thread pool and is merged Loop filtering processing；Step 8 and step 9 are repeated, until executing step 10 after the completion of current frame image decoding；

Step 10: all detection video code flow whether complete by decoding, discharges all resource and destroying threads if completing Pond；Otherwise, next frame image is read, step 5 is executed.

Further, the HEVC decoder in the step 1 includes the coding list of the circulated layered structure based on quaternary tree First CU.

Further, in the step 2, the parameter information includes picture parameter set PPS, video parameter collection VPS, sequence The Slice head information of parameter set SPS, supplemental enhancement information SEI and image.

Further, the slave thread in the step 7 receives the distribution of main thread task, executes in task queue to be decoded Decoding task needs decoded CTU unit letter specifically includes the following steps: obtaining by Cache Cache Communication from external memory Breath and required periphery CTU unit information；Decode CTU unit；Decoded CTU unit is write back by Cache Cache Communication External memory；Notice main thread decoding terminates.

Further, if desired decoded CTU unit is interframe CTU unit, then through Cache Cache Communication out of outside It deposits and obtains its pixel data for referring to CTU unit.

Further, the fusion loop filtering processing in the step 9 includes that deblocking filtering and SAO are filtered；Specific packet Include following steps: according to the data during deblocking filtering between luminance component, chromatic component and sample adaptive equalization Dependence makes process object of the new class CTU decoder object as loop filtering, and the class CTU decoder object, which comes from, works as The sample of preceding CTU unit and its upper left and upper CTU unit；After the completion of deblocking filtering executes, to current class CTU decoder object Range is repartitioned, for handling the pixel samples in SAO filtering；The range of the class CTU decoder object is filtered in SAO It is moved to the left a column in the sample of processing and moves up one-row pixels sample.

Further, the task queue to be decoded in the step 6 is First Input First Output, the task queue to be decoded Middle storage needs decoded CTU unit, can be from task queue to be decoded after completing a CTU unit decoding task from thread It takes out new decoding task and the decoding thread is added in top.

The utility model has the advantages that improving by the present invention in that the real-time response ability of program can be improved with multi-thread design mode The design structure of program more effectively plays the function of processor, reduces frequent scheduling and switching to system resource, reduces The expense that thread object is created and is destroyed.In multi-thread design when multiple thread accesses shared resources, by locking, unlocking behaviour Make matching requirements variable to coordinate correct concurrent operations, to integrally improve the decoding efficiency of system.Compared to existing skill Art, the invention has the following advantages:

The present invention is a kind of efficient parallel decoding algorithm multi-level in multi-core processor platform, in original parallel frame Under conditions of limitation, HEVC decoder is divided into pixel decoding and reconstituting part and loop filtering part, pixel decoding and reconstituting mould Block designs and Implements the wavefront parallel algorithm based on CTU unit, and loop filtering module design simultaneously realizes fusion loop filtering calculation Method utilizes the parallel computation high-performance of multi-core processor, experimental result table between two modules in the way of pipeline parallel method It is bright, the present invention degree of parallelism promotion, multi-core parallel concurrent framework level, in terms of have preferable performance, and realize pair The real-time high-efficiency decoding that any parallel mode encodes the single code stream of full HD 1080P to be formed is not used.

Detailed description of the invention

Fig. 1 is HEVC decoding process block diagram；

Fig. 2 is that CTU relies on expression intention；

Fig. 3 is based on CTU unit wavefront parallel decoding system design setting model schematic diagram；

Fig. 4 is that interaction schematic diagram is cached based on CTU unit wavefront concurrent decoding algorithm；

Fig. 5 is luminance component in deblocking filtering, chromatic component and the signal of sample adaptive equalization data dependence relation Figure；

Fig. 6 is deblocking filtering and sample adaptive equalization data dependence relation schematic diagram；

Fig. 7 is fusion loop filtering processing schematic；

Fig. 8 is pixel decoding and reconstituting module and loop filtering module using pipeline parallel method technology schematic diagram；

Fig. 9 is multi-level efficient parallel decoding algorithm flow chart；

Figure 10 is multi-level efficient parallel decoding algorithm and the decoding algorithm Experimental comparison that task level is combined with data level As a result.

Specific embodiment

The basic idea of the invention is that: using the high Parallel Computing Performance of multi-core processor, fully consider that video sequence is every Computation complexity difference and deblocking filtering and sample adaptive equalization module core between one frame image local CTU Cost is interactively communicated between caching, HEVC decoding is divided into two parts of pixel decoding and reconstituting and loop filtering, is used Multi-level efficient parallel decoding.

Embodiment:

The present embodiment realized for the superelevation Parallel Computing Performance of multi-core processor the HD video of HEVC it is real-time simultaneously Row decoding.

Fig. 1 show HEVC decoder block diagram: entropy decoding is carried out to the binary bit stream after coding first, thus Quantization parameter and control information are obtained, then inverse quantization and inverse transformation are carried out to quantization parameter, obtains residual information.Next solution Code device carries out intra prediction using control information and inter-prediction, predictive information are combined with the residual information restored, It is handled using deblocking filtering and the loop filtering of sampling point adaptive equalization, the image exported.

The basic structure of HEVC encoding and decoding principle and H.264/AVC almost the same, but the encoding and decoding of HEVC are in performance Promote a series of innovation in the deep optimization and some design elements of module levels.Wherein for HD video encoding and decoding The more important new features of performance boost have: the coding unit CU of the circulated layered structure based on quaternary tree；In order to solve high definition The huge data volume of video provides a variety of parallelizations and realizes.The present embodiment just using picture frame CTU unit as parallel particle, It designs CTU and relies on table and coding/decoding method, each CTU unit of parallel decoding creates task queue, and distributes to each Tile core Thread carry out multi-core parallel concurrent processing.

Fig. 2 show CTU dependence table and its initial value.The size of table is the number of totality CTU unit in a frame, table In each element record the number of the adjacent C TU unit that it is relied on.Whenever the CTU unit solution that CTU unit is relied on When code is completed, corresponding element value of the CTU in table subtracts 1, when element value drops to 0, shows that the CTU unit has relieved It with the dependence of adjacent C TU, has been prepared for being decoded, it can be added to decoding task queue by main thread.The table The content of record is as follows: the upper left corner is the CTU unit of the beginning of a frame, is not rely on other CTU units of present frame；First Capable CTU unit only depends on the CTU unit on its left side；The CTU unit of first row depends on its upper left side, top and upper right The CTU unit of side, but the CTU unit in upper right side is to be later than upper left side and the CTU unit of top is decoded certainly, therefore It can recorde to only depend on the CTU unit in upper right side, i.e., can start to decode after the decoding of its upper right side CTU unit； The CTU unit of last column depends on the CTU unit on its upper left side, top and the left side, but the CTU unit on the left side is affirmed It is to be later than the CTU unit of top to be decoded, therefore we can recorde to only depend on the CTU unit on the left side, i.e., when its left side Side CTU unit can start to be decoded after decoding；The CTU unit that other CTU units are relied on have upper left side, on The CTU unit of side, upper right side and the left side, the CTU unit on upper right side and the left side are that be later than upper left side, the CTU of top mono- certainly Member is decoded, but there is no direct relationships between them, therefore both of which is the dependence of current CTU unit.

Fig. 3 is shown based on CTU unit wavefront parallel decoding system design setting model schematic diagram, the upper left corner be establish and not The CTU relation table of disconnected maintenance, the lower left corner are the execution task queues that we design, and are obtained by design buffer structure, M and P1, P2, P3, Pn are our threads, are realized in such a way that core binds thread, and wherein M is main thread (core), and P is from thread (core).It is main Thread (core) needs to safeguard a CTU unit dependence table to track the dependence between CTU unit, once CTU relies on some CTU in table When decoding is all completed in all dependences of unit, which may be in preparation state, therefore can be appointed as a decoding Business is added into task queue.Once there is the slave thread (core) of idle state, then decoding task distribution is taken out from task queue It is decoded to from thread (core)；It is exactly to decode CTU unit from thread (core) purpose, it, can be into when a CTU unit is completed in decoding Enter wait state, main thread (core) is waited to distribute new decoding task；CTU unit decoding task queue is First Input First Output, Task queue to be decoded can be added to when CTU unit has been prepared for being decoded, when from thread (core) complete a CTU unit After decoding task, new decoding task can be taken out at the top of task queue to be decoded and the decoding thread is added.

It is that interaction schematic diagram is cached based on CTU unit wavefront concurrent decoding algorithm shown in Fig. 4.Wavefront based on CTU unit Parallel algorithm is with single CTU unit for minimum decoding unit, so the previous algorithm that compares, data interaction, Cache Communication is more Add frequently, specifically according to the following steps:

Step 1 initializes, and waits main thread (core) to start the thread from thread in thread pool and carries out new decoding task；

Step 2 obtains the decoded CTU unit information of needs from external memory by Cache Cache Communication；

Step 3, by Cache Cache Communication from external memory obtain current CTU unit needed for periphery CTU unit phase Close information；

Step 4 then needs to obtain it with reference to CTU from external memory by Cache Cache Communication if it is interframe CTU unit The pixel data of unit；

Step 5, decoding (reconstruction) CTU unit；

Decoded CTU unit is write back external memory by Cache Cache Communication by step 6；

Step 7, notice main thread (core) decoding terminate；

Fig. 5 indicates that luminance component, chromatic component and sample adaptive equalization data dependence relation show in deblocking filtering It is intended to.According to the data relationship between deblocking filtering and sample adaptive equalization, the present invention propose again partition scheme with Make new class CTU decoder object.Class CTU decoder object includes all data needed for its deblocking filter, including brightness With two chromatic components.Such CTU decoder object includes the sample from current CTU and its upper left and upper CTU.Fig. 5 shows class The size of CTU decoder object is maintained at 64 × 64；But sample range is moved to the left four column, moves up four rows.Such as Fig. 5, newly 64 × 64 class CTU decoder objects be we design deblocking filter new actual treatment object.

Fig. 6 indicates deblocking filtering and sample adaptive equalization data dependence relation schematic diagram, and SAO filter is calculated Method, EO mode require to refer to adjacent sample.SAO source data is obtained from the result of de-blocking filter.Therefore, in order to real at CTU grades Show HEVC loop filter and couple de-blocking filter and SAO filter, needs after executing de-blocking filter to current Class CTU decoder object carries out subregion again.This means that being moved to the left column and upward by the sample range of SAO filter process Mobile a line sample, as shown in the dotted line frame in Fig. 6.

Fig. 7 indicates the process flow diagram of fusion loop filtering scheme, step 1, while carrying out the vertical of multiple 8x8 luminance blocks The brightness decision on boundary.Step 2, the deblocking filtering of 8x8Cb block is carried out.Step 3, the SAO of 8x8Cb block is carried out.Step 4, together The horizontal filtering of the vertical boundary of the multiple 8x8 luminance blocks of Shi Jinhang.Step 5, while the horizontal boundaries of multiple 8x8 luminance blocks is carried out Brightness decision.Step 6, the deblocking filtering of 8x8Cr block is carried out.Step 7, the SAO of 8x8Cr block is carried out.Step 8, at the same into The vertical filtering of the horizontal boundary of the multiple 8x8 luminance blocks of row.Step 9, the SAO of 8x8 luminance block is carried out.Circulation executes the above mistake Journey.

Fig. 8 shows pixel decoding and reconstituting modules and loop filtering module to use pipeline parallel method technology schematic diagram.In pixel Between decoding and reconstituting module and loop filtering module, using pipeline parallel method as shown in Figure 8, when can effectively reduce thread waiting Between, improve decoding efficiency.After pipeline parallel method mode, current pixel decoding and reconstituting module and loop filtering are meeting data It can be decoded simultaneously when dependence, thread need not wait pixel decoding and reconstituting module whole decoding task complete again At further reducing thread latency, improve decoding efficiency.

Fig. 9 indicates multi-level efficient parallel decoding algorithm flow chart.Specifically according to the following steps:

Step 2: reading in the sequence code stream of HEVC coding, call network adaptation layer NAL (Network Abstract Layer) analytical function parses all kinds of parameter informations of encapsulation, PPS (Picture Parameter Set, figure including image As parameter set), SPS (Sequence Parameter Set, sequence parameter set), VPS (Video Parameter Set, video Parameter set), the parameter set informations such as SEI (Supplemental Enhancement Information, supplemental enhancement information) and The Slice head information of image, these information include profile (class) needed for understanding code, level (grade), picture frame class Then type, the width and height of image, the parameter information of loop filtering are saved in decoding image object structural body；

Step 3: according to all kinds of parameter informations that network adaptation layer parsing generates in step 2, carrying out entropy decoding.It first checks for Image frame types carry out the entropy decoding of the frame if detecting I frame or P frame；If detecting mutually independent B frame at the same level, adjust The parallel entropy decoding of frame level is carried out with the thread in thread pool, entropy decoding is that the binary sequence of input is decoded into for indicating to regard The syntactic element of frequency sequence, subsequent each module carry out pixel reconstruction, filtering etc. according to these syntactic elements.

Step 4, a series of syntactic elements obtained according to entropy decoding in step 3 obtain current frame image CTU line number, The Thread Count of quantity identical as current frame image CTU line number is created in thread pool, and is tied up per thread by multi-kernel function library Determine onto different core, enters decoding major cycle later；Execute step 5- step 8, the reconstructed frame that pixel decoding and reconstituting is obtained；

Step 5 relies on table according to the overall number creation CTU initialization of CTU unit in current frame image, and every frame starts to need Dependence table is initialized, the coordinate pair of each element answers coordinate of the CTU unit in frame in a frame image in the dependence table, each The numerical value of element indicates the number for the adjacent C TU unit that the CTU of corresponding position is relied on, for a CTU unit, whenever When the decoding of its CTU unit relied on is completed, indicate that this current CTU has a dependence to meet, therefore the CTU is mono- Member corresponding element value in CTU dependence table just subtracts 1, when its element value is down to 0, show the CTU unit relieved with All dependences with its adjacent C TU；First CTU unit is added to queue to be decoded；

If step 6, queue to be decoded are not sky, show there is CTU unit to need to decode；If thread pool is not empty, generation Table has thread (core) idle；It can be solved at this time needing decoded CTU unit to distribute to idle thread (core) Code；

Step 7 checks whether per thread has message return；There is message to return to representative decoding to be over；So line Journey can return idle state again, be added to thread pool and task next time is waited to distribute；Solution is obtained from returning in the message come The CTU unit that code is completed, and update CTU unit dependence table；If the item dependence being updated becomes 0, show to start to decode, Therefore it is added to queue to be decoded；

Step 8, after CTU unit pixel decoding and reconstituting terminates in picture frame, immediately arrange one from thread to its into Row fusion loop filtering processing operation couples deblocking filtering and sample adaptive equalization that is, as unit of class CTU decoder object SAO, processing operation are shown in Fig. 7.After the completion of processing, thread is transferred to task queue and enters wait state, until next CTU unit The operation of pixel decoding and reconstituting is completed, then carries out fusion loop filtering processing to it；

Step 9, the CTU unit met to dependence in next frame picture frame arrange thread to carry out pixel decoding Reconstructed operation, and repeat above step；

Step 10, complete a frame video code flow decoding after, detection video code flow whether all decoding complete, if complete Then discharge all resource and destroying threads pond；If not completing, return step 5.

Figure 10 indicates that in QP=32, it is different full HD that the present invention carries out multi-core parallel concurrent decoding in multi-core processor The decoded frame rate schematic diagram of 1080P video sequence.Wherein decoding performance frame per second (fps, each second decode frame number) Lai Hengliang.

Case is embodied: we are used as experiment porch using Tilera GX36 multi-core processor, it is by 36 Tile Core composition, Tilera multi-core processor possess the multi-Core Development tool of complete set, realize that multinuclear parallel program provides for us Convenience.In order to verify the effect of the method for the present invention, following confirmatory experiment is carried out: being decoded using the method for the present invention, selected It is 1920 × 1080, QP 32 that take 3 kinds of video sequences, which be resolution ratio, " BasketballDrive ", " Cactus ", "Kimono".Video coding mode selects RA (Random Access) the most complicated access module immediately, CTU row block difference It is designed to size 64 × 64.Coding/decoding method of the invention realizes multi-core parallel concurrent on Tilera multi-core processor and efficiently decodes. Experimental result is as shown in table 1.Pass through comparison Nanjing Univ. of Posts and Telecommunications's image procossings in 2018 and Image Communication laboratory Han Feng simultaneously The HEVC concurrent decoding algorithm combined based on the task level of multi-core processor with data level, make practical comparative analysis.Such as Shown in Fig. 9, the concurrent decoding algorithm that task level is combined with data level is indicated with MLP, indicates that present invention design is realized with SMLP Multi-level parallel decoding method.

1 experimental result of table

From table 1 it follows that high definition video decoding speed is limited and is unable to reach reality in the decoded situation of monokaryon When decoded effect.When nucleus number is in 10 core, multi-level parallel efficiently decoding algorithm of the invention can achieve real-time decoding and want It asks.When nucleus number continues growing, decoding speed increases therewith, and maximum decoding speed can reach 59fps or more.

From Figure 10 Experimental comparison results find, the multi-level concurrent decoding algorithm based on CTU unit of the design, compared with appoint The concurrent decoding algorithm that business grade is combined with data level, decoding efficiency are greatly improved in multicore, are in nucleus number When 24 core, average decoding efficiency promotes about 10% compared with the concurrent decoding algorithm that task level is combined with data level.

In conjunction with the experimental result from table 1 and Fig. 9 it can be seen that

(1) multi-level parallel efficiently decoding algorithm proposed by the invention can be realized HD video on multi-core processor Real-time decoding.

(2) multi-level parallel efficient decoding algorithm proposed by the invention, it is being combined compared with task level with data level and Row decoding algorithm, decoding efficiency are greatly improved in multicore.

Claims

1. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform, which is characterized in that by main thread Realize the control of decoding process, multiple independent parallel decodings that CTU unit is realized from thread, per thread is bundled in multicore In one available core of processor, realizes the efficient parallel decoding on multicore processing platform, include the following steps:

Step 1: main thread completes the work of initialization, including initialization HEVC decoder, applies for register cell, and initialization is slow It deposits, initializes decoding task queue, empty task queue；

Step 2: reading in the sequence code stream of HEVC coding, call network adaptation layer NAL analytical function, parse all kinds of parameters of encapsulation Information obtains decoding required profile, level, image frame types, image size parameter and loop filtering parameter information；

Step 3: according to all kinds of parameter informations that network adaptation layer parsing generates in step 2, carrying out entropy decoding, obtain for indicating The syntactic element of video sequence；

Step 4 obtains current frame image CTU line number according to the syntactic element that entropy decoding in step 3 obtains, and creates in thread pool The Thread Count of quantity identical as current frame image CTU line number, and per thread is tied to by multi-kernel function library by different core On；

Step 5 establishes CTU initialization dependence table according to the overall number of CTU unit in current frame image, each in the dependence table The coordinate pair of element answers coordinate of the CTU unit in frame in a frame image, and the numerical value of each element indicates the CTU institute of corresponding position The number of the adjacent C TU unit of dependence, for a CTU unit, whenever the CTU unit decoding that it is relied on is completed When, as soon as indicate that this current CTU has a dependence to meet, therefore the CTU unit relies on corresponding element value in table in CTU Subtract 1, when its element value is down to 0, show the CTU unit relieved with it is all and it adjacent C TU dependence；

Step 7: judging whether queue to be decoded is sky, if not empty, then judge whether thread pool is sky, if thread pool is not Sky, then the slave thread of idle state receives the distribution of main thread task, executes the decoding task in task queue to be decoded；If thread Pond is sky, then waits after returning idle state again from thread, and task distribution is waited to execute step if queue to be decoded is sky Rapid 9；

Step 8: main thread according to from thread return come message obtain decoding complete CTU unit, update rely on table, will rely on The CTU unit that element value is 0 in table is added in task queue to be decoded；

Step 9:, should be from line using pipeline parallel method technology after completing a CTU unit pixel decoding and reconstituting task from thread Journey takes out new decoding task from task queue to be decoded and continues to execute pixel decoding and reconstituting task, when data dependency meets When, another idle CTU unit for just having completed pixel decoding and reconstituting to this from thread is dispatched from thread pool carries out fusion ring Road filtering processing；Step 8 and step 9 are repeated, until executing step 10 after the completion of current frame image decoding；

Step 10: all detection video code flow whether complete by decoding, discharges all resource and destroying threads pond if completing；It is no Then, next frame image is read, step 5 is executed.

2. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 1, It is characterized by: the HEVC decoder in the step 1 includes the coding unit CU of the circulated layered structure based on quaternary tree.

3. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 1, It is characterized by: the parameter information includes picture parameter set PPS, video parameter collection VPS, sequence parameter set in the step 2 The Slice head information of SPS, supplemental enhancement information SEI and image.

4. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 1, It is characterized by: the slave thread in the step 7 receives the distribution of main thread task, the decoding executed in task queue to be decoded is appointed Business needs decoded CTU unit information and institute specifically includes the following steps: obtaining by Cache Cache Communication from external memory The periphery CTU unit information needed；Decode CTU unit；Decoded CTU unit is write back in outside by Cache Cache Communication It deposits；Notice main thread decoding terminates.

5. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 4, It is characterized by: if desired decoded CTU unit is interframe CTU unit, then obtained by Cache Cache Communication from external memory Its pixel data for referring to CTU unit.

6. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 1, It is characterized by: the fusion loop filtering processing in the step 9 includes that deblocking filtering and SAO are filtered；It specifically includes following Step: it is closed according to the data dependence during deblocking filtering between luminance component, chromatic component and sample adaptive equalization System, makes process object of the new class CTU decoder object as loop filtering, and the class CTU decoder object is mono- from current CTU The sample of member and its upper left and upper CTU unit；After the completion of deblocking filtering executes, to the range of current class CTU decoder object into Row is repartitioned, for handling the pixel samples in SAO filtering；What the range of the class CTU decoder object was filtered in SAO It is moved to the left a column in sample and moves up one-row pixels sample.

7. one kind multi-level concurrent decoding algorithm of efficient HEVC in multi-core processor platform according to claim 1, It is characterized by: the task queue to be decoded in the step 6 is First Input First Output, stored in the task queue to be decoded Need decoded CTU unit, after completing a CTU unit decoding task from thread, can be taken at the top of task queue to be decoded The decoding thread is added in new decoding task out.