CN101860752A

CN101860752A - Video code stream parallelization method for embedded multi-core system

Info

Publication number: CN101860752A
Application number: CN 201010166248
Authority: CN
Inventors: 徐志远; 刘鹏
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2010-05-07
Filing date: 2010-05-07
Publication date: 2010-10-13
Anticipated expiration: 2030-05-07
Also published as: CN101860752B

Abstract

The invention provides a video code stream parallelization method for an embedded multi-core system, which comprises the following steps of: on the basis of data flow diagram representation of a video code original program for processing basic data units, performing analog simulation on a target video code original program, extracting computation workload of various nodes in the data flow diagram, analyzing the dependence of various nodes in the data flow diagram and the data dependence for processing the basic data units after stream parallelization, and selecting a processor load-balanced and less inter-core communication stream parallelization dividing scheme according to the obtained computation workload of the nodes and the dependence between the nodes; and after acquiring the dividing scheme, performing targeted encapsulation on task nodes according to the interface standard of indicators, statically mapping encapsulated objects to corresponding cores of the processor, and performing stream parallelization on the video code original program by using the multi-core system by matching the indicators on the cores of the processor.

Description

A kind of video coding streamlined parallel method at embedded multi-core system

Technical field

The present invention relates to the parallel programming of media application, proposed a kind of video coding program streamlined parallel method at embedded multi-core system especially.

Background technology

The advantage of multiprocessor system chip (MPSoC:Multi-Processor System-on-Chip) at aspects such as calculated performance, power consumption, chip area and real-times makes it use more and more widely in current built-in field.But how easily for MPSoC develops concurrent program efficiently, application programs developer and system designer all are challenges.

The media video application program is to use a class method more and that operand is bigger in the built-in field.Media program is a kind of typical Data Stream Processing program, exports after promptly successively the source data of order input being carried out generating the result some the processing stage.Parallelization to the media video application program has two kinds of strategies: (1) data are parallel divides; Be about to not have in the processing procedure source data of dependence to be assigned on the different processors and handle, reach the purpose that multi-core parallel concurrent is carried out.(2) task is parallel divides; The processing procedure that is about to the elementary cell macro block is divided into some stages, each processor only is responsible for moment, and handle current macro be assigned with the processing stage after, just the result is delivered to next processor and make it begin to calculate, handle macro block by the collaborative of several processors and streamlined ground and reach the parallel purpose of quickening.

For video encoder, the restriction that the parallel division methods of some existing tasks exists mainly comprises: the foundation of the theoretic parallel splitting scheme of flowing water that sets the tasks of (1) neither one.Data dependence relation when (2) clearly not proposing to handle different source data in the analysis of task flowing water splitting scheme is to the influence of task pipeline.(3) neither one advantages of simplicity and high efficiency tasks synchronization and scheduling mechanism.Offering the challenge property of the present invention parallel method can be avoided above-mentioned restriction, can come the equalization processor load by the division that changes the task pipelining-stage, improves the level of resources utilization.

Summary of the invention

The invention provides a kind of video encoder task streamlined parallel method at the embedded multi-core platform.This method mainly comprises the content of two aspects: the parallelization of (1) original program is divided; (2) mapping of the processor of the tasks/threads after the division and parallel scheduled for executing.

The target that program parallelization is divided comprises: the speed-up ratio that promotes parallelisation procedure; Excavate the concurrency of serial program, scheduler task makes processor wait for the expense minimum, optimizes the resource utilization of parallel system.

The video encoder task streamlined parallel method at the embedded multi-core platform that the present invention proposes specifically may further comprise the steps:

(1), obtains a kind of coarse granule degrees of data flow graph representation of program according to carrying out the parallel primitive rank of flowing water at target video coding original program;

According to the primitive rank that the flow process of original program is chosen,, can choose macro block as basic data processing unit for the video programming preface.

(2) target video coding original program is carried out analog simulation, extract each node operand in the data flow diagram;

Before parallel division of target video coding original program, the select target video sequence carries out dynamic simulation, and the operand of each node in the data flow diagram that recording step (1) obtains is as the foundation of determining parallel splitting scheme.

Data dependence relation when (3) dependence of each node and streamlined walk abreast the reprocessing primitive in the analysis data flow diagram;

By each data between nodes dependence in the dynamic simulation specified data flow graph, add up required data traffic size between the node that causes because of these data dependence relations, as the foundation of excessive data communication overhead in the flowing water parallel scheme.

(4) according to the concurrency and the internuclear data communication expense of internuclear task, obtain the parallel splitting scheme of flowing water, if this scheme satisfies the parallel system requirement, enter step (5), otherwise the node in the initial data flow graph is divided or merges, get back to step (2);

In the data flow diagram that step (2) and step (3) are obtained between node operand statistics and node the data communication quantitative statistics mark in the data flow diagram that step (1) obtains, obtain a kind of synchrodata flow graph representation of original program.Select parallel splitting scheme according to the processor quantity of synchrodata flow graph and target multi-core platform.The selection of parallel splitting scheme comprises following two guidelines: the operand of (1) each pipelining-stage equates as far as possible, reduces the parallel efficiency that causes because of processor load is unbalanced and descends; (2) according to the data dependence relation of node in the data flow diagram, the node that data traffic is big is assigned on the same processor, reduces the expense that produces because of internuclear data communication; Finally, be up to criterion with total parallel speed-up ratio and resource utilization ratio and choose splitting scheme.

(5) splitting scheme that obtains according to step (4) to task node as indicated the interface standard of device carry out the objectification encapsulation, object after the encapsulation is mapped on the corresponding processor core statically, cooperates the indicating device on each processor core to realize the streamlined executed in parallel of multiple nucleus system to the video coding original program;

Object (Object) after each encapsulation comprises action-function and some input/output ports, the corresponding inputoutput buffer of each input/output port.Each port all has a corresponding marker bit to represent the state of port, and all sign is formed semaphore entries (SemaEntry), and indicating device is by to the inquiry of semaphore entries and the management object of more newly arriving.When all input/output ports of object were all ready, indicating device scheduler object act of execution function, action-function were obtained to bear results after data are handled from the input block and are delivered to output buffer.Data communication between different processor is finished by indicating device.

The present invention proposes a kind of streamlined parallel method towards media application, is the program parallelization quantitative analysis that example carries out multiple nucleus system with the video encoder, and concrete implementation method has versatility for the program of Data Stream Processing type.

Description of drawings

Fig. 1 is the schematic flow sheet of the embodiment of the invention;

Fig. 2 is the coarse granule degrees of data flow diagram of the embodiment of the invention;

Fig. 3 is the schematic diagram that concerns of the current macro of the embodiment of the invention and adjacent macroblocks;

Fig. 4 is the simple class encoder encodes of the MPEG-4 of an embodiment of the invention schematic flow sheet;

Fig. 5 is the MPEG-4 coding P type macro block data flow graph of the embodiment of the invention;

Fig. 6 is the multiple nucleus system platform structure schematic diagram of the embodiment of the invention;

Fig. 7 is parallel the division and the mapping scheme schematic diagram of three nuclears of the embodiment of the invention;

Fig. 8 is parallel the division and the mapping scheme schematic diagram of five nuclears of the embodiment of the invention;

Fig. 9 is the streamline schematic diagram of three nuclear parallel schemes of the embodiment of the invention;

Figure 10 is the streamline schematic diagram of five nuclear parallel schemes of the embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing, by dividing the execution mode that example illustrates a kind of multiprocessor streamlined parallel method towards media application of the present invention to the task streamlined of the simple class encoder of MPEG-4 is parallel.

The flow process that has proposed a kind of video coding program streamlined parallel method at embedded multi-core system that the present invention proposes as shown in Figure 1.

Carry out the concurrency analysis of serial program based on data flow diagram, as shown in Figure 2.Typical data flow diagram is by some nodes and represent that the directed arc of these node annexations forms, and node among the figure just can be performed as long as becomes ready attitude after all inputs are all satisfied, and the result of generation is as the input of subsequent node after the execution.Node can be that one section program block or linear function call, as long as enough processor numbers are arranged and are in the node of ready attitude, and the execution that just can be scheduled simultaneously of these nodes, so data flow model can be used for the concurrency of development sequence from essence.Directed arc in the data flow diagram is represented the data between nodes dependence, exist the node of data dependence relation not carry out simultaneously, and data dependence relation can be divided into two classes: the data when (1) handles same source data rely on, the present invention is referred to as forward direction and relies on (FD:ForwardDependence), and the stain among Fig. 2 on the arc is promptly represented this dependence.Data when (2) handling different source data rely on, and the present invention is referred to as the back to relying on (BD:Backward Dependence), and M promptly represents this dependence among Fig. 2.

What general concurrency exploitation was considered is that forward direction relies on the restriction that produces, as node F among Fig. 2 _BAnd F _CThereby between do not have forward direction to rely on can executed in parallel.But node F _AAnd F _BData production-consumption dependence dAB when being handled, same source data is arranged and can not executed in parallel.If F _AAnd F _BDo not exist data to rely on when different source datas is handled, can line up the flowing water parallel processing.At F _AWhen handling present input data, F _BHandle previous moment F simultaneously _AThe input data of handling so just can be eliminated the forward direction that same source data is handled and rely on, thereby have realized that the task flow aquation is parallel.

The definite of task flow aquation parallel scheme should reduce the synchronous wait expense that the back causes to dependence, elevator system performance as far as possible.As shown in Figure 2, if node F _BWith F _DBe in different pipelining-stages, because the back is to the existence that relies on M, F _BAt F _DCan not carry out the processing of next source data before handling current source data, cause the pause of streamline to be waited for, reduce systematic function.

In order to improve the level of resources utilization of parallel system, the operand that needs the data flow diagram of statistics program each node module in representing, forward direction dependence and back are to dependence between analysis node, investigate factors such as data communication expense, to explore a kind of load balancing of a plurality of processors, procedure division and duty mapping scheme that pipeline stall is waited for the expense minimum of making.

In the parallelization scheme determination process, the selection of task node granularity size is most important.The little load balancingization that then when duty mapping, realizes processor more easily of task granularity, the situation that can adapt to the varying number processor core simultaneously more neatly, but corresponding communication expense, division complexity and program code amount all can increase, because each task all will be carried out the objectification encapsulation, the inner data communication management module that need be extra of object.

The simple class coding of MPEG-4 flow process as shown in Figure 4.In the MPEG-4 coded program, the primitive macro block (MB:MacroBlock) of original image according to from top to bottom, the input of from left to right scanning sequency, handle the back through some tasks and produce compressed bit streams.In to its parallelization process, in order rationally to arrange task flowing water, the data dependence relation in the time of not only will considering to encode single macro block between each task, also to consider between adjacent macroblocks data dependence relation promptly the back to data dependence relation.

As shown in Figure 4, the coding flow process of I frame and P frame is different, and I frame macro block need not carry out the motion search part in the P frame macro block.Therefore at the splitting scheme of P frame, pipelining-stage has bigger lack of uniformity when handling the I frame data.But because coding mode generally speaking is IPPP ... IPPP ... tens even tens P frames in interval between per two I frames, and the influence that how this more imbalance of P frame causes between two I frames is just more little, therefore mainly considers that the harmony of P frame carries out the pipelining-stage parallelization.

By to the exemplary video sequence (foreman, news, mobile, analog simulation bus), statistics when obtaining MPEG-4 encoder encodes P frame macro block the operand ratio of each main modular as shown in table 1.

Each main modular operand accounts for total operand ratio during table 1.MPEG-4 coding P frame macro block

In order to obtain high as far as possible compression ratio, the MPEG-4 encoder has used various Predicting Techniques, thereby has increased the weight of the data dependency between macro block.No matter be intracoded frame (I frame) or inter-frame encoding frame (Pframe), current macro and macro block A, macro block B and macro block C exist data to rely on, as shown in Figure 3.In the I frame predicted value of current macro data by macro block A, macro block B and and macro block C calculate; The motion vector residual error of the last coding of P frame (MVD:Motion Vector Difference) also is that current motion vector and predictive vector subtraction calculations obtain; And predictive vector is calculated by macro block A, macro block C and macro block D, and the loop filtering of I frame and P frame all needs the value of the left side and top macro block.The sequence requirement of entropy coding is carried out in strict accordance with the macro block scanning sequency, can not carry out the entropy coding of next macro block before promptly the current macro entropy coding is not finished.The macro block that guarantees lastrow when handling current macro all disposes, and the back relies on to data and is present between current macro and left side macro block like this.But this have the back between the node that relies on the residing pipelining-stage of the producer stage before the residing pipelining-stage of the consumer stage, in Fig. 4, need a last macroblock coding to finish result afterwards during coefficient prediction resume module current macro data, and the macroblock coding module is before the coefficient prediction module, last macroblock encoding operation is finished when carrying out the current macro coefficient prediction, therefore can not produce the influence that pauses to task pipeline.

As shown in table 1, the operand ratio is unbalanced.Motion estimation module most operation time that accounted for.For the equalization processor load, must divide again or merge node, and determine partition strategy according to the resource of particular hardware platform.Motion estimation module can be divided into whole pixel motion and estimate and the half-pix estimation that the half-pix motion estimation module that wherein operand is bigger can be divided into one time 16 * 16 block search and four times 8 * 8 block search again.The data flow diagram of MPEG-4 coding P frame macro block as shown in Figure 5.The instruction strip number that this node of numeral among Fig. 5 on the node limit is carried out, the data traffic size is as shown in table 2 between node.

Data traffic between table 2. node

The present invention utilizes the heterogeneous polynuclear SOC (system on a chip), comprise that the multinuclear RED platform that 1 Reduced Instruction Set Computer (RISC:ReducedInstruction Set Computer) processor and 8 digital signal processors (DSPs:DigitalSignal Processors) are formed carries out the parallel scheme experiment of three nuclears and five nuclears, the RED platform structure as shown in Figure 6.The task division of three nuclears and five nuclears and mapping scheme are as shown in Figure 7 and Figure 8.

Mapping scheme among Fig. 7 is three grades of flowing water, and wherein in fact two objects on the digital signal processor #1 are in different pipelining-stages, but after streamlined is carried out a period of time, can regard the situation that they handle different source datas as the same flowing water stage.Mapping scheme among Fig. 8 is five nuclear level Four flowing water, and wherein the object on digital signal processor #2 and the digital signal processor #3 does not have data to rely on each other and walks abreast, and they are in same pipelining-stage.Corresponding pipeline state is respectively as Fig. 9 and shown in Figure 10.

The ratio that these walked abreast modules that above-mentioned two kinds of schemes are mentioned account for the total operand of program is about 96.3%, speed-up ratio in two kinds of parallel schemes on these Modularity Theory is respectively 2.92 and 4.28, according to A Mudaer (Amdahl) law, the theoretical speed-up ratio of whole procedure is respectively 2.72 and 3.82.

At last, it is also to be noted that what more than enumerate only is specific embodiments of the invention.Obviously, the invention is not restricted to above examples of implementation, many distortion can also be arranged.All distortion that those of ordinary skill in the art can directly derive or associate from content disclosed by the invention all should be thought protection scope of the present invention.

Claims

1. video coding streamlined parallel method at embedded multi-core system is characterized in that may further comprise the steps:

(5) splitting scheme that obtains according to step (4) to task node as indicated the interface standard of device carry out the objectification encapsulation, object after the encapsulation is mapped on the corresponding processor core statically, cooperates the indicating device on each processor core to realize the streamlined executed in parallel of multiple nucleus system to the video coding original program.

2. method according to claim 1, it is characterized in that: step (2) is chosen the target video sequence, the operand ratio of each node in data flow diagram during recording of video coding original program dynamic process target sequence is as the foundation that obtains task pipelining-stage splitting scheme.

3. method according to claim 1 is characterized in that: in the step (3) to the dependence of node in the data flow graph comprise between node the data that produce when same primitive handled rely on node between the data that produce when handling different primitive rely on.

4. method according to claim 1, it is characterized in that described flowing water parallel scheme obtains as follows: according to total operand that the processor quantity and the target video coding original program of target multi-core platform are handled a primitive, the pipelining-stage quantity of the streamline that sets the tasks and the operand of every grade of pipelining-stage; According to data dependence relation distribution node between the operand ratio of each node in the data flow diagram and node in corresponding pipelining-stage.

5. method according to claim 4 is characterized in that, the selection of the parallel splitting scheme of described flowing water comprises following two guidelines: the operand of (1) each pipelining-stage equates as far as possible, reduces the parallel efficiency that causes because of processor load is unbalanced and descends; (2) according to the data dependence relation of node in the data flow diagram, the node that data traffic is big is assigned on the same processor, reduces the expense that produces because of internuclear data communication; Finally, be up to criterion with total parallel speed-up ratio and resource utilization ratio and choose splitting scheme.

6. method according to claim 1, it is characterized in that: in step (5), object after the described encapsulation comprises input/output port and action-function, the corresponding inputoutput buffer of each input/output port, action-function is obtained to bear results after data are handled from the input block and is delivered to output buffer.

7. method according to claim 1, it is characterized in that: in the step (5), indicating device writes down the state information of each object port correspondence, and described state information is formed semaphore entries, and indicating device is by managing and simultaneous operation the inquiry of semaphore entries value and the realization object scheduling of more newly arriving.