CN101673391B - Pipelined image processing engine - Google Patents

Pipelined image processing engine Download PDF

Info

Publication number
CN101673391B
CN101673391B CN2009101691171A CN200910169117A CN101673391B CN 101673391 B CN101673391 B CN 101673391B CN 2009101691171 A CN2009101691171 A CN 2009101691171A CN 200910169117 A CN200910169117 A CN 200910169117A CN 101673391 B CN101673391 B CN 101673391B
Authority
CN
China
Prior art keywords
effect
pixel
piece
frame
main
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009101691171A
Other languages
Chinese (zh)
Other versions
CN101673391A (en
Inventor
赫曼·K·加拉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Corp
Sony Electronics Inc
Original Assignee
Sony Corp
Sony Electronics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Corp, Sony Electronics Inc filed Critical Sony Corp
Publication of CN101673391A publication Critical patent/CN101673391A/en
Application granted granted Critical
Publication of CN101673391B publication Critical patent/CN101673391B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • Image Processing (AREA)
  • Image Generation (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)

Abstract

The invention disclose a pipelined image processing engine, and relates to a method of processing image frames through a pipeline of effects includes breaking the image frames into multiple blocks of image data. The example method includes generating a plurality of blocks from each frame, processing each block through a pipeline of effects in a predefined consecutive order, and aggregating the processed blocks to produce an output frame by combining the primary pixels from each processed block. The pipeline of effects may be distributed over a plurality of processing nodes, and each effect may process a block, provided as input to the node. Each processing node may independently process a block using an effect.

Description

The image processing engine of pipelining
Technical field
The present invention relates generally to the image processing engine of pipelining, and relate more specifically to be used for handling the system and method for a plurality of picture frames through the effect pipeline on the multiple core processing system (effect pipeline).
Background technology
Along with the lasting raising of manufacturing technology, the microelectronic physics limit of based semiconductor also expands to satisfy the needs to more competent microprocessor.Restriction on the processing speed has brought such adaptability: come existing processor speed of balance and performance through adopting multinuclear or distributed architecture with the parallelization of distribution working load, increased processing power thus and reduced the processing time.
Application in movie/video/imaging space usually links a plurality of image processing effects with pipeline system.These Flame Image Process streamlines can be benefited from the parallelization that provides through multinuclear or distributed architecture.Carry out the application of this Flame Image Process and share common characteristic, comprise demand with real-time rate processing and mobile mass data.
Traditional embodiment of this system has suffered the combination of these problems of complicacy of the handling capacity of difference, high time-delay and increase, and this has reduced extensibility and scalability.As in the pipeline system, when a plurality of discrete effect that will depend on continuous execution links, these restrictions are increased.
When handling video, ' effect pipeline ' refers to that order with definition is to the image applications visual effect.Similarly, ' effect ' refers to the at different levels of effect pipeline.
In effect pipeline, hardware resource the existing method between each effect of being distributed in received usually the influence of many restrictions about parallelization.
A kind of method is minimum work share with the data definition of each picture frame, and this allows each effect independently to given frame or the operation of other frame.Though this method allows a plurality of effects to be present in jointly can between them, not force in the shared system closely integrated, yet this method has also caused the height time-delay of total system.This time-delay is relevant with the number of effects in the effect pipeline.
' track performance ' is with the combination of performance measurement for time-delay and computing time.' time-delay ' refers to that pipeline system sends the required time of data of given unit.For effect pipeline, ' time-delay ' refers to that each picture frame in the streamline plays the time that is spent till the time of leaving last effect in the streamline from the moment that gets into first effect in the streamline.
' computing time ' refers to handle the data of standard unit, for example required time of picture frame.In addition, can be expressed as the function of the frame frequency of video system computing time, perhaps the function of the real time of processed frame.
Figure 11 illustrates 6 grades of effect pipeline based on frame in the multiple nucleus system.In Figure 11, the processor number in the let flow waterline is equal to or greater than under the optimal cases of number of effects, and each effect is assigned to processor (or nuclear).Each processor is handled the effect on the complete graph picture frame at every turn.Picture frame possibly derive from other source of video flowing or picture frame.Each gets into frame and sequentially is processed successively.For simplicity, suppose the time processed frame of each effect cost same amount, this has represented to export the required Best Times of each frame, so that keep consistent frame frequency.
At time t1, frame 1 gets into streamline, and by processor 1 with first effects applications to frame 1.At time t2, frame 2 is processed device 1 and is written into and handles, and frame 1 is processed device 2 and is written into and handles.At time t3, frame 3 is processed device 1 and is written into and handles, and frame 2 is processed device 2 and is written into and handles, and frame 1 is processed device 3 and is written into and handles.From time t4 to t6, frame 4 to frame 6 is introduced into streamline, and frame 1-3 advances along streamline.In ending place of time t6, frame 1 comes out from streamline.
Can be with the streamline Time delay measurement for handling the required time of a frame through all grades in the streamline, perhaps with ' frame time ', promptly the handled frame number of ducted first effect is measured the streamline time-delay before frame N leaves effect pipeline.As the function of frame time, the streamline time-delay is calculated as:
PL FT=M-N (formula 1)
Wherein:
PL FTBe to be the streamline time-delay of unit with the frame time.
M is the frame that when moment T, gets into streamline.
N is the frame that when moment T, leaves streamline, and M<N.
For the architecture based on frame, the streamline time-delay also can be defined as the processing time and the amassing of the effect in the streamline (or level) number of the effect the most slowly in the chain.This is calculated as:
PL CT=N * T FM(formula 2)
Wherein:
PL CTIt is streamline time-delay as computing time.
N is the progression in the streamline.
T FMIt is the slowest required time of effect M processed frame in the streamline.
Suppose that the new frame of every 120ms is fed in the streamline; Streamline comprises 6 levels or effect; And the slowest effect in the streamline also spends the data that 120ms handles a frame, and then the streamline time-delay (PL) based on the system of frame among Figure 11 is 6 * 120=960ms.
In addition, because streamline time-delay is measured as first frame, promptly frame 1 gets into and the function of the time when leaving streamline, therefore, can not bring benefit through making each processor move each effect concurrently continuously.That is, even each frame can be used for handling immediately, rather than every 120ms with new frame continuous feeding in streamline, get into and leave time of streamline based on first frame, streamline will still have delays time the computing time of time-delay of 6 frames and 960ms.
Above system receive the restriction of available hardware resource, that is,,, then handle and will become quite slow if the available cache memory perhaps in each processor can not be stored a whole frame if the effect number has surpassed available independent processing package count.
Summary of the invention
Realize the pipelining image processing engine of dilatant flow waterline flexibly but the invention provides, streamline can be regulated to limited hardware resource.Embodiment can also comprise static state and dynamic load balancing.Example embodiment removes or has reduced many above-mentioned restrictions, also provides simultaneously with the more general framework of mode implementation effect.
Example embodiment of the present invention can comprise the method that is used for handling through effect pipeline picture frame.This method can comprise from a plurality of of every frame generations, handles every with predetermined consecutive order through effect pipeline, and comes the treated piece of polymerization to produce output frame through the pixel that makes up in each treated.Each piece can comprise one group of main pixel and one group of total pixel, and total pixel comprises that the effect in the effect pipeline produces any pixel to the required conduct input of the output of main pixel.Main pixel in each piece is the pixel with the last frame that is used for being generated by effect pipeline.
Effect pipeline can be distributed in a plurality of processing nodes, and each effect can handle the piece that is provided as the input of node, to produce the output to the main pixel of piece.Each processing node can utilize effect that piece is handled independently.Can come analytical effect to the purpose of scheduling, so that reduce first of handling in every frame and export the time-delay between last piece in every frame.
Effect can comprise Space or chronergy.When handling Space, the total pixel in the piece can comprise a plurality of pixels in the contiguous block.When processing time during effect, total pixel can comprise a plurality of pixels that go up contiguous picture frame from the time.
Piece is handled and is generated and can adopt Different Strategies, and what for example describe below reduces category or fixed category strategy, but is not limited thereto but can comprises the combination and son combination of Different Strategies.For example, the different piece of effect pipeline can utilize Different Strategies to operate.
In one embodiment, generating total pixel can comprise: identification through a plurality of effects in the effect pipeline continuously processing block to produce any pixel to the output of main pixel.This can comprise: analyze first effect to confirm to produce the required total pixel of output to main pixel; And analyze second effect to confirm to generate the required total pixel of output to said total pixel, the total pixel that is directed against is that the generation of first effect is required to the output of main pixel.Treatment step can reduce the number of the total pixel in this piece then after through the second effect process piece and before handling said first effect.
In another embodiment, treatment step can comprise: after handling first effect and before handling second effect, be used to upgrade the total pixel in the piece from the main pixel of at least one contiguous block.
Another example embodiment can be the device that is used for handling through effect pipeline the picture frame chain.This device can comprise Main Processor Unit, a plurality of auxiliary processor and bus.Bus can interconnect Main Processor Unit, a plurality of auxiliary processor and memory interface.
Main Processor Unit can comprise module generator, effect distributor and piece polymerizer.Module generator can generate a plurality of from the input picture frame that provides via memory interface.The effect distributor can management effect and a plurality of distribution and processing sequences between a plurality of auxiliary processors.At last, the piece polymerizer can make up treated piece.
Each of a plurality of auxiliary processors can comprise the minimized memory buffer memory of storage from the content of the piece of module generator.Each of a plurality of auxiliary processors can be with consecutive order through piece of effect pipeline single treatment.Effect pipeline can be distributed in a plurality of auxiliary processors.Each auxiliary processor is implementation effect independently, to produce the output of the main pixel that is directed against given.
Description of drawings
With reference to accompanying drawing, these and other more detailed and concrete characteristic of the present invention is more fully disclosed in the instructions below, in the accompanying drawings:
Fig. 1 is an example according to polycaryon processor of the present invention.
Fig. 2 is second example embodiment according to computer cluster of the present invention.
Fig. 3 illustrates 12 * 12 frame of pixels that are divided into 16 homogeneous pieces.
Fig. 4 illustrates the piece that 12 * 12 frame of pixels form.
Fig. 5 A illustrates the example embodiment that reduces the category strategy that is used for handling at the streamline sight three regional effects according to of the present invention.
Fig. 5 B illustrates and is used for definite processing that generates the size of the required outside category of the regional output of desirable general data when three effects of given Fig. 5 A.
Fig. 6 A illustrates the fixed category strategy that is used for the treatment effect streamline according to the present invention.
Fig. 6 B is first example embodiment of piece traversal strategy.
Fig. 6 C is second example embodiment of piece traversal strategy.
Fig. 7 illustrates the example embodiment that is used for the processing of implementation effect streamline on polycaryon processor according to the present invention.
Fig. 8 illustrates according to the present invention the minimum share of working is reduced to the first modified effect pipeline of the block size of field.
Fig. 9 uses and the interior same effect streamline of the block-based framework of Fig. 8, but block size is reduced to 1/4th frames.
Figure 10 A illustrates an example of distributed non-equilibrium effect pipeline.
Figure 10 B illustrates an example of distributed balancing the load effect pipeline.
Figure 11 illustrates the conventional effects streamline in the multiple nucleus system.
Embodiment
In the following description, for illustrative purposes, a plurality of details such as process flow diagram, system configuration and processing architecture have been set forth, so that the understanding to one or more embodiment of the present invention is provided.Yet those skilled in the art are with clear, and these details are not that embodiment of the present invention is necessary.
But the present invention provides the streamline image processing engine of the flexible dilatant flow waterline of having realized limited hardware resource is regulated.Embodiment is reduced to ' piece ' (block) with the work of each ' effect ' handled minimum share from frame.The subclass of the every whole frame of expression.The data that block-based processing will be equivalent to every frame are split as N and overlap or nonoverlapping, and allow each effect and other piece processing block independently mutually in any level of streamline.This has allowed a plurality of effects to work simultaneously and need not wait for entire frame or available, and has eliminated the needs to the synchronization mechanism of complicacy.Each piece is self-sustaining (self-contained), and has preserved and be enough to the frame data that let each effect work independently above that.Because piece is self-sustaining and each effect of on each piece, working is coupled by loosely, thus framework easily slow effect in streamline assign more concurrent physical processor, and distribute less concurrent physical processor to light simple effects.In addition, the appointment of processor can change by frame, and without the Any user intervention.
Can embody the present invention with various forms, various forms comprises commercial processes, computer implemented method, computer program, computer systems and networks, user interface, API or the like.
Fig. 1 illustrates the example embodiment according to polycaryon processor 100 of the present invention.
Polycaryon processor 100 can comprise single processor unit (SPU) 104 ponds of the separation that couples together through very fast internal bus 110.Polycaryon processor 100 can also comprise parallel processing element (PPU) 102, memory interface 106 and input/output interface 108.PPU 102 can coordinate the distribution of the Processing tasks between the SPU 104.PPU 102 can also comprise local storage buffer memory 103, is used to store and moves and handle or relevant information and the instruction of effect.Each SPU 104 can comprise memory buffer 105, is used for storing and the operation processing of distributing to this SPU 104 or the relevant information of effect.
During operation, the input that comprises effect pipeline and a plurality of picture frames can be provided for polycaryon processor 100 via input/output interface 108.PPU 102 can resolve to effect pipeline the effect of a plurality of separations.Can combine a plurality of effects to analyze each picture frame, and be separated into block of pixels.PPU 102 can be distributed in 104 confession processing of SPU with effect and piece then.SPU 104 based on effect process after the piece, PPU 102 can be polymerized to treated piece complete output map picture frame.
Though Fig. 1 illustrates the architecture with different PPU 102 and SPU 104; Yet; It will be understood by those skilled in the art that to the invention is not restricted to this, but can on the specialized system structure of the common architecture of the polycaryon processor that comprises homogeneous or other type, realize.
Fig. 2 illustrates second example embodiment of the present invention with the form of computer cluster 200, the working load that computer cluster 200 can distribute and be associated with the treatment effect streamline.Computer cluster 200 can comprise 204 ponds, discrete process unit (DPU) that couple together through network 210.Computer cluster 200 can also comprise cluster management unit (CMU) 202 and control terminal 208.CMU 202 can coordinate the distribution of Processing tasks between the DPU 204.CMU 202 can also comprise the cluster monitor that is used to follow the tracks of the progress, load and the effect that are distributed to each DPU 204.
During operation, effect pipeline and a plurality of picture frame can be from the terminal 208 or other source that can accesses network 210 be provided to CMU 202.CMU 202 can resolve to discrete effect with effect pipeline.Can combine effect to analyze each picture frame and be separated into block of pixels.CMU202 can be distributed in 204 confession processing of DPU with effect and block of pixels then.
The type of effect
For pixel data blocks, ' effect ' is the processing that changes or twist block of pixels.Every comprises a plurality of pixels, for example the pixel of rectangle or square node.If M * N treatment element is simultaneously available, then can handle piece simultaneously concurrently with M * N pixel.This only is only possible when each input pixel provides given effect to produce required all data of corresponding output pixel.Yet many effects depend on the input pixel and neighborhood pixels is an output pixel will import pixel change or distortion suitably.
Can effect be divided three classes: some effect, regional effect and range effect.' some effect ' is the effect that only produces output pixel based on an input pixel.The colour correction effect drops in this type.' regional effect ' is the effect that a certain amount of neighborhood pixels of needs except that given input pixel produces output pixel.Convolution filter such as blur effect is exactly regional effect.The zone effect has especially been brought some challenges with memory usage (memory footprint) parallelization relevant with performance.' range effect ' is to need complete frames as importing to produce the effect of single output pixel.The histogram effect generally is a range effect, and this is because they need be to the analysis of entire frame before generating any output pixel.Range effect has caused ultimate challenge to parallelization.
Though described top effect (that is, they utilize the data manipulation of single frame) with respect to spatial domain, yet these effects can be operated in time domain also.The time domain effect possibly go up the data of contiguous frame from the time.For example, the time zone effect possibly need the contiguous frames input pixel data to produce the output pixel data in the single frame, and the time range effect possibly need all frames of montage or sequence to produce the pixel data of single output frame.For example, box fuzzy (Box Blur) is the area of space effect, and motion blur (Motion Blur) is the time zone effect.In addition, effect can be worked in two territories.For example, interlacing is exactly one time of a space zone effect to the conversion of lining by line scan, and this is because it needs (single frame) territory, space and time domain (contiguous frames) proximity data among both.
Piece is divided
In order to reduce the work share that is associated with frame, can each frame be divided into less subframe or data ' piece '.
Fig. 3 illustrates 12 * 12 frame of pixels 302 that are divided into 16 homogeneous pieces 304 (being illustrated as piece 1-16).Though frame 302 comprises 12 * 12 pixels, yet each piece 304 (304-1 to 304-16) can comprise 3 * 3 pixels; Need 16 pieces to come intactly frame 302 to be mapped to piece 304.Will be appreciated that 12 * 12 frame of pixels 302 only are provided to be used for the purpose of example, and also never in any form the present invention is limited to particular frame size, resolution or aspect ratio.
Example embodiment can keep the same size of all pieces in the given frame.This can provide the even distribution to work between all SPU104.Yet the distribution of block size can be followed different schemes.For example, in effect, exist between some pixel in the bigger dependent situation, can with these pixels be in groups single with the promotion parallelization.In addition, can divide particular frame based on the various standards such as color relations, object identity etc.At last, the standard of block size can also produce the piece with similar work share, so that the processing time of given effect on different masses keeps homogeneous substantially.
It self is not parallel in essence task that piece is divided (or polymerization), therefore can be consigned to the single processor of result and the addressable more available memory resources of addressable a plurality of SPU 104, and for example PPU 102.Because PPU 102 can the zero access complete frames, and SPU 104 possibly preferably operate on discrete blocks, therefore, can the division and the aggregation processing of piece be consigned to PPU 102.In addition, when piece 304 is arrived SPU 104 by unloading (offload), preferably can be minimized alternately between PPU 102 and the SPU 104.This has improved the max calculation handling capacity through a plurality of high speed SPU 104 of balance, has avoided and striden the delay that processing or communication are associated simultaneously.
With frame be split into nonoverlapping unique good in the time performance of process points effect.Yet for regional effect, identical not overlapping block possibly still help output, but needs the additional pixel data around the edge.For example, possibly need 5 * 5 block of pixels to revise or export 3 * 3 block of pixels 304.
For piece 304, the part of the piece 304 that possibly revise through effect is called ' general data ' 402.The other data of adjoining general data 402 also are called ' MARG ' 406 and possibly are required auxiliary process zone effect.The combination of general data 402 and MARG 406 is called ' total data ' of piece 304.Similarly, be called ' inner category ' (inner extent) 404, and be called ' outside category ' (outer extent) 408 by the combination region covered of general data 402 and MARG 406 by general data 402 region covered.
The amount of required MARG 406 depends on regional effect.For example, the convolution effect possibly need the MARG 406 that is equivalent to the wide girth of 1 pixel of each piece 304.Fig. 4 illustrates the piece of 12 * 12 frame of pixels 302 and divides.In Fig. 4, frame of pixels 302 is divided into 16 3 * 3 block of pixels 304 (304-1 to 304-16).For example, though piece 304-6 can have 3 * 3 pixel general datas 402 that comprise inner category 404, yet total data comprises 5 * 5 pixel regions with outside category 408, comprises the wide MARG of general data 402 and adjacent pixels 406.
In this example, the MARG 406 of each piece is overlapping with the general data 402 of at least one adjacent block 304.Can be included in redundant ground by this MARG 406, so that piece 304 is self-sustaining as much as possible for effect process, to realize maximum performance.The piece 304 that is positioned at frame 302 edges can comprise the MARG 406 of manufacturing, can be handled equably to guarantee them.For example, this MARG 406 can comprise the copy of the boundary pixel of frame 302, perhaps is set as fixed value.
In the example embodiment of Fig. 4, all MARGs 406 of piece 304-6 all are positioned at the inside category 404 of adjacent block 304; Like this, piece 304-6 is taken as ' home block '.As a comparison, the coincident of two of piece 304-16 edges and frame 302; Like this, need as above-mentioned, make a part of MARG 406 of piece 304-16.When piece is created, make the MARG 406 of edge block 304 and needing can avoid each effect to have special boundary condition inspection, and needing to avoid effect between inside or edge block 304, to distinguish.
The size of having got rid of the piece 304 of any MARG 406 is also referred to as ' main block size ', because this is to the contributive unique part of final output map picture frame in the piece 304.The size that comprises the piece of MARG is called ' total block size '.
In Fig. 4, the main block size of each piece 304 can be 9 pixels, and total block size can be 25.Because these pieces are foursquare, therefore as long as just can calculate each limit with adding in the IOB size from the pixel of MARG.
Though with respect to Space disclosed division described, yet, it will be apparent to one skilled in the art that piece is divided in the time domain also can work.When in time domain, doing the time spent, piece 304 mainly comprises the pixel that goes up consecutive frame from the time, and always block size can be a two dimension or three-dimensional, rather than that kind is two-dimentional in the territory, image space.
Though Fig. 5 A and Fig. 5 B illustrate main block size and total block size of 25 pixels of each piece 304 of 9 pixels; Yet; It will be appreciated by those skilled in the art that; Through consider the edge demand of each effect in the streamline independently to each dimension, can this embodiment be expanded to the non-square piece.In addition, can also embodiment be expanded to multidimensional territory (wherein, piece is not two-dimentional, but multidimensional, i.e. n dimension).
Block edge in the pipeline system
In the context of the effect pipeline that comprises a plurality of effects, each in a plurality of regional effects will make the edge increase in demand of total block size continuously.Will be with respect to comprising that three regional effects (each wide edge of regional effect needs 1 pixel) describe following example embodiment.Yet, will understand that given effect possibly need various edge pixel width, this may or maybe not can be distributed on around the inner category 404 equably.
General data 402 in this example can comprise the piece 304 of total block size of home block size with 9 pixels and 25 pixels.Increase the piece that MARG 406 has produced the size of 25 pixels, it has carried first effect that supplies in the streamline and has produced the enough information of the output of general data 402.Yet; When the first effect process piece 304; The data of 3 * 3 pixels that the regional effect utilization of next in the streamline is remaining possibly not have enough data to produce the data of general data 402, and perhaps next effect possibly must be visited the MARG 406 of the unmodified that does not conform to modified general data 402.This is because the MARG 406 of piece 304 is identical will still receive it with first effect time, and general data 402 will reflect the output of first effect.In order to answer the situation of pipeline, bigger total block size possibly is provided, this is because each continuum effect possibly only can produce the output in all the more little zone.
Fig. 5 A illustrates the example embodiment that reduces category (reduced-extent) strategy that is used for handling at the streamline sight three regional effects.This example embodiment is through overcoming the minimizing that the continuum effect is caused for each the continuum effect in the streamline increases extra MARG 406, makes each can handle the MARG 406 as the follow-up effect of the part of its output in preceding regional effect.
Reduce in the category strategy illustrative, three regional effects are the processed pixels piece continuously.Utilization should strategy, along with piece 304 by each effect process, the outside category of piece 304 shrinks continuously.In the above in the context of example; 81 pixels of first effect process; And export or revise 49 pixels; 25 pixels are exported or revised to 49 modified pixels of second effect process also, and 25 modified pixels of the 3rd effect processing, and export or revise 9 pixels in the inner category 404.
In first step, main block size is 9 pixels, and total block size is 81 pixels.The first area effect is provided with the piece 304 with outside category 408-3, and its generation has the output of the total data of outside category 408-2.The second area effect is provided with the total data with outside category 408-2, and its generation has the output of the total data of outside category 408-1.The 3rd regional effect is provided with the piece 304 with outside category 408-1 subsequently, and produces the output of general data 402.
Fig. 5 B illustrates and is used for confirming that utilization reduces the category strategy generates the size of the required outside category 408 of required general data 402 zone outputs under the situation of three effects of given Fig. 5 A processing.The outside category 408 of piece is confirmed through following processing: take needed inner category 404 (3 * 3 pixels in this example); And the MARG 406 of last the regional effect in the increase streamline (to obtain the outside category 408-1 of 5 * 5 pixels); Return zone line effect (to obtain the outside category 408-2 of 7 * 7 pixels) then, return the outside category 506 (to obtain the outside category 408-3 of 9 * 9 pixels) of last effect then.This processing has generated the block size of the outside category 408-3 with 9 * 9 pixels.
The alternative of calculating this is that simply that each effect is required marginal dimension is increased on the size of general data 402.If the original size of the piece of general data 402 is 3 * 3 pixels, and each effect needs 1 pixel boundary (on each dimension, increasing by 2 pixels), then overall dimensions is 9 * 9.
Reduce the category strategy and when the ergodic flow waterline, benefit from the height independence that gives each piece 304.That is, in case piece 304 gets into streamline, before data block 304 is left streamline, need be not synchronous with any other contiguous block 304, also guarantee data integrity without any need for controlling mechanism.This also provides the maximum degree of freedom that reduces the streamline time-delay.
Though reduce the extra cost that the category strategy has caused the storage aspect in processing time that is produced effect except afterbody and total data in the streamline; Yet;, can be through reducing cost and time-delay distinguishing between an effect and the regional effect and any some effect placed towards the end (after last regional effect) of streamline.
Fig. 6 A illustrates the fixed category strategy that is used for the treatment effect streamline.In the fixed category strategy, each piece 304 and contiguous block 304 synchronously before, only handle an effect in each level of streamline.When given 304 with its neighbours synchronously after, it can be by next effect process in the streamline.This strategy possibly make the streamline time-delay increase some enough and to spares, but has reduced computing cost.
Fig. 6 A illustrates the example embodiment of the inside category 404 that the produced effect fixed category strategy identical with outside category 408.In Fig. 6 A, each effect utilization total data is only handled general data 402 as input.After each regional effect is processed with having guaranteed synchronously before next effect is processed of contiguous block 304, all data of utilizing the general data 402 of contiguous block 304 to upgrade in the pieces 304.
The efficient of fixed category strategy possibly receive the influence of the order that piece 304 is processed.Because the fixed category strategy needs back effect (post-effect) synchronization of data between the contiguous block 304; Therefore; In first (or formerly) effect process given all contiguous blocks 304 and given MARG 406 and its contiguous block 304 synchronously before, given second (or current) effect can not begin to handle this given.As an example, after first effect process, piece 304-1 must by before second effect process with piece around it, promptly piece 304-2,304-5 and 304-6 are synchronous.Similarly, piece 304-6 must be synchronous with piece 304-1,304-2,304-3,304-5,304-7,304-9,304-10 and 304-11.
Fig. 6 B illustrates the example embodiment with the piece traversal strategy (block traversalstrategy) of consecutive order processing block.Though be the simplest processing policy, yet this strategy possibly not be most effective.The parallel character of given multiprocessor 100, PPU 104 can be close to the MARG 406 that its contiguous block 304 begins given after being processed to carry out synchronously.As an example, after piece 304-1 warp first effect process, before it can be synchronized, must wait for 5 pieces, promptly piece 304-2 to 304-6 is processed.Similarly, after piece 304-6 warp first effect process, before it can be synchronized, must wait for 5 pieces, promptly piece 304-7 to 304-11 is processed.Like this, the piece traversal strategy that substitutes can improve the concurrency of fixed category strategy.
Fig. 6 C illustrates the example embodiment of alternative piece traversal strategy, and wherein, piece is processed so that reduce the number at the given piece 304 that must handle before can being synchronized.As an example, after piece 304-1 warp first effect process, before it can be synchronized, must wait for only 3 other pieces, promptly piece 304-2,304-5 to 304-6 are processed.For piece 304-1, two kinds of strategies all need with 3 contiguous block 304 swap datas.Because the number of the piece 304 of actual waits is (4) that fix and irrelevant with the sum of piece 304 in the frame when 3 contiguous blocks 304 are processed, it is more useful that therefore alternative piece travels through strategy.As a comparison; In the original traversal strategy of Fig. 6 B; Even piece 304-1 depends on the piece 304 of similar number, also need wait for two related blocks (304-5 and 304-6) in the next line of all other pieces 304 (piece 304-2 to 304-4) in the delegation and this frame being processed.This has comprised three related blocks (304-2,304-5,304-6) and two irrelevant pieces (304-3,304-4); The latter is comprised because travel through strategy (all row for from the top to the bottom travel through to the left or to the right) in every row.This maybe be significantly (for example than the frame of bulk distribution for having; Frame for 100 * 100; Utilize the strategy of Fig. 6 B, piece 0 possibly need 101 of waits, i.e. 3 related blocks and 98 irrelevant pieces; But utilize the strategy of Fig. 6 C still only need wait for 4 pieces 304, i.e. 3 related blocks, 1 irrelevant piece).
Though Fig. 6 C shows an alternative strategy that is used for processing block 304, yet this strategy only is exemplary, and intention illustrates the benefit of the potential traversal order that changes processing block 304.In addition, the traversal strategy that substitutes also is possible in spirit of the present invention.For example, alternative strategy can wait through effect, memory span, the processor speed in optimization different masses partition strategy, the streamline and improve performance.
Pipeline processes
Fig. 7 illustrates according to the present invention the example embodiment of the processing 700 of implementation effect streamline on polycaryon processor 100.Can in various systems, realize though locate 700, yet, for clear, describe with respect to polycaryon processor 100.
In step 702, polycaryon processor 100 can receive the input of marking effect streamline.Effect pipeline can identify can be by a plurality of effects of SPU 104 execution with processed pixels piece 304.PPU102 can analyze the space/time edge demand of a plurality of effects to identify each effect.
In step 704, can effect be assigned and is distributed among a plurality of SPU 104 based on the analysis in the step 702.PPU 102 assigns an effect can for each SPU 104.Perhaps, PPU 102 can with a plurality of simple operations fast effect assign to single SPU 104, and the operation effect at a slow speed of complicacy is assigned to a plurality of SPU 104.In another alternate embodiment, system can depend on dynamic load balancing and come the distribution of management effect 104 of each SPU.
In step 706, PPU 102 can receive picture frame 302 and be used for handling according to effect pipeline.Perhaps, PPU 102 can be via memory interface 106 from storer or any other addressable data source reading images frame 302.
In step 708, PPU 102 can generate data block 304 from picture frame 302.The size of data block 304 and structure can be predefined, perhaps can depend on the processing demands of each effect and the processing policy that is adopted.
In step 710, PPU 102 can offer the SPU 104 that is assigned to first effect in the effect pipeline with each data block 304.Each data block 304 subsequently can be by a plurality of effects of the defined order traversal of effect pipeline effect pipeline.
In step 712, when data block 304 after the determined last effect of effect pipeline is handled, data block 304 can be aggregated in memory buffer 103 or other storer by PPU 102.PPU 102 can be combined into picture frame 302 with the data block 304 through polymerization then.
In step 714,, then handle and return step 706 if any other picture frame 302 need be handled through effect pipeline; Otherwise processing finishes.
Figure 11 illustrates in the prior art, and the expection time-delay in 6 grades of effect pipeline is six frames, and this is because through definition, each frame spends a frame time and handles.
Fig. 8 illustrates modified effect pipeline, and wherein, minimum work share is reduced to piece 304.In this embodiment, each piece 304 equals half frame 302, therefore, is handled with 1/2 frame time by each effect.As shown in Figure 8, this can be reduced to 31/2 frame from 6 frames with time-delay.
In Fig. 8, the first field F1B1 is processed in 3 frames (1/2 frame * 6 effect).The second field F1B2 also is processed in 3 frames, but only behind the first field F1B1 1/2 frame get into streamline.As a result, the second field F1B2 leaves streamline in 31/2 frame (1/2 frame * 6 effect+1/2 frame time-delay), and causing total frame time time-delay of whole first frame 302 is 31/2 frame.
Example embodiment shown in Figure 9 is used the interior same effect streamline of block-based framework of Fig. 8, but block size is reduced to 1/4th frames.Compared with prior art, present embodiment is through reduce block size once more, and the streamline time-delay that has generated from 6 frames to 21/4 frame improves.
In Fig. 9, the 1/1st frame F1B1 is processed in 11/2 frame (1/4 frame * 6 effect)./ 4th frame F1B2 also are processed in 11/2 frame, but only behind the 1/1st frame F1B1 1/4 frame get into streamline.As a result, the 1/2nd frame F1B2 leaves streamline in 13/4 frame (1/4 frame * 6 effect+1/4 frame time-delay).Third and fourth 1/4th frame F1B3 and F1B4 also after the 1/1st frame F1B1 1/2 frame and 3/4 frame leave streamline.As a result, first frame is processed in 21/4 frame.
Above clarity of illustration ground shown that block-based effect process is with respect to the advantage based on the effect process of frame.Along with block size continues to reduce, obtain more high-gain about the streamline time-delay possibility that becomes; Yet gain becomes more and more littler.Finally, the time-delay of introducing from other calculating aspect that the data of data block 304 transmit and handle reduces along with further piece and has surpassed gain.
Similar with traditional architecture based on frame, the streamline time-delay of block-based system also can be defined as the processing time and the amassing of the effect in the streamline (or level) number of the effect the most slowly in the chain.Yet switch to block-based architecture this is reduced to: the slowest effect M handles that one 304 data are taken time and effect (or level) number subtracts 1 product, add that the slowest effect process one frame 302 data are taken time and.Block-based streamline time-delay as computing time may be calculated:
PL CT=(S * T BM)+((N-1) * T BM) (formula 3)
Wherein:
PL CTIt is streamline time-delay as computing time.
N is the effect/progression in the streamline.
S is the number from the piece 304 of each frame 302 generation.
T BMIt is the required time of the slowest effect M processing in the streamline.
Suppose that handling the expense cost that is associated with piece can ignore, this processing time that calculates is:
T FM=(S * T BM) (formula 4)
For example, the PL of block-based system shown in Figure 7 CTWhen S=2, become (2 * 60)+(5 * 60)=420ms.Because we have ignored any expense that is associated with the switching of handling to piece, so T BMBe calculated as T simply FM/ S.
Calculating is that the PL of unit has produced formula with the frame time:
PL FT=PL CT/ FR (formula 5)
Wherein:
PL FTBe to be the streamline time-delay of unit with the frame time.
PL CTIt is streamline time-delay as computing time.
FR is effective frame frequency.
Numeral above utilizing, the PL of the example among Fig. 8 FTBe the 420/120=3.5 frame, the number that equals before to have calculated.Example shown in Figure 9 has produced the PL of 270ms CTPL with 2.25 frames FT
As noted earlier, these examples suppose that the expense of block-based processing can ignore.The sight of real world possibly have and is arranged in the more computing time (reduce category strategy) of streamline than the effect of front, and therefore perhaps bigger synchronization overhead has reduced streamline time-delay (fixed category strategy).
Storage requirement
Block-based system is providing main advantages aspect the memory usage of streamline.Traditional N level production line based on frame needs N frame buffer at least to keep the operation of all grades.Because the processed frame I/O in the outside of effect pipeline, therefore, this system also needs several additional frame buffer to be used for this purpose (minimumly being 2, and being used for input that is used for an output, and if adopt the double buffering of I/O operation then be 4).Therefore, handle the required minimum frame buffer of streamline and be actually N+2.
If P representes the number of available processors element in the pipeline system; Then required impact damper will be reduced to (P, the minimum value impact damper in N), this be because; If the number of processor elements is less than the progression in the streamline, then system only can handle P frame 302 simultaneously.On the contrary, if pipeline series less than the processor elements number, then system only can handle N impact damper simultaneously, simultaneously all the other processor elements will keep the free time.
The formula that calculates the minimal amount of the required frame buffer of traditional streamline based on frame can be derived as:
Y F=min (P, N)+2 (formula 6)
Wherein:
Y FIt is the minimal amount of handling the required frame buffer of streamline.
P is the number of available processors element.
N is the level/effect number in the streamline.
Then the actual storage demand to impact damper becomes:
M FBA=Y F* Z F(formula 7)
Wherein:
M FBABeing the required minimized memory of impact damper that utilizes based in the streamline of the method for frame, is unit with the byte
Y FIt is the minimal amount of handling the required frame buffer of streamline
Z FBeing the size of each frame buffer, is unit with the byte
Block-based system is providing main advantages aspect the memory usage of streamline.Because piece is littler than frame, even therefore for the pipeline of shorter length, memory savings also can be a large amount of.The logic of abideing by the front, required in the block-based system minimal buffering device number can (P, N) individual block buffer add at least 2 frame buffers that are used for the I/O operation equally for min.As above:
Y B=min (P, N) (formula 8)
Wherein:
Y BIt is the minimal amount of handling the required block buffer of streamline
P is the number of available processors element (SPU 104)
N is the progression in the streamline.
Yet the actual storage demand of all impact dampers is:
M BBA=Y B* Z B+ 2 * Z F(formula 9)
Wherein:
M BBABeing the required minimized memory of impact damper that utilizes in the streamline of block-based method, is unit with the byte.
Y BIt is the minimal amount of handling the required block buffer of streamline.
Z BBeing the size of each piece 304 impact damper, is unit with the byte.
Z FBeing the size of each frame 302 impact damper, is unit with the byte.
As an example; In 6 level production lines in the 4SPU system, wherein, preserving a frame 302 required storeies is 8192KB (preserving the required size of the every pixel HD frame of 4 bytes of per 2048 * 1024 sizes); And each frame 302 is divided into the piece 304 of 256 * 256 sizes; Preserving each piece 304 required size is 256KB, and each frame 302 will have 32 pieces 304 (in order to simplify edge size=0).
Utilize the formula of traditional method based on frame, the minimized memory required based on the impact damper in the pipeline system of frame is:
M FBA=[min(P,N)+2]×ZF=[min(4,6)+2]×8192=6×8192=49152KB
As comparing, utilize the required minimized memory of block-based method processing same stream waterline to be:
M BBA=[min(P,N)]×ZB+2×ZF=[min(4,6)]×256+2×8192=4×256+2×8192=17408KB。
In addition, can notice that two mandatory (mandatory) frame buffers have consumed the 16MB among the employed 17MB of block-based system.Therefore, through removing these mandatory impact dampers, the difference that comparable storer uses becomes the 1.5MB with respect to block-based system based on the 32MB of the system of frame.Even block-based method need more substantial storer for continuous regional effect, memory consumption also can not be near the memory consumption based on the system of frame.
III. balancing the load
The time of all formation effect cost same amounts of the example embodiment hypothesis that is illustrated so far is handled the data (no matter being frame 302 or piece 304) of same amount.Yet, effect can based on they algorithm and embodiment complexity and spend time of variable quantity.The processing time of the variation of different-effect can be reduced the streamline time-delay, and the slowest effect has caused bottleneck.Bottleneck possibly make downstream SPU 104 shortage work subsequently.If number of buffers is restricted, this also possibly make upper reaches SPU104 lack work, and this is because storer is consumed.
Figure 10 A illustrates an example of the streamline that greatly changes computing time separately of effect.In Figure 10 A, effect FX1, FX2 and FX4 need the computing time of 20ms, and effect FX3 needs the computing time of 160ms, and effect FX5 and FX6 need the computing time of 10ms.In this embodiment, because each piece gets into streamline and passes through each effect FX2-FX6 continuously at effect FX1 place, therefore, produced bottleneck at effect FX3 place.
Processing streamline among Figure 10 A is similar with the embodiment that is assigned single effect with respect to each SPU 104 that Fig. 8 and Fig. 9 discussed.Yet although the piece time-delay is 6 pieces in Fig. 8 and Fig. 9, in Figure 10, the bottleneck that total streamline time-delay is created because of effect 3 is much higher.This causes the effect (therefore, other SPU 104) in downstream in idle condition, to be lost time.Chronically, this bottleneck possibly cause utilizing the impact damper of huge amount only to be used for keeping upper reaches SPU 104 occupied, and has no benefit at output terminal.
The quiescent load balance (Static Load Balancing) provide first solution to this problem.The quiescent load balance comprises: through taken into account the computing time of related each effect (FX1-FX6), carry out the static analysis to effect.This has allowed the distribution formerly of effect, and it can comprise that weak point is calculated combination of effects arrives single resource (that is, or less SPU), increases the number of resources (for example, contributing a plurality of SPU) of slow effect simultaneously.
Figure 10 B illustrates an example of balancing the load effect pipeline.The effect process time among Figure 10 B equates with Figure 10 A; Yet effect is a balancing the load between different SPU 104 (104-1 to 104-6).In Figure 11 B, SPU 104-2 to 104-5 is exclusively used in effect FX3.SPU104-1 is assigned to effect FX1 and FX2, and SPU 104-6 is assigned to effect FX4, FX5 and FX6.All SPU 104 have been used in this distribution, have also limited the use of impact damper simultaneously and have promoted overall performance.Especially be apparent that the too much processing time that effect FX3 is consumed has been distributed on the fact on a plurality of SPU 104 (104-2 to 104-5).This has guaranteed when each piece 304 leaves effect FX2 have available SPU 104 to come treatment effect FX3.In addition, when each piece 304 leaves effect FX3, assign and give the SPU 104 of effect FX4 to FX6 available immediately.
For the fixed function streamline; The quiescent load balance can be worked finely; But the general effect pipeline of needs or even when needing the fixed function streamline that the effect process time changes based on complicacy or the external parameter of input data, can not well expand.
Handling load or (for example based on working load itself based on outside input (for example the parameter that provides along with the user of processing time and the effect that changes); Demoder based on the compression variation in the coded data) in the situation by frame or varies block by block, dynamic load balancing provides substituting the quiescent load balance.
Switch to block-based processing from processing and allowed the more effective dynamic load balancing the dispatching system based on frame; This is because there is more synchronous points can supply scheduler to be used for making incident to offset; And can localization work difference (for example be taken into account; The piece that comprises the low contrast data in the same frame will have different Performance Characteristicses with the piece that comprises the hard contrast data for different-effect, and switch to block-based dynamic load balancing and allow system to adapt to this with more fine-grained level).
About processing described herein, system, method, trial method etc.; Be understood that; Though these processed steps etc. are described to take place according to certain generic sequence, yet, also can utilize the step of carrying out by the order except order described herein of describing to implement these processing.Be to be further appreciated that, can carry out some step simultaneously, the step that can add other perhaps can be omitted some step described herein.Handle the computer executable instructions (for example, one or more scripts) that also can be implemented as in client, server and/or the database, the process of storage, executable program etc.In other words, the description of handling is provided for the purpose that illustrates some embodiment here, and never should be interpreted as the invention of requirement for restriction protection.
Therefore, will understand top description hopes it is illustrative and nonrestrictive.To know many embodiment and the application except the example that is provided after the description of those skilled in the art on read.The four corner of the equivalent that should be with reference to top description should not enjoy with reference to appended claims and these claims is confirmed scope of the present invention.Expection also hopes that development in the future will take place in the technology of being discussed here, and the system and method for being mentioned will be included among the embodiment in these future.In a word, should be understood that the present invention can make amendment and change and only limited following claim.
Any situation that disclosed method for parallel processing can be applied to audio frequency effect streamline, 3D effect streamline or can data is split as explant and be processed through effect pipeline.
Computing equipment such as those computing equipments of being discussed here (for example, processor, client, server, terminal etc.) generally can comprise executable instruction.In addition, processor can comprise any equipment of the processing components that has comprised the arbitrary number such as SPU, PPU, GPU etc. itself.Can be from computer program compiling or analytical Calculation machine executable instruction, computer program is to utilize multiple programming language such as the Java that well known to a person skilled in the art the form alone or in combination of including but not limited to, C, C++, compilation and/or technology to create.Usually, processor (for example, microprocessor) receives instruction (for example, from storer, computer-readable medium etc.), and carries out these instructions, carries out the one or more processing that comprise one or more processing described herein thus.Can utilize multiple known computer-readable medium to store and send these instructions and other data.
Computer-readable medium comprises any medium that data (for example, instruction) can be provided by the participation that computing machine reads.This medium can be taked many forms, includes but not limited to non-volatile media, Volatile media and transmission medium.Non-volatile media for example can comprise light or disk and other permanent memory.Volatile media comprises dynamic RAM (DRAM), and it constitutes primary memory usually.
Between the computing equipment and the communication in the computing equipment can adopt transmission medium, comprise that coaxial cable is blue, copper cash and optical fiber, comprise the circuit that has comprised the system bus that is coupled to processor.Transmission medium can comprise or transmit sound wave, light wave and Electromagnetic Launching, those emitting substances that for example during radio frequency (RF) and infrared (IR) data communication, generate.The common form of computer-readable medium for example comprise floppy disk, Flexible disk, hard disk, tape, any other magnetic medium, CD-ROM, DVD, any other light medium, punched card, paper tape, any other have physical medium, RAM, PROM, EPROM, FLASH-EEPROM, any other storage chip or the module of sectional hole patterns, like the carrier wave after this described, or any other medium that can read by computing machine.
Therefore, embodiments of the invention produce and provide the pipelining image processing engine.Though described in detail the present invention with reference to some embodiment of the present invention, yet, under the situation that does not break away from the spirit or scope of the present invention, can be with embodied in various forms the present invention.Therefore, the following claim description to embodiment that should not be subject to here by any way to be comprised.
The U.S. Provisional Application No.61/191 that the application and on September 9th, 2008 submit to, 557 is relevant, and the full content of this application is incorporated into this by reference.

Claims (26)

1. method that is used for handling through effect pipeline picture frame, wherein said effect pipeline refer to order with definition to the image applications visual effect, and said method comprises:
Generate a plurality of; Each piece comprises one group of total pixel with one group of main pixel and one group of edge pixel; Said total pixel comprises that the effect in the said effect pipeline produces any pixel to the required conduct input of the output of said main pixel, and said main pixel is the pixel with the last frame that is used for being generated by said effect pipeline;
Consecutive order with predetermined is handled each piece through said effect pipeline, and said effect pipeline is distributed in a plurality of processing nodes; And
Through making up the said main pixel in treated, come the treated piece of polymerization with the generation output frame,
Wherein, each effect process in the said effect pipeline is provided for the piece of node as input, to produce the output to said said main pixel.
2. the method for claim 1, wherein total pixel comprises that said main pixel and at least one effect produce any other pixel to the required conduct input of the output of said main pixel.
3. the method for claim 1, wherein generating a plurality of also comprises: analytical effect is to confirm to generate the needed total pixel of output to said main pixel.
4. method as claimed in claim 3 wherein, generates a plurality of and comprises:
Analyze first effect to confirm to produce the needed total pixel of output to said main pixel; And
Analyze second effect to confirm to generate the required total pixel of output to said total pixel, the total pixel that is directed against is that said first effect generation is required to the output of said main pixel.
5. method as claimed in claim 4, wherein, said treatment step after through the said second effect process piece and before handling said first effect, the number of the total pixel in reducing said.
6. the method for claim 1, wherein the treating step comprises: after handling first effect and before handling second effect, be used to upgrade the total pixel in the piece from the main pixel of at least one contiguous block.
7. the method for claim 1, wherein said total pixel comprise through a plurality of effects in the said effect pipeline continuously processing block to produce to the required any pixel of the output of said main pixel as input.
8. the method for claim 1; Wherein, Generating a plurality of comprises: analyze each effect to judge that said effect is pixel effects, regional effect or range effect; Wherein said pixel effects is the effect that only produces output pixel based on an input pixel, and said regional effect is the effect that a certain amount of neighborhood pixels of needs except that given input pixel produces output pixel, and said range effect is to need complete frames as importing to produce the effect of single output pixel.
9. the method for claim 1, wherein generating a plurality of comprises: resource that analysis node can be used and effect, and with the number of total pixel with the main pixel of each piece in confirming said a plurality of.
10. the method for claim 1, wherein said total pixel comprises a plurality of pixels that go up contiguous picture frame from the time.
11. the method for claim 1, wherein said total pixel comprises pixel contiguous on a plurality of spaces in the input picture frame.
12. the method for claim 1, wherein when handling each piece through effect pipeline, each processing node utilizes effect that piece is handled independently.
13. the method for claim 1, wherein the treating step comprises: dispatching effect with reduce in every frame first begin handle and every frame in the output of last piece between time-delay.
14. a device that is used for handling through effect pipeline picture frame, wherein said effect pipeline refer to order with definition to the image applications visual effect, said device comprises:
Main Processor Unit, this Main Processor Unit comprise module generator, effect distributor and piece polymerizer;
A plurality of auxiliary processors, each auxiliary processor comprise the minimized memory buffer memory of the content of storage block; And
Bus, the said main storer of this bus interconnection, said a plurality of auxiliary processors and memory interface;
Wherein
Said module generator generates a plurality of from the input picture frame that provides via said memory interface; Each piece comprises and has from one group of main pixel of said input picture frame and one group of total pixel of one group of edge pixel; Said total pixel comprises that the effect in the said effect pipeline produces any pixel to the required conduct input of the output of said main pixel; Said main pixel is the pixel with the last frame that is used for being generated by said effect pipeline
Said effect distributor management effect and said a plurality of distribution and processing sequences between said a plurality of auxiliary processors,
The said treated piece of polymerizer combination;
Said a plurality of auxiliary processor is handled each piece through said effect pipeline with consecutive order, and said effect pipeline is distributed in said a plurality of auxiliary processor, and
Each auxiliary processor is carried out and is handled each piece independently with the effect of generation to the output of said main pixel.
15. device as claimed in claim 14, wherein, the said total pixel in each piece comprises that said main pixel and at least one effect export any other pixel of the required conduct of said main pixel input.
16. device as claimed in claim 14, wherein, said Main Processor Unit analytical effect is to confirm to generate the required total pixel of said main pixel.
17. device as claimed in claim 14; Wherein, Said module generator is analyzed first effect and is generated the required total pixel of said main pixel with definite, and analyzes second effect to confirm that said first effect of generation need be with the required total pixel of the total pixel that generates said main pixel.
18. device as claimed in claim 17, wherein, auxiliary processor after through the said second effect process piece and before handling said first effect, the number of the total pixel in reducing said.
19. device as claimed in claim 14, wherein, said auxiliary processor is used to upgrade from the main pixel of at least one contiguous block total pixel of said after piece is handled first effect and before handling second effect.
20. device as claimed in claim 14; Wherein, said total pixel comprises that at least one effect in the implementation effect streamline is to produce at least one auxiliary processor of the output of said main pixel required any pixel as input of processing block continuously.
21. device as claimed in claim 14; Wherein, Said each effect of module generator analysis is to judge that at least one effect is pixel effects, regional effect or range effect; Wherein said pixel effects is the effect that only produces output pixel based on an input pixel, and said regional effect is the effect that a certain amount of neighborhood pixels of needs except that given input pixel produces output pixel, and said range effect is to need complete frames as importing to produce the effect of single output pixel.
22. device as claimed in claim 14 wherein, generates a plurality of and comprises: analyze resource and corresponding effect that auxiliary processor can be used, to confirm the number of total pixel and main pixel.
23. device as claimed in claim 14, wherein, said total pixel comprises a plurality of pixels that go up contiguous picture frame from the time.
24. device as claimed in claim 14, wherein, said total pixel comprises the pixel in a plurality of main pixel in the piece contiguous on the space.
25. device as claimed in claim 14, wherein, when handling each piece through effect pipeline, each processing node utilizes effect that piece is handled independently.
26. device as claimed in claim 14, wherein, said Main Processor Unit comprises scheduler, with reduce in every frame first begin handle and every frame in the output of last piece between time-delay.
CN2009101691171A 2008-09-09 2009-09-09 Pipelined image processing engine Expired - Fee Related CN101673391B (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US19155708P 2008-09-09 2008-09-09
US61/191,557 2008-09-09
US12/457,858 US8754895B2 (en) 2008-09-09 2009-06-24 Pipelined image processing engine
US12/457,858 2009-06-24

Publications (2)

Publication Number Publication Date
CN101673391A CN101673391A (en) 2010-03-17
CN101673391B true CN101673391B (en) 2012-08-29

Family

ID=41397499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009101691171A Expired - Fee Related CN101673391B (en) 2008-09-09 2009-09-09 Pipelined image processing engine

Country Status (6)

Country Link
US (1) US8754895B2 (en)
EP (1) EP2161685B1 (en)
JP (1) JP2010067276A (en)
CN (1) CN101673391B (en)
AU (1) AU2009213013B2 (en)
CA (1) CA2678240C (en)

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8238624B2 (en) 2007-01-30 2012-08-07 International Business Machines Corporation Hybrid medical image processing
US8462369B2 (en) 2007-04-23 2013-06-11 International Business Machines Corporation Hybrid image processing system for a single field of view having a plurality of inspection threads
US8331737B2 (en) * 2007-04-23 2012-12-11 International Business Machines Corporation Heterogeneous image processing system
US8326092B2 (en) * 2007-04-23 2012-12-04 International Business Machines Corporation Heterogeneous image processing system
US8675219B2 (en) 2007-10-24 2014-03-18 International Business Machines Corporation High bandwidth image processing with run time library function offload via task distribution to special purpose engines
US9135073B2 (en) 2007-11-15 2015-09-15 International Business Machines Corporation Server-processor hybrid system for processing data
US9332074B2 (en) 2007-12-06 2016-05-03 International Business Machines Corporation Memory to memory communication and storage for hybrid systems
US8229251B2 (en) * 2008-02-08 2012-07-24 International Business Machines Corporation Pre-processing optimization of an image processing system
US8379963B2 (en) 2008-03-28 2013-02-19 International Business Machines Corporation Visual inspection system
FR2958064B1 (en) * 2010-03-26 2012-04-20 Commissariat Energie Atomique ARCHITECTURE FOR PROCESSING A DATA STREAM ENABLING THE EXTENSION OF A NEIGHBORHOOD MASK
JP2011223303A (en) * 2010-04-09 2011-11-04 Sony Corp Image encoding device and image encoding method, and image decoding device and image decoding method
US20120001925A1 (en) * 2010-06-30 2012-01-05 Ati Technologies, Ulc Dynamic Feedback Load Balancing
CN103125119B (en) * 2010-10-04 2016-10-26 松下知识产权经营株式会社 Image processing apparatus, method for encoding images and image processing method
GB2494903B (en) 2011-09-22 2017-12-27 Advanced Risc Mach Ltd Graphics processing systems
WO2013101469A1 (en) * 2011-12-29 2013-07-04 Intel Corporation Audio pipeline for audio distribution on system on a chip platforms
US8646690B2 (en) 2012-02-06 2014-02-11 Cognex Corporation System and method for expansion of field of view in a vision system
US11966810B2 (en) 2012-02-06 2024-04-23 Cognex Corporation System and method for expansion of field of view in a vision system
US9027838B2 (en) 2012-02-06 2015-05-12 Cognex Corporation System and method for expansion of field of view in a vision system
US9892298B2 (en) 2012-02-06 2018-02-13 Cognex Corporation System and method for expansion of field of view in a vision system
CN102780616B (en) * 2012-07-19 2015-06-17 北京星网锐捷网络技术有限公司 Network equipment and method and device for message processing based on multi-core processor
US8794521B2 (en) 2012-10-04 2014-08-05 Cognex Corporation Systems and methods for operating symbology reader with multi-core processor
US10154177B2 (en) 2012-10-04 2018-12-11 Cognex Corporation Symbology reader with multi-core processor
US9270999B2 (en) 2013-09-25 2016-02-23 Apple Inc. Delayed chroma processing in block processing pipelines
US9215472B2 (en) 2013-09-27 2015-12-15 Apple Inc. Parallel hardware and software block processing pipelines
CN104361553B (en) * 2014-11-02 2017-04-12 中国科学院光电技术研究所 Synchronization method for improving processing efficiency of graphics processor
KR102156439B1 (en) * 2018-11-06 2020-09-16 한국전자기술연구원 Cloud-edge system and method for processing data thereof
WO2021012257A1 (en) * 2019-07-25 2021-01-28 Qualcomm Incorporated Methods and apparatus to facilitate a unified framework of post-processing for gaming
JP7322576B2 (en) * 2019-07-31 2023-08-08 株式会社リコー Information processing device, imaging device, and moving object
US11663051B2 (en) * 2020-01-07 2023-05-30 International Business Machines Corporation Workflow pipeline optimization based on machine learning operation for determining wait time between successive executions of the workflow
US12112196B2 (en) * 2021-04-26 2024-10-08 GM Global Technology Operations LLC Real-time scheduling for a heterogeneous multi-core system
CN118501164B (en) * 2024-07-15 2024-11-08 山东大学 OCT-based large-view-field assembly line detection control method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5319794A (en) * 1989-11-30 1994-06-07 Sony Corporation Device for high-speed processing of information frames
CN1183154A (en) * 1996-02-06 1998-05-27 索尼计算机娱乐公司 Apparatus and method for drawing
US5886712A (en) * 1997-05-23 1999-03-23 Sun Microsystems, Inc. Data channel extraction in a microprocessor

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61141264A (en) * 1984-12-14 1986-06-28 Canon Inc Image processing device
KR930002316B1 (en) * 1989-05-10 1993-03-29 미쯔비시덴끼 가부시끼가이샤 Multiprocessor type time varying image encoding system and image processor
JPH07210545A (en) * 1994-01-24 1995-08-11 Matsushita Electric Ind Co Ltd Parallel processing processors
JP3593439B2 (en) * 1997-06-09 2004-11-24 株式会社日立製作所 Image processing device
JPH11112753A (en) * 1997-10-02 1999-04-23 Ricoh Co Ltd Picture processor
JP4348760B2 (en) 1998-12-16 2009-10-21 コニカミノルタビジネステクノロジーズ株式会社 Data processing system
JP2000251065A (en) 1999-03-02 2000-09-14 Fuji Xerox Co Ltd Image processor
US6753878B1 (en) * 1999-03-08 2004-06-22 Hewlett-Packard Development Company, L.P. Parallel pipelined merge engines
GB2363017B8 (en) * 2000-03-30 2005-03-07 Autodesk Canada Inc Processing image data
JP2002049603A (en) 2000-08-03 2002-02-15 Toshiba Corp Method and apparatus for dynamic load distribution
US6823087B1 (en) * 2001-05-15 2004-11-23 Advanced Micro Devices, Inc. Parallel edge filters in video codec
JP4202033B2 (en) 2001-09-05 2008-12-24 三菱電機株式会社 Parallel image processing apparatus and parallel image processing method
JP2003216943A (en) * 2002-01-22 2003-07-31 Toshiba Corp Image processing device, compiler used therein and image processing method
US20080094402A1 (en) * 2003-11-19 2008-04-24 Reuven Bakalash Computing system having a parallel graphics rendering system employing multiple graphics processing pipelines (GPPLS) dynamically controlled according to time, image and object division modes of parallel operation during the run-time of graphics-based applications running on the computing system
JP4760541B2 (en) 2006-05-31 2011-08-31 富士ゼロックス株式会社 Buffer control module, image processing apparatus, and program
US7834873B2 (en) * 2006-08-25 2010-11-16 Intel Corporation Display processing line buffers incorporating pipeline overlap
US8467626B2 (en) * 2006-09-29 2013-06-18 Thomson Licensing Automatic parameter estimation for adaptive pixel-based filtering
KR100803220B1 (en) * 2006-11-20 2008-02-14 삼성전자주식회사 Method and apparatus for rendering of 3d graphics of multi-pipeline
US8107758B2 (en) * 2008-04-16 2012-01-31 Microsoft Corporation Block based image processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5319794A (en) * 1989-11-30 1994-06-07 Sony Corporation Device for high-speed processing of information frames
CN1183154A (en) * 1996-02-06 1998-05-27 索尼计算机娱乐公司 Apparatus and method for drawing
US5886712A (en) * 1997-05-23 1999-03-23 Sun Microsystems, Inc. Data channel extraction in a microprocessor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JP特开2004-206387A 2004.07.22
JP特开2008-98911A 2008.04.24

Also Published As

Publication number Publication date
EP2161685B1 (en) 2018-11-28
JP2010067276A (en) 2010-03-25
CA2678240C (en) 2015-11-24
CA2678240A1 (en) 2010-03-09
CN101673391A (en) 2010-03-17
US20100060651A1 (en) 2010-03-11
AU2009213013A1 (en) 2010-03-25
AU2009213013B2 (en) 2015-09-17
US8754895B2 (en) 2014-06-17
EP2161685A3 (en) 2016-11-23
EP2161685A2 (en) 2010-03-10

Similar Documents

Publication Publication Date Title
CN101673391B (en) Pipelined image processing engine
KR102258414B1 (en) Processing apparatus and processing method
US11379555B2 (en) Dilated convolution using systolic array
DE102020124932A1 (en) Apparatus and method for real-time graphics processing using local and cloud-based graphics processing resources
US20110057937A1 (en) Method and system for blocking data on a gpu
DE102021118444A1 (en) Device and method for compressing raytracing acceleration structure design data
EP2698768B1 (en) Method and apparatus for graphic processing using parallel pipeline
US11093225B2 (en) High parallelism computing system and instruction scheduling method thereof
US9256466B2 (en) Data processing systems
DE102020132377A1 (en) Apparatus and method for throttling a ray tracing pipeline
DE102019103310A1 (en) ESTIMATE FOR AN OPTIMAL OPERATING POINT FOR HARDWARE WORKING WITH A RESTRICTION ON THE SHARED PERFORMANCE / HEAT
KR102346119B1 (en) Identification of primitives in the input index stream
Martín et al. Algorithmic strategies for optimizing the parallel reduction primitive in CUDA
US20120001905A1 (en) Seamless Integration of Multi-GPU Rendering
CN113568599A (en) Method, electronic device and computer program product for processing a computing job
US8675002B1 (en) Efficient approach for a unified command buffer
CN110308982A (en) A kind of shared drive multiplexing method and device
CN111860807B (en) Fractal calculation device, fractal calculation method, integrated circuit and board card
DE102019129899A1 (en) SCALABLE LIGHTWEIGHT PROTOCOLS FOR PACKAGE ORIENTATION AT LINE SPEED
KR20190142732A (en) Data processing systems
CN115437637A (en) Compiling method and related device
DE102023105554A1 (en) Fast data synchronization in processors and memory
DE102022128966A1 (en) PARALLEL PROCESSING FOR COMBINATORY OPTIMIZATION
DE102022111609A1 (en) ACCELERATED PROCESSING VIA BODY-BASED RENDERING ENGINE
US11609785B2 (en) Matrix data broadcast architecture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120829