CN106921862A

CN106921862A - Multi-core decoder system and video encoding/decoding method

Info

Publication number: CN106921862A
Application number: CN201610093836.XA
Authority: CN
Inventors: 赵屏; 郑佳韵; 王智鸣; 张永昌
Original assignee: MediaTek Inc
Current assignee: MediaTek Inc
Priority date: 2014-04-22
Filing date: 2016-02-19
Publication date: 2017-07-04

Abstract

A kind of multi-core decoder system and video encoding/decoding method.System includes multiple decoder cores；Shared reference data buffer, is coupled to multiple decoder cores and external memory storage, wherein the reference data that shared reference data buffers store is received from external memory storage, and reference data is provided to multiple decoder cores for decoding video data；And one or more decoding progress synchronizers, it is coupled to one or more of multiple decoder cores, to detect one or more decoding progress msgs for associating or the status information of shared reference data buffer with multiple decoder cores, and control the decoding progress of one or more of multiple decoder cores.The present invention can effectively improve bandwidth efficiency by above scheme.

Description

Multi-core decoder system and video encoding/decoding method

【Technical field】

The present invention relates to interframe level parallel video solution code system.Especially, the present invention relates to system data again Use, to reduce bandwidth consumption.

【Background technology】

In recent years, compression video have been widely used for various applications, for example, video broadcasting, video flowing with And video storage.The video compression technology used by renewal video standard becomes more sophisticated and requires more Reason power.On the other hand, the resolution ratio of elementary video become match high-definition display device resolution ratio and Meet higher-quality requirement.For example, current, it is wide that the compression video in high definition (HD) is widely used in TV Broadcast and video flowing.Even ultra high-definition (UHD) video comes true, and the various products based on UHD are disappearing Take market visible.The requirement of the processing power of UHD contents quickly increases with spatial resolution.More high-resolution The processing power of rate video is a problem for challenge for the realization based on hardware and based on software.Example Such as, UHD frames can have the resolution ratio of 3840x2160, and it corresponds to 8,294,440 pixels of every picture frame. If video is captured with 60 frame per second, UHD will almost generate 500,000,000 pixel per second.For with YUV444 The color video source of color format, will have 15,000,000,000 samples to be processed in per second.Associated with UHD videos Data bulk be it is huge and to real-time video decoder propose huge challenge.

It is high in order to meet the calculating power requirement of high definition, ultra high-definition resolution ratio and/or more complicated coding standard Fast processor and/or multiple processors have been used in execution real-time video decoding.For example, in personal computer (PC) and in consumer electronics environment, multinuclear CPU (CPU) can be used for decoding video bit stream. Multiple nucleus system can be in the form of embedded system so as to cost savings and convenient.In existing multinuclear decoder system In system, control unit generally configures multiple cores (core) (that is, multiple Video Decoder kernels (kernel)) To perform the decoding of frame level parallel video.For the memory for coordinating to be accessed by multiple Video Decoder kernels, deposit Access to store control unit can be used between the memory shared between multiple cores and multiple cores.

Figure 1A illustrates the block diagram of the general dual core video decoder system of frame level parallel video decoding.Double-core is regarded Frequency decoder system 100A includes control unit 110A, decoder core 0 (120A-0), decoder core 1 (120A-1) and memory access control unit 130A.Control unit 110A can be used for specifying decoder Core 0 (120A-0) is decoding a frame, and specified decoder core 1 (120A-1) is another with parallel decoding Frame.Because each decoder core needs access to be stored in the reference data of storage device (for example, memory), Memory access control unit 130A is connected to memory, and for managing by depositing that two decoder cores are accessed Reservoir.Decoder core can be used for bit stream of the decoding corresponding to one or more selection video code models, For example, MPEG-2, H2.64/AVC and new efficient video coding (HEVC) coding standard.

The block diagram of the general four core video decoder system of Figure 1B diagram frame level parallel video decodings.Four core videos Decoder system 100B arrives decoder core 3 comprising control unit 110B, decoder core 0 (120B-0) (120B-3) and memory access control unit 130B.Control unit 110B is seen for specifying decoder core 0 (120B-0) is to decoder core 3 (120B-3) with parallel decoding different frame.Memory access control unit 130B is connected to memory and for managing the memory accessed by four decoder cores.

Although any compressed video format can be used for HD or UHD contents, due to compression efficiency higher, more May be using the compression standard for updating, for example, H2.64/AVC or HEVC.Fig. 2A diagrams support HEVC The exemplary system block diagram of the Video Decoder 200A of video standard.HEVC is to be combined to cooperate by Video coding The new International video coding standard of group (JCT-VC) exploitation.HEVC is based on the block-based motion of mixing The coding scheme of the class dct transform of compensation.The elementary cell of compression is referred to as coding unit (CU), is 2Nx2N Square.CU can be started with maximum CU (LCU), and LCU is also referred to as code tree unit in HEVC (CTU), and each CU can recursively be divided into four smaller CU, until reaching predefined minimum chi It is very little.Once completing the segmentation of CU hierarchical trees, each CU is further separated into according to type of prediction and PU subregions One or more predicting units (PU).The residual value of each CU or each CU is divided into converting unit (TU) tree Changed with application two-dimentional (2D).

In fig. 2, input video bit stream is first by variable-length decoder (VLD) 210A treatment, Parsed with performing length-changeable decoding and syntax.The syntax of parsing may correspond to interframe/frame in residual value signal (to be come Outgoing route from above VLD 210) or movable information (from the outgoing route below VLD 210). Residual value signal is generally transcoded into.Therefore, the residual value signal of coding is by inverse scan (IS) block 212, inverse amount Change (IQ) block 214 and inverse conversion (IT) block 216 is processed.From the defeated of inverse conversion (IT) block 216 Go out to correspond to the residual value signal rebuild.For the block of intraframe coding, the residual value signal of reconstruction uses adder block 218 are added to the infra-frame prediction from intra-frame prediction block 224, or for the block of interframe encode, the residual value of reconstruction Signal is added to the inter prediction from motion compensation block 222 using adder block 218.Interframe/frame in selection Whether block 226 selects infra-frame prediction or inter prediction, for being weighing for interframe or intraframe coding depending on block Build vision signal.For motion compensation, method will access of the picture buffer 230 for being stored in decoding Or multiple reference blocks and the motion vector information by motion vector (MV) the calculating determination of block 220.In order to improve The quality of vision, in-loop filter 228 be used for its be stored in the picture buffer 230 of decoding before treatment The video of reconstruction.In-loop filter is adaptive-biased comprising deblocking filter (DF) and sample in HEVC (SAO).In-loop filter can be used different wave filters for other coding standards.In fig. 2, own Functional block except decode picture buffer 230 can by decoder examine apply.In typical realization, outward Portion's memory, for example, DRAM, for the picture buffer 230 for decoding.

For variable-length decoder (VLD), due to its characteristic, its can without video decoder core list Solely implement.In the case, memory can be used to buffer the output of VLD.Fig. 2 B diagrams do not include VLD Video Decoder kernel 200B example.Memory 240 is used to buffer the output of VLD 210B.

For frame level parallel decoding, due to data dependence, that is, the frame that will be decoded and multiple decoder kernels it Between mapping need carefully to complete maximizing performance.Fig. 3 with decoding order illustrate six pictures (that is, I, P, P, B, B and B) example.This six pictures can correspond to I (1), P (2), B with display order (3), B (4), B (5) and P (6), wherein the numeral inserted represents the display order of picture.Picture I (1) by itself intraframe coding without any data dependence on any other picture.Picture P (2) is Predicted using I (1) pictures rebuild as the twocouese of reference picture.When I (1) and P (2) refer to respectively When arriving decoder kernel 0 and decoder kernel 1 surely for parallel decoding (310), there are problems that data dependence. Similarly, it is used for second when P (6) and B (3) are separately assigned to decoder kernel 0 and decoder kernel 1 Stage (320) parallel decoding, data dependence problem occurs again.The picture B (4) that finally will be decoded Being separately assigned to decoder kernel 0 and decoder kernel 1 with B (5) is used for the parallel of phase III (330) Decoding.Because P (2) and P (6) now can use, parallel decoding B (4) and B (5) will be there is no numbers According to Dependence Problem.

Require to support that the real-time decoding of HD or UHD videos, multinuclear decoder have been used to because height is calculated Improve decoding speed.Due to common reference data, a potential advantages of interframe parallel decoding are bandwidth efficiencies. However, due to data dependence, bandwidth efficiency will degrade.Therefore, it is desirable to developing, data dependence can be solved the problems, such as Method and system, to improve bandwidth efficiency.

【The content of the invention】

In order to solve the above technical problems, the present invention provides a kind of pixel in storing device for storing partial reconstruction Data are used for the video process apparatus and associated video processing method of infra-frame prediction.

According to the first aspect of the invention, disclose a kind of multi-core decoder system and include multiple decoder cores；Altogether The reference data buffer enjoyed, is coupled to multiple decoder cores and external memory storage, wherein shared reference The reference data that data buffer storage is received from external memory storage, and reference data offer is decoded to multiple Device core is used for decoding video data；And one or more decoding progress synchronizers, it is coupled to multiple decoders One or more of core, with detect with one or more decoding progress msgs for associating of multiple decoder cores or The status information of shared reference data buffer, and control the decoding of one or more of multiple decoder cores Progress.

According to the second aspect of the invention, a kind multi-core decoder system is disclosed, comprising multiple decoder cores ；Shared reference data buffer, is coupled to multiple decoder cores and external memory storage, wherein shared ginseng The reference data that data buffer storage is received from external memory storage is examined, and reference data offer is solved to multiple Code device core is used for decoding video data；And postpone first in first out block, it is coupled to multiple decoder cores, shares Reference data buffer and external memory storage, wherein postpone the storage of first in first out block being made by a decoder core Current reference data, for by being used after another decoder core.

According to the third aspect of the invention we, a kind of multi-core decoder system is disclosed, comprising multiple decoder cores； And shared output buffer, multiple decoder cores and external memory storage are coupled to, wherein shared output The data of reconstruction of the buffers store from the first decoder core, and the data rebuild in storage are to external storage Before device, the data that will be rebuild are provided to the second decoder core as reference data, for decoding video data.

According to the fourth aspect of the invention, disclose one kind is used in decoder system using multiple decoder cores The method of video decoding, comprising multiple decoder cores are used for using interframe level parallel decoding come from video The two or more frames of bit stream decoding；The reference data offer that shared reference data buffer will be stored in is arrived Multiple decoder cores, for decoding two or more frames；And one or more of the multiple decoder cores of control Decoding progress, with basis on multiple decoder cores one or more decoding progress msg or shared ginseng The status information of data buffer is examined, the memory access band with shared reference data buffer association is reduced It is wide.

The present invention can effectively improve bandwidth efficiency by above scheme.

【Brief description of the drawings】

Figure 1A diagrams have the exemplary decoder system of the two-decoder core for parallel decoding.

Figure 1B diagrams have the exemplary decoder system of the four decoder cores for parallel decoding.

Fig. 2A exemplary decoder system block diagrams of the diagram based on HEVC standard.

Fig. 2 B another exemplary decoder system block diagrams of the diagram based on HEVC standard, wherein VLD is excluded In outside decoder kernel.

Two frames are assigned to two examples of decoder core by Fig. 3 diagrams.

Fig. 4 diagrams are used for the example of the decoder system framework of reference data buffer using on-chip memory.

Fig. 5 illustrates the multiple solutions with shared reference buffer and delay FIFO according to embodiments of the present invention The example of code device core system.

The decoding progress of Fig. 6 A diagram guide's decoder cores A.

The decoding progress of the backward decoder core B of Fig. 6 B diagrams.

Fig. 7 diagrams are incorporated to the code parallel decoder system with shared reference buffer by progress synchronizer is decoded Example.

Fig. 8 illustrates according to embodiments of the present invention with two decoder cores, decoding progress synchronizer and delay The example of the code parallel decoder system of FIFO.

Fig. 9 A diagrams will decode the example that progress synchronizer is incorporated to decoder core A with principal and subordinate's arrangement.

Decoding progress synchronizer is incorporated to the another of decoder core A and decoder B by Fig. 9 B diagrams with equity arrangement One example.

Figure 10 illustrates the decoder nuclear energy that makes according to embodiments of the present invention and accesses reconstruction from another decoder core Data system architecture example, wherein on-chip memory be used for two decoder cores between.

Figure 11 diagram storages are from a data for the reconstruction of decoder core with being total to for being used by another decoder core The another aspect of the on-chip memory arrangement of the output buffer enjoyed.

The storage of Figure 12 diagrams is shared with what is used by a decoder core from a data for the reconstruction of decoder core Output buffer on-chip memory arrangement another example, wherein the data placement rebuild enters fixation The window of size.

Figure 13 diagrams are subtracted using the code parallel decoder system of the shared output buffer with windows detector The example of few bandwidth consumption.

Figure 14 diagrams use the multi-core parallel concurrent decoder system of the shared output buffer with windows detector Another example, wherein multiplexer and demultiplexer be used for allow the shared output of multiple decoder cores delay Rush device and windows detector.

【Specific embodiment】

Following description is the best mode embodiment of the present invention.This description is to illustrate general original of the invention Reason, and should not be construed as limiting.The scope of the present invention determines with reference to appended claim.

The multi-core decoder system that the present invention is disclosed, it can reduce bandwidth of memory consumption.It is of the invention On one side, if two reference datas of frame reference section overlap, or frame of video is assigned in another kernel The upper kernel with reference to another frame, the candidate of frame of video is chosen and is assigned to interframe level parallel decoding, to reduce Bandwidth of memory is consumed.In such cases, have an opportunity to be shared in the reference frame data accessed between kernel simultaneously Reduce external memory bandwidth consumption.According to another aspect of the present invention, the synchronous (Decoding of decoding progress is disclosed Progress Synchronization, DPS) method and framework, for multiple decoder core systems, with logical Maximization data are crossed to reuse to reduce bandwidth of memory consumption.Additionally, disclosing a kind of data of reconstruction again Application method and framework, to reduce bandwidth consumption.

For the video of motion compensation encoding, decoder needs to access reference data, to generate inter prediction number According to for motion compensated reconstruction.Because the picture previously rebuild can be stored in the picture buffer of decoding, Picture buffer may be implemented using external memory storage, and the access of the picture buffer to decoding is also relatively slow. Additionally, it will consume system bandwidth, system bandwidth is important system resource.Therefore, based on being deposited on chip The reference data buffer of reservoir is generally used for improving bandwidth efficiency.Fig. 4 is illustrated and used using on-chip memory In the example of the decoder system framework 400 of reference data buffer.Decoder system in Fig. 4 includes two Decoder core 410,420 and shared reference buffer 430.Reference data is read into shared reference first Data in buffer 430, and the reference buffer 430 shared can repeatedly be reused by two decoder cores, To reduce bandwidth consumption.For example, decoder core can be assigned as decoding between two P pictures two are continuous B pictures are used for interframe level parallel decoding.Being likely to two decoding process of B pictures needs access to refer to P with two The identical reference data of picture association.If required reference data in shared reference buffer (i.e., Hit), then required reference data just need not can be reused from external memory access.If required Reference data not in shared reference buffer (that is, " not having "), then required reference data must not Do not read from external memory storage.Shared reference buffer is far smaller than external memory storage.However, shared Reference buffer may generally be implemented as the frameworks different from external memory storage for high speed data access.For example, altogether The reference buffer enjoyed can be cached (L1 cachings), type 2 with usage type 1 and cache (L2 cachings) or class Implement like the framework of caching, external memory storage is normally based on DRAM (dynamic random access memory) Technology.

Although shared reference buffer can help improve bandwidth efficiency, due to a variety of causes, whole benefits Can not realize in practice.For example, decoding process can progress differently on two decoder cores two simultaneously The frame of row decoding is carried out.A decoder core prior to it is another it is many in the case of, two decoder core can be with It is accessed dramatically different reference data region.Therefore, backward decoder core may be needed from external storage Think highly of new loading data (that is, " not having ").Shared reference memory usually using high-speed memory implement with Improve performance.Due to the high cost of high-speed memory, the size of shared reference memory is limited.For The bandwidth efficiency of the further system for improving the reference buffer for having shared, embodiments of the invention are introduced It is coupled to delay first in first out (FIFO) block, shared reference buffer and the decoder core of external memory storage. Delay FIFO may be embodied as the different data structure/framework from the reference buffer shared, higher to realize Capacity and more inexpensive, for example, SRAM (static RAM) or L1/L2 delays on exclusive chip Deposit.

Fig. 5 diagrams have shared reference buffer and postpone many of fifo block according to an embodiment of the invention The example of individual decoder core system 500.

In the example of hgure 5, two decoder cores 510,520 are used.In addition to shared reference buffer 530, Also use delay FIFO 540.External memory storage regards the outside of many decoder core systems 500 as.Solved for guiding Guide decoder core of the progress of code before backward decoder core, reference frame data is straight from external memory storage Connect and be read into core, without being stored into shared reference buffer.However, reading from external memory storage The reference frame data for taking also is stored in delay FIFO together with the address of reference frame data or positional information.Herein In the case of, Port Multiplier 512 or 522 is used to directly enter decoder core from external memory storage selection reference data. System will monitor the positional information of the oldest entry for postponing FIFO.When the decoding progress of backward decoder core is caught up with During upper oldest bar destination locations, oldest entry is fallen out and reference data is transferred to shared reference buffer, with Just backward decoder core can access reference data from shared reference buffer, be deposited without access outside Reservoir.In the case, Port Multiplier 532 is set to from the delay selection inputs of FIFO 540.System in Fig. 5 Can also be used to support existing without the shared reference buffer for postponing FIFO.In the case, Port Multiplier 512 It is set to select input, and Port Multiplier 532 to be set to from external memory storage from shared reference buffer with 522 550 selection inputs.

The use for postponing FIFO should help improve bandwidth efficiency.If for example, decoder core A is treatment be located at Guide's core of the macro block (MB) of the block position (x, y) of (x, y)=(10,2).Decoder core B is Core of the treatment positioned at the backwardness of the macro block of (x, y)=(3,2).In the case, the ginseng of decoder core A Examining data will be placed in delay FIFO, because decoder core B just treatment is away from just by decoder core A treatment The region of block position (10,2).Decoder core B will unlikely need to be referred to decoder core A identicals Data.However, when decoder core B improves progress to or close to block position (10,2), in delay FIFO Reference data can put shared reference buffer into.In the case, decoder core B can be used shared The probability of the data in reference buffer is greatly increased.Therefore, because postponing the use of FIFO, band is improved Efficiency wide.

According to another aspect of the present invention, system is improved bandwidth and is imitated using synchronous (DPS) method of decoding progress Rate.As it was previously stated, shared reference buffer is relatively small.If decoder core just treatment figure away from each other As region, the decoding progress of decoder core unlikely shares identical reference data.Therefore, it is of the invention Another embodiment can synchronously between multiple decoder cores decoding progress.For example, two decoder core (core A With core B) for two interframe level parallel decodings of frame, and shared reference buffer can store five ginsengs of block Examine data.If decoder core A just treatment is shown in the block X in Fig. 6 A, and decoder core B just treatment shows The block Y in Fig. 6 B is shown in, decoder core B unlikely reuses the reference data in shared reference buffer. Because the reference data of only block (X-1) to (X-5) will be stored in shared reference buffer, and It is too remote that block X and Y point are opened.On the other hand, if decoder core A just treatment be shown in block X in Fig. 6 A with And decoder core B just treatment is shown in the block Z in Fig. 6 B, decoder core B is likely to reuse shared reference Reference data in buffer, because block X and Z is closer.Therefore, system will check two progresses of core simultaneously Ensure the difference between two currently processed piece in limit.If for example, decoder core A just treatment has block The block of block_index_A is indexed, decoder core B just treatment indexes the block of block_index_B with block, is System will limit | block_index_A-block_index_B |<Th, wherein Th are in 1 and frame between the sum of block Threshold value.

In order to keep the difference between two currently processed blocks in limit, system will control each decoder The progress of core.If the decoding progress of the decoder kernel performed on a core is away from another core, shared The efficiency of reference buffer will be reduced.In the case, it would be possible to more access external memory storages.Therefore, System is slowed or stopped guide's core or accelerates backward core, shared to improve to shorten the difference in limit Reference buffer efficiency (for example, higher hits rate).Fig. 7 diagrams add decoding progress synchronizer To the example of the code parallel decoder system with shared reference buffer.As shown in fig. 7, decoding progress synchronization Device 710 is coupled to decoder core A 410 and decoder core B 420, to monitor the progress of decoding and correspondingly control Progress.As shown in fig. 7, decoding progress synchronizer 710 also can detect the letter on the reference buffer 430 shared Cease and correspondingly application control to decoder core.

In order to monitor the decoding progress of decoder kernel, system can be used decoding progress synchronizer 710 to work as with basis (x, y) position or the rope of the MB of preceding decoding, coding unit (CU), maximum CU or superblock (SB) Draw, detect their progress.Local location or index information can be used.For example, system can be simply used X- positions or y- positions are determining progress.System can also be according to their progress of the address detected of memory access. According to the progress of detection, system can be used decoding progress synchronizer 710 so that accordingly stopping or acceleration/deceleration are every respectively Individual decoder kernel.Control can be by controlling kernel/submodule state machine (for example, pause), in each The clock (for example, pause, acceleration/deceleration) of core, the memory access priority of each kernel, influence solution The other factorses or any combinations of code progress or decoding speed, or cause memory access to stop realizing.

For example, system can be used decoding progress synchronizer 710 to detect the decoding progress of each kernel, and calculate Difference in decoding progress.Decoding progress may correspond to currently processed elementary area index (index_A or index_B).Elementary area may correspond to MB or LCU.If | index_A-index_B | >=Th, decode Difference in progress needs to reduce, and wherein Th represents threshold value.In order to reduce the difference in decoding progress, system Can slow down or suspend the decoding progress with the guide's core compared with massive index, until the difference in decoding progress is in threshold In value.Alternatively, system can accelerate the decoding progress of the decoder core of the backwardness with smaller index, until Difference in decoding progress is in threshold value.

In another example, system can check the state of shared reference buffer, as shown in Figure 7.If shape State indicates the reference data accessed by any kernel that deleted or data are reused into rate reduction or less than threshold value, Then system will control the decoding progress of each kernel.

In another embodiment, decoding progress synchronizer can be used together with delay FIFO disclosed above.Fig. 8 Diagram has two decoder cores 810,820, decoding progress synchronizer 830 and postpones the parallel solution of FIFO 840 The example of code device system.System can check whether delay FIFO has expired or almost full, and then control each The decoding progress of kernel.When postponing FIFO completely or almost expiring, in the decoder that instruction is performed on a core The decoding progress of core is away from another core, because postponing FIFO can check position, to decide whether consumption data bar Mesh is simultaneously exported to shared reference buffer.In other words, position detection can be in DPS itself or prolong In slow FIFO or other modules.In the case, it should which application control solve between two decoder cores with being checked The difference of code progress.Decoding progress synchronizer can be single module in decoder core.Alternatively, decode Progress synchronizer is incorporated into postponing FIFO as integrated delay FIFO/ decoding progress synchronizers.

In another embodiment of the present invention, decoding progress synchronizer is incorporated into one or more decoder cores work It is the integrated part of decoder core.For example, Fig. 9 A diagrams will decoding progress synchronizer 910 with principal and subordinate's arrangement It is incorporated to the example of decoder core A 920.Two decoder cores are coupled to each other, so as in decoder core A 920 Decoding progress synchronizer 910 can detect decoding progress from decoder core B 930, and provide progress monitoring to decoder Core B 930.In the case, decoder core A sees core of deciding, for decoding progress synchronization.Fig. 9 B diagram with Equity arrangement will decode progress synchronizer 940A, and 940B is accordingly respectively incorporated into decoder core A 950 and decoder The example of core B 960.Two decoder cores are coupled to each other, so that two decoder cores can be from obtaining progress each other Information.Based on progress msg, decoding progress synchronizer 940A, 940B will control the progress of respective decoder core.

In another embodiment of the present invention, system enables the motion compensation in a decoder core, with from another The data that the access of one decoder core is rebuild, for the parallel solution of interframe level of two frames with data dependence relation Code.Figure 10 diagrams enable the system architecture of the data that a decoder core is rebuild from the access of another decoder core Example, wherein on-chip memory 1030 are used in two decoder cores 1010 between 1020.From one The data storage of the reconstruction of decoder core is reused in on-chip memory by another core, is disappeared to reduce bandwidth Consumption.In this example, P pictures are assigned to decoder core 0 and B pictures are assigned to decoder core 1, wherein B Picture uses P pictures as reference picture.Therefore, the part of P pictures or P pictures needs to be opened in the decoding of B pictures Rebuild before beginning.In the conventional method, the P image datas of reconstruction are written to the reference picture buffer of decoding, And decoder core 1 will store reference data from the reference picture buffer of decoding.As described above, the reference of decoding Picture buffer is implemented usually using external memory storage.Therefore, on-chip memory 1030 be used for buffer by The data that some for the P pictures that decoder core 1 is used are rebuild, to process B pictures.Via on-chip memory Reconstruction data the detailed description for reusing it is as follows.

On-chip memory 1030 in Figure 10 can be used as shared output buffer, for being responsible for decoding P pictures Decoder core 01010.The another aspect of Figure 11 diagram on-chip memory arrangements.From decoder core 1110 The data storage of reconstruction that is referred to by another decoder core of expectation in shared output buffer 1120 (i.e., On-chip memory in Figure 10) rather than being stored directly in external memory storage 1130.Rebuild data by After another decoder core is reused or when the data rebuild need to be eliminated, shared output buffering is stored in The data of the reconstruction of device 1120 are written to the picture buffer of the decoding in external memory storage 1130.It is stored in altogether The data of the reconstruction of the output buffer enjoyed can with less than whole frame to be sized to arrange.When new weight When the data built are received, some former received datas can be eliminated.Additionally, being stored in shared defeated The scope for going out the data of buffer can be represented that window is located at two external storages by one or more windows Between device address or between two points (x, y) and (x ', y ').For example, block 1140 represents the frame rebuild , wherein Data Position 1142 corresponds to the data that rigid connection is received, and Data Position 1144 corresponds to the data removed. Data area between Data Position 1144 and Data Position 1142 is stored in shared output buffer, display In block 1150.It is also shown in Figure 11, the data area between Data Position 1144 and Data Position 1142 Comprising three windows (that is, window A, window B and window C).Data Position 1144 and Data Position 1142 Can be specified by external memory address (x, y) and (x ', y ').When data are received or remove, often The scope of individual window can be updated.Each window can be stored on chip with continuous space or separate space and be deposited Reservoir.

Although the window in Figure 11 has different sizes, window can also have common size.Common size May correspond to individual data word, macro block (MB), sub-block or more bulk.Figure 12 diagrams use common size The example of window.The frame that the expression of block 1210 is rebuild, the data that wherein Data Position 1212 is received corresponding to rigid connection, And Data Position 1214 corresponds to the data removed.Block 1220 is shown in non-company in shared output buffer The example of storage dimensions stationary window in continuous space.Each window may map to the ground of the value for corresponding to 2 power The start address of the relatively lower part of location.Therefore, Window match can be by comparing address increment formula come effectively Realize.

According to another aspect of the present invention, whether windows detector is used to determine desired reference data shared Output buffer in, and correspondingly from the ginseng of the shared output buffer access requirement of external memory storage Examine data.When core access reference frame data when, windows detector by by with shared output buffer in Each window compares the reference frame data address or (x, y) position of requirement, there is provided Window match method.Such as The address/location of the reference data that fruit requires is located between beginning and the end address/position of window, it indicates that ginseng Data are examined in this window.If Window match success (i.e., it is desirable to reference frame data in the window), The biasing of the data required in system-computed on-chip memory, and start to read reference data from offset position. When Window match failure, if desired address or position have been cleared by, system reads from external memory storage Reference data, this can be learnt by system using address or location comparison.If for the output shared Buffer and external memory storage reference data are not yet ready, and system can stop access.

Figure 13 diagrams use the code parallel decoder system of the shared output buffer with windows detector 1340 Example.System includes two decoder cores 1310,1320, shared output buffer 1330 and window inspection Survey device 1340.Shared output buffer 1330 and windows detector 1340 is coupled to external memory storage 1350. The position of window A, B and C in the diagram reconstruction frames of block 1360, wherein the reconstruction associated with window A, B and C Data storage is reused in shared output buffer for the possibility by decoder core B 1320.For rebuilding Frame, the reference picture buffering of decoding of the data storage in external memory storage 1350 rebuild before window A Device.For example, the reference data 1362 before window A is stored in the region 1354 of external memory storage.Block 1370 The example of the state of shared output buffer is illustrated, wherein the number of the reconstruction associated with window A, B and C According to being stored in shared output buffer.If decoder core B 1320 requires the data rebuild in slot B 1362, windows detector 1340 reads the data that correspondence is rebuild by detection case and from shared output buffer 1372.The data 1364 rebuild in the requirement accesses of decoder core B 1320 have been stored in external buffer 1350 In the case of, windows detector 1340 reads the number of correspondence reconstruction by detection case and from external memory storage 1350 According to 1354.

Although Figure 13 diagrams use the code parallel decoder system of the shared output buffer with windows detector Example, but embodiment can expand to the code parallel decoder system with more than two decoder core.Figure 14 diagrams use the example of the multi-core parallel concurrent decoder system of the shared output buffer with windows detector. System includes multiple decoder core 1410-01,1410-1,1410-2 etc., the shared and of output buffer 1420 Windows detector 1430.Shared output buffer 1420 and windows detector 1430 is coupled to external memory storage 1440.Multiplexer 1450 is used to select a decoder core as the input of shared output buffer 1420, And demultiplexer 1460 is used to for data to be routed to a decoder core from windows detector 1430.Shared is defeated Each window gone out in buffer can be comprising Image ID, so that the output buffer shared can be between multiple cores It is shared.Additionally, windows detector 1430 can service multiple cores, so that two or more decoder cores are referred to The data of the reconstruction in shared output buffer.

Above description be in order that those skilled in the art can as application-specific its require content provide reality Trample the present invention.The various modifications of described embodiment will be to those skilled in the art apparent , and generic principles defined herein may apply to other embodiments.Therefore, the present invention is not meant to limitation In shown and description specific embodiment, and it is intended to meet and principle disclosed herein and novel feature one The widest scope of cause.In discussed in detail above, various details are it is stated that of the invention complete to provide Arrange solution.However, it will be understood by those skilled in the art that the present invention can be implemented.

Code parallel decoder system also it is also possible to use and be stored in the program code in readable media and implement.Software generation Code can be configured using software format, for example, Java, C++, XML and for define on implement this The other Languages of the function of the device operation of the feature operation of invention is implemented.Code can be with art technology Different form that personnel know and style are write.Different code format of the invention, code are matched somebody with somebody Put, the style of software program and form and configuration code will to define other equipments of the operation of microprocessor Without departing substantially from the spirit and scope of the present invention.Software code can be performed on different types of device, for example, Laptop computer or desktop computer, the hand-held device with processor or treatment logic and utilize this hair Bright computer server or other devices.Described example be merely possible in every respect it is illustrative simultaneously It is nonrestrictive.Therefore, the scope of the present invention is indicated by appended claim rather than described above. The all changes for falling into the impartial meaning of claim and scope belong to its scope.

Claims

1. a kind of multi-core decoder system, comprising：

Multiple decoder cores；

Shared reference data buffer, is coupled to the multiple decoder core and external memory storage, wherein The reference data that the shared reference data buffers store is received from the external memory storage, and will be described Reference data is provided to the multiple decoder core is used for decoding video data；And

One or more decoding progress synchronizers, are coupled to one or more of the multiple decoder core, with Detect one or more decoding progress msg or the described shared references that associate with the multiple decoder core The status information of data buffer, and control the decoding progress of one or more of the multiple decoder core.

2. multi-core decoder system as claimed in claim 1, it is characterised in that one or more of decodings Progress synchronizer be embedded in the multiple decoder core one or more in as the multiple decoder core Part in one or more.

3. multi-core decoder system as claimed in claim 2, it is characterised in that the multi-core decoder system Using only a decoding progress synchronizer, and the decoding progress synchronizer is embedded in a decoder core as master Core, to detect one or more the described decoding progress msgs for associating with the multiple decoder core, and controls Make the decoding progress of one or more of the multiple decoder core.

4. multi-core decoder system as claimed in claim 2, it is characterised in that each decoder core includes The decoding progress synchronizer of individual insertion, to control the decoding progress of corresponding decoder core, and with it is the multiple The decoding progress synchronizer of the insertion of decoder core association is used for peering.

5. multi-core decoder system as claimed in claim 1, it is characterised in that also including one or more Postpone first in first out block, be coupled to the multiple decoder core, the shared reference data buffer and institute External memory storage is stated, wherein described postpone the current reference that the storage of first in first out block is used by a decoder core Data, so as to by being used after another decoder core.

6. multi-core decoder system as claimed in claim 5, it is characterised in that one or more of decodings Progress synchronizer be embedded in the multiple decoder core one or more as the one of the multiple decoder core Individual or multiple part, or the multi-core decoder system using only be embedded in it is described delay first in first out block in One decoding progress synchronizer.

7. multi-core decoder system as claimed in claim 1, it is characterised in that the shared reference data Buffer is to be cached based on Class1 caching, type 2 or other similar frameworks cachings are implemented.

8. a kind of multi-core decoder system, comprising：

Multiple decoder cores；

Shared reference data buffer, is coupled to the multiple decoder core and external memory storage, wherein institute State the reference data that shared reference data buffers store is received from the external memory storage, and by the ginseng Examining data and providing to the multiple decoder core is used for decoding video data；And

Postpone first in first out block, be coupled to the multiple decoder core, the shared reference data buffer With the external memory storage, wherein it is described postpone the storage of first in first out block by a decoder core use it is current Reference data, so as to by being used after another decoder core.

9. multi-core decoder system as claimed in claim 8, it is characterised in that the delay first in first out block Be based on Class1 caching, type 2 cache or exclusive chip on static random random access memory implement.

10. multi-core decoder system as claimed in claim 8, it is characterised in that the shared reference number It is to be cached based on Class1 caching, type 2 or other similar frameworks cachings are implemented according to buffer.

11. multi-core decoder systems as claimed in claim 10, it is characterised in that the multiple decoder core, The shared reference data buffer and the delay first in first out block are integrated in the identical of integrated circuit Substrate.

12. multi-core decoder systems as claimed in claim 8, it is characterised in that guide's decoder core is from institute External memory storage is stated rather than the shared reference data of reference data buffer inputs first, and described One reference data also is stored in the delay first in first out block.

13. multi-core decoder systems as claimed in claim 12, it is characterised in that with first reference number It also is stored in the delay first in first out block according to the address or positional information of association.

14. multi-core decoder systems as claimed in claim 12, it is characterised in that when backward decoder core It is required that first reference data, and first reference data is still stored in the delay first in first out block When, first reference data is read into the shared reference data buffer, and the backwardness Decoder core reads first reference data from the shared reference data buffer.

15. multi-core decoder systems as claimed in claim 8, it is characterised in that also including one or more Port Multiplier, wherein the selection of one or more of Port Multipliers is from delay first in first out block or the outside The shared reference data buffer input of memory, or from the shared reference data buffer or described Postponing first in first out block selects the reference data of each decoder core to be input into.

A kind of 16. multi-core decoder systems, comprising：

Multiple decoder cores；And

Shared output buffer, is coupled to the multiple decoder core and external memory storage, wherein described common The output buffer enjoyed stores the data of the reconstruction from the first decoder core, and is storing the number of the reconstruction According to before the external memory storage, the data of the reconstruction being provided to the second decoder core as reference number According to for decoding video data.

17. multi-core decoder system as described in claim 0, it is characterised in that the data quilt of the reconstruction One or more windows are arranged into, and are stored in the shared output buffer, and wherein each window is big Less than whole frame.

18. multi-core decoder system as described in claim 0, it is characterised in that one or more of windows Mouthful there is common size, wherein the common size correspond to individual data word, macro block, one Sub-block, a coding unit or a maximum coding unit.

19. multi-core decoder system as described in claim 0, it is characterised in that when the shared output During buffer full, the oldest window of the data of the reconstruction is eliminated.

20. multi-core decoder system as described in claim 0, it is characterised in that also including windows detector, It is coupled to the multiple decoder core, the shared output buffer and the external memory storage, wherein institute Whether state windows detector determines the data by the reconstruction of the second decoder core requirement described shared Output buffer in.

21. multi-core decoder system as described in claim 0, it is characterised in that if solved by described second In the shared output buffer, then the windows detector makes the reference data of code device core requirement Obtain and provided to institute from the shared output buffer by the reference data of the second decoder core requirement State the second decoder core.

22. multi-core decoder system as described in claim 0, it is characterised in that if solved by described second Not in the shared output buffer, the windows detector is caused the reference data of code device core requirement There is provided to the described second solution from the external memory storage by the reference data of the second decoder core requirement Code device core.

23. multi-core decoder system as described in claim 0, it is characterised in that be stored in described shared The data of the reconstruction in output buffer be arranged to each window have window address one or more Window, and the wherein window address based on each window and reference data address, the windows detector determine Whether the data of the reconstruction as reference data required by the second decoder core are described shared In output buffer.

24. multi-core decoder system as described in claim 0, it is characterised in that if the reference data Address is more than or equal to a beginning window address for window and less than or equal to an end window ground for window Location, the windows detector is determined by the reconstruction as reference data of the second decoder core requirement Data in the shared output buffer.

25. multi-core decoder system as described in claim 0, it is characterised in that also including multiplexer, coupling Together between the multiple decoder core and the shared output buffer, with from the multiple decoder core In a data for the selection reconstruction, to be stored in the shared output buffer.

26. multi-core decoder system as described in claim 0, it is characterised in that also including demultiplexer, It is coupled between the multiple decoder core and the windows detector, with from the shared output buffer Or the external memory storage provides the reference data in the multiple decoder core.

27. multi-core decoder system as described in claim 0, it is characterised in that the shared output is delayed It is to be cached based on Class1 caching, type 2 or other similar framework caching implementations to rush device.

A kind of 28. methods of video decoding, decode for carrying out video using multiple decoder cores in decoder system , comprising：

The multiple decoder core is arranged using interframe level parallel decoding to decode two or many from video bit stream Individual frame；

The reference data that shared reference data buffer will be stored in is provided to the multiple decoder core, is used In decoding described two or multiple frames；And

The decoding progress of one or more of the multiple decoder core is controlled, with basis on the multiple solution Code device core one or more decoding progress msg or the shared reference data buffer status information, Reduce the memory access bandwidth with the shared reference data buffer association.

29. method as described in claim 0, it is characterised in that according to the decoding progress msg or described The status information of shared supplemental characteristic buffer, controls the solution of one or more of the multiple decoder core Code progress causes the decoding progress of one or more of the multiple decoder core to stop, acceleration or deceleration.

30. method as described in claim 0, it is characterised in that of the control multiple decoder core Or the decoding progress of multiple includes：By one or more the submodule bulk states for causing the multiple decoder core The clock of one or more that machine stops, causing the multiple decoder core stops or changes, changes described many The memory access priority of one or more of individual decoder core, cause memory access stop or above-mentioned group The decoding progress of one or more for closing to cause the multiple decoder core stops, acceleration or deceleration.

31. method as described in claim 0, it is characterised in that with one of the multiple decoder core or The decoding progress msg of multiple association be the macro block based on the current decoding that is associated with the multiple decoder core, What the position of coding unit, maximum coding unit or superblock or index were detected.

32. method as described in claim 0, it is characterised in that if what is associated with two decoder cores works as Difference between the macro block of preceding decoding or the two of coding unit positions or index exceedes threshold value, it is one or Multiple decoding progress synchronizers cause that guide's decoder core of described two decoder cores stops or slows down, or make The decoder core for obtaining the backwardness of described two decoder cores accelerates.

33. method as described in claim 0, it is characterised in that the shared reference data buffer Whether status information will be deleted or by one based on any reference data accessed by a decoder core The reference data of decoder core reuses whether rate reduces or detect less than threshold value.