CN101083763B - Programmable video processing unit and video data processing method - Google Patents

Programmable video processing unit and video data processing method Download PDF

Info

Publication number
CN101083763B
CN101083763B CN2007101119554A CN200710111955A CN101083763B CN 101083763 B CN101083763 B CN 101083763B CN 2007101119554 A CN2007101119554 A CN 2007101119554A CN 200710111955 A CN200710111955 A CN 200710111955A CN 101083763 B CN101083763 B CN 101083763B
Authority
CN
China
Prior art keywords
data
video
sent
texture
multiplexer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2007101119554A
Other languages
Chinese (zh)
Other versions
CN101083763A (en
Inventor
扎伊尔德·荷圣
徐建明
约翰·柏拉勒斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Via Technologies Inc
Original Assignee
Via Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Via Technologies Inc filed Critical Via Technologies Inc
Publication of CN101083763A publication Critical patent/CN101083763A/en
Application granted granted Critical
Publication of CN101083763B publication Critical patent/CN101083763B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a programmable video processing unit and a video data processing method, especially relates to a programmable video processing unit for video data of at least two formats, including filter logic circuit for filtering video data based on formats thereof; converter logic circuit for converting video data base on formats thereof; and logic circuit to output video data for processing thereafter. The filter logic circuit and the converter logic circuit can work in parallel. The mentioned data formats can be one of MPEG-2, VC-1, and H.264. The inventive programmable video processing unit and video data processing method is able to improve speed of process for video data.

Description

Programmable vision processing unit and video data handling procedure
Technical field
The invention relates to and handle video and graph data, more particular it, the invention relates to provides a kind of video processing unit with programmable core.
Background technology
The continuous development of Along with computer technology is to the also lifting thereupon of demand of computing equipment.More particular it, many computer applied algorithms and/or data flow need be handled video data, along with the video data more sophisticated that becomes, the processing requirements of video data are also increased thereupon.
At present, many computing architectures are provided for handling the CPU (CPU) that comprises video and graph data, though CPU can be provided for the suitable disposal ability of some videos and figure, CPU also need handle other data.Therefore, in dealing with complicated video and figure, may influence the usefulness of whole system unfriendly to the demand of CPU.
In addition, many computing architectures comprise the one or more performance elements (EU) that are used for deal with data.More particular it, EU can be in order to handle a plurality of data of different types at least one framework.As CPU, may influence the usefulness of The whole calculations system unfriendly derived from dealing with complicated video and graph data to the demand of EU.In addition, possibly increase power consumption so that surpass acceptable threshold value by EU dealing with complicated video and graph data.In addition, the different agreement of data or specification more can limit the ability that EU handles video and graph data.In addition, present many computing architectures provide 32 order of the bit, and this situation possibly lower efficiency, thereby influences processing speed.In addition, utilizing a plurality of operations in the single component also is another demand.
Therefore, existence addresses the aforementioned drawbacks and not enough still unresolved so far demand in the industrial circle.
Summary of the invention
The present invention includes the embodiment that is used for processing video data.One embodiment of the invention comprise a kind of programmable vision processing unit in order to the video data of handling at least two kinds of forms, comprise: the command stream processor, in order to send instruction; A plurality of performance elements in order to receiving the instruction that said command stream processor sends, and judge whether this instruction is video instructions, then this video instructions is sent to the texture address generator; This texture address generator is in order to be sent to this video instructions texture quick access to memory controller; This texture quick access to memory controller is in order to take in the data of texture filtering unit soon; This texture filtering unit is in order to be received from video instructions that said texture quick access to memory controller sends and according to this this video data of video instructions filtering; And video processor; In order to handle video data through said texture filtering unit filtering according to this video instructions that receives from this texture quick access to memory controller; And with the processed video data be sent to the back wrapper; This back wrapper is collected as brick from this texture filtering unit, if be that part is complete as brick, then encapsulate a plurality of specific identification symbols that are sent to pipeline as brick and use and will be back to said performance element as brick.Wherein, said performance element is through sending said video instructions to said texture filtering unit and said video processor, output video decoding data; Said command stream processor can be controlled said performance element executed in parallel, to accomplish video coding.
The present invention also comprises the embodiment of the method that is used for processing video data.This video data handling procedure comprises: performance element receives an instruction from a command stream processor, and judges whether the instruction that receives is video instructions, then this video instructions is sent to the texture address generator; This texture address generator is sent to texture quick access to memory controller with this video instructions; This texture quick access to memory controller is taken the video data that is selected from one of at least two kinds of forms in the texture filtering unit soon; This texture filtering unit is received from video instructions that said texture quick access to memory controller sends and according to this this video data of video instructions filtering; And video processing unit is handled the video data through said texture filtering unit filtering according to this video instructions that receives from this texture quick access to memory controller; And with the processed video data be sent to the back wrapper; This back wrapper is collected as brick from this texture filtering unit; If is that part is complete, then encapsulates a plurality of specific identification symbols that are sent to pipeline as brick and use and will be back to said performance element as brick as brick.Wherein, this video instructions comprises an identification field in order to indicate the form of this video data, and said performance element is through sending said video instructions to said texture filtering unit and said video processor, output video decoding data; Said command stream processor can be controlled a plurality of said performance element executed in parallel, to accomplish video coding.
Other system, method, characteristic and the advantage that the present invention discloses below having inspected graphic and describe in detail after will be tangible to those skilled in the art or become obvious.Expection is included in this with all these additional system, method, characteristic and advantage and describes in the category that reaches this disclosure in the content.
Programmable vision processing unit provided by the present invention and video data handling procedure can improve the processing speed of video data.
Description of drawings
Fig. 1 is the embodiment that is used for the computing architecture of processing video data.
Fig. 2 be similar to Fig. 1 framework introducing the embodiment of computing architecture of video processing unit (VPU).
Fig. 3 is the flow chart embodiment such as the process that in the computing architecture of Fig. 2, is used to handle video and graph data.
Fig. 4 A is the functional flow diagram embodiment of the data flow in calculation element (such as the calculation element of the computing architecture with Fig. 2).
Fig. 4 B is the continuity of the functional flow diagram of Fig. 4 A.
Fig. 4 C is the continuity of the functional flow diagram of Fig. 4 A and Fig. 4 B.
Fig. 5 A is the functional block diagram such as the assembly embodiment that in the computing architecture of Fig. 2, can be used for providing the operation of dynamic compression (MC) and/or discrete cosine transform (DCT).
Fig. 5 B is the continuity of the figure of Fig. 5 A.
Fig. 5 C is the continuity of the figure of Fig. 5 A and Fig. 5 B.
Fig. 5 D is the continuity of the figure of Fig. 5 A-Fig. 5 C.
Fig. 5 E is the continuity of the figure of Fig. 5 A-Fig. 5 D.
Fig. 5 F is the embodiment of total figure of the assembly of Fig. 5 A-Fig. 5 E.
Fig. 6 is the functional block diagram that can be used for the processes pixel engine of computing architecture (such as the computing architecture of Fig. 2).
Fig. 7 A can be used for the functional block diagram of the assembly of filter in the VC-1 loop (such as in the computing architecture of Fig. 2) for explanation.
Fig. 7 B is the continuity of the figure of Fig. 7 A.
Fig. 7 C is the continuity of the figure of Fig. 7 A and Fig. 7 B.
Fig. 7 D is the continuity of the figure of Fig. 7 A-Fig. 7 C.
Fig. 8 is for being used in the calcspar of carrying out the assembly of absolute difference and calculating in the computing architecture (such as the computing architecture of Fig. 2).
Fig. 9 is similar to the flow chart that Fig. 8 can be used for carrying out the embodiment of absolute difference and calculation process.
Figure 10 A can be used for the calcspar of a plurality of assemblies of (such as can in the computer architecture of Fig. 2, carrying out) in the deblocking operation for explanation.
Figure 10 B is the continuity of the figure of Figure 10 A.
Figure 10 C is the continuity of the figure of Figure 10 A and Figure 10 B.
Figure 10 D is the continuity of the figure of Figure 10 A-Figure 10 C.
Figure 10 E is the continuity of the figure of Figure 10 A-Figure 10 D.
Figure 11 is for being used in the embodiment flow chart of carrying out the process of data in the computing architecture (such as the computing architecture of Fig. 2).
Embodiment
Fig. 1 is an embodiment who is used for the computing architecture of processing video data.As shown in Figure 1, calculation element can comprise performance element (Execution Unit, Ji Qu EU) (pool) 146.The collection district 146 of performance element can comprise the one or more performance elements that are used for carrying out at the computing architecture of Fig. 1 data.The collection district 146 of performance element (being called " EUP 146 " among this paper) can be coupled to data flow memory cache 116, and receives data from data flow memory cache 116.EUP 146 also can be coupled to input port 142 and output port 144.Input port 142 can receive data in order to the EUP controller 118 that has the memory cache subsystem certainly.Input port 142 also can receive data from L2 memory cache 114 and back wrapper 160.EUP 146 can handle the data that received, and exports data after treatment to output port 144.
In addition; EUP controller 118 with memory cache subsystem can be sent to memory access unit (memory access unit is hereinafter to be referred as MXU) A164a and triangle and attribute configuration unit (triangle and attribute setup) 134 with data.L2 memory cache 114 also can be sent to MXU A 164a with data, and receives data from MXU A164a.Summit memory cache (vertex cache) 112 and data flow memory cache 110 also can be communicated by letter with MXU A 164a, and memory access port 108 is also communicated by letter with MXU A 164a.Memory access port 108 can with Bus Interface Unit (bus interface unit; BIU) 90, memory interface unit (memory interfaceunit; MIU) A 106a, MIU B 106b, MIU C 106c and MIU D 106d communication data, memory access port 108 also can be coupled to MXU B 164b.
MXU A 164a also is coupled to command stream processor (command streamprocessor is hereinafter to be referred as CSP) front end 120 and CSP rear end 128.CSP front end 120 is coupled to 3D and state component 122, and 3D and state component 122 are coupled to the EUP controller 118 with memory cache subsystem.CSP front end 120 also is coupled to 2D prebox (pre component) 124, and 2D prebox 124 is coupled to 2D first in first out (FIFO) assembly 126.CSP front end 120 also with definition and model texture processor (clear and type texture processor) 130 and advanced ciphering system (advanced encryption system, AES) encrypt/decrypt assembly 132 communication datas.C SP rear end 128 is coupled to span as brick generator (span-tile generator) 136.
Triangle and attribute configuration unit 134 be coupled to 3D and state component 122, have the memory cache subsystem EUP controller 118 and span are as brick generator 136.Span can be in order to be sent to data ZL1 memory cache 123 as brick generator 136, and span also can be coupled to ZL1 138 as brick generator 136, and ZL1 138 can be sent to ZL1 memory cache 123 with data.ZL2 140 can be coupled to Z (for example, depth buffered memory cache) and template, and (stencil, ST) memory cache 148.Z and ST memory cache 148 can send and receive data through writing back unit 162, and can be coupled to frequency range (hereinafter to be referred as BW) compressor reducer 146.BW compressor reducer 146 also can be coupled to MXUB 164b, and MXU B 164b can be coupled to texture quick access to memory and controller 166.Texture quick access to memory and controller 166 can be coupled to texture filtering unit (texturefilter unit is hereinafter to be referred as TFU) 168, and TFU 168 can be sent to back wrapper 160 with data.Back wrapper 160 can be coupled to interpolater 158.Preceding wrapper 156 can be coupled to interpolater 158 and texture address generator 150.Write back unit 162 and can be coupled to 2D processing components (pro component) 154, D memory cache 152, Z and ST memory cache 148, input port 142 and CSP rear end 128.
The embodiment of Fig. 1 is via utilizing EUP 146 to come processing video data.More particular it, at least one embodiment, the one or more of performance element can be in order to processing video data.Though this framework is used applicable to some, this framework possibly consume power in excess; In addition, this framework possibly have much difficulty in the data handling H.264.
Fig. 2 is for being similar to Fig. 1 framework and having introduced an embodiment of the computing architecture of video processing unit (videoprocessing unit is hereinafter to be referred as VPU).More particular it, in the embodiment of Fig. 2, the VPU 199 with programmable core can be provided in the computing architecture of Fig. 1.VPU 199 can be coupled to CSP front end 120 and TFU 168.VPU 199 can be used as the application specific processor that is used for video data.In addition, VPU 199 can be in order to handle with mpeg (hereinafter to be referred as MPEG), VC-1 and the video data of protocol code H.264.
More particular it, at least one embodiment, can be at one or more shadow shielding device sign indicating numbers (shader code) of go up carrying out of performance element (EU) 146.Instruction can be extracted through decoding and from buffer, and main and minor actions sign indicating number can be in order to judge EU that operand is delivered and the function that can carry out computing based on this operand.If operation belongs to SAMPLE type (for example, all VPU instructions are all the SAMPLE type), then can be from the EUP146 dispatch command.Although VPU 199 can use TFU filtering hardware in order to reduce, VPU199 also can stay with TFU168 and deposit.
The EUP146 that is used for the SAMPLE operation makes up 580 data structure (seeing table 1).EUP146 extracts the indicated source buffer of SAMPLE instruction, and these data are placed among minimum effective 512 of EUP-TAG interface structure.Other related datas that EUP146 is inserted in this structure are:
REG_TYPE: this should be 0
ThreadID-is in order to deliver back the result correct shadow shielding device program
ShaderResID-
ShaderType=PS
CRFIndex-purpose buffer
This is VPU filtering operation to be carried out for SAMPLE_MODE-
ExeMode=is vertical
This data structure can be sent to texture address generator (textureaddress generator is hereinafter to be referred as TAG) 150 subsequently.Whether TAG 150 can contain texture sample information or real data with the decision data field in order to inspection SAMPLE_MODE position.If contain real data, then TAG 150 directly is forwarded to VPU199 with data, extracts otherwise TAG 150 can open the beginning texture.
Table 1 is used for the EUP-TAG interface of Video processing
The data input ?XOUT_TAG_DA?TA 580 ? ? ?
Field ?Data 512 511 0 4×4×32
? ? ? ? ? ?
Field ?Req?Type 1 525 525 Request type: 0-sample, 1-resinfo
? ? ? ? ? ?
Field ? 7 533 527 Do not use (reservation)
Field ? 4 537 534 Do not use (reservation)
Field ?Write_Mask 4 541 538 The Texel assembly writes shielding
? ? ? ? ? ?
Field ?Thread?Id 6 547 542 The EU thread
Field ?Shader?Res?ID 2 551 550 The shadow shielding device scope
Field ?Shader?Type 3 553 552 00:VS 01:GS 10:PS 11:PS_PF
? ? ? ? ? ?
Field ?CRF?index 8 565 558 EU passes address 6+2sub id back
Field ?Sample?Mode 5 570 566 01000: SAMPLE_MCF_BLR 01001: SAMPLE_MCF_VC1 01010: SAMPLE_MCF_H264 01111:SAMPLE_SAD 01011: SAMPLE_IDF_VC1 01100: SAMPLE_IDF_H264_0 01101: SAMPLE_IDF_H264_1 01110:
[0053]
SAMPLE_IDF_H2642 10000 SAMPLE_TCF_14 * 4 10001:SAMPLE_TCF_M4 * 4 10010 SAMPLE_TCF_MPEG2 10011:SAMPLE_MADD 10100:SAMPLE_SMMUL
Field Exe_mode
1 571 571 Execution pattern
Value Horizontal (level) 1
Value Vertical (vertically) 0
Field Bx2 1 572 572 _ bx2 revises.Notice that for sample_Id, this sign is used to indicate whether to use sampler, 0-not to have s# and 1-has s# (supplying video to use).
Field <R> 9 579 573 Keep
If SAMPLE_MODE is among MCF, SAD, IDF_VC-1, IDF_H264_0 or the IDF_H264_1, then it need extract data texturing, otherwise data are in the Data field.
TAG 150 can find in minimum effective 128 of Data field in order to produce the required and information that be passed to texture quick access to memory controller (texture cache controller is hereinafter to be referred as TCC) 166 in address:
Position [31:0]-U, V coordinate, this constitutes the address (4 * 4 * 8) of texture block
Position [102:96]-T#
Position [106:103]-S#
T#, S#, U and V are the required full information of texture that extracts from particular surface.U, V, T#, S# can extract from the SRC1 of INSTRUCTION field during deciphering, and can be used for filling above field.Therefore, can the term of execution dynamically revise U, V, T#, S#.
SAMPLE_MODE and containing in minimum effective 128 order push-up storages (hereinafter to be referred as COMMAND FIFO) that can be placed in VPU 199 of data of this information subsequently, corresponding data push-up storage (DATAFIFO) can be filled with the data that are forwarded from the texture quick access to memory (position [383:128]) or 256 (maximum).These data will be by operation in VPU 199, and this operation is to be judged by the information of COMMAND FIFO, and its result (maximum 256) can use ThreadID and CRFIndex to be transmitted back to EUP 146 and EU buffer as passing the address back.
In addition, the present invention includes the instruction set that is provided and can be supplied VPU 199 to use by EUP 146, its instruction can be formatted into 64, yet this is inessential.More particular it, at least one embodiment, the VPU instruction set can comprise one or more dynamic compensation filtering (motion compensation filter is hereinafter to be referred as MCF) instruction.Possibly there are the one or more of following MCF instruction in this embodiment:
SAMPLE_MCF_BLR DST、S#、T#、SRC2、SRC1
SAMPLE_MCF_VC1 DST、S#、T#、SRC2、SRC1
SAMPLE_MCF_H264 DST、S#、T#、SRC2、SRC1
First group 32 of SRC1 contain U, V coordinate, and wherein minimum effective 16 is U.Owing to can not use maybe and can ignore SRC2, so SRC2 can be any value, and for example for containing 32 place values of 4 element filtering cores, each element be 8 of following announcement signed.
Table 2MCF filtering core
Figure G071B1955420070709D000091
In addition, the instruction set of VPU 199 also comprises the instruction about de-blocking filter in the loop (Inloop Deblocking Filtering is hereinafter to be referred as IDF), as one or more with what give an order:
SAMPLE_IDF_VC1 DST、S#、T#、SRC2、SRC1
SAMPLE_IDF_H264_0 DST、S#、T#、SRC2、SRC1
SAMPLE_IDF_H264_1 DST、S#、T#、SRC2、SRC1
SAMPLE_IDF_H264_2 DST、S#、T#、SRC2、SRC1
For the operation of VC-1 IDF, TFU 168 can provide 8 * 4 * 8 (or 4 * 8 * 8) data to filtered buffer.Yet,, control by the visual H.264IDF operation types of data volume that TFU 168 carries for H.264.
For SAMPLE_IDF_H264_0 instruction, the data block of TFU supply 8 * 4 * 8 (or 4 * 8 * 8).For the SAMPLE_IDF_H264_1 instruction, the data block that TFU 168 supplies are one 4 * 4 * 8, and another 4 * 4 * 8 bit data is supplied by shadow shielding device (EU) 146 (Fig. 2).In addition, through SAMPLE_IDF_H264_2, two 4 * 4 * 8 bit data block all can be supplied by shadow shielding device (being positioned at EU) 146, but not from TFU 168.
In addition, the instruction set of VPU 199 also comprises dynamic estimation (motionestimation is hereinafter to be referred as ME) instruction, and it can comprise the instruction of listing such as following:
SAMPLE_SAD DST、S#、T#、SRC2、SRC1。
More than instruction can map to following main and minor actions sign indicating number and take above-described form.Below in dependent instruction part, discuss the details of SRC and D ST form.
Table 3 dynamic estimation and corresponding operating sign indicating number
Figure G071B1955420070709D000101
LOCK when wherein LCK indicates collection locking EU data path on the throne and do not allow another thread to get into pipeline.NEG indication counter-rotating predicate buffer (predicateregister).S#, T# field are ignored by VPU SAMPLE instruction.And alternatively use T#, S# field with SRC1 coding.
Filtering of table 4 dynamic compensation and corresponding operating sign indicating number
Table 5 conversion coefficient filtering (transform coefficient filtering is hereinafter to be referred as TCF) and corresponding operating sign indicating number
Figure G071B1955420070709D000112
The execution route shown in Fig. 3 is followed in the SAMPLE instruction.In addition, EUP-TAG interface such as following table 6, other interfaces also can described after a while in more detail.
Table 6 is used for the EUP-TAG interface of Video processing
The data input XOUT_TAG_D ATA 580 ? ? ?
Field Data 512 511 0 4×4×32
? ? ? ? ? ?
Field Req?Type 1 525 525 Request type: 0-sample, 1-resinfo
? ? ? ? ? ?
Field T# 7 533 527 Texture index 0~127
Field S# 4 537 534 Sampler index 0~15
Field Write?Mask 4 541 538 Texel assembly write mask
? ? ? ? ? ?
Field Thread?Id 6 547 542 The EU thread
Field shader?Res?ID 2 551 550 The shadow shielding device scope
Field Shader?Type 3 553 552 00:VS 01:GS 10:PS 11:PS_PF
? ? ? ? ? ?
Field CRF?Index 8 565 558 EU passes address 6+2sub id back
Field Sample?Mode 5 570 566 01000:SAMPLE_MCF_BLR 01001:SAMPLE_MCF_VC1 01010:SAMPLE_MCF_H264 01111:SAMPLE_SAD 01011:SAMPLE_IDF_VC1 01100:SAMPLE_IDF_H264_0 01101:SAMPLE_IDF_H264_1 01110:SAMPLE_IDF_H264_2 10000:SAMPLE_TCF_I4×4 10001:SAMPLE_TCF_M4×4 10010:SAMPLE_TCF_MPEG2 10011:SAMPLE_MADD 10100:SAMPLE_SMMUL
Field exe?mode 1 571 571 Execution pattern
Value Horizontal (level) 1 ? ? ?
Value Vertical (vertically) 0 ? ? ?
Field Bx2 1 572 572 _ bx2 revises.Note, for sample_Id,
[0088]
This sign is used to indicate whether to use sampler, 0-not to have s# and 1-has s# (supplying video to use).
Field <R> 9 579 573 Keep
Should notice that the texture sample filtering operation also can map to Sample Mode field, value is 00XXX under this kind situation.Value 11XXX keeps for using future at present.In addition; Among at least one embodiment that is disclosed in this article; Some video capabilitys can be inserted in the texture pipeline to utilize L2 memory cache logical circuit and some L2 to be written into the data of MUX with filtration again, like ME (dynamic estimation), MC (dynamic compensation), TC (transform coding) and ID (deblocking in the loop).
The data from TCC 166 and/or TFU168 that following form is summed up for different sample instructions are written into criterion.Should note looking particular architectures and decide, Sample_MC_H264 can only be used for the Y plane, but is not for essential for the CrCb plane.
The data that table 7 is used for video are written into
Instruction Explain The Y plane The CrCb plane
SAMPLE_MC_BLR
8 * 8 * 8 pieces from the texture quick access to memory Be Be
SAMPLE_MC_VC1
12 * 12 * 8 pieces from the texture quick access to memory Be Be
SAMPLE_MC_H264 12 * 12 * 8 pieces from the texture quick access to memory Be Not
SAMPLE_SAD From 8 * 4 * 8 pieces of texture quick access to memory, V can be any aligning Be Be
SAMPLE_IDF_VC1 From 8 * 4 * 8 (or 4 * 8 * 8) of texture quick access to memory, 32 bit alignments Be Be
SAMPLE_IDF_H264 _0 From 8 * 4 * 8 (or 4 * 8 * 8) of texture quick access to memory, 32 bit alignments Be Be
SAMPLE_IDF_H264_1 From the texture quick access to memory Be Be
[0093]
4 * 4 * 8,32 bit alignments
SAMPLE_IDF_H264 _ 2 Nothing is from the data of texture quick access to memory
SAMPLE_TCF_I4 * 4 Nothing is from the data of texture quick access to memory
SAMPLE_TCF_M4 * 4 Nothing is from the data of texture quick access to memory
SAMPLE_TCF_MPE G2 Nothing is from the data of texture quick access to memory
SAMPLE_MADD Nothing is from the data of texture quick access to memory
SAMPLE_SMMUL Nothing is from the data of texture quick access to memory
Among at least one embodiment that is disclosed in this article, the Y plane can comprise HSF_Y0Y1Y2Y3_32BPE_VIDEO2 tile work form.The CrCb plane comprises staggered CrCb passage and is regarded as HSF_CrCb_16BPE_VIDEO tile work form.If do not require the CbCr plane that interlocks, then for Cb or Cr, the form identical all capable of using with the Y plane.
In addition, will be added into shadow shielding device instruction set architecture (ISA) to give an order.
SAMPLE_MCF_BLR DST、S#、T#、SRC2、SRC1
SAMPLE_MCF_VC1 DST、S#、T#、SRC2、SRC1
SAMPLE_MCF_H264 DST、S#、T#、SRC2、SRC1
SAMPLE_IDF_VC1 DST、S#、T#、SRC2、SRC1
SAMPLE_IDF_H264_0 DST、S#、T#、SRC2、SRC1
SAMPLE_IDF_H264_1 DST、S#、T#、SRC2、SRC1
SAMPLE_SAD DST、S#、T#、SRC2、SRC1
SAMPLE_TCF_MPEG2 DST、#ctrl、SRC2、SRC1
SAMPLE_TCF_I4×4 DST、#ctrl、SRC2、SRC1
SAMPLE_TCF_M4×4 DST、#ctrl、SRC2、SRC1
SAMPLE_MADD DST、#ctrl、SRC2、SRC1
SAMPLE_IDF_H264_2 DST、#ctrl、SRC2、SRC1
The #ctrl that is used for SAMPLE_IDF_H264_2 should be zero.
SRC1, SRC2 and #ctrl (but time spent) can be in order to form 512 bit data fields in the EU/TAG/TCC interface shown in following table 8.
Figure G071B1955420070709D000161
Figure G071B1955420070709D000171
Referring to table 8, the TR=transposition; FD=filtering direction (vertical=1); BS=boundary intensity (Boundary Strength); BR=BR control, YC position (YC=1 in the CbCr plane; And CEF=chroma edge sign (Chroma Edge Flag) YC=0 then in the Y plane).In addition, when 32 or (or still less position) when being used in SRC1 or SRC2 (remain undefined), can stipulate that lane (lane) selects the use with the attenuating buffer.
Though more than described command format, the following general introduction that in table 10, comprises instruction manipulation.
Table 10 instruction general introduction
The instruction title Command format Instruction manipulation
?SAMPLE_MCF_BLR SAMPLE_MCF_BLR DST、SRC2、SRC1 MC filtering is implemented
?SAMPLE_MCF_VC1 SAMPLE_MCF_VC1 DST、SRC2、SRC1 MC filtering for VC-1 is implemented
?SAMPLE_MCF_H264 SAMPLE_MCF_H264 DST、SRC2、SRC1 MC filtering for is H.264 implemented
?SAMPLE_IDF_VC1 SAMPLE_IDF_VC1DST、 SRC2、SRC1 The VC-1 deblocking operation
?SAMPLE_IDF_H264_0 ?SAMPLE_IDF_H264_0 DST、SRC2、SRC1 H.264 deblocking operation.From texture quick access to memory 166 4 * 4 * 8 (vertical filter) or 8 * 4 * 8 are provided.
SAMPLE_IDF_H264_1 SAMPLE_IDF_H264_1 DST, SRC2, SRC1 H.264 operation.From tinter one 4 * 4 * 8 pieces are provided, another 4 * 4 * 8 pieces are provided from texture quick access to memory 166.This allows structure 8 * 4 (or 4 * 8) piece.
SAMPLE_IDF_H264_2 SAMPLE_IDF_H264_2 DST, #ctrl, SRC2, SRC1 H.264 deblocking operation.Two 4 * 4 provide by shadow shielding device, to construct 8 * 4.
SAMPLE_SAD SAMPLE_SAD DST, S#, T#, SRC2, SRC1 To reference (SRC2) and four absolute differences of prediction data execution and (SAD) computing.
[0116]
SAMPLE_TCF_I4 * 4 SAMPLE_TCF_I4 * 4DST, #ctrl, SRC2, SRC1 Transition coding is implemented
SAMPLE_TCF_M4 * 4 SAMPLE_TCFM4 * 4DST, #ctrl, SRC2, SRC1 Transition coding is implemented
SAMPLE_TCF_MPEG2 SAMPLE_TCF_MPEG2 DST, #ctrl, SRC2, SRC 1 Transition coding is implemented
SAMPLE_MADD SAMPLE_MADD DST, #ctrl, SRCW, SRC1 See below
SAMPLE_SIMMUL SAMPLE_SIMMUL DST, #ctrl, SRC2, SRC1 Carry out the scalar matrix multiplication.#ctrl is 11 immediate values.This can be for 0 (for example, the #ctrl signal will be ignored).Also vide infra
In addition, for SAMPLE_MADD, #ctrl can be 11 immediate value, must carry out the addition of two 4 * 4 matrixes (SRC1 and SRC2) in addition.One or more elements of arbitrary matrix can be the integer of 16 bit strip signs, and its result (DST) is 4 * 4 (16) matrix.Matrix can be as following in being placed in source/purpose buffer shown in the table 11, this can be the individual elements in the VPU.In addition, SRC1 and #ctrl data can supply access in 1 o'clock cycle, but and SRC2 in also access of cycle subsequently, therefore, can be per operation of two cycles issue.
#ctrl [0] indicates whether to carry out saturated (saturation, SAT) operation.
#ctrl [1] indicates whether to carry out round off (rounding, R) operation.
#ctrl [2] indicates whether to carry out 1 gt (shift, S) operation.
#ctrl [10:3] ignores.
Table 11 be used to the to originate buffer of matrix and purpose matrix
255 : 240 239 : 224 223 : 208 207 : 192 ? ? ? ? ? ? ? ? 63: 48 47: 32 31: 16 15: 0
M 33 M 32 M 31 M 30 M 23 M 22 M 21 M 20 M 13 M 12 M 11 M 10 M 03 M 02 M 01 M 00
In addition, the logic criterion that data are relevant therewith can comprise following:
#Lanes:=16;#Lanewidth:=16;
If(#ctrl[1])R=1;ELSE?R=0;
If(#ctrl[2])S=1;ELSE?S=0;
IF(#ctrl[0])SAT=1;ELSE?SAT=0;
For(I:=0;I<#Lanes;I+=1){
Base:=I*#Lanewidth;
Top:=Base+#Lanewidth-1;
Source1[I]:=SRC1[Top..B?ase];
Source2[I]:=SRC2[Top..Base];
Destination[I]:=(Sourcel[I]+Source2[I]+R)>>S;
IF(SAT)Destination[I]=MIN(MAX(Destination[I],0),255);
DST[Top..Base]=Destination[I];
Referring to table 9, it multiplies each other for carrying out scalar matrix once more.#ctrl is 11 immediate values, and this value can be for 0 (that is the #ctrl signal will be ignored).This instruction is in the crowd identical with SAMPLE_TCF and SAMPLE_IDF_H264_2.The logic criterion that instruction is relevant therewith can comprise following:
#Lanes:=16;#Lanewidth:=16;
MMODE=Control_4[17:16];
SM=Control_4[7:0];
SP=Control_4 [15:8]; // only use minimum effective 5
For(I:=0;I<#Lanes;I+=1){
Base:=I*#Lanewidth;
Top:=Base+#Lanewidth-1;
Source2[I]:=SRC2[Top..Base];
Destination[I]:=(SM*Source2[I])>>SP;
DST[Top..Base]=Destination[I];}
This is to use the FIR_FILTER_BLOCK unit that is used to carry out MCF/TCF among the VPU to implement.SM be the weighting that is applied to all lanes (for example, W [0]=W [1]=W [2]=W [3]=SM), Pshift is SP.When carrying out this operation, the summation adder among the FIR_FILTER_BLOCK is crossed, and can be shifted from four results of 16 * 8 multiplication gained, and minimum effective 16 of each result is collected in and becomes 16 16 results together, is passed to EU to return.
Fig. 3 is the embodiment of the flow chart of the process that is used for processing video data in the computing architecture of explanation like Fig. 2.More particular it, illustrated like the embodiment of Fig. 3, the command stream processor can be sent to EUP 146 with data and instruction.EUP 146 correspondingly can be in order to reading command and data that processing received.EUP 146 can be sent to texture address generator (TAG) 150 with instruction, treated data and from the data of EUP texture address generator (TAG) interface 242 subsequently.TAG 150 can be in order to produce the address of reduced data.TAG 150 can be sent to texture quick access to memory controller (texture cache controller, TCC) 166 with data and instruction subsequently.TCC166 can be in order to take (texture filter unit, TFU) 168 data in the texture filtering unit soon.The data that TFU 168 can come filtering to receive according to the instruction that is received, and will be sent to video programmable unit (VPU) 199 through the data of filtering.VPU 199 can handle the data that received according to the instruction that is received, and treated data are sent to back wrapper (postpacker, PSP) 160.PSP 160 can be from collecting as brick such as each assembly of TFU 168.If is that part is complete as brick, then PSP 160 can encapsulate a plurality of specific identification symbols that are sent to pipeline as brick and use and will send back to EUP 146 as brick.
Fig. 4 A is the embodiment of explanation functional flow diagram of data flow in calculation element (such as the calculation element of the computing architecture with Fig. 2).Embodiment like Fig. 4 A is illustrated, can data stream encrypted be sent to the decryption component 236 on the CSP 120,128.In at least one embodiment, encrypt bit stream and can and be written back to VRAM through deciphering.Can use variable length decoder (VLD) hardware to decipher the video of being deciphered subsequently.Decryption component 236 can be deciphered the bit stream that received to form coding stream 238.Coding stream 238 can be sent to VLD, Huffman (Huffman) decoder, complicated adaptive variable length codes device (complex adaptive Variable length decoder; CAVLC) and/or binary arithmetic coder (Context Based BinaryArithmetic Coder, CABAC) 240 (being called " decoder " among this paper).Decoder 240 is with the bitstream interpretation that is received, and the bit stream of being deciphered is sent to the acceleration of DirectX video, and (DirectX Video Acceleration, DXVA) data structure 242.In addition, the data that receive at DXVA data structure 242 places are outside MPEG-2VLD counter-scanning, inverse quantization and anti-DC prediction, and outside VC-1VLD counter-scanning, inverse quantization and anti-DC/AC prediction.Subsequently can be via image header 244, storage buffer 0 (MB0) 246a, MB 1246b, MB2246c ..., MBN 246n etc. and with this data acquisition in DXVA data structure 242.Data can get into skipped blocks 250,252 and 254 subsequently, in Fig. 4 B and Fig. 4 C, to continue.
Fig. 4 B is the continuity of the functional flow diagram of Fig. 4 A.As shown in the figure, from the skipped blocks 250,252 and 254 of Fig. 4 A, receive data at anti-Q assembly 264 of counter-scanning and anti-DC/AC prediction component 262 places.These data are treated and be sent to interchanger 265.Whether interchanger 265 decision data send via the Intra/Inter input, and selected data is sent to skipped blocks 270.In addition, will be sent to the coding pattern piece from the data of skipped blocks 260 and rebuild assembly 266.
Fig. 4 C is the continuity of the functional flow diagram of Fig. 4 A and Fig. 4 B.As shown in the figure, be received in filter assembly 280 places from the data of skipped blocks 272,274 (Fig. 4 A).Any of these data based a plurality of agreements is by 282 filtering of MC filter.More particular it, if data are received with the MPEG-2 form, then these data are constructed with 1/2 pixel deviation.If data are received with the VC-1 form, then utilize 4 taps (4-tap) filter.On the other hand, if data are received 6 tap filters then capable of using with form H.264.Data through filtering are sent to reconstruction reference component 284 subsequently, are sent to interchanger assembly 288 with filter assembly 280 relevant data.Interchanger assembly 288 also receives zero.The interchanger assembly can judge that which data will be sent to adder 298 based on the Intra/Inter data that received.
In addition, inverse transform component 296 own coding pattern blocks are rebuild assembly 286 and are received data, and receive data via skipped blocks 276 from interchanger 265 (Fig. 4 B).Inverse transform component 296 execution are for 8 * 8 discrete cosine inverse transform (IDCT) of MPEG-2 data, for 8 * 8,8 * 4,4 * 8 and/or 4 * 4 integers conversion of VC-1 data and for H.264 4 * 4 integers conversion of data; And according to the conversion that will carry out, these data are sent to adder 298.
Adder 298 is with the data addition summation of inverse transform component 296 and interchanger 288, and the data of the gained of will suing for peace are sent to filter 297 in the loop.Filter 297 filters the data that received in the loop, and will be sent to reconstruction framework assembly 290 through the data of filtering.Reconstruction framework assembly 290 is sent to data and rebuilds reference component 284.Reconstruction framework assembly 290 can be sent to data and deblock and decyclization (dering) filter 292, and filter 292 can be with being sent to release of an interleave (de-interlacing) assembly 294 that is used for release of an interleave through the data of filtering, and these data can supply to show subsequently.
Fig. 5 A can be used for providing the functional block diagram of embodiment of the assembly of dynamic compression (MC) and/or discrete cosine transform (DCT) operation for explanation (such as in the computing architecture of Fig. 2) in VPU.More particular it, illustrated like the embodiment of Fig. 5 A, bus A can be in order to be sent to 16 bit data the input port B of PE 3314d, bus A also is sent to Z with data -1Delay Element 300 is to be sent to 16 bit data second input of PE 2314c.Bus A also is sent to Z with these data -1Delay Element 302 is to be sent to PE 1314b with 16 bit data, and these data also are sent to Z -1Delay Element 304, it gets into PE 0314a and Z subsequently -1Delay Element 306.Passing Z -1After the Delay Element 306, low level 8 bit data of bus A are sent to PE 0314a, these data are by Z -1308 postpone and are sent to PE 1314b and Z -1Delay Element 310.Arrive Z -1After the Delay Element 310,8 of the low levels of these data are sent to PE 2314c and Z -1Delay Element 312; Arrive Z -1After the Delay Element 312,8 of the low levels of these data are sent to PE 3314d.In addition, bus B is sent to 64 bit data each of PE 3314d, PE 2314c, PE 1314b and PE 0314a.
Treatment element 0 (Processing Elelment, PE 0) 314a can promote to filter the data that receive.More particular it, PE can be an element of FIR filter.When PE 0314a, PE 1314b, PE 2314c and PE 3314d and adder 330 combinations, this can form 4 taps/8 tap FIR filters.The part of data at first is sent to Z -3Delay Element 316.Multiplexer 318 is selected data, and (Field Input Response FIR) exports the selection port of multiplexer 318 to, and these data are sent to adder 330 from multiplexer 318 so that the input data are from field input response assembly.
Likewise, be sent to multiplexer 322 from the data of PE 1314b, the some of them data are at first at Z -2Delay Element 320 places are received.Multiplexer 322 is selected from the data that received via the FIR input that is received, and selected data is sent to adder 330.The data of PE 2314c are sent to multiplexer 326, and the some of them data at first are sent to Z -1Delay Element 324.The data to adder 330 to be sent are selected in FIR input, are sent to adder 330 from the data of PE 3314d.
What also input to adder 330 is the feedback loop of N shift unit 332.These data are via Z -1Delay Element 326 is received at multiplexer 328 places.What also receive at multiplexer 328 places is the data that round off.Multiplexer 328 is selected the data that received via the broad input in the selection port of multiplexer 328.Multiplexer 328 is sent to adder 330 with selected data, and adder 330 adds the data that received and added data are sent to N shift unit 332 that these 16 bit shift data are sent to output.
Fig. 5 B is the continuity of the figure of Fig. 5 A.More particular it, illustrated like the embodiment of Fig. 5 B, be sent to multiplexer 342a from the data of storage buffer 340a, 340b, 340c and 340d.Multiplexer 342a is sent to skipped blocks 344a and 346a with 16 bit data.Likewise, multiplexer 342b receives data from storage buffer 340b, 340c, 340d and 340e, and data are sent to skipped blocks 344b and 346b; Multiplexer 342c receives data and data is sent to 344c and 346c from 340c, 340d, 340e and 340f; Multiplexer 342d receives data and data is sent to skipped blocks 344d and 346d from 340d, 340e, 340f and 340g; Multiplexer 342e receives data and data is sent to 344e and 346e from 340e, 340f, 340g and 340h; Multiplexer 342f receives data and data is sent to 344f and 346f from 340f, 340g, 340h and 340i; Multiplexer 342g receives data and data is sent to skipped blocks 344g and 346g from 340g, 340h, 340i and 340h; Multiplexer 342h receives data and data is sent to 344h and 346h from 340h, 340i, 340j and 340k; Multiplexer 342i receives data and data is sent to skipped blocks 344i and 346i from 340i, 340j, 340k and 340l.
Fig. 5 C is the continuity of the figure of Fig. 5 A and Fig. 5 B.More particular it, be sent to storage buffer B, groove 350a from the data (via skipped blocks 348a) of multiplexer 342a; Data (via skipped blocks 348b) from multiplexer 342b are sent to memory B, groove 350b; Data (via skipped blocks 348c) from multiplexer 342c are sent to memory B, groove 350c; Data (via skipped blocks 348d) from multiplexer 342d are sent to memory B, groove 350d; Data (via skipped blocks 348e) from multiplexer 342e are sent to memory B, groove 350e; Data (via skipped blocks 348f) from multiplexer 342f are sent to memory B, groove 350f; Data (via skipped blocks 348g) from multiplexer 342g are sent to memory B, groove 350g; Data (via skipped blocks 348h) from multiplexer 342h are sent to memory B, groove 350h; Data (via skipped blocks 348i) from multiplexer 342i are sent to memory B, groove 350i.
Likewise, the data (from Fig. 5 D, below discussing) from skipped blocks 362j-362r are sent to transposition (Transpose) network 360.The data that transposition network 360 transposition are received; And send it to storage buffer B, storage buffer B is sent to skipped blocks 366j-366r with data.
Fig. 5 D is the continuity of the figure of Fig. 5 A-Fig. 5 C.More particular it; Data at multiplexer 369a place from skipped blocks 368a (Fig. 5 B; Via multiplexer 342a) and skipped blocks 368j (Fig. 5 C is via storage buffer B) be received, these data are selected by the vert signal and are sent to FIR filter block 0370a via bus A (seeing Fig. 5 A).Likewise, multiplexer 369b-369i receives data from skipped blocks 368b-368i and 368k-368r, and these data are sent to FIR filter block 370b-370i and treated, just such as about Fig. 5 A narration.Data from FIR filter block 0370a output are sent to skipped blocks 372b and 372j; FIR filter block 370b exports skipped blocks 372c and 372k to; FIR filter block 370c exports skipped blocks 372d and 372l to; FIR filter block 370d exports skipped blocks 372e and 372m to; FIR filter block 370e exports skipped blocks 372f and 372n to; FIR filter block 370f exports skipped blocks 372g and 372o to; FIR filter block 370g exports skipped blocks 372h and 372p to; FIR filter block 370h exports skipped blocks 372i and 372q to; FIR filter block 370i exports skipped blocks 372j and 372r to.As above discuss, receive from the data of skipped blocks 372j-372r transposition network 360 by Fig. 5 C.Skipped blocks 372b-372j continues in Fig. 5 E.
Fig. 5 E is the continuity of the figure of Fig. 5 A-Fig. 5 D.More particular it, illustrated among the embodiment like Fig. 5 E, be sent to storage buffer C, groove 380b from the data (via the FIR filter block 370a of Fig. 5 D) of skipped blocks 376b.Likewise, the data (via the FIR filter block 370b of Fig. 5 D) from skipped blocks 376c are sent to storage buffer C, groove 380c; Data (via the FIR filter block 370c of Fig. 5 D) from skipped blocks 376d are sent to storage buffer C, groove 380d; Data (via the FIR filter block 370d of Fig. 5 D) from skipped blocks 376e are sent to storage buffer C, groove 380e; Data (via the FIR filter block 370e of Fig. 5 D) from skipped blocks 376f are sent to storage buffer C, groove 380f; Data (via the FIR filter block 370f of Fig. 5 D) from skipped blocks 376g are sent to storage buffer C, groove 380g; Data (via the FIR filter block 370g of Fig. 5 D) from skipped blocks 376h are sent to storage buffer C, groove 380h; Data (via the FIR filter block 370h of Fig. 5 D) from skipped blocks 376i are sent to storage buffer C, groove 380i; Data (via the FIR filter block 370i of Fig. 5 D) from skipped blocks 376j are sent to storage buffer C, groove 380j.
Multiplexer 382a receives data from storage buffer C, groove 380b, 380c and 380d; Multiplexer 382b receives data from storage buffer C, groove 380d, 380e and 380f; Multiplexer 382c receives data from storage buffer C, groove 380f, 380g and 380h; Multiplexer 382d receives data from storage buffer C, groove 380h, 380i and 380j.In case the data of receiving, multiplexer 382a-382d just is sent to ALU 384a-384d with data.Adder 384a-384d receives these data and value " 1 " also is sent to shift unit 386a-386d respectively with treated data to handle the data that received; Shift unit 386a-386d is with the data shift that is received and will be sent to multiplexer 390a-390d through the data of displacement, then data is sent to multiplexer 390a-390d respectively from Z piece 388a-388d.
In addition, Z piece 388a receives data and data is sent to multiplexer 390a from skipped blocks 376b; Z piece 388b receives data and data is sent to multiplexer 390b from skipped blocks 376c; Z piece 388c receives data and data is sent to multiplexer 390c from skipped blocks 376d; Z piece 388d receives data and data is sent to multiplexer 390d from skipped blocks 376e; Multiplexer 390a-390d also receives and selects input and selected data is sent to output.
Fig. 5 F is the embodiment of total figure of the assembly of Fig. 5 A-Fig. 5 E.More particular it, illustrated like the embodiment of Fig. 5 F, data are received at storage buffer A 340 places.These data are other data multiplex (MUX) in storage buffer A 340 at multiplexer 342 places.Multiplexer 342 is selected data, and selected data is sent to storage buffer B 350.Storage buffer B 350 also autobiography SCN Space Cable Network 360 receives data.Storage buffer B 350 is sent to multiplexer 369 with data, and multiplexer 369 also receives data from multiplexer 342.Multiplexer 369 is selected data, and selected data is sent to FIR filter 370.The FIR filter is the data filter that is received, and will be sent to storage buffer C 380, Z assembly 388 through the data of filtering and transmit network 360.Storage buffer C 380 is sent to multiplexer 382 with data, and multiplexer 382 is selected since the data that storage buffer C 380 receives.The data that are chosen are sent to ALU384, and ALU 384 is from the data computation result that receives, and the data that will calculate gained are sent to shift unit 386.Then the data through displacement are sent to multiplexer 390, and multiplexer 390 also receives data from Z assembly 388, multiplexer 390 selection results and this result is sent to output.
Assembly shown in Fig. 5 A-Fig. 5 F can be in order to provide dynamic compression (MC) and/or discrete cosine transform (DCT).More particular it, look specific embodiments and/or data format and decide, data can repeatedly be wanted the result through the assembly of Fig. 5 A-Fig. 5 F to reach in the operation of pulling over.In addition, look special operational and special data form and decide, data can receive from EU 146 and/or TFU 168.
Like a non-limiting example, in practical operation, the assembly of Fig. 5 A-Fig. 5 F can be in order to receive the indication about on-unit (for example, motion compensation, discrete cosine transform etc.).In addition, also can receive indication about data format (for example, H.264, VC-1, MPEG-2 etc.).Like an embodiment, for form H.264, dynamic compensation (MC) data can be passed FIR filter 370 in a plurality of cycles, and the storage buffer C 380 that gets into subsequently is to convert 1/4 pixel format into.Discuss the identical or different purposes of other operations under form H.264 or the assembly of other data Fig. 5 A-capable of using Fig. 5 F as follows more in detail.In addition, multiplier array can be in order to multiply each other and/or as vector or matrix multiplier to carry out 16 16 as the array of multiplier.This instance is the SMMUL instruction.
Fig. 6 is for can be used for the functional block diagram of the processes pixel engine in the computing architecture (such as the computing architecture of Fig. 2).More particular it, illustrated like the embodiment of Fig. 6, bus A (before shift registor) and bus B (seeing Fig. 5 A) are sent to multiplexer 400 with 16 bit data.The selection port of multiplexer 400 receives the negative acknowledge character (NAK) from FIR filter 370, and selects one 16 bit data, and these data are sent to multiplexer 406.In addition, multiplexer 402 can be in order to receive bus A data (behind shift registor) and zero data.Multiplexer 402 can select the port in 6 tap data, to select the result that wants, and these 16 results can be sent to 16 no sign adders 404.16 no sign adders 404 also can be in order to receive data (before shift registor) from bus A.
But 16 data that no sign adder 404 totallings are received, and the result is sent to multiplexer 406.Multiplexer 406 can be in order to selecting in the free lane inversion that the is received 6 tap data of selecting the port, and selected data can be sent to 16 * 8 multipliers 410, but also receiving mode data of multiplier 410.24 results can be sent to shift unit 412 subsequently so that 32 results to be provided.
Fig. 7 A is for can be used in the VC-1 loop assembly function calcspar of (such as in the computing architecture of Fig. 2) in the filter.Embodiment like Fig. 7 A is illustrated, and multiplexer 420 can receive " 1 " value and " 0 " value at the input port place, and whether multiplexer 420 also can receive A0 absolute value<Pquant as selecting input.Likewise, multiplexer 422 can receive " 1 " value and " 0 " value, and A3<A0 490c absolute value whether.Multiplexer 424 can receive " 1 " value, " 0 " value as input, and clip (montage) value is not equal to 0, and whether (from the shift unit 468 of Fig. 7 C) is as selecting input.In addition, the data of exporting from multiplexer 420 can be sent to logic sum gate 426, and logic sum gate 426 can be sent to multiplexer 428 with data.Multiplexer 428 also can receive the filter_other_3 data as input.More particular it, shown in Fig. 7 A, can produce the filter_other_3 signal, if this signal is non-vanishing, then indication needs to filter other three row pixels; Otherwise, can not filter (modification) these 4 * 4.Multiplexer 428 is according to selecting the processed pixels data 3 that input received to select dateout.
Fig. 7 B is the continuity of the figure of Fig. 7 A.More particular it, illustrated like the embodiment of Fig. 7 A, absolute value assembly 430 receives 9 input A1 490a (from Fig. 7 D), absolute value assembly 432 receives 9 input A2490b (from Fig. 7 D).Through calculating the absolute value of the data that receive, the minimum value of the minimum value assembly 434 judgement data that receive, and with these data as exporting A3 and being sent to 2 system complement assemblies (2 ' s complimentcomponent) 436.2 system complement assemblies 436 calculate 2 system complements of the data that receive, and these data are sent to subtraction assembly 438.Subtraction assembly 438 deducts this data from importing data A0490c (from Fig. 7 D), be sent to subsequently shift unit 440 with the result to shifting left two and be sent to adder 442.Therefore in addition, the output of subtraction assembly 438 will input in the adder 442, allow circuit not use multiplier just can carry out to multiply by 5 operation.
The data that adder 442 totallings are received, and the result is sent to shift unit 444.Shift unit 444 moves right three with the data that received, and data are sent to pincers byte parts (clamp component) 446.Pincers byte part 446 also receives montage value clip (from shift unit 468, Fig. 7 C), and the result is sent to output.The result that should note filter can be negative or greater than 255.Therefore this pincers byte part 446 can be in order to be clamped to the result no sign 8 place values.Therefore, if input d is for what bear, then d will be set to 0.If d>montage value clip, then d can be set to montage value clip.
Fig. 7 C is the continuity of the figure of Fig. 7 A and Fig. 7 B.Like the embodiment of Fig. 7 C, P1 data 450a, P5 data 450e and P3 data 450c are sent to multiplexer 452.Multiplexer 452 receives to be selected input and selects data to be sent to subtraction assembly 460.Multiplexer also is sent to dateout the selection input of multiplexer 454.
Multiplexer 454 also receives the input data from P4 450d, P8 450h and P6 450f.Multiplexer 454 is sent to subtraction assembly 460 with dateout.460 pairs of data that received of subtraction assembly subtract, and the result is sent to shift unit 466.Shift unit 466 to moving to left one, and is sent to skipped blocks 474 with this result with the data that received.
Likewise, multiplexer 456 receives input P2 450b, P3 450c and P4 450d.Multiplexer 456 receives from multiplexer 454 and selects input, and selected data are sent to subtraction assembly 464.Multiplexer 458 receives from multiplexer 456 and selects input, and receives the input data from P3 450c, P7 450g and P5 450e.Multiplexer is sent to subtraction assembly 464 with dateout, and 464 pairs of data that received of subtraction assembly subtract, and these data are sent to shift unit 470 and adder 472.Shift unit 470 to moving to left two, and will be sent to adder 472 through the data of displacement with the data that received, data that adder 472 additions are received and the result is sent to skipped blocks 480.
In addition, subtraction assembly 462 receives data, the data that received is subtracted and the result is sent to shift unit 468 from P4 450d and P5 450e.Shift unit 468 moves right one with the data that received, and exports these data and clamp byte part 446 and multiplexer 424 as clip data clip to input to.In addition, P4 450d is sent to skipped blocks 476 and P3 450e data are sent to skipped blocks 478.
Fig. 7 D is the continuity of the figure of Fig. 7 A-Fig. 7 C.More particular it, like the embodiment of Fig. 7 D, subtraction assembly 486 receives data from skipped blocks 482 and skipped blocks 484.486 pairs of data that received of subtraction assembly subtract and the result are sent to shift unit 488.Shift unit 488 moves right three and the result is sent to A1490a, A2 490b and A0 490c with the data that received.
In addition, multiplexer 496 receives input data " 0 " and " d ".This operation can comprise:
If(Do_filter){
P4[I]=P4[I]-D[I]
P?5[I]=P5[I]+D[I]}
Multiplexer 496 is selected input and is selected the result that wants via do_filter.Said result is sent to subtraction assembly 500.Subtraction assembly 500 also receives data (via skipped blocks 476, Fig. 7 C) from skipped blocks 492, and the data that received are subtracted and the result is sent to P4450d.
Multiplexer 498 also receives " 0 " and " d " and imports as selecting as input and do_filter.These data of multiplexer 498 multiplex (MUX)s and the result is sent to adder 502.Adder 502 also receives data (via skipped blocks 478, Fig. 7 C), input that addition received and the result is sent to P5 450e from skipped blocks 494.
Fig. 8 carries out absolute difference and (sum of absolute difference, the calcspar of the blocks of SAD) calculating for being used in the computing architecture (such as the computing architecture of Fig. 2).More particular it, like the embodiment of Fig. 8, assembly 504 receives the part of 32 bit data A [31:0] and the part of 32 bit data B.Assembly 504 through judging if (C) s=Not (S)+1 { C, whether S} ← A-B, and output is provided to adder 512.Likewise; Assembly 506 receives A data and B data; And output is sent to adder 512 based on similarly judging with assembly 504; Except A data that assembly 506 received and the part of B data for [23:16] position, with respect to data that assembly 504 received part for [31:24] position.Likewise, data, execution and the assembly 504 and 506 of assembly 508 reception [15:8] bit positions similarly calculate and the result are sent to adder 512.Data, execution and the assembly 504,506 and 508 of assembly 510 reception [7:0] bit positions similarly calculates and the result is sent to adder 512.
In addition, assembly 514,516,518 and 520 receives 32 the part (with the data of [31:0] bit position of in assembly 504-510 place receiving relative) of data A corresponding to position [63:32].More particular it, assembly 514 receives the data of [31:24] bit position among data A and the data B.Assembly 514 is carried out the similar calculating of as above being discussed, and 8 results are sent to adder 522.Likewise, assembly 516 receive [23:16] bit positions data, carry out similar calculating, and the gained data are sent to adder 522.Assembly 518 receives the data of [15:8] bit position among data A and the data B, the data that processing is received as stated, and the result is sent to adder 522.Assembly 520 is as above discussed the data that receive [7:0] bit position among data A and the data B, the data that processing is received, and the result is sent to adder 522.
Assembly 524-530 receives 32 of [95:64] bit position in A data and the B data.More particular it, assembly 524 receives [31:24] position, assembly 526 receives [23:16] position, assembly 528 receives [15:8] position, and assembly 530 receives the data of [7:0] position.In case receive this data, assembly 524-530 can be in order to handle the data that received, and as stated, treated data can be sent to adder 532 subsequently.Likewise, assembly 534-540 receives 32 bit data of [127:96] bit position in A data and the B data.More particular it, assembly 534 receives the data of [31:24] bit position among A data and the B, assembly 536 receives the data of [23:16] bit positions, assembly 538 receives the data of [15:8] bit positions, assembly 540 receives the data of [7:0] bit positions.The data that receive are as above discussed treated and are sent to adder 541.In addition, 512,522,532 and 542 pairs of data that received of adder are made addition, and 10 results are sent to adder 544.The data that adder 544 additions are received, and 12 bit data are sent to output.
Fig. 9 is for being similar to can be used for carrying out absolute difference and (SAD) flow chart of another embodiment of calculation process shown in Figure 8.More particular it, like the embodiment of Fig. 9, " i " be defined as piece size BlkSize and suma is initialized as " 0 " (block 550).Whether at first judge i greater than " 0 " (square 552), if i is greater than " 0 ", then vecx [i]=Tabelx [i], vecy [i]=Tabely [i], vectx=mv_x+vecx [i] and vecty=mv_y+vecy [i] (square 554).Follow vectx capable of using and vecty calculated address, also can extract 4 * 4 memory datas (byte alignment) (square 556) from PredImage.128 position prediction data can be sent to SAD 44 (see figure 8)s, as illustrated in the square 558.In addition, square 560 can receive blocks of data and calculated address.At square 560, also can extract 4 * 4 memory datas and carry out byte alignment from RefImage.128 Ref [i] data can be sent to SAD 44 (square 558) subsequently.Can be sent to square 562 from SAD 44 with value, wherein total value suma increases " 1 " and i reduces " 1 ".Then whether decidable total value suma is greater than threshold value (square 564).If then process can stop; On the other hand, if total value suma is not more than this threshold value, then process can be returned square 552 to judge that whether i is greater than 0.If i is not more than 0, then process can finish.
Figure 10 A is for can be used for the calcspar of a plurality of assemblies of (such as can in the computer architecture of Fig. 2, carrying out) in the deblocking operation.Like the embodiment of Figure 10 A, ALU 580 receives input data p2 and p0, and data are sent to absolute value assembly 586.Absolute value assembly 586 calculates the absolute value and the dateout a of the data that receive p, judge assembly 590 judgement a pWhether be sent to skipped blocks 596 less than β and with data.ALU 580 also is sent to skipped blocks 594 with data.Likewise, ALU 582 receives data from q0 and q2.After result of calculation, ALU 582 is sent to absolute value assembly 588 with data, and absolute value assembly 588 is judged the absolute value of the data that receive, and with a pBe sent to and judge assembly 592.Judge assembly 592 judgement a qWhether be sent to skipped blocks 598 less than β and with data.
ALU 600 receives data, result of calculation and the result is sent to absolute value assembly 606 from q0 and p0.Absolute value assembly 606 is judged the absolute value with the data that receive, and sends it to judgement assembly 612.Whether judgement assembly 612 judges the value that is received less than α, and the result is sent to and door 620.ALU 602 receives data, result of calculation and the result is sent to absolute value assembly 608 from p0 and p1.Absolute value assembly 608 is judged the absolute value of the data that receive, and this value is sent to judgement assembly 614.Judge that whether assembly 614 judge institute's data that receive less than β, and the result is sent to and 620.ALU 604 receives data, result of calculation and the result is sent to absolute value assembly 610 from q0 and q1.Absolute value assembly 610 is judged the absolute value of the data that receive, and the result is sent to judgement assembly 616.Judge that whether assembly 616 judge institute's data that receive less than β, and the result is sent to and 620.In addition, judge certainly that with door 620 assembly 618 receives data, judge that assembly 618 receives the bS data and judges whether these data are not equal to zero.
Figure 10 B is the continuity of the figure of Figure 10 A.More particular it, ALU 622 receives data, result of calculation and data is sent to ALU 624 from p1 and q1.ALU 624 also receives data (via the ALU 580 of Figure 10 A) and in 4 bit data of carry input from skipped blocks 646.ALU 624 is result of calculation and the result is sent to shift unit 626 subsequently, and shift unit 626 moves right three with the data that received.Shift unit 626 is sent to montage 3 (clip3) assembly 628 with data subsequently, and clip3 assembly 628 also receives data (via the ALU 744 of Figure 10 D, below more describe in detail) from skipped blocks 630.Clip3 assembly 628 is sent to multiplexer 634 with data and is sent to " non-(NOT) " door 632.The not gate 632 counter-rotating data that receive, and oppisite phase data is sent to multiplexer 634.Multiplexer 634 is also selecting input to receive t C0Data, and selected data is sent to ALU636.ALU 636 also receives data from multiplexer 640.Multiplexer 640 receives data from q0 and p0, and from! Left_top receives and selects input.The carry input of ALU 636 receives the data from multiplexer 642.Multiplexer 642 receive " 1 " and " 0 " and! The left_top data.ALU 636 is sent to SAT (0,255) 638 with the result, and SAT (0,255) 638 is sent to skipped blocks 644 (continuing Figure 10 E at multiplexer 790 places) with data.
In addition, ALU 648 receives data and is selecting input to receive the one digit number certificate from q0 and p0, ALU 648 result of calculations and these data are sent to shift unit 650.Shift unit 650 moves right one with the data that received, and the data that are shifted are sent to ALU 652.Likewise, multiplexer 656 from p1 and q1 receive data and! Left_top is as selecting input, multiplexer 656 result of determination, and the result is sent to shift unit 658.Shift unit 658 to moving to left one, and is sent to ALU 652 with the data that are shifted with the data that received, ALU 652 result of calculations and data are sent to ALU 662.ALU 662 also receives data from multiplexer 660, and multiplexer 660 receives q2 and p2 and from the data (via the not gate 802 of Figure 10 E) of skipped blocks 680.
ALU 662 result of calculations and these data are sent to shift unit 664, shift unit 664 moves right one with the data that received, and the data that are shifted are sent to montage 3 (clip3) assembly 668.Clip3 assembly 668 also receives t C0, and data are sent to ALU670.ALU 670 also receives data from multiplexer 656, after the result of calculation these data is sent to multiplexer 672.Multiplexer 672 also receives data and receives data (via the multiplexer 754 of Figure 10 E) from skipped blocks 678 from multiplexer 656, and data are sent to skipped blocks 674.
Figure 10 C is the continuity of the figure of Figure 10 A and Figure 10 B.Like the embodiment of Figure 10 C, multiplexer 682 from p2, p1 and! Left_top receives data, and selected data is sent to adder 706.Multiplexer 684 receive p1 and p0 and! Left_top also is sent to shift unit 700 with the result.Shift unit 700 to moving to left one, and sends it to adder 706 with the data that received.Multiplexer 686 from p0 and q1 and! Left_top receives data.Multiplexer 686 is sent to shift unit 702 with data, and shift unit 702 to moving to left one, and is sent to adder 706 with the data that are shifted with the data that received.Multiplexer 688 from q0 and q1 and! Left_top receives data, and selected data is sent to shift unit 704, and shift unit 704 to moving to left one, and sends it to adder 706 with the data that received.Multiplexer 690 from q1 and q2 and! Left_top receives data and data is sent to adder 706.Adder 706 also receives 4 of carry input, and output is sent to skipped blocks 708.
Likewise, multiplexer 691 receive q2, p0 and! Left_top, and select a result to send it to adder 698.Multiplexer 692 receive p1, p0 and! Left_top and will select the result and be sent to adder 698.Multiplexer 694 from q0, q1 and! Left_top receives data, and selects a result to send it to adder 698.Multiplexer 696 receive q0, q2 and! Left_top, and select the result that wants that these data are sent to adder 698.Adder 698 also receives 2 of carry input and output is sent to skipped blocks 710.
Multiplexer 712 receive p3, q3 and! Left_top and the result is sent to shift unit 722.Shift unit 722 to moving to left one, and sends it to adder 726 with the data that received.Multiplexer 714 receive p2, q2 and! Left_top, and will select the result and be sent to shift unit 724 and adder 726.Shift unit 724 to moving to left one, and is sent to adder 726 with the result who is shifted with the data that received.Multiplexer 716 receive p1, q1 and! Left_top and will select the result and be sent to adder 726.Multiplexer 718 receive p0, q0 and! Left_top, and will select the result and be sent to adder 726.Multiplexer 720 receive p0, q0 and! Left_top, and will select the result and be sent to adder 726.Adder 726 carry input receive four with the data addition that is received, the data that add the General Logistics Department are sent to skipped blocks 730.
Figure 10 D is the continuity of the figure of Figure 10 A-Figure 10 C.More particular it, like the embodiment of Figure 10 D, α table 750 receives IndexA and output α.β table 748 receives IndexB and exports data to zero expansion (Zero Extend) assembly 752, zero extension element, 752 output β.
Likewise, multiplexer 736 receives " 1 " and " 0 " and from the data (via the decision block 590 of Figure 10 A) of skipped blocks 732, and selection result sends it to ALU740.Multiplexer 738 also receives " 1 " and " 0 " and from the data (via the decision block 592 of Figure 10 A) of skipped blocks 734, and will select the result and be sent to ALU 740.ALU 740 result of calculations and data are sent to multiplexer 742.Multiplexer 742 also receives " 1 " and chroma edge sign (chroma edge flag) data, and selection result and send it to ALU 744.ALU 744 also receives t C0, result of calculation t cAnd the result is sent to skipped blocks 746.
Figure 10 E is the continuity of the figure of Figure 10 A-Figure 10 D.More particular it, like Figure 10 E embodiment, multiplexer 754 receptions and relational expression " ChromaEdgeFlag==0) && (a p<β) " relevant data, and with relational expression " ChromaEdgeFlag==0) && (a q<β) " relevant data, and receive data from non-assembly 802, and selected data is sent to skipped blocks 756 (to the multiplexer 672 of Figure 10 B).
In addition, multiplexer 780 receptions and relational expression " ChromaEdgeFlag==0) && (a p<β) && (abs (p0-q0)<((α>>2)+2) " relevant data and with relational expression " ChromaEdgeFlag==0) && (a q<β) && (abs (p0-q0)<((α>>2)+2)) " relevant data, multiplexer 780 also receives from non-assembly 802 and selects input, select according to this result that wants and send it to multiplexer 782,784 and 786.
Multiplexer 757 receives data from p1, q1 and non-assembly 802, and selected data is sent to shift unit 763, and shift unit 763 to moving to left one, and sends it to adder 774 with the data that received.Multiplexer 759 receives p0, q0 and data from non-assembly 802, and selected data is sent to adder 774.Multiplexer 761 receives data from q1, p1 and non-assembly 802, and data are sent to adder 774.Adder 774 also receives two data at carry input, and output is sent to multiplexer 782.
Shift unit 764 receives data (via the adder 706 of Figure 10 C) and the data that received is moved right three from skipped blocks 758, then the data that are shifted is sent to multiplexer 782.Shift unit 766 receives data (via the adder 698 of Figure 10 C) and the data that received is moved right two from skipped blocks 760, then the data that are shifted is sent to multiplexer 784.Shift unit 768 receives data (from the adder 726 of Figure 10 C) and the data that received is moved right three from skipped blocks 762, then the data that are shifted is sent to multiplexer 786.
Like above argumentation, multiplexer 782 receives data from shift unit 764 and adder 782 and multiplexer 780, since then data selection result and send it to multiplexer 790.Likewise, multiplexer 784 receives data from shift unit 766, multiplexing data device 780 with multiplexer 776.Multiplexer 776 receives p1, q1 and from the data of non-assembly 802, then will select the result and be sent to multiplexer 798.Multiplexer 786 receives data from shift unit 768, multiplexer 780 with multiplexer 778.Multiplexer 778 receives p2, q2 and from the data of non-assembly 802.Multiplexer 786 is sent to multiplexer 800 with selected data.
As above discuss, multiplexer 790 receives data from multiplexer 782.In addition, multiplexer 790 receives data from skipped blocks 772 (via the SAT assembly 638 of Figure 10 B) and multiplexer 794.Multiplexer 794 receives the data of p0, q0 and non-assembly 802.Multiplexer 790 also receives bSn & nfilterSampleFlag data and imports as selecting, and selected data is sent to buffer 808 and 810.Likewise, multiplexer 798 receives the bSn & nfilterSampleFlag data of data and selection input from multiplexer 784, skipped blocks 755 (via the multiplexer 674 of Figure 10 B) and multiplexer 792.Multiplexer 792 receives the data of p1, q1 and non-assembly 802.Multiplexer 798 is sent to buffer 806 and 812 with data.Likewise, multiplexer 800 receives data and receives bSn & nfilterSampleFlag data as selecting input from multiplexer 786.In addition, multiplexer 800 receives data from multiplexer 788.Multiplexer 788 receives the data of p2, q2 and non-assembly 802.Multiplexer 800 is selected the data of wanting, and data are sent to buffer 806 and 814.Buffer 804-814 also receives data from non-assembly 802, and data are sent to p2, p1, p0, q0, q1 and q2 respectively.
Figure 11 is used in the embodiment flow chart of carrying out the process of data in the computing architecture (such as the computing architecture of Fig. 2) for explanation.Receive data like the odd number square 880 of the embodiment texture address generator TAG of Figure 11 and even number square 882 (also see Fig. 2 150) from output port 144 (Fig. 2).Then produce the address that is used for the data that receive, and this process proceeds to texture quick access to memory and controller (TCC) 884,886 (also seeing Fig. 2,166).
Data can be sent to subsequently memory cache 890 and texture filtering first in first out assembly (Texture Cache First In First Out, TFF) 888,892, it can be in order to serve as delay queue/buffer.Data are sent to texture filtering unit 894,896 (Texture Filter Unit, TFU also sees Fig. 2,168) subsequently.In case data are through after the filtering, TFU894,896 just is sent to VPU 898,900 (also seeing Fig. 2,199) with data.Look instruction whether require dynamic compensation filtering, the filtering of texture quick access to memory, mutually de-blocking filter and/or absolute difference and and decide, data can be sent to the different piece of different VPU and/or identical VPU.After having handled the data that received, VPU898,900 can be sent to data the output (also seeing Fig. 2,142) of input port 902,904.
The embodiment that is disclosed among this paper can implement in hardware, software, firmware or its combination.At least one embodiment that is disclosed among this paper is in being stored in memory, and by implementing in performed software of suitable instruction execution system and/or the firmware.If in hardware, implement; As in alternate embodiment, the embodiment that is then disclosed among this paper can below any or combination of technology implement: have and be used for data-signal is implemented the discrete logic of the gate of logic function, the application-specific integrated circuit (ASIC) (ASIC) with appropriate combination gate, programmable gate array (PGA), field programmable gate array (FPGA) etc.
Should note framework, function and the operation of the possible embodiment of flow chart displaying software included among this paper and/or hardware.About this, can each square be interpreted as the part of representation module, section or code, it comprises the one or more executable instructions that are used to implement the regulation logic function.Also should notice that in some alternate embodiments the function of institute's note can unusual and/or not occur in the square.For example, look included function and decide, in fact two squares of showing continuously can be carried out in fact simultaneously or square can be carried out by reverse order sometimes.
Any (it can comprise the ordered list of the executable instruction that is used for implementing logic function) that should note listed program among this paper can be embodied in any computer-readable medium that is used or combined said each item to use by instruction execution system, device or equipment (such as with the computer being the system on basis, the other system that contains the system of processor or can extract instruction and execution command from instruction execution system, device or equipment).In the context of this document, " computer-readable medium " can be and can contain, store, transmit or carry any member that is used or combined its program of using by instruction execution system, device or equipment.Computer-readable medium for example can be (but being not limited to) electronics, magnetic, light, electromagnetism, infrared ray or semiconductor system, device or equipment.More particular instances of computer-readable medium (non-exhaustive list) can comprise electrical connection (electronics), portable computer disc (magnetic), random-access memory (ram) (electronics), read-only memory (ROM) (electronics), Erasable Programmable Read Only Memory EPROM (EPROM or flash memory) (electronics), optical fiber (light) and the portable compact disk read-only memory (CDROM) (light) with one or more lead.In addition, the category of some embodiment of this disclosure can comprise: embody the function described in the logic that is embodied in the medium with hardware or software architecture.
Also should note conditional language (such as) especially " can (can, could, might or may) "; Only if in addition special provision or understanding is arranged in employed context in addition, otherwise be intended to pass on some embodiment to comprise (and other embodiment do not comprise) some characteristic, element and/or step substantially.Therefore; These conditional language generally are not to be intended to hint that characteristic, element and/or step are always required by one or more specific embodiments; Or hint that one or more specific embodiments must be included in the logic that is used to make a strategic decision under the situation that adopts or do not adopt the user to import or point out, and no matter whether will comprise or carry out these characteristics, element and/or step in any specific embodiments.
The above is merely preferred embodiment of the present invention; So it is not in order to limit scope of the present invention; Any personnel that are familiar with this technology; Do not breaking away from the spirit and scope of the present invention, can do further improvement and variation on this basis, so the scope that claims were defined that protection scope of the present invention is worked as with the application is as the criterion.
The simple declaration of symbol is following in the accompanying drawing:
88,102: the internal logic analyzer
90,104: Bus Interface Unit BIU
106a, 106b, 106c, 106d: memory interface unit MIU
108: memory access port
110,116: the data flow memory cache
112: the summit memory cache
The 114:L2 memory cache
118: EU collection area controller with memory cache subsystem
120: command stream processor (CSP) front end
122:3D and state component
The 124:2D prebox
126:2D first in first out (FIFO) assembly
128:CSP rear end/ZL1 memory cache
130: definition and model texture processor
132: advanced ciphering system (AES) encrypt/decrypt assembly
134: triangle and attribute configuration unit
136: span is as the brick generator
138:ZL1
140:ZL2
142,902,904: input port
144: output port
146: the collection district EUP/BW compressor reducer of performance element
148:Z and ST memory cache
150: texture address generator TAG
The 152:D memory cache
The 154:2D processing components
156: preceding wrapper
158: interpolater
160: the back wrapper
162: write back the unit
164a, 164b: memory access unit MXU
166,884,886: texture quick access to memory and controller TCC
168,894,896: texture filtering unit TFU
199,898,900: video processing unit VPU
234: encrypt bit stream
236: decryption component
238: coding stream
240:VLD, Huffman (Huffman) decoder, CAVLC, CABAC
242:EUP TAG interface
244: the image header
246a, 246b, 246c, 246n: storage buffer MB
250,252,254,256,258,260,270,272,274,276,344a~i, 346a~i, 348a~i, 362j~r, 366j~r, 368a~r, 372b~r, 376b~j, 474,476,478,480,482,484,492,494,594,596,598,630,644,646,674,678,680,708,710,730,732,734,746,755,756,758,760,762,770,772: skipped blocks
262: anti-DC/AC prediction component
264: the anti-Q assembly of counter-scanning
265: interchanger
266: the coding pattern piece is rebuild assembly
280: filter assembly
The 282:MC filter
284: rebuild reference component
286: the coding pattern piece is rebuild
288: the interchanger assembly
290: the reconstruction framework assembly
292: deblock and the decyclization filter
294: the release of an interleave assembly
296: filter in inverse transform component/loop
298,330,442,472,502,512,522,532,542,544,698,706,726,774: adder
300,302,304,306,308,310,312,324:Z -1Delay Element
314a、314b、314c、314d:PE
316:Z -3Delay Element
320:Z -2Delay Element
318; 322; 326; 328; 342; 342a~i; 369; 369a~i; 382; 382a~d; 390; 390a~d; 400; 402; 404; 406; 408; 420; 422; 424; 428; 452; 454; 456; 458; 496; 498; 634; 640; 642; 656; 660; 672; 682; 684; 686; 690; 691; 692; 694; 696; 712; 714; 716; 718; 720; 736; 738; 742; 754; 757; 759; 761; 776; 778; 780; 782; 784; 786; 788; 790; 792; 794; 796; 798; 800: multiplexer
The 332:N shift unit
340,304a~1: storage buffer
350,350a~i: memory B, groove
360: the transposition network
370,370a~i:FIR filter block
380,380b~j: storage buffer C, groove
384、384a~d、580、582、600、602、604、622、624、636、648、652、662、670、740、744、:ALU
386、386a~d、412、440、444、466、468、470、488、626、650、658、664、700、702、704、722、724、763、764、766、
768: shift unit
388,388a~d:Z piece
410: multiplier
426: logic sum gate
430,432,586,606,608,610: the absolute value assembly
434: the minimum value assembly
436:2 system complement assembly
438,460,462,464,486,500: the subtraction assembly
446: pincers byte part
450a~h:P1~8 data
490a:A1
490b:A2
490c:A0
504,506,508,510,514,516,518,520,524,526,528,530,534,536,538,540: assembly
590,592,612,614,616,618: judge assembly
620: with door
628,668:clip3 assembly
632: not gate
The 638:SAT assembly
748: the β form
750: form
752: zero extension element
802: non-assembly
804,806,808,810,812,814: buffer
880,882: texture address generator-TAG square
888,891: texture filtering first in first out assembly TFF
890: memory cache

Claims (9)

1. a programmable vision processing unit in order to handle the video data of two kinds of forms, is characterized in that at least, comprises:
The command stream processor is in order to send instruction;
A plurality of performance elements in order to receiving the instruction that said command stream processor sends, and judge whether this instruction is video instructions, then this video instructions is sent to the texture address generator;
This texture address generator is in order to be sent to this video instructions texture quick access to memory controller;
This texture quick access to memory controller is in order to take in the data of texture filtering unit soon;
This texture filtering unit is in order to be received from video instructions that said texture quick access to memory controller sends and according to this this video data of video instructions filtering; And
Video processor; In order to handle video data through said texture filtering unit filtering according to this video instructions that receives from this texture quick access to memory controller; And with the processed video data be sent to the back wrapper; This back wrapper is collected as brick from this texture filtering unit, if be that part is complete as brick, then encapsulate a plurality of specific identification symbols that are sent to pipeline as brick and use and will be back to said performance element as brick;
Wherein, said performance element is through sending said video instructions to said texture filtering unit and said video processor, output video decoding data; Said command stream processor can be controlled said performance element executed in parallel, to accomplish video coding.
2. programmable vision processing unit according to claim 1 is characterized in that, the form of this video data be selected from following at least one of them: H.264 form, VC-1 form and MPEG-2 form.
3. programmable vision processing unit according to claim 1 is characterized in that, this video instructions comprises: separate block instruction and dynamic estimation instruction in dynamic compensation filter command, video transformation coding instruction, the loop.
4. programmable vision processing unit according to claim 3 is characterized in that, this video processor utilizes said video transformation coding instruction to carry out integer conversion in the form of this video data during for H.624 form and VC-1 form; This video processor utilizes said video transformation coding instruction to carry out a discrete cosine inverse transform when the form of this video data is the MPEG-2 form.
5. a video data handling procedure is characterized in that, this video data handling procedure comprises:
Performance element receives an instruction from a command stream processor, and judges whether the instruction that receives is video instructions, then this video instructions is sent to the texture address generator;
This texture address generator is sent to texture quick access to memory controller with this video instructions;
This texture quick access to memory controller is taken the video data that is selected from one of at least two kinds of forms in the texture filtering unit soon;
This texture filtering unit is received from video instructions that said texture quick access to memory controller sends and according to this this video data of video instructions filtering; And
Video processing unit is handled the video data through said texture filtering unit filtering according to this video instructions that receives from this texture quick access to memory controller; And with the processed video data be sent to the back wrapper; This back wrapper is collected as brick from this texture filtering unit; If is that part is complete, then encapsulates a plurality of specific identification symbols that are sent to pipeline as brick and use and will be back to said performance element as brick as brick;
Wherein, this video instructions comprises an identification field in order to indicate the form of this video data, and said performance element is through sending said video instructions to said texture filtering unit and said video processor, output video decoding data; Said command stream processor can be controlled a plurality of said performance element executed in parallel, to accomplish video coding.
6. video data handling procedure according to claim 5 is characterized in that, this video instructions comprises: separate block instruction and dynamic estimation instruction in dynamic compensation filter command, video transformation coding instruction, the loop.
7. video data handling procedure according to claim 6 is characterized in that, when this identification field was the MPEG-2 form, the step of handling this video data comprised dynamic compensation filtering and discrete cosine inverse transform; This identification field be the VC-1 form with form H.264 one of them the time, the step of handling this video data comprises filtering in dynamic compensation filtering, integer conversion and the loop.
8. video data handling procedure according to claim 5 is characterized in that, more comprises the combination in any of following items:
Carry out an absolute difference and calculating;
Carry out a texture quick access to memory filtering; And
Carry out de-blocking filter in the loop.
9. programmable vision processing unit according to claim 5 is characterized in that, the form of this video data be selected from following at least one of them: H.264 form, VC-1 form and MPEG-2 form.
CN2007101119554A 2006-06-16 2007-06-18 Programmable video processing unit and video data processing method Active CN101083763B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US81462306P 2006-06-16 2006-06-16
US60/814,623 2006-06-16

Publications (2)

Publication Number Publication Date
CN101083763A CN101083763A (en) 2007-12-05
CN101083763B true CN101083763B (en) 2012-02-08

Family

ID=38880763

Family Applications (6)

Application Number Title Priority Date Filing Date
CN2007101103594A Active CN101072351B (en) 2006-06-16 2007-06-13 Systems and methods of video compression deblocking
CN2007101101936A Active CN101068353B (en) 2006-06-16 2007-06-18 Graph processing unit and method for calculating absolute difference and total value of macroblock
CN2007101101921A Active CN101068364B (en) 2006-06-16 2007-06-18 Video encoder and graph processing unit
CN200710111956.9A Active CN101083764B (en) 2006-06-16 2007-06-18 Programmable video processing unit and video data processing method
CN2007101119554A Active CN101083763B (en) 2006-06-16 2007-06-18 Programmable video processing unit and video data processing method
CN2007101101940A Active CN101068365B (en) 2006-06-16 2007-06-18 Method for judging moving vector for describing refrence square moving and the storage media

Family Applications Before (4)

Application Number Title Priority Date Filing Date
CN2007101103594A Active CN101072351B (en) 2006-06-16 2007-06-13 Systems and methods of video compression deblocking
CN2007101101936A Active CN101068353B (en) 2006-06-16 2007-06-18 Graph processing unit and method for calculating absolute difference and total value of macroblock
CN2007101101921A Active CN101068364B (en) 2006-06-16 2007-06-18 Video encoder and graph processing unit
CN200710111956.9A Active CN101083764B (en) 2006-06-16 2007-06-18 Programmable video processing unit and video data processing method

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN2007101101940A Active CN101068365B (en) 2006-06-16 2007-06-18 Method for judging moving vector for describing refrence square moving and the storage media

Country Status (2)

Country Link
CN (6) CN101072351B (en)
TW (6) TWI444047B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8705622B2 (en) 2008-04-10 2014-04-22 Qualcomm Incorporated Interpolation filter support for sub-pixel resolution in video coding
US9967590B2 (en) 2008-04-10 2018-05-08 Qualcomm Incorporated Rate-distortion defined interpolation for video coding based on fixed filter or adaptive filter
US9077971B2 (en) 2008-04-10 2015-07-07 Qualcomm Incorporated Interpolation-like filtering of integer-pixel positions in video coding
EP2359590A4 (en) * 2008-12-15 2014-09-17 Ericsson Telefon Ab L M Method and apparatus for avoiding quality deterioration of transmitted media content
CN101901588B (en) * 2009-05-31 2012-07-04 比亚迪股份有限公司 Method for smoothly displaying image of embedded system
CN102164284A (en) * 2010-02-24 2011-08-24 富士通株式会社 Video decoding method and system
US8295619B2 (en) * 2010-04-05 2012-10-23 Mediatek Inc. Image processing apparatus employed in overdrive application for compressing image data of second frame according to first frame preceding second frame and related image processing method thereof
TWI395490B (en) * 2010-05-10 2013-05-01 Univ Nat Central Electrical-device-implemented video coding method
US8681162B2 (en) * 2010-10-15 2014-03-25 Via Technologies, Inc. Systems and methods for video processing
KR101438471B1 (en) 2011-01-03 2014-09-16 미디어텍 인크. Method of filter-unit based in-loop filtering
CN106162186B (en) * 2011-01-03 2020-06-23 寰发股份有限公司 Loop filtering method based on filtering unit
EP2708027B1 (en) * 2011-05-10 2019-12-25 MediaTek Inc. Method and apparatus for reduction of in-loop filter buffer
ES2657197T3 (en) * 2011-06-28 2018-03-01 Samsung Electronics Co., Ltd. Video decoding device with intra prediction
TWI612802B (en) * 2012-03-30 2018-01-21 Jvc Kenwood Corp Image decoding device, image decoding method
US9953455B2 (en) 2013-03-13 2018-04-24 Nvidia Corporation Handling post-Z coverage data in raster operations
US10154265B2 (en) 2013-06-21 2018-12-11 Nvidia Corporation Graphics server and method for streaming rendered content via a remote graphics processing service
CN105872553B (en) * 2016-04-28 2018-08-28 中山大学 A kind of adaptive loop filter method based on parallel computation
US20180174359A1 (en) * 2016-12-15 2018-06-21 Mediatek Inc. Frame difference generation hardware in a graphics system
CN111028133B (en) * 2019-11-21 2023-06-13 中国航空工业集团公司西安航空计算技术研究所 Graphic command pre-decoding device based on SystemVerilog

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1259830A (en) * 1998-10-09 2000-07-12 松下电器产业株式会社 Variable frequency system used for 2 : 1 sampling effectiveness
US6177922B1 (en) * 1997-04-15 2001-01-23 Genesis Microship, Inc. Multi-scan video timing generator for format conversion

Family Cites Families (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3578498B2 (en) * 1994-12-02 2004-10-20 株式会社ソニー・コンピュータエンタテインメント Image information processing device
CN1146047A (en) * 1995-02-28 1997-03-26 大宇电子株式会社 Method for sequentially displaying information recorded on interactive information recording medium
US6064450A (en) * 1995-12-06 2000-05-16 Thomson Licensing S.A. Digital video preprocessor horizontal and vertical filters
JP3876392B2 (en) * 1996-04-26 2007-01-31 富士通株式会社 Motion vector search method
JPH10145753A (en) * 1996-11-15 1998-05-29 Sony Corp Receiver and its method
US6496537B1 (en) * 1996-12-18 2002-12-17 Thomson Licensing S.A. Video decoder with interleaved data processing
JP3870491B2 (en) * 1997-07-02 2007-01-17 松下電器産業株式会社 Inter-image correspondence detection method and apparatus
US6573905B1 (en) * 1999-11-09 2003-06-03 Broadcom Corporation Video and graphics system with parallel processing of graphics windows
WO2000036844A1 (en) * 1998-12-11 2000-06-22 Matsushita Electric Industrial Co., Ltd. Device for deblocking filter operation and method for deblocking filter operation
CN1112714C (en) * 1998-12-31 2003-06-25 上海永新彩色显象管有限公司 Kinescope screen washing equipment and method
US6871001B1 (en) * 1999-03-23 2005-03-22 Sanyo Electric Co., Ltd. Video decoder
KR100677082B1 (en) * 2000-01-27 2007-02-01 삼성전자주식회사 Motion estimator
JP4461562B2 (en) * 2000-04-04 2010-05-12 ソニー株式会社 Playback apparatus and method, and signal processing apparatus and method
US6717988B2 (en) * 2001-01-11 2004-04-06 Koninklijke Philips Electronics N.V. Scalable MPEG-2 decoder
US7940844B2 (en) * 2002-06-18 2011-05-10 Qualcomm Incorporated Video encoding and decoding techniques
CN1332560C (en) * 2002-07-22 2007-08-15 上海芯华微电子有限公司 Method based on difference between block bundaries and quantizing factor for removing block effect without additional frame memory
US6944224B2 (en) * 2002-08-14 2005-09-13 Intervideo, Inc. Systems and methods for selecting a macroblock mode in a video encoder
WO2004030369A1 (en) * 2002-09-27 2004-04-08 Videosoft, Inc. Real-time video coding/decoding
US7027515B2 (en) * 2002-10-15 2006-04-11 Red Rock Semiconductor Ltd. Sum-of-absolute-difference checking of macroblock borders for error detection in a corrupted MPEG-4 bitstream
FR2849331A1 (en) * 2002-12-20 2004-06-25 St Microelectronics Sa METHOD AND DEVICE FOR DECODING AND DISPLAYING ACCELERATED ON THE ACCELERATED FRONT OF MPEG IMAGES, VIDEO PILOT CIRCUIT AND DECODER BOX INCORPORATING SUCH A DEVICE
US6922492B2 (en) * 2002-12-27 2005-07-26 Motorola, Inc. Video deblocking method and apparatus
CN100424717C (en) * 2003-03-17 2008-10-08 高通股份有限公司 Method and apparatus for improving video quality of low bit-rate video
US7660352B2 (en) * 2003-04-04 2010-02-09 Sony Corporation Apparatus and method of parallel processing an MPEG-4 data stream
US7274824B2 (en) * 2003-04-10 2007-09-25 Faraday Technology Corp. Method and apparatus to reduce the system load of motion estimation for DSP
NO319007B1 (en) * 2003-05-22 2005-06-06 Tandberg Telecom As Video compression method and apparatus
US20050013494A1 (en) * 2003-07-18 2005-01-20 Microsoft Corporation In-loop deblocking filter
AU2004301091B2 (en) * 2003-08-19 2008-09-11 Panasonic Corporation Method for encoding moving image and method for decoding moving image
US20050105621A1 (en) * 2003-11-04 2005-05-19 Ju Chi-Cheng Apparatus capable of performing both block-matching motion compensation and global motion compensation and method thereof
US7292283B2 (en) * 2003-12-23 2007-11-06 Genesis Microchip Inc. Apparatus and method for performing sub-pixel vector estimations using quadratic approximations
CN1233171C (en) * 2004-01-16 2005-12-21 北京工业大学 A simplified loop filtering method for video coding
US20050262276A1 (en) * 2004-05-13 2005-11-24 Ittiam Systamc (P) Ltd. Design method for implementing high memory algorithm on low internal memory processor using a direct memory access (DMA) engine
NO20042477A (en) * 2004-06-14 2005-10-17 Tandberg Telecom As Chroma de-blocking procedure
US20060002479A1 (en) * 2004-06-22 2006-01-05 Fernandes Felix C A Decoder for H.264/AVC video
US8116379B2 (en) * 2004-10-08 2012-02-14 Stmicroelectronics, Inc. Method and apparatus for parallel processing of in-loop deblocking filter for H.264 video compression standard
NO322722B1 (en) * 2004-10-13 2006-12-04 Tandberg Telecom As Video encoding method by reducing block artifacts
CN1750660A (en) * 2005-09-29 2006-03-22 威盛电子股份有限公司 Method for calculating moving vector

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6177922B1 (en) * 1997-04-15 2001-01-23 Genesis Microship, Inc. Multi-scan video timing generator for format conversion
CN1259830A (en) * 1998-10-09 2000-07-12 松下电器产业株式会社 Variable frequency system used for 2 : 1 sampling effectiveness

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JP特开2001-285802A 2001.10.12
JP特开2004-134887A 2004.04.30

Also Published As

Publication number Publication date
TWI383683B (en) 2013-01-21
CN101068365A (en) 2007-11-07
TW200816820A (en) 2008-04-01
TW200816082A (en) 2008-04-01
CN101068364A (en) 2007-11-07
CN101068364B (en) 2010-12-01
CN101072351B (en) 2012-11-21
CN101068353A (en) 2007-11-07
CN101072351A (en) 2007-11-14
CN101068353B (en) 2010-08-25
TW200821986A (en) 2008-05-16
TWI395488B (en) 2013-05-01
TW200803525A (en) 2008-01-01
TW200803527A (en) 2008-01-01
TW200803528A (en) 2008-01-01
TWI350109B (en) 2011-10-01
CN101083763A (en) 2007-12-05
TWI348654B (en) 2011-09-11
CN101068365B (en) 2010-08-25
TWI482117B (en) 2015-04-21
CN101083764B (en) 2014-04-02
TWI444047B (en) 2014-07-01
CN101083764A (en) 2007-12-05

Similar Documents

Publication Publication Date Title
CN101083763B (en) Programmable video processing unit and video data processing method
US5768429A (en) Apparatus and method for accelerating digital video decompression by performing operations in parallel
JP4554600B2 (en) Accelerate video decoding using a graphics processing unit
Shen et al. Accelerate video decoding with generic GPU
JP2001142678A (en) Operating method for processing core and multiplication executing method
US8174532B2 (en) Programmable video signal processor for video compression and decompression
CN102223525A (en) Video decoding method and system
JP2001147804A (en) Shift method and processing core of package data
CA2192532C (en) Hybrid software/hardware video decoder for personal computer
Qiu et al. DC coefficient recovery for JPEG images in ubiquitous communication systems
US20110261884A1 (en) Multi-Bus Architecture for a Video Codec
US20020021842A1 (en) Circuit and method for performing a two-dimensional transform during the processing of an image
JP2001147799A (en) Data-moving method, conditional transfer logic, method for re-arraying data and method for copying data
US8498333B2 (en) Filtering for VPU
Wan et al. AVS video decoding acceleration on ARM Cortex-A with NEON
US8873637B2 (en) Hardware pixel processing pipeline and video processing instructions
Krishnamoorthy et al. Design and implementation of power efficient image compressor for WSN systems
US9204159B2 (en) VPU with programmable core
Asbun et al. Real-time error concealment in digital video streams using digital signal processors
US20030118110A1 (en) Method for padding macroblocks
CN114531600B (en) Conversion unit, field programmable gate array, chip, electronic device, and system on chip
Lei et al. Optimization of Specific Instruction Set Processor for Image Algorithms
Nageswari et al. Design of a scalable approximate DCT architecture for efficient HEVC compliant video coding applications
Cantineau et al. Efficient parallelisation of an mpeg-2 codec on a tms320c80 video processor
Choi The architecture of a multimedia multiprocessor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant