CN100442847C - H.264 integer transformation accelerator - Google Patents
H.264 integer transformation accelerator Download PDFInfo
- Publication number
- CN100442847C CN100442847C CNB2005100617040A CN200510061704A CN100442847C CN 100442847 C CN100442847 C CN 100442847C CN B2005100617040 A CNB2005100617040 A CN B2005100617040A CN 200510061704 A CN200510061704 A CN 200510061704A CN 100442847 C CN100442847 C CN 100442847C
- Authority
- CN
- China
- Prior art keywords
- acc
- vacc
- data
- clock cycle
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Complex Calculations (AREA)
Abstract
This invention relates to H.264 integral number acieration device, which comprises the following parts: data memory connected to data bus; vector computation register set to receive original data as vector data; summing register set data with each register Ri composed of four work register RiO, Ri1, Ri2 and Ri3; eight path vector data connection to run data operation; sum register set to store Ri acieration middle data; controller to operate data and designing selection signals.
Description
(1) technical field
The present invention relates to the device that a kind of H.264 integer transform quickens.
(2) background technology
Video encoding and decoding standard in the past, as MPEG2, MPEG4 etc. are general to adopt 8 * 8 discrete cosine transforms (DCT) to carry out transition coding.H.264, up-to-date video encoding and decoding standard adopts 4 * 4 integer transforms, comprises integer cosine transformation, integer anti-cosine transform and integer Ha Deman conversion.Although, computation complexity from single, H.264 4 * 4 integer transform and 8 * 8 discrete cosine transform (DCT) are compared, operand has reduced a lot, but in standard H.264, the number of the piece of participation integer transform is very many, so under the video image of same frame sign, the integer transform operand of accumulative total but is higher than the operand of 8 * 8 discrete cosine transforms (DCT) far away.So, realize video encoding and decoding standard real time codec H.264, must quicken H.264 integer transform.
Adopt general processor to carry out H.264 integer transform, though can share hardware, conversion rate is slow, and horizontal transformation all needs 64 clock cycle with vertical conversion.Adopt the method for application-specific integrated circuit (ASIC) can well quicken H.264 integer transform, but its circuit structure is generally special-purpose, apparatus expensive does not possess programmability and hardware extensibility, can only be applicable to a kind of coding standard.And single-instruction multiple-data (SIMD) processor, can utilize vector operation on certain degree, to quicken H.264 integer transform, though accelerating velocity is slower than the method for application-specific integrated circuit (ASIC), but hardware can be shared, do not need to drop into expensive equipment cost, by software programming, be better than adopting the method for general processor on the integer transform speed greatly.
General single-instruction multiple-data (SIMD) processor is when horizontal transformation or vertical conversion, to change the input matrix procession, acceleration effect is not very desirable, the present invention proposes one on single-instruction multiple-data (SIMD) system, the method that expansion strengthens, adopt special instruction efficiently to quicken H.264 integer transform, can share hardware, also very flexible on the software.
(3) summary of the invention
Can share with other softwares in order to overcome in the prior art deficiency that integer transform device H.264 can not possess the rapidity of the extensibility of hardware and conversion simultaneously, to the invention provides on a kind of hardware, the integer transform device that quickens of integer transform H.264 fast.
Technical scheme of the present invention is:
The device that a kind of H.264 integer transform quickens comprises the data storage that is connected with data/address bus, also comprises:
Vector calculus work register group is used to receive the initial data of data storage, and this initial data is a vector data: 4 * 4 input matrixes; And the intermediate data of accumulator register group;
Each vector calculus work register R
iBy 4 scalar operation work register R
I0, R
I1, R
I2, R
I3Form vector calculus work register R
iBe used to store the i line data of 4 * 4 matrixes or the capable new data of i of accumulator register group output.
8 road vector data paths are used for carrying out the operation of 4 * 4 row matrix operation of data according to operand;
Described 8 road vector data path flow processs comprise six grades of computings:
The first order is the selector of 8 alternatives, is used for selecting still to import from vector calculus work register group from the accumulator register group, carries out 8 variablees of 2 line data of 4 * 4 matrixes simultaneously;
The second level is 8 eight and selects one selector, is used to select to carry out 2 operations between the variable;
Under the horizontal transformation pattern, the scalar operation between 4 * 4 matrix i line data is carried out in selection according to operand, carries out the computing of 2 line data of 4 * 4 matrixes simultaneously;
Under vertical pattern conversion, the vector calculus between 4 * 4 row matrix data is carried out in selection according to operand;
The third level is the selector of 16 two alternatives, is used to select the operand to each adder whether to multiply by 2;
The fourth stage is the selector of 16 alternatives, is used to select the operand to each adder whether to multiply by 1/2;
Level V is the selector of 16 alternatives, is used to select to the whether negate of the operand of each adder;
The 6th grade is 8 adders, is used to carry out add operation, and result of calculation outputs to the accumulator register group.
The accumulator register group comprises ACC[0]-ACC[7] have 8 accumulator registers altogether, be used to store R
iThe intermediate data that conversion is quickened; Wherein four accumulator registers are formed vector register VACC[0], other four accumulator registers are formed vector register VACC[1].
Controller is used for 8 road vector data paths are carried out the operand assignment, specifies the selection signal of 8 road vector data paths; And the read-write operation of control data memory;
Under the horizontal transformation pattern, controller produces operand information, and the read operation of control data memory, carries out twice scalar operation of two data of every row in 4 * 4 matrixes, two data are carried out scalar operation twice in addition simultaneously, and result of calculation is temporary in vector register;
Controller produces operand information once more, carry out vector register VACC[0] and VACC[1] in four accumulator registers in twice scalar operation of two data, two data are carried out scalar operation twice in addition simultaneously, and the result is saved in vector calculus work register group;
Each 4 * 4 matrix data is carried out twice operation of priority.
Under vertical pattern conversion, controller produces operand information, carries out the vector calculus between 4 * 4 matrixes, 2 line data, and operation result is temporary to vector register VACC[0], carry out the vector calculus between 2 line data in addition of 4 * 4 matrixes, result of calculation is temporary in vector register VACC[1];
Controller produces operand information once more, carries out VACC[0] and VACC[1] between vector calculus, and the result is saved in vector calculus work register group from the accumulator register group;
Each 4 * 4 matrix data is carried out twice operation of priority.
H.264 conversion quickens to be preferably as follows scheme, adopts integer cosine transformation to quicken:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively
1, R
2, R
3And R
4In.
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively
5, R
6, R
7And R
8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate respectively:
VACC[0]
n=R
1+R
4
VACC[1]
n=R
2+R
3
(2) in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+ACC[0]
n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R7;
Other two line data are operated:
(4) in two clock cycle, calculate respectively:
VACC[0]
n=R
1-R
4
VACC[1]
n=R
2-R
3
(5) in a clock cycle, calculate:
VACC[0]
n+1=2VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-2VACC[1]
n+VACC[0]
n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R8;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively
5, R
6, R
7And R
8In.
H.264 conversion quickens to be preferably as follows scheme, adopts the integer anti-cosine transform to quicken:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another the row computing;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively
1, R
2, R
3And R
4In.
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively
5, R
6, R
7And R
8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate:
VACC[0]
n=R
1+R
3
(2) in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8;
Other two line data are operated:
(4) in two clock cycle, calculate respectively:
VACC[0]=R
1-R
3
(5) in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively
5, R
6, R
7And R
8In.
H.264 conversion quickens to be preferably as follows scheme, adopts integer Ha Deman conversion to quicken:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively
1, R
2, R
3And R
4In.
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively
5, R
6, R
7And R
8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate:
VACC[0]
n=R
1+R
3
VACC[1]
n=R
2+R
4
(2) in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8;
Other two line data are operated:
(4) in two clock cycle, calculate:
VACC[0]
n=R
1-R
3
VACC[1]
n=R
2-R
4
(5) in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively
5, R
6, R
7And R
8In.
Operation principle of the present invention: under the horizontal transformation pattern, controller produces operand information, 8 road vector data paths are carried out the operand assignment, specify the selection signal of 8 road vector data paths, and the read operation of control data memory, store 4 * 4 matrix datas to be quickened into vector calculus work register group, wherein 4 tunnel twice scalar operation of 8 road vector data paths according to two data of delegation in operand execution 4 * 4 matrixes, simultaneously, two data in addition of delegation are carried out scalar operation twice, 4 road another row of carrying out 4 * 4 matrixes simultaneously calculate in addition, and result of calculation is temporary in vector register;
Controller produces operand information once more, 8 road vector data paths are carried out vector register VACC[0 according to operand] and VACC[1] in four accumulator registers in twice scalar operation of two data, two data are carried out scalar operation twice in addition simultaneously, and the result is saved in vector calculus work register group, finish the row of two in 4 * 4 matrixes horizontal transformation;
4 * 4 matrix datas are carried out once-through operation again, finish the horizontal transformation of 4 * 4 matrixes.
Under vertical pattern conversion, controller produces operand information, 8 road vector data paths are carried out the operand assignment, specify the selection signal of 8 road vector data paths, 8 road vector data paths are carried out the vector calculus between 4 * 4 matrixes, 2 line data, operation result is temporary to vector register VACC[0], 8 road vector data paths are carried out the vector calculus between 2 line data in addition of 4 * 4 matrixes, result of calculation is temporary in vector register VACC[1];
Controller produces operand information once more, 8 road vector data paths are carried out VACC[0 according to operand] and VACC[1] between vector calculus, and the result is saved in vector calculus work register group from the accumulator register group, finish the vertical conversion of the row of two in 4 * 4 matrixes;
4 * 4 matrix datas are carried out once-through operation again, finish the vertical conversion of 4 * 4 matrixes.
Beneficial effect of the present invention mainly shows: expansion strengthens on SIMD architecture, adopts powerful instruction to accelerate integer transform speed.
(4) description of drawings
Fig. 1 sets forth the H.264 overall block-diagram of integer transform accelerator of the present invention;
Fig. 2 sets forth the H.264 concrete block diagram of integer transform accelerator;
(5) embodiment
Below in conjunction with accompanying drawing the present invention is further described.
Embodiment one
Referring to accompanying drawing: the device that a kind of H.264 integer transform quickens, comprise the data storage that is connected with data/address bus, also comprise:
Vector calculus work register group is used to receive the initial data of data storage, and this initial data is a vector data: 4 * 4 input matrixes; And the intermediate data of accumulator register group;
Each vector calculus work register R
iBy 4 scalar operation work register R
I0, R
I1, R
I2, R
I3Form vector calculus work register R
iBe used to store the i line data of 4 * 4 matrixes or the capable new data of i of accumulator register group output.
8 road vector data paths are used for carrying out the operation of 4 * 4 row matrix operation of data according to operand;
Described 8 road vector data path flow processs comprise six grades of computings:
The first order is the selector of 8 alternatives, is used for selecting still to import from vector calculus work register group from the accumulator register group, carries out 8 variablees of 2 line data of 4 * 4 matrixes simultaneously;
The second level is 8 eight and selects one selector, is used to select to carry out 2 operations between the variable;
Under the horizontal transformation pattern, the scalar operation between 4 * 4 matrix i line data is carried out in selection according to operand, carries out the computing of 2 line data of 4 * 4 matrixes simultaneously;
Under vertical pattern conversion, the vector calculus between 4 * 4 row matrix data is carried out in selection according to operand;
The third level is the selector of 16 alternatives, is used to select the operand to each adder whether to multiply by 2;
The fourth stage is the selector of 16 alternatives, is used to select the operand to each adder whether to multiply by 1/2;
Level V is the selector of 16 alternatives, is used to select to the whether negate of the operand of each adder;
The 6th grade is 8 adders, is used to carry out add operation, and result of calculation outputs to the accumulator register group.
The accumulator register group comprises ACC[0]-ACC[7] have 8 accumulator registers altogether, be used to store R
iThe intermediate data that conversion is quickened; Wherein four accumulator registers are formed vector register VACC[0], other four accumulator registers are formed vector register VACC[1].
Controller is used for 8 road vector data paths are carried out the operand assignment, specifies the selection signal of 8 road vector data paths; And the read-write operation of control data memory;
Under the horizontal transformation pattern, controller produces operand information, and the read operation of control data memory, carries out twice scalar operation of two data of every row in 4 * 4 matrixes, two data are carried out scalar operation twice in addition simultaneously, and result of calculation is temporary in vector register;
Controller produces operand information once more, carry out vector register VACC[0] and VACC[1] in four accumulator registers in twice scalar operation of two data, two data are carried out scalar operation twice in addition simultaneously, and the result is saved in vector calculus work register group;
Each 4 * 4 matrix data is carried out twice operation of priority.
Under vertical pattern conversion, controller produces operand information, carries out the vector calculus between 4 * 4 matrixes, 2 line data, and operation result is temporary to vector register VACC[0], carry out the vector calculus between 2 line data in addition of 4 * 4 matrixes, result of calculation is temporary in vector register VACC[1];
Controller produces operand information once more, carries out VACC[0] and VACC[1] between vector calculus, and the result is saved in vector calculus work register group from the accumulator register group;
Each 4 * 4 matrix data is carried out twice operation of priority.
Fig. 1 sets forth the H.264 overall block-diagram of integer transform accelerator of the present invention.Wherein, vector calculus work register group 1, data storage 4, accumulator register group 3 link to each other by bus and carry out the transmission of data.And each vector calculus work register can be regarded as by 4 parallel scalar operation work registers and forms.8 road vector data paths 2 link to each other with vector calculus work register group 1 passage by the twice highway width, thus the data content in can two vector calculus work registers of disposable access.The data path of 8 road vector data paths selects to be subjected to the control of controller 6, and controller 6 is reading command from program storage 5, is converted into control signal and outputs to 8 road vector data paths.
Fig. 2 sets forth the H.264 concrete block diagram of integer transform accelerator.Whole data path flow process can be divided into six grades of computings:
The first order, the task of this one-level are to carry out the selection of input variable, actually or select to import from vector calculus work register group from the accumulator register group, carry out 8 variablees of 2 line data of 4 * 4 matrixes simultaneously.This one-level has the selector of 8 alternatives, needs 8 control bits in Fig. 2 from left to right altogether, represents with A0~A7 here, down together.
The second level, the task choosing of this one-level is carried out the operation between 2 variablees.This one-level has 8 eight to select one selector, needs 3 * 8=24 control bit in Fig. 2 from left to right altogether, represents with B0~B23 here, down together.
The third level, the task of this one-level are whether the operand to each adder multiply by 2 selection.This one-level has the selector of 16 alternatives, needs 16 control bits in Fig. 2 from left to right altogether, represents with C0~C15 here, down together.
The fourth stage, the task of this one-level are whether the operand to each adder multiply by 1/2 selection.This one-level has the selector of 16 alternatives, needs 16 control bits in Fig. 2 from left to right altogether, represents with D0~D15 here, down together.
Level V, the task of this one-level are the selections of whether operand of each adder being carried out negate.This one-level has the selector of 16 alternatives, needs 16 control bits in Fig. 2 from left to right altogether, represents with E0~E15 here, down together.
The 6th grade, the task of this one-level is to carry out add operation, and the result is outputed in the accumulator register group.
Embodiment two
H.264 conversion is quickened to be preferably as follows scheme, and adopting H.264, integer cosine transformation quickens:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) 8 road vector data paths are configured according to table 1:
Table 1
The first order, A0~A7 assignment " 1 " selects to import from vector calculus work register group;
Another variable of carrying out computing with this variable when a variable is determined, is selected in the second level, just represents the R of 0 path as B0=6
I0When determining, variable selects the R of 6 paths
I3With R
I0Carry out computing;
The third level is used to select the operand to each adder whether to multiply by 2, as C0=0, and C1=0, just expression is to R
I0With R
I3Not multiply by 2;
The fourth stage is used to select the operand to each adder whether to multiply by 1/2, as D0=0, and D1=0, just expression is to R
I0With R
I3Not multiply by 1/2;
Level V is used to select to the whether negate of the operand of each adder, as E0=0, and E1=0, just expression is to R
I0With R
I3Not negates;
The 6th grade, be used to carry out add operation, as ACC[0]=R
I0+ R
I3, and result of calculation outputed to the accumulator register group;
According to the configuration of 8 road vector data contimuity meters 1, in a clock cycle, calculate:
Wherein subscript n is represented some clock cycle;
(2) 8 road vector data paths are configured according to table 2:
Table 2
The first order, A0~A7 assignment " 0 " selects to import from the accumulator register group;
According to the configuration of 8 road vector data contimuity meters 2, in a clock cycle, calculate:
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively
1, R
2, R
3And R
4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively
5, R
6, R
7And R
8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) 8 road vector data paths are configured according to table 3:
Table 3
In first clock cycle, read in data from vector calculus work register group, carry out the vector calculus between 4 * 4 matrixes, 2 line data, operation result is temporary to vector register VACC[0], second clock cycle, read in two line data in addition from vector calculus work register group, carry out the vector calculus between 2 line data in addition of 4 * 4 matrixes, operation result is temporary to vector register VACC[1];
According to the configuration of 8 road vector data contimuity meters 3, in two clock cycle, calculate respectively:
VACC[0]
n=R
1+R
4
VACC[1]
n=R
2+R
3
(2) 8 road vector data paths are configured according to table 4:
Table 4
According to the configuration of 8 road vector data contimuity meters 4, in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R7;
Other two line data are operated:
(4) 8 road vector data paths are configured according to table 5:
Table 5
According to the allocation list 5 of 8 road vector data paths, in two clock cycle, calculate respectively:
VACC[0]
n=R
1-R
4
VACC[1]
n=R
2-R
3
(5) 8 road vector data paths are configured according to table 6:
Table 6
According to the configuration of 8 road vector data contimuity meters 6, in a clock cycle, calculate:
VACC[0]
n+1=2VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-2VACC[1]
n+VACC[0]
n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R8;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively
5, R
6, R
7And R
8In.
All the other structures of present embodiment are identical with embodiment one with implementation.
Embodiment three
H.264 conversion is quickened to be preferably as follows scheme, and adopting H.264, the integer anti-cosine transform quickens:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) 8 road vector data paths are configured according to table 7:
Table 7
According to the configuration of 8 road vector data contimuity meters 7, in a clock cycle, calculate:
Wherein subscript n is represented some clock cycle;
(2) 8 road vector data paths are configured according to table 8:
Table 8
According to the configuration of 8 road vector data contimuity meters 8, in a clock cycle, calculate:
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another the row computing;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively
1, R
2, R
3And R
4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively
5, R
6, R
7And R
8In, the data of exporting behind the horizontal transformation are carried out following computing;
Two line data are operated:
(1) 8 road vector data paths are configured according to table 9:
Table 9
According to the configuration of 8 road vector data contimuity meters 9, in two clock cycle, calculate:
VACC[0]
n=R
1+R
3
(2) according to the configuration of 8 road vector data contimuity meters 4, in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-V4CC[1]
n+VACC[0]
n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8; Other two line data are operated:
(4) 8 road vector data paths are configured according to table 10:
Table 10
According to the configuration of 8 road vector data contimuity meters 10, in two clock cycle, calculate respectively:
VACC[0]=R
1-R
3
(5) according to the configuration of 8 road vector data contimuity meters 4, in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7; After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively
5, R
6, R
7And R
8In.
All the other structures of present embodiment are identical with embodiment one with implementation.
Embodiment four
H.264 conversion is quickened to be preferably as follows scheme, adopts H.264 integer Ha Deman conversion to quicken:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) according to the configuration of 8 road vector data contimuity meters 1, in a clock cycle, calculate:
Wherein subscript n is represented some clock cycle;
(2) 8 road vector data paths are configured according to table 11:
Table 11
According to the configuration of 8 road vector data contimuity meters 11, in a clock cycle, calculate:
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively
1, R
2, R
3And R
4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively
5, R
6, R
7And R
8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) according to the configuration of 8 road vector data contimuity meters 3, in two clock cycle, calculate:
VACC[0]
n=R
1+R
3
VACC[1]
n=R
2+R
4
(2) according to the configuration of 8 road vector data contimuity meters 4, in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8;
Other two line data are operated:
(4) according to the configuration of 8 road vector data contimuity meters 5, in two clock cycle, calculate:
VACC[0]
n=R
1-R
3
VACC[1]
n=R
2-R
4
(5) according to the configuration of 8 road vector data contimuity meters 4, in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7; After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively
5, R
6, R
7And R
8In.
All the other structures of present embodiment are identical with embodiment one with implementation.
Claims (4)
1, a kind of device of H.264 integer transform acceleration comprises the data storage that is connected with data/address bus, it is characterized in that: also comprise:
Vector calculus work register group is used to receive the initial data of data storage, and this initial data is a vector data: 4 * 4 input matrixes; And the intermediate data of accumulator register group;
Each vector calculus work register R
iBy 4 scalar operation work register R
I0, R
I1, R
I2, R
I3Form vector calculus work register R
iBe used to store the i line data of 4 * 4 matrixes or the capable new data of i of accumulator register group output;
8 road vector data paths are used for carrying out the operation of 4 * 4 row matrix operation of data according to operand;
Described 8 road vector data path flow processs comprise six grades of computings:
The first order is the selector of 8 alternatives, is used for selecting still to import from vector calculus work register group from the accumulator register group, carries out 8 variablees of 2 line data of 4 * 4 matrixes simultaneously;
The second level is 8 eight and selects one selector, is used to select to carry out 2 operations between the variable;
Under the horizontal transformation pattern, the scalar operation between 4 * 4 matrix i line data is carried out in selection according to operand, carries out the computing of 2 line data of 4 * 4 matrixes simultaneously;
Under vertical pattern conversion, the vector calculus between 4 * 4 row matrix data is carried out in selection according to operand;
The third level is the selector of 16 alternatives, is used to select the operand to each adder whether to multiply by 2;
The fourth stage is the selector of 16 alternatives, is used to select the operand to each adder whether to multiply by 1/2;
Level V is the selector of 16 alternatives, is used to select to the whether negate of the operand of each adder;
The 6th grade is 8 adders, is used to carry out add operation, and result of calculation outputs to the accumulator register group;
The accumulator register group comprises ACC[0]-ACC[7] have 8 accumulator registers altogether, be used to store R
iThe intermediate data that conversion is quickened; Wherein four accumulator registers are formed vector register VACC[0], other four accumulator registers are formed vector register VACC[1];
Controller is used for 8 road vector data paths are carried out the operand assignment, specifies the selection signal of 8 road vector data paths; And the read-write operation of control data memory;
Under the horizontal transformation pattern, controller produces operand information, and the read operation of control data memory, carries out twice scalar operation of two data of every row in 4 * 4 matrixes, two data are carried out scalar operation twice in addition simultaneously, and result of calculation is temporary in vector register;
Controller produces operand information once more, carry out vector register VACC[0] and VACC[1] in four accumulator registers in twice scalar operation of two data, two data are carried out scalar operation twice in addition simultaneously, and the result is saved in vector calculus work register group;
Each 4 * 4 matrix data is carried out twice operation of priority;
Under vertical pattern conversion, controller produces operand information, carries out the vector calculus between 4 * 4 matrixes, 2 line data, and operation result is temporary to vector register VACC[0], carry out the vector calculus between 2 line data in addition of 4 * 4 matrixes, result of calculation is temporary in vector register VACC[1];
Controller produces operand information once more, carries out VACC[0] and VACC[1] between vector calculus, and the result is saved in vector calculus work register group from the accumulator register group;
Each 4 * 4 matrix data is carried out twice operation of priority.
2, the device that quickens of H.264 integer transform as claimed in claim 1 is characterized in that: described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
Wherein the next clock cycle of some clock cycle n is represented in subscript n+1;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively
1, R
2, R
3And R
4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively
5, R
6, R
7And R
8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate respectively:
VACC[0]
n=R
1+R
4
VACC[1]
n=R
2+R
3
(2) in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R7;
Other two line data are operated:
(4) in two clock cycle, calculate respectively:
VACC[0]
n=R
1-R
4
VACC[1]
n=R
2-R
3
(5) in a clock cycle, calculate:
VACC[0]
n+1=2VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-2VACC[1]
n+VACC[0]
n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R8;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively
5, R
6, R
7And R
8In.
3, the device that quickens of H.264 integer transform as claimed in claim 1 is characterized in that: described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
Wherein the next clock cycle of some clock cycle n is represented in subscript n+1;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another the row computing;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively
1, R
2, R
3And R
4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively
5, R
6, R
7And R
8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate:
VACC[0]
n=R
1+R
3
(2) in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8;
Other two line data are operated:
(4) in two clock cycle, calculate respectively:
VACC[0]=R
1-R
3
(5) in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively
5, R
6, R
7And R
8In.
4, the device that quickens of H.264 integer transform as claimed in claim 1 is characterized in that: described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
Wherein the next clock cycle of some clock cycle n is represented in subscript n+1;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively
1, R
2, R
3And R
4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively
5, R
6, R
7And R
8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate:
VACC[0]
n=R
1+R
3
VACC[1]
n=R
2+R
4
(2) in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8;
Other two line data are operated:
(4) in two clock cycle, calculate:
VACC[0]
n=R
1-R
3
VACC[1]
n=R
2-R
4
(5) in a clock cycle, calculate:
VACC[0]
n+1=VACC[0]
n+VACC[1]
n
VACC[1]
n+1=-VACC[1]
n+VACC[0]
n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively
5, R
6, R
7And R
8In.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2005100617040A CN100442847C (en) | 2005-11-25 | 2005-11-25 | H.264 integer transformation accelerator |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2005100617040A CN100442847C (en) | 2005-11-25 | 2005-11-25 | H.264 integer transformation accelerator |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1929603A CN1929603A (en) | 2007-03-14 |
CN100442847C true CN100442847C (en) | 2008-12-10 |
Family
ID=37859354
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2005100617040A Expired - Fee Related CN100442847C (en) | 2005-11-25 | 2005-11-25 | H.264 integer transformation accelerator |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100442847C (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014523673A (en) * | 2011-06-18 | 2014-09-11 | サムスン エレクトロニクス カンパニー リミテッド | Video conversion method and apparatus, inverse conversion method and apparatus |
CN103914426B (en) * | 2013-01-06 | 2016-12-28 | 中兴通讯股份有限公司 | A kind of method and device of multiple threads baseband signal |
US11334358B2 (en) | 2019-12-09 | 2022-05-17 | Amazon Technologies, Inc. | Hardware accelerator having reconfigurable instruction set and reconfigurable decoder |
US11841792B1 (en) * | 2019-12-09 | 2023-12-12 | Amazon Technologies, Inc. | Instructions with multiple memory access modes |
CN112383782B (en) * | 2020-10-10 | 2022-07-26 | 河南工程学院 | One-dimensional DCT/IDCT converter for bit vector conversion accumulation shift |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1414793A (en) * | 2001-10-23 | 2003-04-30 | 三星电子株式会社 | Compression video decoder with contraction image function and its method |
US20030093452A1 (en) * | 2001-08-23 | 2003-05-15 | Minhua Zhou | Video block transform |
CN1589025A (en) * | 2004-07-30 | 2005-03-02 | 联合信源数字音视频技术(北京)有限公司 | Vido decoder based on software and hardware cooperative control |
-
2005
- 2005-11-25 CN CNB2005100617040A patent/CN100442847C/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030093452A1 (en) * | 2001-08-23 | 2003-05-15 | Minhua Zhou | Video block transform |
CN1414793A (en) * | 2001-10-23 | 2003-04-30 | 三星电子株式会社 | Compression video decoder with contraction image function and its method |
CN1589025A (en) * | 2004-07-30 | 2005-03-02 | 联合信源数字音视频技术(北京)有限公司 | Vido decoder based on software and hardware cooperative control |
Also Published As
Publication number | Publication date |
---|---|
CN1929603A (en) | 2007-03-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106940815B (en) | Programmable convolutional neural network coprocessor IP core | |
US5517666A (en) | Program controlled processor wherein vector distributor and vector coupler operate independently of sequencer | |
KR100330604B1 (en) | Parallel Computing Unit Using Memory Array | |
CN111062472B (en) | Sparse neural network accelerator based on structured pruning and acceleration method thereof | |
US4541071A (en) | Dynamic gate array whereby an assembly of gates is simulated by logic operations on variables selected according to the gates | |
US7725520B2 (en) | Processor | |
CN101061460B (en) | Micro processor device and method for shuffle operations | |
CN100442847C (en) | H.264 integer transformation accelerator | |
CN105912501A (en) | SM4-128 encryption algorithm implementation method and system based on large-scale coarseness reconfigurable processor | |
JP2637749B2 (en) | Data processing apparatus and processing method | |
CN111488976A (en) | Neural network computing device, neural network computing method and related products | |
EP1314099B1 (en) | Method and apparatus for connecting a massively parallel processor array to a memory array in a bit serial manner | |
CN104364755A (en) | Method and device for accelerating computations by parallel computations of middle stratum operations | |
CN1564125A (en) | Array type reconstructural DSP engine chip structure based on CORDIC unit | |
CN102004720B (en) | Variable-length fast fourier transform circuit and implementation method | |
Margolus | An embedded DRAM architecture for large-scale spatial-lattice computations | |
CN114416180B (en) | Vector data compression method, vector data decompression method, device and equipment | |
US7260711B2 (en) | Single instruction multiple data processing allowing the combination of portions of two data words with a single pack instruction | |
CN106776474B (en) | The system and its data exchange, address generating method of vector processor realization FFT | |
CN111522776B (en) | Computing architecture | |
JP3305406B2 (en) | Program-controlled processor | |
RU2134448C1 (en) | Homogeneous computing medium with double- layer programmable structure | |
JPH07210545A (en) | Parallel processing processors | |
US11327677B1 (en) | Data mover circuitry for N-dimensional data in an integrated circuit | |
JP5708634B2 (en) | SIMD processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C17 | Cessation of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20081210 Termination date: 20101125 |