CN100442847C - H.264 integer transformation accelerator - Google Patents

H.264 integer transformation accelerator Download PDF

Info

Publication number
CN100442847C
CN100442847C CNB2005100617040A CN200510061704A CN100442847C CN 100442847 C CN100442847 C CN 100442847C CN B2005100617040 A CNB2005100617040 A CN B2005100617040A CN 200510061704 A CN200510061704 A CN 200510061704A CN 100442847 C CN100442847 C CN 100442847C
Authority
CN
China
Prior art keywords
acc
vacc
data
clock cycle
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2005100617040A
Other languages
Chinese (zh)
Other versions
CN1929603A (en
Inventor
严晓浪
秦兴
刘大可
葛海通
罗晓华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CNB2005100617040A priority Critical patent/CN100442847C/en
Publication of CN1929603A publication Critical patent/CN1929603A/en
Application granted granted Critical
Publication of CN100442847C publication Critical patent/CN100442847C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

This invention relates to H.264 integral number acieration device, which comprises the following parts: data memory connected to data bus; vector computation register set to receive original data as vector data; summing register set data with each register Ri composed of four work register RiO, Ri1, Ri2 and Ri3; eight path vector data connection to run data operation; sum register set to store Ri acieration middle data; controller to operate data and designing selection signals.

Description

H.264 the device that quickens of integer transform
(1) technical field
The present invention relates to the device that a kind of H.264 integer transform quickens.
(2) background technology
Video encoding and decoding standard in the past, as MPEG2, MPEG4 etc. are general to adopt 8 * 8 discrete cosine transforms (DCT) to carry out transition coding.H.264, up-to-date video encoding and decoding standard adopts 4 * 4 integer transforms, comprises integer cosine transformation, integer anti-cosine transform and integer Ha Deman conversion.Although, computation complexity from single, H.264 4 * 4 integer transform and 8 * 8 discrete cosine transform (DCT) are compared, operand has reduced a lot, but in standard H.264, the number of the piece of participation integer transform is very many, so under the video image of same frame sign, the integer transform operand of accumulative total but is higher than the operand of 8 * 8 discrete cosine transforms (DCT) far away.So, realize video encoding and decoding standard real time codec H.264, must quicken H.264 integer transform.
Adopt general processor to carry out H.264 integer transform, though can share hardware, conversion rate is slow, and horizontal transformation all needs 64 clock cycle with vertical conversion.Adopt the method for application-specific integrated circuit (ASIC) can well quicken H.264 integer transform, but its circuit structure is generally special-purpose, apparatus expensive does not possess programmability and hardware extensibility, can only be applicable to a kind of coding standard.And single-instruction multiple-data (SIMD) processor, can utilize vector operation on certain degree, to quicken H.264 integer transform, though accelerating velocity is slower than the method for application-specific integrated circuit (ASIC), but hardware can be shared, do not need to drop into expensive equipment cost, by software programming, be better than adopting the method for general processor on the integer transform speed greatly.
General single-instruction multiple-data (SIMD) processor is when horizontal transformation or vertical conversion, to change the input matrix procession, acceleration effect is not very desirable, the present invention proposes one on single-instruction multiple-data (SIMD) system, the method that expansion strengthens, adopt special instruction efficiently to quicken H.264 integer transform, can share hardware, also very flexible on the software.
(3) summary of the invention
Can share with other softwares in order to overcome in the prior art deficiency that integer transform device H.264 can not possess the rapidity of the extensibility of hardware and conversion simultaneously, to the invention provides on a kind of hardware, the integer transform device that quickens of integer transform H.264 fast.
Technical scheme of the present invention is:
The device that a kind of H.264 integer transform quickens comprises the data storage that is connected with data/address bus, also comprises:
Vector calculus work register group is used to receive the initial data of data storage, and this initial data is a vector data: 4 * 4 input matrixes; And the intermediate data of accumulator register group;
Each vector calculus work register R iBy 4 scalar operation work register R I0, R I1, R I2, R I3Form vector calculus work register R iBe used to store the i line data of 4 * 4 matrixes or the capable new data of i of accumulator register group output.
8 road vector data paths are used for carrying out the operation of 4 * 4 row matrix operation of data according to operand;
Described 8 road vector data path flow processs comprise six grades of computings:
The first order is the selector of 8 alternatives, is used for selecting still to import from vector calculus work register group from the accumulator register group, carries out 8 variablees of 2 line data of 4 * 4 matrixes simultaneously;
The second level is 8 eight and selects one selector, is used to select to carry out 2 operations between the variable;
Under the horizontal transformation pattern, the scalar operation between 4 * 4 matrix i line data is carried out in selection according to operand, carries out the computing of 2 line data of 4 * 4 matrixes simultaneously;
Under vertical pattern conversion, the vector calculus between 4 * 4 row matrix data is carried out in selection according to operand;
The third level is the selector of 16 two alternatives, is used to select the operand to each adder whether to multiply by 2;
The fourth stage is the selector of 16 alternatives, is used to select the operand to each adder whether to multiply by 1/2;
Level V is the selector of 16 alternatives, is used to select to the whether negate of the operand of each adder;
The 6th grade is 8 adders, is used to carry out add operation, and result of calculation outputs to the accumulator register group.
The accumulator register group comprises ACC[0]-ACC[7] have 8 accumulator registers altogether, be used to store R iThe intermediate data that conversion is quickened; Wherein four accumulator registers are formed vector register VACC[0], other four accumulator registers are formed vector register VACC[1].
Controller is used for 8 road vector data paths are carried out the operand assignment, specifies the selection signal of 8 road vector data paths; And the read-write operation of control data memory;
Under the horizontal transformation pattern, controller produces operand information, and the read operation of control data memory, carries out twice scalar operation of two data of every row in 4 * 4 matrixes, two data are carried out scalar operation twice in addition simultaneously, and result of calculation is temporary in vector register;
Controller produces operand information once more, carry out vector register VACC[0] and VACC[1] in four accumulator registers in twice scalar operation of two data, two data are carried out scalar operation twice in addition simultaneously, and the result is saved in vector calculus work register group;
Each 4 * 4 matrix data is carried out twice operation of priority.
Under vertical pattern conversion, controller produces operand information, carries out the vector calculus between 4 * 4 matrixes, 2 line data, and operation result is temporary to vector register VACC[0], carry out the vector calculus between 2 line data in addition of 4 * 4 matrixes, result of calculation is temporary in vector register VACC[1];
Controller produces operand information once more, carries out VACC[0] and VACC[1] between vector calculus, and the result is saved in vector calculus work register group from the accumulator register group;
Each 4 * 4 matrix data is carried out twice operation of priority.
H.264 conversion quickens to be preferably as follows scheme, adopts integer cosine transformation to quicken:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
ACC [ 0 ] n = R i 0 + R i 3 ACC [ 2 ] n = R i 1 - R i 2 ACC [ 4 ] n = R i 2 + R i 1 ACC [ 6 ] n = - R i 3 + R i 0
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
ACC [ 0 ] n + 1 = ACC [ 0 ] n + ACC [ 4 ] n ACC [ 2 ] n + 1 = 2 ACC [ 2 ] n + ACC [ 6 ] n ACC [ 4 ] n + 1 = - ACC [ 4 ] n + ACC [ 0 ] n ACC [ 6 ] n + 1 = - 2 ACC [ 6 ] n + ACC [ 2 ] n
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively 1, R 2, R 3And R 4In.
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively 5, R 6, R 7And R 8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate respectively:
VACC[0] n=R 1+R 4
VACC[1] n=R 2+R 3
(2) in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+ACC[0] n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R7;
Other two line data are operated:
(4) in two clock cycle, calculate respectively:
VACC[0] n=R 1-R 4
VACC[1] n=R 2-R 3
(5) in a clock cycle, calculate:
VACC[0] n+1=2VACC[0] n+VACC[1] n
VACC[1] n+1=-2VACC[1] n+VACC[0] n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R8;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively 5, R 6, R 7And R 8In.
H.264 conversion quickens to be preferably as follows scheme, adopts the integer anti-cosine transform to quicken:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
ACC [ 0 ] n = R i 0 + R i 2 ACC [ 2 ] n = 1 2 R i 1 - R i 3 ACC [ 4 ] n = - R i 2 + R i 0 ACC [ 6 ] n = 1 2 R i 3 + R i 1
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
ACC [ 0 ] n + 1 = ACC [ 0 ] n + ACC [ 6 ] n ACC [ 2 ] n + 1 = ACC [ 2 ] n + ACC [ 4 ] n ACC [ 4 ] n + 1 = ACC [ 4 ] n - ACC [ 2 ] n ACC [ 6 ] n + 1 = - ACC [ 6 ] n + ACC [ 0 ] n
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another the row computing;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively 1, R 2, R 3And R 4In.
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively 5, R 6, R 7And R 8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate:
VACC[0] n=R 1+R 3
VACC [ 1 ] n = R 2 + 1 2 R 4
(2) in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8;
Other two line data are operated:
(4) in two clock cycle, calculate respectively:
VACC[0]=R 1-R 3
VACC [ 1 ] = 1 2 R 2 - R 4
(5) in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively 5, R 6, R 7And R 8In.
H.264 conversion quickens to be preferably as follows scheme, adopts integer Ha Deman conversion to quicken:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
ACC [ 0 ] n = R i 0 + R i 3 ACC [ 2 ] n = R i 1 - R i 2 ACC [ 4 ] n = R i 2 + R i 1 ACC [ 6 ] n = - R i 3 + R i 0
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
ACC [ 0 ] n + 1 = ACC [ 0 ] n + ACC [ 4 ] n ACC [ 2 ] n + 1 = ACC [ 2 ] n + ACC [ 6 ] n ACC [ 4 ] n + 1 = - ACC [ 4 ] n + ACC [ 0 ] n ACC [ 6 ] n + 1 = ACC [ 6 ] n - ACC [ 2 ] n
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively 1, R 2, R 3And R 4In.
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively 5, R 6, R 7And R 8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate:
VACC[0] n=R 1+R 3
VACC[1] n=R 2+R 4
(2) in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8;
Other two line data are operated:
(4) in two clock cycle, calculate:
VACC[0] n=R 1-R 3
VACC[1] n=R 2-R 4
(5) in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively 5, R 6, R 7And R 8In.
Operation principle of the present invention: under the horizontal transformation pattern, controller produces operand information, 8 road vector data paths are carried out the operand assignment, specify the selection signal of 8 road vector data paths, and the read operation of control data memory, store 4 * 4 matrix datas to be quickened into vector calculus work register group, wherein 4 tunnel twice scalar operation of 8 road vector data paths according to two data of delegation in operand execution 4 * 4 matrixes, simultaneously, two data in addition of delegation are carried out scalar operation twice, 4 road another row of carrying out 4 * 4 matrixes simultaneously calculate in addition, and result of calculation is temporary in vector register;
Controller produces operand information once more, 8 road vector data paths are carried out vector register VACC[0 according to operand] and VACC[1] in four accumulator registers in twice scalar operation of two data, two data are carried out scalar operation twice in addition simultaneously, and the result is saved in vector calculus work register group, finish the row of two in 4 * 4 matrixes horizontal transformation;
4 * 4 matrix datas are carried out once-through operation again, finish the horizontal transformation of 4 * 4 matrixes.
Under vertical pattern conversion, controller produces operand information, 8 road vector data paths are carried out the operand assignment, specify the selection signal of 8 road vector data paths, 8 road vector data paths are carried out the vector calculus between 4 * 4 matrixes, 2 line data, operation result is temporary to vector register VACC[0], 8 road vector data paths are carried out the vector calculus between 2 line data in addition of 4 * 4 matrixes, result of calculation is temporary in vector register VACC[1];
Controller produces operand information once more, 8 road vector data paths are carried out VACC[0 according to operand] and VACC[1] between vector calculus, and the result is saved in vector calculus work register group from the accumulator register group, finish the vertical conversion of the row of two in 4 * 4 matrixes;
4 * 4 matrix datas are carried out once-through operation again, finish the vertical conversion of 4 * 4 matrixes.
Beneficial effect of the present invention mainly shows: expansion strengthens on SIMD architecture, adopts powerful instruction to accelerate integer transform speed.
(4) description of drawings
Fig. 1 sets forth the H.264 overall block-diagram of integer transform accelerator of the present invention;
Fig. 2 sets forth the H.264 concrete block diagram of integer transform accelerator;
(5) embodiment
Below in conjunction with accompanying drawing the present invention is further described.
Embodiment one
Referring to accompanying drawing: the device that a kind of H.264 integer transform quickens, comprise the data storage that is connected with data/address bus, also comprise:
Vector calculus work register group is used to receive the initial data of data storage, and this initial data is a vector data: 4 * 4 input matrixes; And the intermediate data of accumulator register group;
Each vector calculus work register R iBy 4 scalar operation work register R I0, R I1, R I2, R I3Form vector calculus work register R iBe used to store the i line data of 4 * 4 matrixes or the capable new data of i of accumulator register group output.
8 road vector data paths are used for carrying out the operation of 4 * 4 row matrix operation of data according to operand;
Described 8 road vector data path flow processs comprise six grades of computings:
The first order is the selector of 8 alternatives, is used for selecting still to import from vector calculus work register group from the accumulator register group, carries out 8 variablees of 2 line data of 4 * 4 matrixes simultaneously;
The second level is 8 eight and selects one selector, is used to select to carry out 2 operations between the variable;
Under the horizontal transformation pattern, the scalar operation between 4 * 4 matrix i line data is carried out in selection according to operand, carries out the computing of 2 line data of 4 * 4 matrixes simultaneously;
Under vertical pattern conversion, the vector calculus between 4 * 4 row matrix data is carried out in selection according to operand;
The third level is the selector of 16 alternatives, is used to select the operand to each adder whether to multiply by 2;
The fourth stage is the selector of 16 alternatives, is used to select the operand to each adder whether to multiply by 1/2;
Level V is the selector of 16 alternatives, is used to select to the whether negate of the operand of each adder;
The 6th grade is 8 adders, is used to carry out add operation, and result of calculation outputs to the accumulator register group.
The accumulator register group comprises ACC[0]-ACC[7] have 8 accumulator registers altogether, be used to store R iThe intermediate data that conversion is quickened; Wherein four accumulator registers are formed vector register VACC[0], other four accumulator registers are formed vector register VACC[1].
Controller is used for 8 road vector data paths are carried out the operand assignment, specifies the selection signal of 8 road vector data paths; And the read-write operation of control data memory;
Under the horizontal transformation pattern, controller produces operand information, and the read operation of control data memory, carries out twice scalar operation of two data of every row in 4 * 4 matrixes, two data are carried out scalar operation twice in addition simultaneously, and result of calculation is temporary in vector register;
Controller produces operand information once more, carry out vector register VACC[0] and VACC[1] in four accumulator registers in twice scalar operation of two data, two data are carried out scalar operation twice in addition simultaneously, and the result is saved in vector calculus work register group;
Each 4 * 4 matrix data is carried out twice operation of priority.
Under vertical pattern conversion, controller produces operand information, carries out the vector calculus between 4 * 4 matrixes, 2 line data, and operation result is temporary to vector register VACC[0], carry out the vector calculus between 2 line data in addition of 4 * 4 matrixes, result of calculation is temporary in vector register VACC[1];
Controller produces operand information once more, carries out VACC[0] and VACC[1] between vector calculus, and the result is saved in vector calculus work register group from the accumulator register group;
Each 4 * 4 matrix data is carried out twice operation of priority.
Fig. 1 sets forth the H.264 overall block-diagram of integer transform accelerator of the present invention.Wherein, vector calculus work register group 1, data storage 4, accumulator register group 3 link to each other by bus and carry out the transmission of data.And each vector calculus work register can be regarded as by 4 parallel scalar operation work registers and forms.8 road vector data paths 2 link to each other with vector calculus work register group 1 passage by the twice highway width, thus the data content in can two vector calculus work registers of disposable access.The data path of 8 road vector data paths selects to be subjected to the control of controller 6, and controller 6 is reading command from program storage 5, is converted into control signal and outputs to 8 road vector data paths.
Fig. 2 sets forth the H.264 concrete block diagram of integer transform accelerator.Whole data path flow process can be divided into six grades of computings:
The first order, the task of this one-level are to carry out the selection of input variable, actually or select to import from vector calculus work register group from the accumulator register group, carry out 8 variablees of 2 line data of 4 * 4 matrixes simultaneously.This one-level has the selector of 8 alternatives, needs 8 control bits in Fig. 2 from left to right altogether, represents with A0~A7 here, down together.
The second level, the task choosing of this one-level is carried out the operation between 2 variablees.This one-level has 8 eight to select one selector, needs 3 * 8=24 control bit in Fig. 2 from left to right altogether, represents with B0~B23 here, down together.
The third level, the task of this one-level are whether the operand to each adder multiply by 2 selection.This one-level has the selector of 16 alternatives, needs 16 control bits in Fig. 2 from left to right altogether, represents with C0~C15 here, down together.
The fourth stage, the task of this one-level are whether the operand to each adder multiply by 1/2 selection.This one-level has the selector of 16 alternatives, needs 16 control bits in Fig. 2 from left to right altogether, represents with D0~D15 here, down together.
Level V, the task of this one-level are the selections of whether operand of each adder being carried out negate.This one-level has the selector of 16 alternatives, needs 16 control bits in Fig. 2 from left to right altogether, represents with E0~E15 here, down together.
The 6th grade, the task of this one-level is to carry out add operation, and the result is outputed in the accumulator register group.
Embodiment two
H.264 conversion is quickened to be preferably as follows scheme, and adopting H.264, integer cosine transformation quickens:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) 8 road vector data paths are configured according to table 1:
Figure C20051006170400171
Table 1
The first order, A0~A7 assignment " 1 " selects to import from vector calculus work register group;
Another variable of carrying out computing with this variable when a variable is determined, is selected in the second level, just represents the R of 0 path as B0=6 I0When determining, variable selects the R of 6 paths I3With R I0Carry out computing;
The third level is used to select the operand to each adder whether to multiply by 2, as C0=0, and C1=0, just expression is to R I0With R I3Not multiply by 2;
The fourth stage is used to select the operand to each adder whether to multiply by 1/2, as D0=0, and D1=0, just expression is to R I0With R I3Not multiply by 1/2;
Level V is used to select to the whether negate of the operand of each adder, as E0=0, and E1=0, just expression is to R I0With R I3Not negates;
The 6th grade, be used to carry out add operation, as ACC[0]=R I0+ R I3, and result of calculation outputed to the accumulator register group;
According to the configuration of 8 road vector data contimuity meters 1, in a clock cycle, calculate:
ACC [ 0 ] n = R i 0 + R i 3 ACC [ 2 ] n = R i 1 - R i 2 ACC [ 4 ] n = R i 2 + R i 1 ACC [ 6 ] n = - R i 3 + R i 0
Wherein subscript n is represented some clock cycle;
(2) 8 road vector data paths are configured according to table 2:
Figure C20051006170400182
Table 2
The first order, A0~A7 assignment " 0 " selects to import from the accumulator register group;
According to the configuration of 8 road vector data contimuity meters 2, in a clock cycle, calculate:
ACC [ 0 ] n + 1 = ACC [ 0 ] n + ACC [ 4 ] n ACC [ 2 ] n + 1 = 2 ACC [ 2 ] n + ACC [ 6 ] n ACC [ 4 ] n + 1 = - ACC [ 4 ] n + ACC [ 0 ] n ACC [ 6 ] n + 1 = - 2 ACC [ 6 ] n + ACC [ 2 ] n
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively 1, R 2, R 3And R 4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively 5, R 6, R 7And R 8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) 8 road vector data paths are configured according to table 3:
Figure C20051006170400191
Table 3
In first clock cycle, read in data from vector calculus work register group, carry out the vector calculus between 4 * 4 matrixes, 2 line data, operation result is temporary to vector register VACC[0], second clock cycle, read in two line data in addition from vector calculus work register group, carry out the vector calculus between 2 line data in addition of 4 * 4 matrixes, operation result is temporary to vector register VACC[1];
According to the configuration of 8 road vector data contimuity meters 3, in two clock cycle, calculate respectively:
VACC[0] n=R 1+R 4
VACC[1] n=R 2+R 3
(2) 8 road vector data paths are configured according to table 4:
Table 4
According to the configuration of 8 road vector data contimuity meters 4, in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R7;
Other two line data are operated:
(4) 8 road vector data paths are configured according to table 5:
Figure C20051006170400201
Table 5
According to the allocation list 5 of 8 road vector data paths, in two clock cycle, calculate respectively:
VACC[0] n=R 1-R 4
VACC[1] n=R 2-R 3
(5) 8 road vector data paths are configured according to table 6:
Figure C20051006170400202
Table 6
According to the configuration of 8 road vector data contimuity meters 6, in a clock cycle, calculate:
VACC[0] n+1=2VACC[0] n+VACC[1] n
VACC[1] n+1=-2VACC[1] n+VACC[0] n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R8;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively 5, R 6, R 7And R 8In.
All the other structures of present embodiment are identical with embodiment one with implementation.
Embodiment three
H.264 conversion is quickened to be preferably as follows scheme, and adopting H.264, the integer anti-cosine transform quickens:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) 8 road vector data paths are configured according to table 7:
Figure C20051006170400211
Table 7
According to the configuration of 8 road vector data contimuity meters 7, in a clock cycle, calculate:
ACC [ 0 ] n = R i 0 + R i 2 ACC [ 2 ] n = 1 2 R i 1 - R i 3 ACC [ 4 ] n = - R i 2 + R i 0 ACC [ 6 ] n = 1 2 R i 3 + R i 1
Wherein subscript n is represented some clock cycle;
(2) 8 road vector data paths are configured according to table 8:
Table 8
According to the configuration of 8 road vector data contimuity meters 8, in a clock cycle, calculate:
ACC [ 0 ] n + 1 = ACC [ 0 ] n + ACC [ 6 ] n ACC [ 2 ] n + 1 = ACC [ 2 ] n + ACC [ 4 ] n ACC [ 4 ] n + 1 = ACC [ 4 ] n - ACC [ 2 ] n ACC [ 6 ] n + 1 = - ACC [ 6 ] n + ACC [ 0 ] n
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another the row computing;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively 1, R 2, R 3And R 4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively 5, R 6, R 7And R 8In, the data of exporting behind the horizontal transformation are carried out following computing;
Two line data are operated:
(1) 8 road vector data paths are configured according to table 9:
Table 9
According to the configuration of 8 road vector data contimuity meters 9, in two clock cycle, calculate:
VACC[0] n=R 1+R 3
VACC [ 1 ] n = R 2 + 1 2 R 4
(2) according to the configuration of 8 road vector data contimuity meters 4, in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-V4CC[1] n+VACC[0] n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8; Other two line data are operated:
(4) 8 road vector data paths are configured according to table 10:
Table 10
According to the configuration of 8 road vector data contimuity meters 10, in two clock cycle, calculate respectively:
VACC[0]=R 1-R 3
VACC [ 1 ] = 1 2 R 2 - R 4
(5) according to the configuration of 8 road vector data contimuity meters 4, in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7; After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively 5, R 6, R 7And R 8In.
All the other structures of present embodiment are identical with embodiment one with implementation.
Embodiment four
H.264 conversion is quickened to be preferably as follows scheme, adopts H.264 integer Ha Deman conversion to quicken:
Described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) according to the configuration of 8 road vector data contimuity meters 1, in a clock cycle, calculate:
ACC [ 0 ] n = R i 0 + R i 3 ACC [ 2 ] n = R i 1 - R i 2 ACC [ 4 ] n = R i 2 + R i 1 ACC [ 6 ] n = - R i 3 + R i 0
Wherein subscript n is represented some clock cycle;
(2) 8 road vector data paths are configured according to table 11:
Figure C20051006170400233
Table 11
According to the configuration of 8 road vector data contimuity meters 11, in a clock cycle, calculate:
ACC [ 0 ] n + 1 = ACC [ 0 ] n + ACC [ 4 ] n ACC [ 2 ] n + 1 = ACC [ 2 ] n + ACC [ 6 ] n ACC [ 4 ] n + 1 = - ACC [ 4 ] n + ACC [ 0 ] n ACC [ 6 ] n + 1 = ACC [ 6 ] n - ACC [ 2 ] n
Wherein subscript n+1 is illustrated in the next clock cycle of the clock cycle of back computing indication;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively 1, R 2, R 3And R 4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively 5, R 6, R 7And R 8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) according to the configuration of 8 road vector data contimuity meters 3, in two clock cycle, calculate:
VACC[0] n=R 1+R 3
VACC[1] n=R 2+R 4
(2) according to the configuration of 8 road vector data contimuity meters 4, in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8;
Other two line data are operated:
(4) according to the configuration of 8 road vector data contimuity meters 5, in two clock cycle, calculate:
VACC[0] n=R 1-R 3
VACC[1] n=R 2-R 4
(5) according to the configuration of 8 road vector data contimuity meters 4, in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7; After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively 5, R 6, R 7And R 8In.
All the other structures of present embodiment are identical with embodiment one with implementation.

Claims (4)

1, a kind of device of H.264 integer transform acceleration comprises the data storage that is connected with data/address bus, it is characterized in that: also comprise:
Vector calculus work register group is used to receive the initial data of data storage, and this initial data is a vector data: 4 * 4 input matrixes; And the intermediate data of accumulator register group;
Each vector calculus work register R iBy 4 scalar operation work register R I0, R I1, R I2, R I3Form vector calculus work register R iBe used to store the i line data of 4 * 4 matrixes or the capable new data of i of accumulator register group output;
8 road vector data paths are used for carrying out the operation of 4 * 4 row matrix operation of data according to operand;
Described 8 road vector data path flow processs comprise six grades of computings:
The first order is the selector of 8 alternatives, is used for selecting still to import from vector calculus work register group from the accumulator register group, carries out 8 variablees of 2 line data of 4 * 4 matrixes simultaneously;
The second level is 8 eight and selects one selector, is used to select to carry out 2 operations between the variable;
Under the horizontal transformation pattern, the scalar operation between 4 * 4 matrix i line data is carried out in selection according to operand, carries out the computing of 2 line data of 4 * 4 matrixes simultaneously;
Under vertical pattern conversion, the vector calculus between 4 * 4 row matrix data is carried out in selection according to operand;
The third level is the selector of 16 alternatives, is used to select the operand to each adder whether to multiply by 2;
The fourth stage is the selector of 16 alternatives, is used to select the operand to each adder whether to multiply by 1/2;
Level V is the selector of 16 alternatives, is used to select to the whether negate of the operand of each adder;
The 6th grade is 8 adders, is used to carry out add operation, and result of calculation outputs to the accumulator register group;
The accumulator register group comprises ACC[0]-ACC[7] have 8 accumulator registers altogether, be used to store R iThe intermediate data that conversion is quickened; Wherein four accumulator registers are formed vector register VACC[0], other four accumulator registers are formed vector register VACC[1];
Controller is used for 8 road vector data paths are carried out the operand assignment, specifies the selection signal of 8 road vector data paths; And the read-write operation of control data memory;
Under the horizontal transformation pattern, controller produces operand information, and the read operation of control data memory, carries out twice scalar operation of two data of every row in 4 * 4 matrixes, two data are carried out scalar operation twice in addition simultaneously, and result of calculation is temporary in vector register;
Controller produces operand information once more, carry out vector register VACC[0] and VACC[1] in four accumulator registers in twice scalar operation of two data, two data are carried out scalar operation twice in addition simultaneously, and the result is saved in vector calculus work register group;
Each 4 * 4 matrix data is carried out twice operation of priority;
Under vertical pattern conversion, controller produces operand information, carries out the vector calculus between 4 * 4 matrixes, 2 line data, and operation result is temporary to vector register VACC[0], carry out the vector calculus between 2 line data in addition of 4 * 4 matrixes, result of calculation is temporary in vector register VACC[1];
Controller produces operand information once more, carries out VACC[0] and VACC[1] between vector calculus, and the result is saved in vector calculus work register group from the accumulator register group;
Each 4 * 4 matrix data is carried out twice operation of priority.
2, the device that quickens of H.264 integer transform as claimed in claim 1 is characterized in that: described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
ACC [ 0 ] n = R i 0 + R i 3 ACC [ 2 ] n = R i 1 - R i 2 ACC [ 4 ] n = R i 2 + R i 1 ACC [ 6 ] n = - R i 3 + R i 0
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
ACC [ 0 ] n + 1 = ACC [ 0 ] n + ACC [ 4 ] n ACC [ 2 ] n + 1 = 2 ACC [ 2 ] n + ACC [ 6 ] n ACC [ 4 ] n + 1 = - ACC [ 4 ] n + ACC [ 0 ] n ACC [ 6 ] n + 1 = - 2 ACC [ 6 ] n + ACC [ 2 ] n
Wherein the next clock cycle of some clock cycle n is represented in subscript n+1;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively 1, R 2, R 3And R 4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively 5, R 6, R 7And R 8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate respectively:
VACC[0] n=R 1+R 4
VACC[1] n=R 2+R 3
(2) in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R7;
Other two line data are operated:
(4) in two clock cycle, calculate respectively:
VACC[0] n=R 1-R 4
VACC[1] n=R 2-R 3
(5) in a clock cycle, calculate:
VACC[0] n+1=2VACC[0] n+VACC[1] n
VACC[1] n+1=-2VACC[1] n+VACC[0] n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R8;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively 5, R 6, R 7And R 8In.
3, the device that quickens of H.264 integer transform as claimed in claim 1 is characterized in that: described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
ACC [ 0 ] n = R i 0 + R i 2 ACC [ 2 ] n = 1 2 R i 1 - R i 3 ACC [ 4 ] n = - R i 2 + R i 0 ACC [ 6 ] n = 1 2 R i 3 + R i 1
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
ACC [ 0 ] n + 1 = ACC [ 0 ] n + ACC [ 6 ] n ACC [ 2 ] n + 1 = ACC [ 2 ] n + ACC [ 4 ] n ACC [ 4 ] n + 1 = ACC [ 4 ] n - ACC [ 2 ] n ACC [ 6 ] n + 1 = - ACC [ 6 ] n + ACC [ 0 ] n
Wherein the next clock cycle of some clock cycle n is represented in subscript n+1;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another the row computing;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively 1, R 2, R 3And R 4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively 5, R 6, R 7And R 8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate:
VACC[0] n=R 1+R 3
VACC [ 1 ] n = R 2 + 1 2 R 4
(2) in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8;
Other two line data are operated:
(4) in two clock cycle, calculate respectively:
VACC[0]=R 1-R 3
VACC [ 1 ] = 1 2 R 2 - R 4
(5) in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively 5, R 6, R 7And R 8In.
4, the device that quickens of H.264 integer transform as claimed in claim 1 is characterized in that: described controller under the horizontal transformation pattern, 4 * 4 matrix datas in the data path reading of data memory, and the i line data carried out following computing:
(1) in a clock cycle, calculate:
ACC [ 0 ] n = R i 0 + R i 3 ACC [ 2 ] n = R i 1 - R i 2 ACC [ 4 ] n = R i 2 + R i 1 ACC [ 6 ] n = - R i 3 + R i 0
Wherein subscript n is represented some clock cycle;
(2) in a clock cycle, calculate:
ACC [ 0 ] n + 1 = ACC [ 0 ] n + ACC [ 4 ] n ACC [ 2 ] n + 1 = ACC [ 2 ] n + ACC [ 6 ] n ACC [ 4 ] n + 1 = - ACC [ 4 ] n + ACC [ 0 ] n ACC [ 6 ] n + 1 = ACC [ 6 ] n - ACC [ 2 ] n
Wherein the next clock cycle of some clock cycle n is represented in subscript n+1;
With (1), (2) step with clock cycle, ACC[1], ACC[3], ACC[5], ACC[7] carry out another row operation;
(3) two clock cycle, read the content among the output result of two groups of accumulator registers respectively, send in the vector register and preserve;
Other two line data are carried out (1)~(3) operation, and after eight clock cycle, four groups of line data behind the horizontal transformation leave vector calculus work register R in by row vector successively 1, R 2, R 3And R 4In;
Controller is under vertical pattern conversion, vector register VACC[0] comprise ACC[0], ACC[2], ACC[4], ACC[6], vector register VACC[1] comprise ACC[1], ACC[3], ACC[5], ACC[7], leave the four lines new data after the vertical conversion in R according to row vector successively 5, R 6, R 7And R 8In, the data of exporting behind the horizontal transformation are carried out following computing:
Two line data are operated:
(1) in two clock cycle, calculate:
VACC[0] n=R 1+R 3
VACC[1] n=R 2+R 4
(2) in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(3) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R5, the R8;
Other two line data are operated:
(4) in two clock cycle, calculate:
VACC[0] n=R 1-R 3
VACC[1] n=R 2-R 4
(5) in a clock cycle, calculate:
VACC[0] n+1=VACC[0] n+VACC[1] n
VACC[1] n+1=-VACC[1] n+VACC[0] n
(6) two clock cycle, respectively with VACC[0], VACC[1] data be saved among R6, the R7;
After ten clock cycle, four groups of line data after the vertical conversion leave vector calculus work register R in by row vector successively 5, R 6, R 7And R 8In.
CNB2005100617040A 2005-11-25 2005-11-25 H.264 integer transformation accelerator Expired - Fee Related CN100442847C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005100617040A CN100442847C (en) 2005-11-25 2005-11-25 H.264 integer transformation accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005100617040A CN100442847C (en) 2005-11-25 2005-11-25 H.264 integer transformation accelerator

Publications (2)

Publication Number Publication Date
CN1929603A CN1929603A (en) 2007-03-14
CN100442847C true CN100442847C (en) 2008-12-10

Family

ID=37859354

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100617040A Expired - Fee Related CN100442847C (en) 2005-11-25 2005-11-25 H.264 integer transformation accelerator

Country Status (1)

Country Link
CN (1) CN100442847C (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014523673A (en) * 2011-06-18 2014-09-11 サムスン エレクトロニクス カンパニー リミテッド Video conversion method and apparatus, inverse conversion method and apparatus
CN103914426B (en) * 2013-01-06 2016-12-28 中兴通讯股份有限公司 A kind of method and device of multiple threads baseband signal
US11334358B2 (en) 2019-12-09 2022-05-17 Amazon Technologies, Inc. Hardware accelerator having reconfigurable instruction set and reconfigurable decoder
US11841792B1 (en) * 2019-12-09 2023-12-12 Amazon Technologies, Inc. Instructions with multiple memory access modes
CN112383782B (en) * 2020-10-10 2022-07-26 河南工程学院 One-dimensional DCT/IDCT converter for bit vector conversion accumulation shift

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1414793A (en) * 2001-10-23 2003-04-30 三星电子株式会社 Compression video decoder with contraction image function and its method
US20030093452A1 (en) * 2001-08-23 2003-05-15 Minhua Zhou Video block transform
CN1589025A (en) * 2004-07-30 2005-03-02 联合信源数字音视频技术(北京)有限公司 Vido decoder based on software and hardware cooperative control

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030093452A1 (en) * 2001-08-23 2003-05-15 Minhua Zhou Video block transform
CN1414793A (en) * 2001-10-23 2003-04-30 三星电子株式会社 Compression video decoder with contraction image function and its method
CN1589025A (en) * 2004-07-30 2005-03-02 联合信源数字音视频技术(北京)有限公司 Vido decoder based on software and hardware cooperative control

Also Published As

Publication number Publication date
CN1929603A (en) 2007-03-14

Similar Documents

Publication Publication Date Title
CN106940815B (en) Programmable convolutional neural network coprocessor IP core
US5517666A (en) Program controlled processor wherein vector distributor and vector coupler operate independently of sequencer
KR100330604B1 (en) Parallel Computing Unit Using Memory Array
CN111062472B (en) Sparse neural network accelerator based on structured pruning and acceleration method thereof
US4541071A (en) Dynamic gate array whereby an assembly of gates is simulated by logic operations on variables selected according to the gates
US7725520B2 (en) Processor
CN101061460B (en) Micro processor device and method for shuffle operations
CN100442847C (en) H.264 integer transformation accelerator
CN105912501A (en) SM4-128 encryption algorithm implementation method and system based on large-scale coarseness reconfigurable processor
JP2637749B2 (en) Data processing apparatus and processing method
CN111488976A (en) Neural network computing device, neural network computing method and related products
EP1314099B1 (en) Method and apparatus for connecting a massively parallel processor array to a memory array in a bit serial manner
CN104364755A (en) Method and device for accelerating computations by parallel computations of middle stratum operations
CN1564125A (en) Array type reconstructural DSP engine chip structure based on CORDIC unit
CN102004720B (en) Variable-length fast fourier transform circuit and implementation method
Margolus An embedded DRAM architecture for large-scale spatial-lattice computations
CN114416180B (en) Vector data compression method, vector data decompression method, device and equipment
US7260711B2 (en) Single instruction multiple data processing allowing the combination of portions of two data words with a single pack instruction
CN106776474B (en) The system and its data exchange, address generating method of vector processor realization FFT
CN111522776B (en) Computing architecture
JP3305406B2 (en) Program-controlled processor
RU2134448C1 (en) Homogeneous computing medium with double- layer programmable structure
JPH07210545A (en) Parallel processing processors
US11327677B1 (en) Data mover circuitry for N-dimensional data in an integrated circuit
JP5708634B2 (en) SIMD processor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20081210

Termination date: 20101125