Background
Discrete Cosine Transform (DCT) is an important module in video encoders, and in the currently mainstream video compression coding standard, DCT Transform is usually required to support multiple different Transform sizes, for example, the DCT Transform of HEVC has four sizes, 4 × 4, 8 × 8, 16 × 16 and 32 × 32. And as video resolution moves toward 4K/8K, the maximum transform size supported by DCT transforms in future video compression coding standards will also increase.
Two in video compression coding standardThe dimensional DCT transform (2D-DCT) can be written as a matrix multiplication form Z ═ CXCTWhere X is a residual matrix generated by the prediction encoding module, C is an integer coefficient matrix specified by a standard, and Z is transformed data. The 2D-DCT is usually implemented in steps, and the input residual matrix is first subjected to 1D-DCT line by line, i.e. Y ═ XCTThe intermediate result Y is then subjected to 1D-DCT column by column, i.e. Z ═ CY, so that the implementation of a 2D-DCT requires only two 1D-DCTs. For single size DCT transform, existing DCT/IDCT hardware designs typically optimize resources by implementing constant multiplication based on shift and addition instead of multiplication. However, as the transform size continues to increase, the register area and power consumption consumed by the shift operation will increase; meanwhile, the method based on shift and addition can only realize constant multiplication, and is not flexible enough when multiplexing DCT transform of different sizes in the video compression coding standard. Some researchers propose some multiplexing architectures suitable for 4, 8, 16, 32-point DCT transformation, but these architectures have problems of low utilization rate of some modules, too high complexity and consumed resource amount of hardware implementation multiplexing architecture, or inflexible configuration of throughput, and the like.
Disclosure of Invention
The invention aims to provide a configurable multi-size DCT hardware multiplexing architecture, which can realize multi-size 1D-DCT, has high resource utilization rate, can flexibly call core matrix coefficients with different sizes to multiply so as to realize DCT with different sizes, and can realize different throughputs under the condition of different configuration parameters.
In order to achieve the purpose, the technical scheme of the invention is as follows: a configurable multi-size DCT transform hardware multiplexing architecture, comprising:
the judgment and data rearrangement module judges whether the data input into the multiplexing framework needs to be rearranged according to the size of DCT transformation; for the DCT transform with the maximum size, the data input into the module does not need to be rearranged and is directly output; for DCT transformation smaller than the maximum size, rearranging the data input into the module to ensure that the arranged data meets the rule of subsequent butterfly operation, and providing guarantee for realizing parallel processing of multiple rows of input data, thereby fully utilizing interface resources of a multiplexing architecture;
the K-layer butterfly data processing module is used for carrying out K-layer butterfly data processing on the data processed by the judgment and data rearrangement module; each layer of butterfly shape data processing module firstly carries out butterfly operation on data input into the butterfly unit, even position data output after the calculation of the butterfly unit is used as the input of the next layer of butterfly shape data processing module, odd position data output after the calculation of the butterfly unit is used as the input of the multiplication unit of the current layer and multiplied by the corresponding core matrix coefficient, and the multiplied results are added through the addition unit of the current layer and then output;
and the final-stage vector inner product module multiplies the even-number position data vectors output by the last-layer butterfly data processing module by the corresponding core matrix coefficients, adds the multiplied results, and outputs the result.
In an embodiment of the present invention, the K-layer butterfly data processing module includes K layers of butterfly data processing modules, and the rule of each layer of butterfly data processing module is even position data output after the operation of the butterfly unit in the previous layer of butterfly data processing module, and the even position data is used as the input of the butterfly unit in the current layer of butterfly data processing module; each layer of butterfly shape data processing module in K layer butterfly shape data processing module includes:
the k-layer butterfly unit is used for performing butterfly operation on even position data output by the butterfly processing unit in the previous layer of butterfly data processing module, and if k is 1, performing butterfly operation on data output by the judgment and data rearrangement module; the even position data output after operation is used as the input of the next butterfly data processing module, and the odd position data output after operation is used as the input of the multiplication unit of the current layer;
the k-layer multiplication unit multiplies the odd-number position data output by the k-layer butterfly unit by the corresponding core matrix coefficient, and the number of multipliers contained in the multiplication unit is configurable so as to realize different data throughputs;
the k-th layer addition unit is used for adding the data output by the k-th layer multiplication unit step by step in pairs, and the result is used as the output of the multiplexing frame;
wherein, 1< ═ K.
In an embodiment of the present invention, the last-stage vector inner product module is configured to perform a vector inner product operation, and includes:
the final-stage multiplication unit is used for multiplying the even-number position data output by the K-th-layer butterfly unit by the corresponding core matrix coefficient;
and the final-stage addition unit is used for adding the data output by the final-stage multiplication unit pairwise and taking the data as the output of the multiplexing architecture.
In an embodiment of the present invention, the configurable multi-size DCT transform hardware multiplexing architecture is implemented by using two ways, namely, an FPGA-based digital logic hardware circuit and an ASIC-based digital logic hardware circuit.
Compared with the prior art, the invention has the following beneficial effects: the configurable multi-size DCT transform hardware multiplexing architecture improves the realization architecture of the traditional DCT transform, can effectively improve the resource utilization rate of the internal module of the whole multiplexing architecture, is compatible with DCT transforms of various sizes, and can flexibly configure the throughput of the DCT transform hardware multiplexing architecture; in addition, the configurable DCT conversion hardware multiplexing architecture is respectively realized by adopting a digital logic hardware circuit based on FPGA and a digital logic hardware circuit based on ASIC, is simple, effective and reconfigurable, and can be widely applied to multi-size DCT conversion in various video compression coding standards.
Detailed Description
The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.
The invention discloses a configurable multi-size DCT (discrete cosine transform) transformation hardware multiplexing architecture which comprises a judgment and data rearrangement module, a K-layer butterfly data processing module and a final-stage vector inner product module. The judgment and data rearrangement module judges whether the data input into the multiplexing framework needs to be rearranged according to the size of DCT, and for the DCT with the maximum size, the data input into the module does not need to be rearranged and is directly output; for DCT transformation smaller than the maximum size, rearranging the data input into the module to ensure that the arranged data meets the rule of subsequent butterfly operation, and providing guarantee for realizing parallel processing of multiple rows of input data, thereby fully utilizing interface resources of a multiplexing framework; the K-layer butterfly data processing module carries out K-layer butterfly data processing on the data processed by the judgment and data rearrangement module, each layer of butterfly data processing module carries out butterfly operation on the data input into the butterfly unit, the even position data output after the calculation of the butterfly unit is used as the input of the next layer of butterfly data processing module, the odd position data calculated by the butterfly unit is used as the input of the multiplication unit at the current layer and multiplied by the corresponding core matrix coefficient, and the multiplied results are added through the addition unit and then output; and the final-stage vector inner product module multiplies the even position data vectors output by the butterfly units in the last-layer butterfly data processing module by the corresponding core matrix coefficients, and the multiplied results are added step by step in pairs through the addition unit and then output.
Fig. 1 is a block diagram of the structure of an embodiment of the present invention. In this embodiment, the maximum size of the multi-size DCT transform supported by the multiplexing architecture is set to S
maxMinimum size is set to S
min. In the realization of S (S is less than or equal to S)
max) In point DCT transformation, r (S) row (or column) data is taken out from S-S image block every time
And sequentially form a row N
in(N
in=S
max) Inputting the data into the multiplexing structure, wherein the bit width of each data is w
inA bit. Multiplexing fabric packetsButterfly data processing module with K layers, wherein the value of K is K-min (log)
2S
min,log
2S
max)=log
2S
min. The whole multiplexing framework is realized on FPGA and ASIC hardware, and comprises a judgment and data rearrangement module 11, a K-layer butterfly data processing module 12 and a final-stage vector inner product module 13.
The judgment and data rearrangement module 11 judges whether the data input into the multiplexing framework needs to be rearranged according to the size of DCT transformation, and for the DCT transformation with the maximum size, the data input into the module does not need to be rearranged and is directly output; for DCT transforms smaller than the maximum size, the data input to the module is rearranged such that the arranged data satisfies the rules of subsequent butterfly operations, the rules followed during rearrangement being as shown in fig. 2. The order of the arrows in FIG. 2 is the rearranged data vector
The order of (a). The square array refers to the input vector x ═ x with r (S) row length S before rearrangement
0,x
1,...,x
S-1]Composed input matrix X'
(r×S)Wherein, matrix X'
(r×S)Is x'
a,bAnd is and
and correspond to each other. In order to support parallel processing of r (S) row vectors x for all K-level butterflies, we will do so
Is divided into 2
KThe number of the parts is one,
belong to the first
Part of which is the first
And (4) each element. Each part is internally arranged in a raster orderAll the rows are traversed, and in order to ensure head-to-tail symmetry, the scanning order of two adjacent parts is opposite, namely when c is an even number,
and when c is an odd number, the number of the carbon atoms,
the K-layer butterfly data processing module 12 is configured to perform butterfly data processing on the rearranged data in K layers, where even-numbered position data output by the butterfly processing unit in the previous layer of butterfly data processing module is used as input of the next layer of butterfly data processing module, and a horizontal ellipsis in fig. 1 indicates a middle layer of butterfly data processing module. By configuring the number of multipliers contained in the multiplication unit in the K-layer butterfly data processing module, the data processing module can be operated at Smin·r(S)≤T(S)≤S·r(S)=SmaxThe throughput t(s) of the multiplexing architecture is flexibly set within range. Each level of butterfly data processing module comprises 3 sub-modules. Take butterfly data processing modules of a first layer, a second layer, a K-1 layer and a K layer as examples.
The first layer butterfly data processing module comprises 3 sub-modules which are respectively:
(1) first-tier butterfly unit 1201: the module performs butterfly operation on the rearranged data and outputs two parts of data at even number positions and odd number positions, the number of the two parts of data output by the butterfly unit is reduced by half compared with the number of input data of the butterfly unit, and the bit width of each data is increased by 1 bit.
(2) First-layer multiplication unit 1202: the module multiplies the odd position data output by the first-layer butterfly unit 1201 by the corresponding core matrix coefficient.
(3) The first-layer addition unit 1203: this module adds the vector data output by the first layer of
multiplication units 1202 two by two in stages, which are total (log)
2S
max-1) stages of addition, as shown in fig. 3. For S
maxPoint conversion, namely outputting a datum at the adder at the last stage of the adding unit; for the
Point transformation, two data are output from the last second-stage adder of the adding unit; for smaller point number transforms, the analogy can be followed. Similarly, for the k (1)<=k<K) layer addition units, which share (log)
2S
max-k) stages of addition.
The second layer butterfly data processing module comprises 3 sub-modules which are respectively:
(1) second tier butterfly unit 1204: the module performs butterfly operation on the even position data output by the first-layer butterfly unit 1201, and outputs two parts of data of the even position and the odd position.
(2) Second-layer multiplication unit 1205: which multiplies the odd position data output by the second-tier butterfly unit 1204 by the corresponding core matrix coefficients.
(3) Second layer addition unit 1206: the module adds the vector data output by the second layer of multiplication unit 1205 two by two step, and the sum is total (log)2Smax-2) stages of addition.
The 3 sub-modules contained in the K-1 layer butterfly data processing module are respectively:
(1) layer K-1 butterfly unit 1207: the module carries out butterfly operation on even position data output by the K-2 layer butterfly unit and outputs the data into two parts of data of an even position and an odd position.
(2) Layer K-1 multiplication unit 1208: the module multiplies the odd position data output by the K-1 butterfly unit 1207 by the corresponding core matrix coefficients.
(3) Layer K-1 addition unit 1209: the module adds the vector data output by the K-1 layer multiplication unit 1208 two by two step, and the sum is total (log)2Smax-K-1) stages of addition.
The K-th layer butterfly data processing module comprises 3 sub-modules which are respectively:
(1) layer K butterfly unit 1210: the module performs butterfly operation on the even position data output by the K-1 layer butterfly unit 1207, and outputs two parts of data of the even position and the odd position.
(2) K-th layer multiplication unit 1211: this module multiplies the odd position data output by the K-th butterfly unit 1210 by the corresponding core matrix coefficients.
(3) Layer K addition unit 1212: the module adds the vector data output by the K-th layer multiplication unit 1211 two by two step, and the sum is total (log)2Smax-K) stages of addition.
The last-stage vector inner product module 13 is configured to multiply the data vector output by the last-layer (i.e., K-th-layer) butterfly data processing module by the corresponding core matrix coefficient, add the multiplied results by the addition unit, and output the result. The module comprises 2 submodules, wherein each submodule is respectively as follows:
(1) final multiplication unit 131: this module multiplies the even position data output by the K-th butterfly unit 1210 by the corresponding core matrix coefficients.
(2) Last-stage addition unit 132: the block adds the vector data output by the final multiplication unit 131 two by two in stages, which are total (log)2Smax-K) stages of addition.
The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.