CN117440168A

CN117440168A - Hardware architecture for realizing parallel spiral search algorithm

Info

Publication number: CN117440168A
Application number: CN202311752857.4A
Authority: CN
Inventors: 陈志峰; 施隆照; 王诗鑫; 杨小玲
Original assignee: Fuzhou Shixin Technology Co ltd
Current assignee: Fuzhou Shixin Technology Co ltd
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-01-23
Anticipated expiration: 2043-12-19
Also published as: CN117440168B

Abstract

The invention relates to a hardware architecture for realizing a parallel spiral search algorithm, which comprises a split pixel interpolation storage module, a UMVP control module, a motion estimation control module, a Merge control module, a cost calculation comparison module and a brightness prediction pixel reconstruction module; the whole pixel motion estimation, the sub-pixel motion estimation and the Merge share the same cost calculation comparison module to calculate and compare the cost, and the utilization rate of hardware resources is high. In the invention, the whole pixel motion estimation can search by taking a spiral search algorithm as a core, the sub-pixel motion estimation adopts a motion vector grouping strategy, the Merge adopts a strategy of cutting the length of Merge candidate list and simultaneously adopts column raster scanning and CU block alternate scanning, thereby effectively reducing the calculation complexity and reducing the clock number required by hardware realization.

Description

Hardware architecture for realizing parallel spiral search algorithm

Technical Field

The invention belongs to the technical field of video encoding and decoding, and particularly relates to a hardware architecture for realizing a parallel spiral search algorithm.

Background

The HEVC video coding standard is newly added with a set of special image segmentation modes based on H.264/AVC, wherein the modes of a coding unit, a prediction unit and a transformation unit are divided, and compared with the H.264, the code stream of the HEVC video under the condition of the same PSNR can be saved by 25% -50%.

The excellent performance of HEVC in terms of coding efficiency benefits from its possession of advanced coding structures, various advanced techniques, but this also makes HEVC far more complex than the h.264 coding format. Inter prediction occupies up to 80% of the complexity in the whole encoding process, and the motion estimation calculation time in inter prediction is about 70% of the whole inter prediction, so that reducing the motion estimation time can effectively reduce the complexity of the whole encoding process. The TZsearch algorithm adopted in the HM16.7 test model can effectively reduce the complexity by more than 93 percent under the condition that the performance loss is only 0.28 percent; however, the position change of the search point is large, the data is difficult to read quickly, the time consumption is high, and the hardware implementation is not facilitated; for the full search algorithm with a fixed search sequence, the complexity is far from meeting the requirement of real-time application. In HEVC, a common inter-frame motion estimation search algorithm needs to iterate CU blocks continuously to obtain optimal MVs and costs of all PU blocks, which is not suitable for video coding with CTUs increasing continuously, and repeatedly calculates pixel residual sums in the iteration process, resulting in repeated computation of many pixels. After the optimal point of the whole pixel motion estimation is determined by the PU block in the HEVC coding standard, 8 sub-pixel points with 1/2 precision around the optimal point are searched first, 8 sub-pixel points around the optimal 1/2 sub-pixel point are searched after the optimal 1/2 sub-pixel point is determined, namely, 16 sub-pixel motion estimation needs to be carried out on each PU block to obtain a final result. Although the complexity of sub-pixel motion estimation is an order of magnitude lower at the algorithm level than whole-pixel motion estimation, a significant amount of clock cycles and logic resources are still required in a hardware implementation. In HEVC, a large loop circuit exists in the conventional Merge implementation, so that the problem of inter-adjacent block motion information interdependence can cause hardware pipeline to break, and adjacent PU blocks need to wait for a long clock cycle interval when calculating Merge prediction.

In order to solve the problems, the invention provides a hardware architecture for realizing a parallel spiral search algorithm. The whole pixel motion estimation in the architecture can search by taking a parallel spiral search algorithm as a core, the algorithm has a fixed search sequence and a higher data multiplexing rate, redundant calculation can be effectively solved, the calculation complexity is reduced, and the calculation of one search point can be completed every four clock cycles; the sub-pixel motion estimation adopts a motion vector grouping strategy, and the sub-pixel motion estimation is carried out on the prediction blocks with the same motion vector at the same time, so that the calculation complexity is effectively reduced; the Merge adopts a strategy of cutting the length of the Merge candidate list and simultaneously adopting column raster scanning and CU block interleaving scanning, so that full-pipeline calculation is realized, and the clock number required by hardware realization is reduced; the whole pixel motion estimation, the sub-pixel motion estimation and the Merge share the same cost calculation comparison module to calculate and compare the cost, and the utilization rate of hardware resources is high.

Disclosure of Invention

In view of this, an object of the present invention is to provide a hardware architecture that implements a parallel spiral search algorithm. The whole pixel motion estimation in the architecture can search by taking a parallel spiral search algorithm as a core, the algorithm has a fixed search sequence and a higher data multiplexing rate, redundant calculation can be effectively solved, the calculation complexity is reduced, and the calculation of one search point can be completed every four clock cycles; the sub-pixel motion estimation adopts a motion vector grouping strategy, and the sub-pixel motion estimation is carried out on the prediction blocks with the same motion vector at the same time, so that the calculation complexity is effectively reduced; the Merge adopts a strategy of cutting the length of the Merge candidate list and simultaneously adopting column raster scanning and CU block interleaving scanning, so that full-pipeline calculation is realized, and the clock number required by hardware realization is reduced; the whole pixel motion estimation, the sub-pixel motion estimation and the Merge share the same cost calculation comparison module to calculate and compare the cost, and the utilization rate of hardware resources is high.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a hardware architecture for implementing a parallel spiral search algorithm, comprising the following features:

the framework comprises a sub-pixel interpolation storage module, a UMVP control module, a motion estimation control module, a Merge control module, a cost calculation comparison module and a brightness prediction pixel reconstruction module; the motion estimation control module and the Merge control module share the cost calculation comparison module to calculate rate distortion cost and cost comparison;

the sub-pixel interpolation storage module is used for cutting reference pixels transmitted from the encoder top layer module into search frames, storing the search frames into the storage module, interpolating and filtering in advance, and storing all 1/2 and 1/4 sub-pixel values in the search frames for calculating cost of motion estimation and Merge;

the UMVP control module calculates the MVP value of the maximum CU of the current CTU as a starting point of whole pixel motion estimation;

the motion estimation control module controls the whole pixel motion estimation and twice sub-pixel motion estimation processes, and outputs the minimum rate distortion cost, the partitioning mode and the motion vector of all CU blocks under four depths;

the Merge control module establishes a candidate list of each PU block, sequentially calculates rate-distortion cost, compares the rate-distortion cost with a motion estimation result and updates an optimal result;

the cost calculation and comparison module is used for calculating the rate distortion cost of the current matching block and comparing the rate distortion cost of each PU block in the process of motion estimation and Merge;

the brightness prediction pixel reconstruction module extracts the prediction pixel values of all the inter-frame blocks from the storage module according to the coding information of the current CTU for the subsequent reconstruction module to use.

Further, the sub-pixel interpolation storage module specifically includes:

the sub-pixel interpolation storage module comprises a sub-pixel interpolation storage control module, a reference pixel storage matrix module, an integral pixel buffer module, a sub-pixel interpolation filtering module, a read address decoding module and a reference pixel screening module;

the sub-pixel interpolation storage control module is used for controlling reading of reference pixels and writing of sub-pixels and is also used for starting the sub-pixel interpolation filtering module.

The whole pixel buffer module is used for buffering the reference pixels input by the reference pixel storage matrix module.

The sub-pixel interpolation filtering module is used for carrying out interpolation filtering on the reference whole pixel so as to obtain the sub-pixel required in sub-pixel motion estimation.

The read address decoding module is used for translating the received MV signals into read addresses of the storage matrix;

and the reference pixel screening module screens the output reference pixels according to the position information to obtain the reference pixels with the size of 32 multiplied by 32 which are used for cost calculation of the inter-frame prediction module.

Further, the motion estimation control module specifically includes:

the motion estimation control module comprises a whole pixel motion estimation control module, a sub-pixel motion estimation control module, a division mode selection module, an optimal cost, a division mode and an MV storage module;

the whole pixel motion estimation control module is used for calculating a motion vector required by whole pixel motion estimation, the sub-pixel interpolation storage module outputs a reference pixel required by whole pixel motion estimation according to the motion vector, the whole pixel motion estimation takes a spiral search algorithm as a core, and searches from a starting point to the periphery in a spiral extending sequence during searching, the algorithm has a fixed search sequence and a higher data multiplexing rate, and the whole pixel motion estimation calculates the cost of all PU blocks in a CTU in a parallel mode;

the sub-pixel motion estimation control module is used for calculating a motion vector required by sub-pixel motion estimation, the sub-pixel interpolation storage module outputs a reference pixel required by the sub-pixel motion estimation according to the motion vector, and the sub-pixel motion estimation adopts a motion vector grouping strategy to combine prediction blocks with the same motion vector to carry out the sub-pixel motion estimation;

the partition mode selection module is used for determining an optimal partition mode of the current CU;

the optimal cost, the optimal dividing mode and the MV storage module are used for storing the minimum cost, the optimal dividing mode and the corresponding motion vector of each CU block.

Further, the Merge control module specifically includes:

the Merge control module comprises an MV selection control module, a time domain MV expansion module, a time domain and space domain reference MV memory module, a CU division mode table module, an MV lookup table module and a FIFO for caching data;

the MV selecting control module directly extracts candidate motion vectors from the MV memory module and the MV lookup table module through coordinates and a dividing mode of the PU blocks, the split pixel interpolation memory module outputs reference pixels required by Merge calculation according to the motion vectors, merge adopts a cutting Merge candidate list length, PU blocks with depth 0 and depth 1 construct a complete candidate list, the first line of PU blocks under depth 2 and depth 3 do not use A0 and B2 blocks, the other lines of PU blocks only use A1, time domain blocks and zero vectors, and Merge adopts a strategy of column raster scanning and interleaving scanning of CU blocks with different depths, and the specific scanning sequence is as follows: depth 0- (depth 1) first column CU- (depth 2) first column CU- (depth 3) second column CU- (depth 2) second column CU- (depth 3 third, four column CU- (depth 1 third CU- (depth 2 third column CU- (depth 3 fifth, six column CU- (depth 1) fourth column CU- (depth 2 fourth column CU- (depth 3 seventh, eight column CU);

the time domain MV expansion module is used for carrying out expansion transformation on the time domain MVs;

the time domain and space domain reference MV memory module is used for storing time domain and space domain reference MVs required by Merge calculation;

the CU partition mode table module is used for storing CU partition modes required by Merge calculation;

the MV lookup table module is used for storing the spatial reference MVs of the current PU block required by Merge calculation, and the spatial reference MVs of the current PU block required by Merge calculation are given by motion estimation.

Further, the cost calculation and comparison module specifically includes:

the cost calculation and comparison module comprises an original pixel buffer module, a reference pixel buffer module, an SAD/SATD calculation module, an MVD bit number calculation module and a cost comparison module, wherein the whole pixel motion estimation, the sub-pixel motion estimation and the Merge share the cost calculation and comparison module to carry out cost calculation and cost comparison;

the original pixel buffer module is used for buffering original pixel data required by motion estimation and mere;

the reference pixel buffer module is used for buffering reference pixel data required by motion estimation and mere;

the SAD/SATD calculation module can select SAD or SATD calculation of the current matching block according to motion estimation and Merge requirements;

the MVD bit number calculation module is used for calculating the head bit number of the current matching block, and the distortion degree of the current matching block is added with the head bit number to obtain rate distortion cost;

the cost comparison module is used for comparing the rate distortion cost of each PU block and selecting the motion information with the minimum rate distortion cost of each PU block;

compared with the prior art, the invention has the following beneficial effects:

the whole pixel motion estimation can search by taking a parallel spiral search algorithm as a core, the algorithm has a fixed search sequence and a higher data multiplexing rate, can effectively solve redundant calculation, reduces calculation complexity, and can finish calculation of one search point every four clock cycles; the sub-pixel motion estimation adopts a motion vector grouping strategy, and the sub-pixel motion estimation is carried out on the prediction blocks with the same motion vector at the same time, so that the calculation complexity is effectively reduced; the Merge adopts a strategy of cutting the length of the Merge candidate list and simultaneously adopting column raster scanning and CU block interleaving scanning, so that full-pipeline calculation is realized, and the clock number required by hardware realization is reduced; the whole pixel motion estimation, the sub-pixel motion estimation and the Merge share the same cost calculation comparison module to calculate and compare the cost, and the utilization rate of hardware resources is high.

Drawings

FIG. 1 is a block diagram of a hardware architecture of the method of the present invention;

FIG. 2 is a schematic diagram of a split pixel interpolation memory module architecture according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a motion estimation control module architecture according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a parallel spiral search sequence in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a Merge control module architecture according to an embodiment of the present invention;

FIG. 6 is a column raster scan schematic of Merge in an embodiment of the invention;

FIG. 7 is a schematic view of a CU block interspersed scan of Merge in an embodiment of the invention;

FIG. 8 is a schematic diagram of a cost calculation comparison module architecture according to an embodiment of the present invention;

FIG. 9 is a schematic illustration of a pipeline in an embodiment of the invention;

FIG. 10 is a schematic diagram of a sequential state machine in an embodiment of the invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples.

Referring to fig. 1, the present invention provides a hardware architecture for implementing a parallel spiral search algorithm, where the architecture includes a split pixel interpolation storage module, a UMVP control module, a motion estimation control module, a Merge control module, a cost calculation comparison module, and a luminance prediction pixel reconstruction module;

referring to fig. 2, the sub-pixel interpolation storage module includes a sub-pixel interpolation storage control module, a reference pixel storage matrix module, an integer pixel buffer module, a sub-pixel interpolation filtering module, a read address decoding module and a reference pixel screening module;

the sub-pixel interpolation storage control module is used for controlling reading of reference pixels and writing of sub-pixels and also used for starting the sub-pixel interpolation filtering module;

the whole pixel buffer module is used for buffering the reference pixels input by the reference pixel storage matrix module;

the sub-pixel interpolation filtering module is used for carrying out interpolation filtering on the reference whole pixel so as to obtain a sub-pixel required in sub-pixel motion estimation;

Referring to fig. 3, the motion estimation control module includes a whole pixel motion estimation control module, a sub-pixel motion estimation control module, a partition mode selection module, and an optimal cost, partition mode and MV storage module;

the whole pixel motion estimation control module is used for calculating a motion vector required by whole pixel motion estimation, the sub-pixel interpolation storage module outputs a reference pixel required by whole pixel motion estimation according to the motion vector, the whole pixel motion estimation takes a spiral search algorithm as a core, the whole pixel motion estimation searches from a starting point to four sides in a spiral extending sequence during searching, the searching sequence is shown in fig. 4, the algorithm has a fixed searching sequence and a higher data multiplexing rate, and the whole pixel motion estimation calculates the cost of all PU blocks in a CTU in a parallel mode;

Referring to fig. 5, the Merge control module includes an MV selection control module, a time domain MV expansion module, a time domain and space domain reference MV memory module, a CU partition mode table module, an MV lookup table module, and a FIFO for caching data;

the MV selecting control module directly extracts candidate motion vectors from the MV memory module and the MV lookup table module through coordinates and a dividing mode of the PU blocks, the split pixel interpolation memory module outputs reference pixels required by Merge calculation according to the motion vectors, merge adopts a cutting Merge candidate list length, PU blocks with depth 0 and depth 1 construct a complete candidate list, the first line of PU blocks under depth 2 and depth 3 do not use A0 and B2 blocks, the other lines of PU blocks only use A1, time domain blocks and zero vectors, and Merge adopts a strategy of column raster scanning and interleaving scanning of CU blocks with different depths, and the specific scanning sequence is as follows: depth 0-depth 1 first CU-depth 2 first column CU-depth 3 first, second column CU-depth 1 second column CU-depth 3 third, fourth column CU-depth 1 third CU-depth 2 third column CU-depth 3 fifth, six column CU-depth 1 fourth CU-depth 2 fourth column CU-depth 3 seventh, eight column CU, see fig. 6 and 7;

Referring to fig. 8, the cost calculation and comparison module includes an original pixel buffer module, a reference pixel buffer module, a SAD/SATD calculation module, an MVD bit number calculation module and a cost comparison module;

the cost comparison module is used for comparing the rate distortion cost of each PU block and selecting the motion information with the minimum rate distortion cost of each PU block.

Referring to fig. 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10, the implementation of the present embodiment includes the following steps:

step S1: the encoder top-layer module inputs the original pixel, reference pixel, coordinates, temporal and spatial reference MV list of the current CTU.

Step S2: the sub-pixel interpolation storage module cuts reference pixels transmitted from the top layer into search frames and stores the search frames into the reference pixel storage matrix module, after the required reference pixels are stored, the sub-pixel interpolation storage control module starts the sub-pixel interpolation filtering module, and then the sub-pixel interpolation filtering module reads whole pixel points from the reference pixel storage matrix module and interpolates and filters out all 1/2 and 1/4 sub-pixel values of the search frames for motion estimation and calculation cost of Merge;

step S3: starting a UMVP control module to calculate the MVP value of the current CTU as a starting point of whole pixel motion estimation;

step S4: the motion estimation control module sequentially carries out integral pixel motion estimation and twice sub-pixel motion estimation, the motion vectors calculated by the integral pixel motion estimation control module and the sub-pixel motion estimation control module are input to a read address decoding module in a sub-pixel interpolation storage module, the read address decoding module translates received MV signals into read addresses of a reference pixel storage matrix module, the reference pixel storage matrix module outputs reference pixels to a reference pixel screening module according to the read addresses, the reference pixel screening module screens and outputs reference pixels with the size of 32 multiplied by 32 to a cost calculation comparison module for cost calculation, the cost calculation comparison module calls a SAD/SATD calculation module and an MVD bit number calculation module to calculate rate distortion cost at a current search point, then the rate distortion cost of each PU block at the current search point is compared with the stored minimum rate distortion cost of each PU block, the optimal cost and a corresponding MV (current CU) are reserved and output to a partition mode selection module in the motion estimation module, and the optimal partition mode of each CU block is determined by the partition mode selection module, and the optimal cost of each CU and the corresponding motion vector and the optimal CU block are stored in the partition mode and the optimal CU;

step S5: starting a Merge control module, sequentially traversing each PU block by an MV selection control module, directly taking out candidate MVs from a time domain and space domain reference MV memory module and an MV lookup table module through coordinates and a division mode of the PU blocks, reading reference pixels from a reference pixel memory matrix module to a cost calculation comparison module according to the candidate MVs, sequentially calculating rate distortion cost by using the cost calculation comparison module, comparing the rate distortion cost with a motion estimation result, updating an optimal result to an optimal cost, a division mode and an MV memory module, and outputting the optimal cost, the optimal division mode and corresponding motion vectors of all CU blocks under four depths by the optimal cost, the division mode and the MV memory module after the Merge is finished;

step S6: and after the Merge mode is finished, starting a brightness prediction pixel reconstruction module, and extracting the prediction pixel values of all the inter-frame blocks from a storage module by the brightness prediction pixel reconstruction module according to the coding information of the current CTU for use by a subsequent reconstruction module.

The hardware circuit of the embodiment adopts the Verilog HDL language to carry out RTL code description, and the Xilinx-based Vivado 2017 platform uses the FPGA device of VCU118 model to carry out synthesis and layout wiring, and the hardware resource consumption of the system is given in the table I.

The whole pixel motion estimation, the sub-pixel motion estimation and the Merge share the same cost calculation comparison module to calculate and compare the cost, so that the utilization rate of hardware resources is high; the whole pixel motion estimation can search by taking a spiral search algorithm as a core, the sub-pixel motion estimation adopts a motion vector grouping strategy, the Merge adopts a strategy of cutting the length of a Merge candidate list and simultaneously adopts column raster scanning and CU block interleaving scanning, so that the calculation complexity is effectively reduced, and the clock number required by hardware realization is reduced.

List one

The foregoing description is only of the preferred embodiments of the invention, and all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. The hardware architecture for realizing the parallel spiral search algorithm is characterized in that: the framework comprises a sub-pixel interpolation storage module, a UMVP control module, a motion estimation control module, a Merge control module, a cost calculation comparison module and a brightness prediction pixel reconstruction module; the motion estimation control module and the Merge control module share the cost calculation comparison module to calculate rate distortion cost and cost comparison;

2. The hardware architecture for implementing a parallel spiral search algorithm according to claim 1, wherein the sub-pixel interpolation storage module comprises a sub-pixel interpolation storage control module, a reference pixel storage matrix module, an integer pixel buffer module, a sub-pixel interpolation filter module, a read address decoding module and a reference pixel screening module;

3. The hardware architecture for implementing a parallel spiral search algorithm according to claim 1, wherein the motion estimation control module comprises an integer pixel motion estimation control module, a sub-pixel motion estimation control module, a partition mode selection module, and an optimal cost and partition mode and MV storage module;

4. The hardware architecture for implementing a parallel spiral search algorithm according to claim 1, wherein the Merge control module includes a MV selection control module, a time domain MV expansion module, a time domain and space domain reference MV memory module, a CU partition pattern table module, a MV lookup table module, and a FIFO for caching data;

5. The hardware architecture for implementing a parallel spiral search algorithm according to claim 1, wherein the cost calculation comparison module includes an original pixel buffer module, a reference pixel buffer module, a SAD/SATD calculation module, an MVD bit number calculation module, and a cost comparison module, where the whole pixel motion estimation, the sub-pixel motion estimation, and the Merge share the cost calculation comparison module to perform cost calculation and cost comparison;