CN109089124B

CN109089124B - Inter-frame data reuse method and device for motion estimation

Info

Publication number: CN109089124B
Application number: CN201811018540.7A
Authority: CN
Inventors: 徐卫志; 郭元元; 于惠; 陆佃杰; 张宇昂; 刘方爱
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2018-09-03
Filing date: 2018-09-03
Publication date: 2021-10-19
Anticipated expiration: 2038-09-03
Also published as: CN109089124A

Abstract

The invention discloses a motion estimation-oriented interframe data reuse method and device. The motion estimation-oriented interframe data reuse method comprises the following steps: processing at least two current frames in turn in the same time period, and adopting the same starting point and scanning sequence; when any two adjacent frames are processed, the processing result data of the current frame is read to an on-chip cache in time for direct reading when the adjacent frames are processed, and the data of the frames of the current frame is reused; thus, when m current frames are processed in the same time period, only 1/m frames are read twice, and the rest frames only need to be read once; wherein m is more than or equal to 2 and m is a positive integer.

Description

Inter-frame data reuse method and device for motion estimation

Technical Field

The invention belongs to the field of data processing, and particularly relates to a motion estimation-oriented interframe data reuse method and device.

Background

The frame rate boost is used for optimizing the display effect of the dynamic images of the liquid crystal television. The frame rate lifting technology mainly solves the problems of obvious reduction of motion resolution, motion blur trailing and the like of a liquid crystal display panel due to response delay by improving the refreshing frequency of a television video image. Frame rate boosting is achieved by inserting one or more frames of images between two consecutive frames of images, and the frame insertion method based on motion estimation is one of the most effective methods.

However, the accuracy of motion estimation may have an impact on the quality of the inserted frame. Motion estimation is usually the most computation and memory part of frame rate lifting, and usually occupies most of the running time of frame rate lifting.

Motion estimation based on block matching is the mainstream motion estimation method because it is simple and efficient. Block matching is used to find the reference block that best matches the current macroblock. Among them, sad (sum of absolute differences) is a criterion for determining the best match. The displacement between the current macroblock and the reference macroblock is a Motion Vector (MV).

Full search motion estimation (FSIME) employs a brute force search to search for the best matching macroblock in the search window to the best accuracy. Due to its regularity, FSIME is suitable for hardware implementation, but FSIME requires a large amount of computation and memory access.

In addition, some fast search algorithms are proposed to reduce the time overhead, but there is usually some loss of accuracy, such as: three-step search, new three-step search, diamond search and four-step search. Fast search algorithms may not find the best matching macroblock and some fast search algorithms are not suitable for hardware implementation.

In recent years, the difference between the memory access speed and the computing speed of a processor is increasing, the performance of real-time video application is improved by reducing off-chip memory access as an important means, and the data reuse on-chip is an effective means for reducing the off-chip memory access. For FSIME, some data reuse methods have been proposed. But these efforts focus on how to improve data reuse efficiency within reference frames, while ignoring frame-to-frame data reuse. For frame frequency promotion, in a traditional memory access design, each frame of image is read twice from an off-chip memory, the first time is used as a current frame, and the second time is used as a previous frame, so that the video processing speed is reduced.

In summary, in the prior art, an effective solution is not yet provided for the problem that each frame of image in the conventional memory access design needs to be read twice from the off-chip memory.

Disclosure of Invention

In order to solve the problem that each frame of image needs to be read twice from an off-chip memory in the traditional memory access design, the first purpose of the invention is to provide a motion estimation-oriented inter-frame data reuse method, which can reduce the memory access times and improve the video processing speed through data reuse between adjacent frame images.

The invention relates to a motion estimation-oriented interframe data reuse method, which comprises the following steps:

processing at least two current frames in turn in the same time period, and adopting the same starting point and scanning sequence;

when any two adjacent frames are processed, the processing result data of the current frame is read to an on-chip cache in time for direct reading when the adjacent frames are processed, and the data of the frames of the current frame is reused; thus, when m current frames are processed in the same time period, only 1/m frames are read twice, and the rest frames only need to be read once; wherein m is more than or equal to 2 and m is a positive integer.

Further, the process of processing the current frame includes: according to the pre-divided macro blocks, the absolute error sum and the motion vector of the corresponding macro block are calculated.

Where the sum of absolute errors is one criterion for determining the reference block that best matches the current macroblock, is a measure of the similarity between image blocks. By taking the absolute value of the difference between each pixel in the original block and the corresponding pixel in the block for comparison. These differences are added to create a simple measure of block similarity, the L1 norm of the difference image or the manhattan distance between two image blocks.

The motion vector refers to a displacement between the current macroblock and the reference macroblock.

Further, the inter-frame data reuse efficiency increases as the number of current frames processed within the same time period increases.

A second object of the present invention is to provide a motion estimation-oriented inter-frame data reuse apparatus, which can reduce the number of accesses and memories and increase the video processing speed through data reuse between adjacent frame images.

The invention relates to a motion estimation-oriented interframe data reuse device, which comprises:

an off-chip storage for storing frame sequential image data;

an on-chip cache for storing reusable data;

a processor configured to perform the steps of:

Further, the processor is further configured to:

according to the pre-divided macro blocks, the absolute error sum and the motion vector of the corresponding macro block are calculated.

Furthermore, the inter-frame data reuse device for motion estimation is a unidirectional frame frequency lifting motion estimation architecture, and when any two adjacent frames are processed, a current macro block cache of a current frame and two search window caches respectively used for storing image data of the two adjacent frames are allocated in the on-chip cache.

Furthermore, the inter-frame data reuse device for motion estimation is a bidirectional frame frequency lifting motion estimation architecture, and when any two adjacent frames are processed, three search window caches are allocated in the on-chip cache and are respectively used for storing two adjacent frame image data and the current frame image data.

Further, the processor comprises a plurality of processing unit arrays working in parallel.

Further, the processor is a GPU processor.

Further, the on-chip cache is a shared memory of the GPU processor.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention processes at least two current frames in turn in the same time period, and adopts the same starting point and scanning sequence, and when processing any two adjacent frames, the processing result data of the current frame is read to the on-chip cache in time for direct reading when processing the adjacent frames, thereby realizing the reuse of the data between frames of the current frame; therefore, the memory access times are reduced, the memory access time and the memory access bandwidth requirements are reduced, and the running speed of motion estimation and related video applications is improved.

(2) The invention does not need to store the whole frame image on the chip, only needs to store a plurality of search window caches on the chip, reduces the requirement of the data reuse technology on the size of the on-chip memory, and improves the chip layout of hardware.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a diagram of an Inter-C architecture on FRUC-UME in accordance with the present invention;

FIG. 2 is a sequence diagram of PEA processing a current CB in an Inter-C architecture over FRUC-UME in accordance with the present invention;

FIG. 3 is an Inter-C architecture diagram on a FRUC-BME according to the present invention;

fig. 4 is a graph of read times for each frame in a sequence of frames over an Inter-C architecture in accordance with the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Interpretation of terms:

FRUC-ME: motion estimation in Frame Rate Up Conversion, a Motion estimation algorithm for Frame Rate Up.

FRUC-UME: unidirectional FRUC-ME, one-way frame rate up motion estimation.

FRUC-BME: bidirectional FRUC-ME, Bidirectional frame rate boost motion estimation.

PEA: processing Element Array, Processing Element Array.

Inter-C: and C-level inter-frame data reuse.

SR: search Range, Search window.

CB: current Block, Current macroblock.

The invention discloses a motion estimation-oriented interframe data reuse method, which has the following principle:

The specific implementation is exemplified by two typical FRUC-MEs:

for the proposed method Inter-C, it is not necessary to store the whole frame of image on-chip, only multiple SRs need to be stored on-chip at the same time. One SR cache is used to store one SR.

Reference is made to fig. 1 for an Inter-C architecture diagram over FRUC-UME in accordance with the present invention.

As shown in fig. 1, on the FRUC-UME Inter-C architecture, there are two SR caches and one CB cache on the chip, where the two SR caches refer to caches on two frames, Frame i and Frame i-1, that is, each SR cache is used to store an SR of a reference Frame.

CB buffer refers to a buffer on Frame i +1 Frame. Both the search window width (SRH) and the search window height (SRV) are equal to twice the macroblock size (2N).

The PEA calculates a SAD value and a motion vector for the current block. PEA processes CBs of two current frames (Frame i and Frame i +1) in turn in the framework. Frame i is the current Frame of Frame i-1 and is also the reference Frame of Frame i + 1.

The object of the present embodiment is to change the number of times of reading Frame i from two to one. The CB of the Frame i is contained in the SR cache of the Frame i. Therefore, the CBs of Frame i and Frame i +1 are processed by turns using PEA, and the same starting point and scanning order are used.

Wherein i is more than or equal to 0 and less than or equal to m-1; m is more than or equal to 2 and m is a positive integer.

Referring to fig. 2, a sequence diagram of PEA processing a current CB in an Inter-C architecture over FRUC-UME in accordance with the present invention is shown.

As shown in fig. 2, PEA, after processing CB0 of Frame i +1 in Step0, continues processing CB0 of Frame i in Step 1. CB0 for Frame i is now already in SR buffer (read to the chip in Step 0), so it can be reused in Step 1.

PEA then proceeds to process CB1 for Frame i +1 at Step 2.

Processing of CB1 for Frame i continues in Step 3. CB1 for Frame i is also already in SR buffer (read to the chip in Step 2) at this time, so it can be reused again in Step 3.

In this way, the CB of Frame i is always read to the on-chip buffer in time, and does not need to be read from the off-chip again, so that the "inter-Frame" data reuse of Frame i is realized in the SR buffer.

However, in this way, the frames i-1 and i +1 still need to be read onto the sheet twice.

By analogy, each m frame of image needs to be read twice from the outside of the chip, and other images need to be read only once, so that the Inter-frame data reuse efficiency Ra of FRUC-UME Inter-C can be calculated as follows:

wherein SRV refers to the height of a search window, W and H refer to the width and height of a frame respectively, N refers to the number of macro blocks, m refers to the number of current frames processed simultaneously, m is more than or equal to 2, and m is a positive integer.

Reference is made to fig. 3 for an Inter-C architecture diagram on a FRUC-BME in accordance with the present invention.

As shown in fig. 3, the difference from FRUC-UME is that FRUC-BME replaces the CB buffer of Frame i +1 with one SR buffer because each inserted macroblock requires two adjacent search windows. FRUC-BME the order of CB processing is the same as FRUC-UME.

The calculation mode of the interframe data reuse efficiency Ra of the FRUC-BME Inter-C is similar to the calculation mode of the Ra of the FRUC-UME Inter-C:

wherein, SRV refers to a vertical search range, W and H refer to the width and height of a frame respectively, N refers to the number of macro blocks, m refers to the number of current frames processed simultaneously, m is more than or equal to 2, and m is a positive integer.

Referring to fig. 4, a graph of read times for each frame in a sequence of frames over an Inter-C architecture in accordance with the present invention is shown.

As shown in fig. 4, the number of current frames processed in the same time period can be increased by increasing the number of SR buffers, thereby increasing the efficiency of data reuse.

If only one current frame is processed in the same time period, only data reuse in the frame can be realized, and each frame needs to be read onto the chip twice.

If two current frames are processed simultaneously in the same time period, data reuse between frames can be realized, and the 1/2 frames only need to be transmitted once.

If three frames are processed simultaneously in the same time period, the 1/3 frames need only be read twice.

The data reuse efficiency will increase with the increase of the number of current frames processed in the same time period, when m current frames are processed simultaneously, only 1/m frame is read twice, and the rest frames only need to be read once.

The invention also provides a device for reusing the interframe data facing the motion estimation.

an off-chip storage for storing frame sequential image data;

an on-chip cache for storing reusable data;

a processor configured to perform the steps of:

In a specific implementation, the processor is further configured to:

In an embodiment, the inter-frame data reuse apparatus for motion estimation is a unidirectional frame rate up-conversion motion estimation architecture, and when any two adjacent frames are processed, a current macroblock buffer of a current frame and two search window buffers respectively used for storing image data of the two adjacent frames are allocated in an on-chip buffer.

In another embodiment, the motion estimation-oriented inter-frame data reuse apparatus is a bidirectional frame rate up-scaling motion estimation architecture, and when any two adjacent frames are processed, three search window buffers are allocated in the on-chip buffer for two adjacent frame image data and the current frame image data, respectively.

In a specific implementation, the processor comprises a plurality of processing unit arrays working in parallel. Parallelism can be increased by increasing the number of PEAs.

In a specific implementation, besides using PEA to realize data reuse, shared memory in gpu (graphics processing unit) may be used to realize inter-frame data reuse.

The processor is a GPU processor; the on-chip cache is a shared memory of the GPU processor.

In addition, the Inter-C can also utilize the data reuse between adjacent SRs in the previous frame at the same time, i.e. the Inter-C is compatible with the Intra-C.

The invention processes at least two current frames in turn in the same time period, and adopts the same starting point and scanning sequence, and when processing any two adjacent frames, the processing result data of the current frame is read to the on-chip cache in time for direct reading when processing the adjacent frames, thereby realizing the reuse of the data between frames of the current frame; therefore, the memory access times are reduced, and the memory access time is reduced; the requirement on the memory access bandwidth is reduced, and the running speed of motion estimation and related video applications is increased.

The invention does not need to store the whole frame image on the chip, only needs to store a plurality of search window caches on the chip, reduces the requirement of the data reuse technology on the size of the on-chip memory, and improves the chip layout of hardware.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A motion estimation-oriented inter-frame data reuse method is characterized by comprising the following steps:

when any two adjacent frames are processed, the processing result data of the current frame is read to an on-chip cache in time for direct reading when the adjacent frames are processed, and the data of the frames of the current frame is reused; thus, when m current frames are processed in the same time period, only 1/m frames are read twice, and the rest frames only need to be read once; wherein m is more than or equal to 2 and is a positive integer;

the Inter-frame data reuse method facing the motion estimation is executed on an Inter-C framework on an FRUC-UME;

when any two adjacent frames are processed, a current macro block cache of a current frame and two search window caches which are respectively used for storing image data of the two adjacent frames are distributed in the on-chip cache;

the whole frame of image does not need to be stored on the chip, and only a plurality of search window caches need to be stored on the chip, so that the requirement of a data reuse technology on the size of an on-chip memory is reduced;

the process of processing the current frame comprises the following steps: according to the pre-divided macro blocks, the sum of absolute differences and the motion vector of the corresponding macro block are calculated.

2. The method of claim 1, wherein inter-frame data reuse efficiency increases as the number of current frames processed in a same time period increases.

3. An apparatus for reusing inter-frame data for motion estimation, comprising:

an off-chip storage for storing frame sequential image data;

an on-chip cache for storing reusable data;

a processor configured to perform the steps of:

the processor further configured to:

calculating the sum of absolute differences and motion vectors of corresponding macro blocks according to the pre-divided macro blocks;

the inter-frame data reusing device for motion estimation is a unidirectional frame frequency lifting motion estimation framework, and when any two adjacent frames are processed, a current macro block cache of a current frame and two search window caches which are respectively used for storing image data of the two adjacent frames are distributed in an on-chip cache;

4. The apparatus of claim 3, wherein the motion estimation oriented inter-frame data reuse apparatus is a bidirectional frame rate up-scaling motion estimation architecture, and when any two adjacent frames are processed, three search window buffers are allocated in the on-chip buffer for two adjacent frame image data and the current frame image data.

5. The apparatus of claim 3, wherein the processor comprises a plurality of processing unit arrays operating in parallel.

6. The motion estimation oriented inter-frame data reuse apparatus as claimed in claim 3, wherein said processor is a GPU processor.

7. The apparatus of claim 6, wherein the on-chip cache is a shared memory of a GPU processor.