WO2023123873A1

WO2023123873A1 - Dense optical flow calculation method employing attention mechanism

Info

Publication number: WO2023123873A1
Application number: PCT/CN2022/097531
Authority: WO
Inventors: 张继东; 吕超; 曹靖城; 涂娟娟
Original assignee: 天翼数字生活科技有限公司
Priority date: 2021-12-28
Filing date: 2022-06-08
Publication date: 2023-07-06
Also published as: CN114913196A

Abstract

The present invention relates to a dense optical flow calculation method employing an attention mechanism. Provided in the present invention is a dense optical flow calculation method employing a Unet and a Transformer. In the method, a Transformer module is introduced into a Unet architecture to process a feature sequence, and effectively uses the global self-attention advantages of a multihead self-attention device of a Transformer in sequence-to-sequence prediction. In the present invention, firstly, two adjacent frames are joined on a channel by means of a down-sampling module and are then input into a convolutional network to undergo down-sampling; then a feature processing module is used to carry out global context feature processing by encoding a feature map output by a down-sampling network and inputting a sequence into the feature map; and finally, an up-sampling module is used to perform up-sampling on the feature map that has undergone feature processing to reconstruct an optical flow image with the same size as an input image.

Description

A Dense Optical Flow Calculation Method Based on Attention Mechanism

technical field

The invention relates to the field of video applications, and mainly relates to dense optical flow calculation in video applications.

Background technique

When the human eye observes a moving object, the scene of the object forms a series of continuously changing images on the retina of the human eye. This series of continuously changing information continuously "flows" through the retina (that is, the image plane), like a light "Flow", so it is called optical flow. Specifically, optical flow is the instantaneous velocity of the pixel motion of a spatially moving object on the observation imaging plane. The optical flow method uses the changes of pixels in the image sequence in the time domain and the correlation between adjacent frames to find the corresponding relationship between the previous frame and the current frame, thereby calculating the motion of objects between adjacent frames. A method of information. The traditional methods of calculating optical flow mainly include gradient-based, frequency-based, phase-based and matching-based methods.

Dense optical flow is an image registration method for point-by-point matching of an image or a specified area. It calculates the offset of all points on the image to form a dense optical flow field. Through this dense optical flow field, pixel-level image registration can be performed. The Horn-Schunck algorithm and most optical flow methods based on region matching belong to the category of dense optical flow. Among optical flow calculation methods using deep learning, FlowNet is the most widely used in practical applications.

The patent "robust interpolation optical flow calculation method for pyramid occlusion detection block matching" (CN112509014A) discloses a robust interpolation optical flow calculation method for pyramid occlusion detection block matching. First, the pyramid occlusion detection block matching is performed to obtain a sparse robust motion field , two consecutive frames of images are constructed by downsampling factors to form a k-level image pyramid, and block matching is performed on each layer of the pyramid to obtain the matching result with initial occlusion; the occlusion detection information is obtained through the occlusion detection algorithm based on deformation error; by matching To obtain accurate sparse matching results, dense optical flow needs to be obtained through a robust interpolation algorithm; after the dense optical flow is obtained by the robust interpolation algorithm, the dense optical flow is optimized by the global energy functional variation: obtained by the global energy functional variational optimization Final optical flow.

The patent "An Image Sequence Optical Flow Estimation Method Based on Learnable Occlusion Mask and Quadratic Deformation Optimization" (CN112465872A) discloses an image sequence optical flow estimation method based on learnable occlusion mask and quadratic deformation optimization. First Any two consecutive frames of images in the input image sequence are subjected to feature pyramid down-sampling and layering to obtain multi-resolution two-frame features; the correlation between the first frame feature and the second frame feature is calculated in each layer of the pyramid, and the correlation is used Construct the module based on the occlusion mask; then use the obtained occlusion mask to remove the edge artifacts of deformation features to optimize the optical flow of image motion edge blur; and use the optical flow after occlusion constraints to construct a secondary deformation optimization module, the secondary deformation is Further optimize the image motion edge optical flow estimation at the pixel level; perform the same occlusion mask and secondary deformation on the deformation features in each layer of the pyramid to obtain the residual flow to refine the optical flow, and output the final optimized when it reaches the bottom of the pyramid Optical flow estimation.

The above two patents have effectively improved the calculation accuracy of optical flow estimation, but the accuracy of dense optical flow still cannot meet the requirements of optical flow for tasks such as video coding and HDR synthesis. Therefore, an improved technique is needed to improve the accuracy of dense optical flow calculations.

Contents of the invention

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Compared with the existing dense optical flow method, the present invention introduces a multi-head self-attention machine in the optical flow prediction calculation task, and utilizes Transformer's global self-attention advantage in sequence-to-sequence prediction to improve the effect of the optical flow calculation task. In addition, the present invention can improve the accuracy of the dense optical flow map at key positions, and at the same time improve the timeliness of dense optical flow calculation by reducing the network depth of Unet's up-sampling and down-sampling.

According to an embodiment of the present invention, a method for calculating dense optical flow is disclosed, including: stitching adjacent frames on channels to generate a stitched vector map; inputting the stitched vector map for downsampling The network performs feature extraction to generate feature vectors; the generated feature vectors are mapped to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence; the high-dimensional embedding representation sequence is input into the feature processing composed of I Transformer layers Network to generate hidden feature sequences; reorganize the generated hidden feature sequences to generate reorganized feature vectors; and input the reorganized feature vectors into the upsampling network for processing to generate dense optical flow maps.

According to another embodiment of the present invention, a system for computing dense optical flow is disclosed, including a downsampling module, a feature processing module and an upsampling module. The down-sampling module is configured to: splice adjacent frames on the channel to generate a spliced vector map; input the spliced vector map into the down-sampling network for feature extraction to generate a feature vector. The feature processing module is configured to: map the feature vector generated by the downsampling module to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence; input the high-dimensional embedding representation sequence into the A feature processing network to generate a sequence of hidden features. The upsampling module is configured to: reorganize the hidden feature sequence generated by the feature processing module to generate a reorganized feature vector; and input the reorganized feature vector into the upsampling network for processing to generate a dense optical flow map.

According to another embodiment of the present invention, a computing device for dense optical flow calculation is disclosed, including: a processor; a memory, the memory stores instructions, and the instructions can be executed when executed by the processor method as above.

These and other features and advantages will become apparent by reading the following detailed description and by reference to the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are illustrative only and are not restrictive in all respects as claimed.

Description of drawings

So that the manner in which the above recited features of the invention can be understood in detail, a more particular description of what has been briefly summarized above may be had by reference to various embodiments, some aspects of which are illustrated in the accompanying drawings. It is to be noted, however, that the drawings illustrate only certain typical aspects of the invention and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.

FIG. 1 shows a block diagram of a system 100 for dense optical flow calculation according to an embodiment of the present invention;

FIG. 2 shows a detailed diagram 200 of each module 101-103 in FIG. 1 according to an embodiment of the present invention;

FIG. 3 shows a flowchart of a method 300 for calculating dense optical flow according to an embodiment of the present invention; and

FIG. 4 shows a block diagram 400 of an exemplary computing device according to one embodiment of the invention.

Detailed ways

Describe the present invention in detail below in conjunction with accompanying drawing, and feature of the present invention will be further manifested in the following detailed description.

The following is an explanation of the terms used in the present invention, which includes the general meanings well known to those skilled in the art:

Unet: is a segmentation model, specifically, it is a fully convolutional network including 4 layers of downsampling, 4 layers of upsampling and a similar skip connection structure. Symmetrical, and the feature map at the downsampling end can skip deep sampling and be spliced to the corresponding upsampling end.

Transformer: Transformer is a natural language processing (NLP) model that uses an attention mechanism for machine translation tasks.

In computer vision, optical flow plays an important role and has very important applications in target object segmentation, recognition, tracking, robot navigation, and shape information recovery. Optical flow computing can be widely used in various scenarios, for example, motion detection of video codec in cloud storage video compression tasks, high-altitude parabolic, fall detection and other motion recognition and video understanding tasks. In order to obtain more accurate motion estimation, dense optical flow calculation is a key module in video coding and decoding technology. The traditional dense optical flow calculation method has a large amount of calculation and poor timeliness. The existing optical flow calculation methods based on deep learning methods have improved timeliness, but the accuracy of dense optical flow maps is low, which will have a negative impact on the quality of video encoding and decoding.

The present invention proposes a dense optical flow calculation method based on Unet and Transformer. The method introduces the Transformer module into the Unet structure, and utilizes Transformer's global self-attention advantage in sequence-to-sequence prediction to improve the accuracy of dense optical flow at key positions. At the same time, it can also reduce the network depth of Unet's upsampling and downsampling, and improve the timeliness of dense optical flow calculations.

FIG. 1 shows a block diagram of a system 100 for calculating dense optical flow according to an embodiment of the present invention. As shown in FIG. 1 , the system 100 is divided into modules, and communication and data exchange are performed between modules in a manner known in the art. In the present invention, each module can be implemented by software or hardware or a combination thereof. As shown in FIG. 1 , the system 100 may include a downsampling module 101 , a feature processing module 102 and an upsampling module 103 .

According to an embodiment of the present invention, the downsampling module 101 is configured to stitch two adjacent frames on channels (for example, color channels) to form an input picture, which is input to a convolutional network for downsampling, thereby obtaining a feature map. The feature processing module 102 is configured to perform global context feature processing on the encoded input sequence of feature maps output by the down-sampling module 101 . The upsampling module 103 is configured as a cascaded upsampler, which upsamples the feature map after feature processing to reconstruct an optical flow map with the same size as the input picture.

FIG. 2 shows a detailed diagram 200 of each of the modules 101-103 in FIG. 1 according to one embodiment of the present invention.

As shown in FIG. 2 , the downsampling module 101 receives two adjacent frames 201, first splices the two frames 201 to obtain a vector map of h×w×6, and then inputs it to the The downsampling network of , each convolutional block consists of a convolutional layer and a ReLU activation function, of which 5 convolutional layers have a stride of 2.

Finally, the down-sampling module 101 outputs a size of

feature map for the feature processing module 102 to process.

As shown in Figure 2, the feature processing module 102 includes using a trainable linear map E to map the feature map sequence output by the downsampling module 101 into the high-dimensional embedding space of the latent layer, and the calculation method is shown in formula (1) :

The high-dimensional embedding representation sequence is then fed into a feature processing network consisting of I Transformer layers. The specific structure of the Transformer layer is shown in Figure 3. Specifically, the Transformer layer is composed of a Multihead Self-Attention (MSA) and a Multi-Layer Perceptron (MLP). The output of the i-th layer is shown in formula (2) (3) :

z′ _i =MSA(LN(z _i-1 ))+z _i-1 , (2)

z _i =MLP(LN(z' _i ))+z' _i , (3)

where LN(·) represents the level normalization operation. The feature processing module 102 finally outputs the hidden feature sequence z _I .

As shown in FIG. 2 , the upsampling module 103 is a cascaded upsampling network, which includes multiple upsampling steps to decode and output the final optical flow picture 202 . First, the up-sampling module 103 reorganizes the hidden feature sequence z _I finally output by the feature processing module 102 into

The feature vector of size is then input into the upsampling network consisting of 7 deconvolution blocks, each deconvolution block consists of a deconvolution layer and a ReLU activation function, and the step size of the 5 deconvolution layers is 2. Finally, an optical flow map output of size h×w×3 is obtained. In addition, the present invention adds three jumping layers between the down-sampled feature vectors to realize feature aggregation (203, 204, 205) at different resolution levels, thereby optimizing the details of the optical flow.

FIG. 3 shows a flowchart of a method 300 for dense optical flow calculation according to an embodiment of the present invention.

In step 301, adjacent frames are spliced on channels to generate a spliced vector image. According to an embodiment of the present invention, the channel is a color channel, such as an RGB channel. According to an embodiment of the present invention, the size of the vector map is h×w×6.

In step 302, input the spliced vector image into the down-sampling network for feature extraction, so as to generate feature vectors. According to an embodiment of the present invention, the downsampling network is composed of 7 convolutional blocks, and each convolutional block is composed of a convolutional layer and a ReLU activation function, and the stride of 5 convolutional layers is 2. According to an embodiment of the present invention, the size of the feature vector is

In step 303, the feature vector generated in step 302 is mapped to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence. According to an embodiment of the present invention, a trainable linear map E can be used to map the feature vector obtained in step 302 into the high-dimensional embedding space of the latent layer.

In step 304, the high-dimensional embedding representation sequence is input into a feature processing network composed of I Transformer layers to generate a hidden feature sequence. According to one embodiment of the present invention, the Transformer layer consists of MSA and MLP for global contextual feature processing.

In step 305, the hidden feature sequence generated in step 304 is reorganized to generate a reorganized feature vector. According to one embodiment of the present invention, the hidden feature sequence z _I is reorganized as

A eigenvector of size.

In step 306, the recombined feature vector is input into the up-sampling network for processing to generate a dense optical flow map. The dense optical flow map may reflect the optical flow of object motion in two adjacent frames obtained in step 301 . According to an embodiment of the present invention, the upsampling network is composed of 7 deconvolution blocks, and each deconvolution block is composed of a deconvolution layer and a ReLU activation function, wherein the step size of the 5 deconvolution layers is 2. According to an embodiment of the present invention, the size of the dense optical flow map is h×w×3. According to an embodiment of the present invention, the upsampling network is a cascaded upsampling network, which realizes feature aggregation at different resolution levels, thereby optimizing the details of dense optical flow.

To sum up, compared with the prior art, the main advantages of the present invention are: (1) Introducing a multi-head self-attention machine in the optical flow prediction calculation task, utilizing Transformer's global self-attention advantage in sequence-to-sequence prediction, the present invention Can improve the accuracy of the dense optical flow of the key position; (2) thanks to the excellent performance of the multi-head self-attention machine in the feature layer for predictive calculation, it can also reduce the network depth of Unet's up-sampling and down-sampling, the present invention can Improve the timeliness of dense optical flow calculation.

FIG. 4 illustrates a block diagram 400 of an exemplary computing device, which is an example of a hardware device applicable to aspects of the invention, according to one embodiment of the invention. Computing device 400 can be any machine that can be configured to perform processing and/or computing, and can be, but is not limited to, a workstation, server, desktop, laptop, tablet, personal digital processing, smartphone , on-board computer, or any combination thereof. Computing device 400 may include components that may be connected or communicate via one or more interfaces and bus 402 . For example, computing device 400 may include a bus 402 , one or more processors 404 , one or more input devices 406 , and one or more output devices 408 . The one or more processors 404 may be any type of processor and may include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors (eg, specialized processing chips). Input device 406 may be any type of device capable of entering information into a computing device and may include, but is not limited to, a mouse, keyboard, touch screen, microphone, and/or remote control. Output devices 408 may be any type of device capable of presenting information and may include, but are not limited to, displays, speakers, video/audio output terminals, vibrators, and/or printers. The computing device 400 may also include a non-transitory storage device 410 or be connected to the non-transitory storage device. The non-transitory storage device may be any storage device that is non-transitory and capable of storing data, and the non-transitory storage device Transient storage devices may include, but are not limited to, magnetic disk drives, optical storage devices, solid state memory, floppy disks, floppy disks, hard disks, magnetic tape or any other magnetic media, optical disks or any other optical media, ROM (read only memory), RAM (random memory access memory), cache memory and/or any memory chip or cartridge, and/or any other medium from which a computer can read data, instructions and/or code. The non-transitory storage device 410 is detachable from the interface. The non-transitory storage device 410 may have data/instructions/codes for implementing the above methods and steps. Computing device 400 may also include a communication device 412 . Communication device 412 may be any type of device or system capable of communicating with internal devices and/or with a network and may include, but is not limited to, a modem, network card, infrared communication device, wireless communication device, and/or chipset, such as a Bluetooth device , IEEE 1302.11 devices, WiFi devices, WiMax devices, cellular communications devices, and/or similar devices.

Bus 402 may include, but is not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.

Computing device 400 may also include working memory 414, which may be any type of working memory capable of storing instructions and/or data that facilitate the operation of processor 404 and may include, but is not limited to, random access memory and/or Read-only storage device.

Software components may be located in working memory 414 including, but not limited to, an operating system 416, one or more application programs 418, drivers, and/or other data and code. The instructions for realizing the above-mentioned methods and steps of the present invention may be included in the one or more application programs 418, and the instructions of the one or more application programs 418 may be read and executed by the processor 404 to realize the present invention The method 300 described above.

It should also be recognized that variations can be made according to specific needs. For example, custom hardware could also be used, and/or particular components could be implemented in hardware, software, firmware, middleware, microcode, hardware description voice, or any combination thereof. Additionally, connections to other computing devices, such as network input/output devices and the like, may be employed. For example, programming hardware (eg, programmable logic circuits including field programmable gate arrays (FPGAs) and/or programmable logic arrays (PLAs)) with assembly language or hardware programming languages (eg, VERILOG, VHDL, C++) ) utilize logic and algorithms in accordance with the present invention to implement some or all of the disclosed methods and apparatus.

While aspects of the invention have been described thus far with reference to the accompanying drawings, the methods and apparatus described above are examples only, and the scope of the invention is not limited to these aspects but only by the appended claims and their equivalents. Various components may be omitted or may also be substituted for equivalent components. In addition, the steps may also be implemented in an order different from that described in the present invention. Also, various components may be combined in various ways. It is also important to note that, as technology advances, many of the components described may be replaced by equivalent components presented hereafter.

Claims

A method for dense optical flow calculations, comprising:

Stitch adjacent frames on the channel to generate a spliced vector image;

Input the spliced vector image into the downsampling network for feature extraction to generate feature vectors;

Map the generated feature vectors to the high-dimensional embedding space of the latent layer to generate a sequence of high-dimensional embedding representations;

Input the high-dimensional embedding representation sequence into a feature processing network consisting of I Transformer layers to generate a hidden feature sequence;

recombining the generated hidden feature sequence to generate a recombined feature vector; and

The restructured feature vectors are fed into an upsampling network for processing to generate dense optical flow maps.
The method according to claim 1, wherein the downsampling network is composed of 7 convolutional blocks, and each convolutional block is composed of a convolutional layer and a ReLU activation function, wherein the 5 convolutional layers The step size is 2.
The method according to claim 1, wherein the Transformer layer is composed of a multi-head self-attention machine and a multi-layer perceptron.
The method according to claim 1, wherein the upsampling network is a cascaded upsampling network, and consists of 7 deconvolution blocks, each deconvolution block consists of a deconvolution layer and a ReLU The activation function consists of 5 deconvolution layers with a step size of 2.
The method according to claim 1, wherein mapping the generated feature vectors to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence further comprises: using a trainable linear map E to transform the The feature vectors are mapped into the high-dimensional embedding space of the latent layer.
A system for dense optical flow calculations, comprising:

A downsampling module, the downsampling module is configured to:

Stitch adjacent frames on the channel to generate a spliced vector image;

Input the spliced vector image into the downsampling network for feature extraction to generate feature vectors;

A feature processing module, the feature processing module is configured to:

Mapping the feature vector generated by the downsampling module to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence;

Input the high-dimensional embedding representation sequence into a feature processing network consisting of I Transformer layers to generate a hidden feature sequence;

An upsampling module, the upsampling module is configured to:

recombining the hidden feature sequence generated by the feature processing module to generate a reorganized feature vector; and

The restructured feature vectors are fed into an upsampling network for processing to generate dense optical flow maps.
The system according to claim 6, wherein the downsampling network is composed of 7 convolutional blocks, and each convolutional block is composed of a convolutional layer and a ReLU activation function, wherein the 5 convolutional layers The step size is 2;

Wherein the upsampling network is a cascaded upsampling network, and consists of 7 deconvolution blocks, each deconvolution block consists of a deconvolution layer and a ReLU activation function, of which 5 deconvolution layers The step size is 2.
The system according to claim 6, wherein the Transformer layer is composed of a multi-head self-attention machine and a multi-layer perceptron.
The system according to claim 6, wherein mapping the generated feature vectors to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence further comprises: using a trainable linear map E to transform the The feature vectors are mapped into the high-dimensional embedding space of the latent layer.
A computing device for dense optical flow calculations, comprising:

processor;

A memory, the memory stores instructions, and when the instructions are executed by the processor, the method according to any one of claims 1-5 can be executed.