CN115861647A

CN115861647A - Optical flow estimation method based on multi-scale global cross matching

Info

Publication number: CN115861647A
Application number: CN202211474506.7A
Authority: CN
Inventors: 项学智; 陈一鸣; 乔玉龙
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-28

Abstract

The invention provides an optical flow estimation method based on multi-scale global cross matching, which comprises the following steps: 1. constructing an image feature enhancement network based on a multi-scale cross attention Module (MCA); 2. constructing an optical flow estimation module; 3. constructing a pixel processing module of the shielded area; 4. inputting two continuous frames of images at the input end of the network, and carrying out supervised training; 5. and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow. The invention provides a multi-scale cross attention Module (MCA), which is used for complementing relevant information among different image blocks in the same characteristic image so that a network can learn image information with various resolutions. Meanwhile, the problem of pixel occlusion is solved by modeling the self-similarity of the image, and then the finally predicted optical flow is obtained.

Description

Optical flow estimation method based on multi-scale global cross matching

Technical Field

The invention provides an optical flow estimation method based on multi-scale global cross matching, and belongs to the field of computer vision.

Background

Optical flow estimation of video continuous frames is a long-standing, fundamental and challenging problem in the field of computer vision that is crucial to building higher-level cognitive abilities in scene understanding, such as object recognition, object tracking, motion recognition, scene segmentation, etc. With the development of deep learning, the solution of the optical flow estimation problem by using deep learning gradually becomes a mainstream scheme, and recently, as a Transformer gradually rises in a vision task, self-attention and a Transformer architecture succeed in a plurality of vision subtasks, and a small amount of applications have been already made in the optical flow estimation task, but because the Transformer has huge calculation amount, an image division method is usually adopted to reduce the calculation amount, but the advantage of the Transformer for establishing remote dependence is destroyed, and information correlation among divided image blocks is lost.

The occlusion is a great challenge of the optical flow estimation problem, the invention adopts the self-similarity of the modeling image, supposes that a network can find points with similar motion by searching points with similar appearances in a reference frame, and designs an occlusion processing module consisting of self attention by utilizing the characteristic that the structure of an optical flow image is similar to that of an input image.

Disclosure of Invention

The invention provides an optical flow estimation method based on multi-scale global cross matching, and provides a multi-scale cross attention Module (MCA), wherein a feature enhancement network formed by the multi-scale cross attention module is constructed, the MCA module is used for complementing related information among different image blocks in the same feature image, so that the network can learn multi-scale image information, and then the self-similarity modeling of the image is carried out by utilizing the characteristic that the structure of an optical flow image is similar to that of an input image, so that the pixel shielding problem is solved, and the accuracy of optical flow estimation is improved.

The purpose of the invention is realized as follows: (1) Constructing an image feature enhancement network based on a multi-scale cross attention module, wherein the model comprises a feature extraction convolution network and a multi-scale feature matching network composed of MCA modules, firstly performing convolution operation on an input image to extract image features, then performing position coding on the feature image and inputting the feature image into the multi-scale feature matching network composed of the MCA modules, wherein the whole multi-scale feature matching network comprises N MCA modules, each MCA module performs correlation calculation on three different resolutions of the input image, image pairs of the three different resolutions are obtained through down sampling, are recovered to a uniform resolution through up sampling after passing through an attention layer and are fused, and then are added with the input for output;

(2) Constructing an optical flow estimation module; inputting two frames of image features output by the last layer in the multi-scale feature matching network into an optical flow estimation module for prediction, wherein the module consists of a global matching module and a softmax layer; performing dot product operation on the input features to obtain global correlation, normalizing the last two dimensions of the global correlation through softmax to obtain matching probability, and multiplying the weighted average and the multiplication of the matching probability and 2D coordinates of pixel grid points to obtain a corresponding relation matrix; finally, the optical flow is obtained by calculating the coordinate difference between corresponding points;

(3) Constructing a pixel processing module of the shielded area; the module is composed of a self-attention layer, 2D optical flow output by an optical flow estimation module and target map features output by a multi-scale feature matching network are input into the self-attention layer, and then the self-attention layer and the 2D optical flow are added to obtain a final optical flow;

(4) Inputting two continuous frames of images at the input end of the network, and carrying out supervised training;

(5) And inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.

The invention also includes such structural features:

1. in the feature (1), two consecutive frames of images I _t And I _t+1 Firstly, inputting a feature extraction network formed by convolution to obtain

A feature map, wherein H and W represent the height and width of the input image, respectively, C represents the number of channels of the input image, and then the feature image is converted into H by down-sampling dimensionality _l ×W _l X C, wherein H _l And W _l Is down-sampled to the height and width of the lowest resolution, which is a constant value manually specified, changed due to the different resolutions of the input images, and then input to a cross attention layer through position encoding, whose attention operation is defined as:

where CAtt stands for cross attention calculation; q, K, V respectively represent the linear projection of the input features, in particular Q from the source feature F _s ⁱ Derivation, F _s Representing source characteristics, i representing characteristic serial numbers at different moments; k, V from the target feature F _t ⁱ Derivation of F _t Representing target characteristics, i representing characteristic serial numbers at different moments; t represents the transpose operation of the matrix, softmax (·) represents the normalization operation, and D represents the dimensions Q and K; w _q ,W _k ,W _v Three different parameter matrices are represented, the scores M of their global cross-attention are used to update the source signatures through the feed-forward neural network FFN layer, the update operation of which is defined as:

where Cat represents the Concat operation, updated output characteristics

And/or>

Restored to the input size by upsampling and then entered into a multi-scale image feature matching network consisting of an MCA module.

2. The MCA module in the characteristic (1) obtains images with three different resolutions by down-samplingCharacteristic pairs, each of which is H from low to high _l ×W _l ×C、

And->

Three feature resolutions corresponding to an output attention score of M _c 、M _m And M _f For the Kth MCA block, the resolution is (H) _l ,W _l ) The characteristic image pair is subjected to pixel-by-pixel cross attention calculation, and the other two resolution characteristics are divided firstly, and the original characteristic F is subjected to division _S ⁱ Divide into image feature blocks patch of size S, divide into ^ and ^ S in total>

A patch, then input into a cross attention module, where M _c 、M _m And M _f The calculation method of (c) is consistent with the calculation method of (M); upsampling feature pairs output by each attention tier back to the input size and merging recompressed channels along the channel to get->

Then, outputting after passing through a single-layer convolution network, wherein the operation is as follows:

/>

wherein LN represents the normalization layer, conV represents the convolution layer, cat represents the Concat operation;

output features representing three different resolution feature images passing through the cross attention module->

Represents the input of the kth MCA module,

represents the output of the Kth MCA module>

The image feature pairs representing the output of each attention layer in the kth MCA module are up-sampled back to the input resolution and merged along the channels and recompressed to obtain image feature pairs.

3. Optical flow prediction for successive frames in said feature (2), two successive frames of image I _t And I _t+1 Finally obtaining an image feature pair F after an image feature enhancement network based on a multi-scale cross attention Module (MCA) ₁ And F ₂ Then calculating the correlation between the two image features to compare F ₁ Each pixel of (1) with respect to F ₂ The operation of the feature similarity of all pixels in (1) is defined as follows:

where CM represents a correlation matrix in which each element represents F ₁ Coordinate sum of (5) and (F) ₂ The correlation of the coordinates in (1); then, the last two dimensions of the CM are normalized by using a softmax operation to obtain a matching probability distribution M, which is defined as follows:

then, the weighted average of the 2D coordinates of the pixel grid G is multiplied by the matching probability distribution M to obtain a corresponding relation matrix

Wherein the size of the pixel grid G and F ₁ And F ₂ Likewise, its operation is defined as follows:

finally, the coordinate difference OF the corresponding pixel is calculated to obtain the optical flow OF, which operates as follows:

4. pixel processing for occlusion regions in the feature (3); propagating the high-quality optical flow estimation result of the matching area to the unmatched area through the self-similarity of the features, and realizing the self-attention layer, wherein the calculation adopts a mode of calculating attention by a sliding window, and the definition is as follows:

wherein

Representing the output optical flow before upsampling, and finally using the upsampling to obtain the final optical flow.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a multi-scale cross attention Module (MCA), which is used for complementing relevant information among different image blocks in the same characteristic image so that a network can learn multi-scale image information. Meanwhile, the problem of pixel occlusion is solved by modeling the self-similarity of the image, and then the finally predicted optical flow is obtained.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a multi-scale spatial cross attention Module (MCA) based optical flow network architecture;

FIG. 3 is an initialization structure diagram;

FIG. 4 is a block diagram of a multi-scale spatial cross attention Module (MCA);

FIG. 5 Concat structure diagram in MCA module;

FIG. 6 is a cross-attention layer structure diagram;

FIG. 7 is a block diagram of optical flow estimation;

fig. 8 is a block diagram of the occlusion processing module.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

With reference to fig. 1-8, the present invention is implemented by the following steps:

the method comprises the following steps: and constructing an image feature enhancement network based on a multi-scale space cross attention Module (MCA).

As shown in FIG. 2, two consecutive frames of images I are input _t And I _t+1 To the convolution network of feature extraction, the feature extraction module is composed of convolution with step size of 2 and size of 7 × 7, convolution with 1 × 1 and three residual modules, and each residual module is composed of convolution with step size of 1 and size of 3 × 3. Obtained after passing through a feature extraction module

And then the image feature pair is down-sampled to H _l ×W _l X C, wherein H _l And W _l Is down-sampled to the lowest resolution height and width, which is a constant value manually specified, changed due to the resolution of the input image, and then input to a cross attention layer via position encoding, whose attention operation is defined as shown in fig. 6:

where CAtt stands for cross attention calculation; q, K, V respectively represent linear projections of the input features, in particular Q from the source featuresF _s ⁱ Derivation, F _s Representing source characteristics, i representing characteristic serial numbers at different moments; k, V from the target feature F _t ⁱ Derivation of F _t Representing target characteristics, i representing characteristic serial numbers at different moments; t represents the transpose operation of the matrix, softmax (·) represents the normalization operation, and D represents the dimensions Q and K; w _q ,W _k ,W _v Representing three different parameter matrices whose global cross-attention score, M, is used to update the source signature through the FFN layer. The update operation is defined as:

wherein Cat represents a Concat operation, followed by

And/or>

The feature image is restored to the input resolution by upsampling, and then input into a multi-scale feature matching network composed of an MCA module. The entire multi-scale feature matching network contains N MCA modules. There will be three image feature pair-pass cross attention layers of different resolutions in each MCA module, and the attention score of its corresponding output is M _c 、M _m And M _f The three image feature pairs with different resolutions are obtained by the same stride average pooling down-sampling, and are restored to the uniform resolution by the same up-sampling after passing through the attention layer, and then are combined along the channel and pass through the convolution layer, and finally are added with the input image feature pairs for output. For the Kth MCA block, the resolution is (H) _l ,W _l ) The image pair of the feature points is subjected to pixel-by-pixel cross attention calculation, and the image features of the other two resolutions are divided firstly, and the original feature F is subjected to division _S ⁱ Divide into image feature blocks patch of size S, divide into ^ and ^ S in total>

And a patch. Then input to hand overFork attention module, wherein M _c 、M _m And M _f The calculation method of (2) is the same as the above-described M calculation method. Upsampling pairs of features output by respective attention layers back to input resolution and combining recompression channels along channels->

Represents the input of the kth MCA module,

represents the output of the Kth MCA module>

The image feature pairs representing the output of each attention layer in the kth MCA module are up-sampled back to the input resolution and merged along the channels and recompressed to get image feature pairs.

Step two: and constructing an optical flow estimation network.

As shown in fig. 7, two consecutive frames of video images I _t And I _t+1 Inputting the data into the whole image feature enhancement network, and firstly obtaining a preliminary image feature enhancement network through a convolutional neural networkImage features, where a fixed two-dimensional sine and cosine position code is first added to the features since the two sets of features do not have the notion of spatial position. Adding location information, which allows the matching process to take into account not only feature similarity but also their spatial distance, helps to resolve ambiguities and improve performance, and after adding location information, the resulting image feature pair F is finally output, after passing through the N MCA modules ₁ And F ₂ Considering that the corresponding pixels in the two frames should have a high similarity, the correlation between the two image features is calculated to compare F ₁ Each pixel of (1) with respect to F ₂ The feature similarity of all pixels in the image, this step can be accomplished by a simple dot product, whose operation is defined as follows:

where CM represents a correlation matrix in which each element represents F ₁ Coordinate sum of (5) and (F) ₂ The correlation of coordinates in (1). To determine the correspondence, a simple method is to directly take the position with the largest correlation. However, this operation is not differentiable, which can hamper end-to-end training. To solve this problem, a micro-matchable layer is used, that is, the last two dimensions of the CM are normalized by using the softmax operation to obtain a matching probability distribution M, which is defined as follows:

step three: and processing pixels of the occlusion area.

The principle of predicting the optical flow is that position information is added before the characteristics enter a related layer, firstly, a remarkable characteristic point is found through autocorrelation, then, corresponding positions of the characteristic points of the two images are found through cross attention, obviously, the position information difference of the characteristic points of the two images is the optical flow, but the process considers that pixels of an occlusion area do not exist, once the pixels of the occlusion area exist, the characteristic coordinates of a target image cannot correspond to the positions of the occluded pixels, and the predicted optical flow is inaccurate at this time. But because the structure of the target image and the optical flow image has a certain similarity. To resolve ambiguities caused by occlusion, the idea is to allow the network to reason about at a higher level, that is, to globally aggregate the motion characteristics of similar pixels after implicitly reasoning which pixels are similar in the appearance feature space. It is assumed that by finding points with similar appearance in the frame of reference, the network will be able to find points with similar motion. This is because it is observed that the movement of points on a single object is generally uniform. This statistical bias is used to propagate motion information from high confidence non-occluded pixels to low confidence occluded pixels. Here, the confidence may be interpreted as whether there is a significant match, i.e. a high correlation value at the correct displacement. Transformers networks are known for their ability to model long-term dependencies. In the self-attention mechanism of transformations, query, key, and value are from the same feature vector, unlike the self-attention mechanism in transformations, where a generalized attention variant is used. The query and key features are output by the image feature network to model the appearance self-similarity in frame 1. The value feature is the optical flow projection through the softmax layer, and the output optical flow is coded by 4D related volume. The attention matrix computed from the query and key features is used to aggregate the value features as a motion hidden representation.

As shown in fig. 8, the high-quality optical flow estimation result of the matching area is propagated to the unmatched area by the self-similarity of the features, and the method is implemented by a self-attention layer, and the calculation is performed in a manner of calculating attention by a sliding window, which is defined as follows:

wherein

Representing the output optical flow before upsampling. Finally, up-sampling is used to obtain the final optical flow.

Step four: two continuous frames of images are input at the input end of the network, and the network is supervised and trained by using the whole network loss function. All traffic predictions are monitored using the group route.

Step five: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.

Two consecutive frames of images I _t And I _t+1 Firstly, inputting the data into a feature extraction module, passing a convolution layer with the step size of 2 and the size of 7 multiplied by 7, then passing three continuous residual blocks, wherein each residual block is formed by convolution with the step size of 1 and the size of 3 multiplied by 3, and finally passing convolution with the size of 1 multiplied by 1, wherein the number of convolution kernels in each layer is 64, 64, 96, 128 and 128. Except for the convolution of the last layer, each layer of convolution is followed by a normalization layer and a ReLU activation layer. Obtained after passing through a feature extraction module

And then downsampling the image feature pairs to H through step-wise averaging pooling _l ×W _l X C, wherein H _l And W _l Is downsampled to the lowestThe height and width of the resolution, which are constant values manually specified, vary depending on the resolution of the input image, and are then input to a global cross attention layer via position encoding, which is defined as follows:

where CAtt stands for cross attention calculation; q, K, V respectively represent linear projections of the input features, in particular Q from the source feature F _s ⁱ Derivation, F _s Representing source characteristics, i representing characteristic serial numbers at different moments; k, V from the target feature F _t ⁱ Derivation of F _t Representing target characteristics, i representing characteristic serial numbers at different moments; t represents the transpose operation of the matrix, softmax (·) represents the normalization operation, and D represents the dimensions Q and K; w _q ,W _k ,W _v Represents three different parameter matrices whose global cross-attention score, M, is used to update the source signature through the FFN layer. The update operation is defined as:

wherein Cat represents a Concat operation, followed by

And/or>

The input resolution is restored through bilinear upsampling, and then the feature image is input into a multi-scale feature matching network formed by an MCA module. The entire multi-scale feature matching network contains N MCA modules. There will be three image feature pairs of different resolutions in each MCA module that pass through the cross attention layer and their corresponding outputsAttention score of M _c 、M _m And M _f The three image feature pairs with different resolutions are obtained by the same stride average pooling down-sampling, are restored to the uniform resolution by bilinearity after passing through the attention layer, are combined along the channel and pass through the convolution layer, and are finally added with the input image feature pairs for output. For the Kth MCA block, the resolution is (H) _l ,W _l ) The image pair of the feature points is subjected to pixel-by-pixel cross attention calculation, and the image features of the other two resolutions are divided firstly, and the original feature F is subjected to division _S ⁱ Divide into image feature blocks patch of size S, divide into ^ and ^ S in total>

A patch. Then input into a cross attention module, where M _c 、M _m And M _f The calculation method of (2) is the same as the above-described M calculation method. Upsampling feature pairs output by the various attention layers back to the input resolution and merging recompressed channels along the channel results in->

output features representing three different resolutions of feature images through a cross attention module>

Represents the K thThe inputs of the MCA modules are connected to,

represents the output of the Kth MCA module>

The image feature pairs representing the output of each attention layer in the kth MCA module are up-sampled back to the input resolution and merged along the channels and recompressed to obtain image feature pairs. Finally outputting the image characteristics after passing through the N MCA modules, and finally outputting to obtain an image characteristic pair F ₁ And F ₂ Then calculating the correlation between two image features to compare F ₁ Each pixel of (1) with respect to F ₂ The feature similarity of all pixels in the image, this step can be accomplished by a simple dot product, whose operation is defined as follows:

where CM represents a correlation matrix in which each element represents F ₁ Coordinate sum of (5) and (F) ₂ The correlation of coordinates in (1). Normalizing the last two dimensions of the CM using the softmax operation yields a matching probability distribution M, which is defined as follows:

/>

then, propagating the high-quality optical flow estimation result of the matching area to the unmatched area through the self-similarity of the features, wherein the method is realized through a self-attention layer, and the calculation adopts a mode of calculating attention by a sliding window, and the method is defined as follows:

wherein

In summary, the invention provides an optical flow estimation method based on multi-scale global cross matching, which comprises the following steps: 1. constructing an image feature enhancement network based on a multi-scale cross attention Module (MCA); 2. constructing an optical flow estimation module; 3. constructing a pixel processing module of the shielded area; 4. inputting two continuous frames of images at the input end of the network, and carrying out supervised training; 5. and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow. The invention provides a multi-scale cross attention Module (MCA), which is used for complementing relevant information among different image blocks in the same characteristic image so that a network can learn image information with various resolutions. Meanwhile, the problem of pixel occlusion is solved by modeling the self-similarity of the image, and then the finally predicted optical flow is obtained.

Claims

1. An optical flow estimation method based on multi-scale global cross matching is characterized in that:

(1) Constructing an image feature enhancement network based on a multi-scale cross attention module, wherein the model comprises a feature extraction convolution network and a multi-scale feature matching network composed of MCA modules, firstly performing convolution operation on an input image to extract image features, then performing position coding on the feature image and inputting the feature image into the multi-scale feature matching network composed of the MCA modules, wherein the whole multi-scale feature matching network comprises N MCA modules, each MCA module performs correlation calculation on three different resolutions of the input image, image pairs of the three different resolutions are obtained through down sampling, are recovered to a uniform resolution through up sampling after passing through an attention layer and are fused, and then are added with the input for output;

(2) Constructing an optical flow estimation module; inputting two frames of image features output by the last layer in the multi-scale feature matching network into an optical flow estimation module for prediction, wherein the module consists of a global matching module and a softmax layer; performing dot product operation on input features to obtain global correlation, normalizing the last two dimensions of the global correlation through softmax to obtain matching probability, and multiplying the weighted average sum of the matching probability and 2D coordinates of pixel grid points to obtain a corresponding relation matrix; finally, the optical flow is obtained by calculating the coordinate difference between corresponding points;

(3) Constructing a pixel processing module of the shielded area; the module is composed of a self-attention layer, 2D optical flow output by an optical flow estimation module and target graph features output by a multi-scale feature matching network are input into the self-attention layer, and then the self-attention layer and the 2D optical flow are added to obtain a final optical flow;

2. The method of claim 1, wherein the method comprises:

in the feature (1), two consecutive frames of images I _t And I _t+1 Firstly, inputting a feature extraction network formed by convolution to obtain

A feature map, wherein H and W represent the height and width of the input image, respectively, C represents the number of channels of the input image, and then the feature image is changed into H by down-sampling dimension _l ×W _l X C, wherein H _l And W _l Is down-sampled to the height and width of the lowest resolution, which is a constant value manually specified, changed due to the different resolutions of the input images, and then input to a cross attention layer through position encoding, whose attention operation is defined as:

where CAtt stands for cross attention calculation; q, K, V respectively represent linear projections of the input features, in particular Q from the source feature F _s ⁱ Derivation, F _s Representing source characteristics, i representing characteristic serial numbers at different moments; k, V from the target feature F _t ⁱ Derivation of F _t Representing target characteristics, i representing characteristic serial numbers at different moments; t represents the transpose operation of the matrix, softmax (·) represents the normalization operation, and D represents the dimensions Q and K; w _q ,W _k ,W _v Three different parameter matrices are represented, the scores M of their global cross-attention are used to update the source signatures through the feed-forward neural network FFN layer, the update operation of which is defined as:

where Cat represents the Concat operation, updated output characteristics

And/or>

Restored to the input size by upsampling and then entered into a multi-scale image feature matching network consisting of an MCA module. />

3. The method of claim 2, wherein the method comprises: the MCA module in the feature (1) obtains three image feature pairs with different resolutions through down-sampling, wherein the three image feature pairs are respectively from low to high

And->

Three feature resolutions corresponding to an output attention score of M _c 、M _m And M _f For the Kth MCA block, the resolution is (H) _l ,W _l ) The characteristic image pair is subjected to pixel-by-pixel cross attention calculation, and the other two resolution characteristics are divided firstly, and the original characteristic F is subjected to division _S ⁱ Divide into blocks of image features of size S × S patch, into { (R) } in total>

Represents an input of the Kth MCA module>

Represents the output of the Kth MCA module>

4. The optical flow estimation method based on multi-scale global cross matching according to claim 1, wherein: optical flow prediction for successive frames in said feature (2), two successive frames of image I _t And I _t+1 Finally obtaining an image feature pair F after an image feature enhancement network based on a multi-scale cross attention Module (MCA) ₁ And F ₂ Then calculating the correlation between the two image features to compare F ₁ Each pixel of (1) with respect to F ₂ The operation of the feature similarity of all pixels in (1) is defined as follows:

where CM represents a correlation matrix in which each element represents F ₁ Coordinate sum of (1) and (F) ₂ The correlation of the coordinates in (1); then, the last two dimensions of the CM are normalized by using a softmax operation to obtain a matching probability distribution M, which is defined as follows:

5. the optical flow estimation method based on multi-scale global cross matching according to claim 1, wherein: pixel processing for occlusion regions in the feature (3); propagating the high-quality optical flow estimation result of the matching area to the unmatched area through the self-similarity of the features, and realizing the self-attention layer, wherein the calculation adopts a mode of calculating attention by a sliding window, and the definition is as follows:

wherein

Representing the output optical flow before upsampling, and finally using the upsampling to obtain the final optical flow. />