CN115861647A - Optical flow estimation method based on multi-scale global cross matching - Google Patents

Optical flow estimation method based on multi-scale global cross matching Download PDF

Info

Publication number
CN115861647A
CN115861647A CN202211474506.7A CN202211474506A CN115861647A CN 115861647 A CN115861647 A CN 115861647A CN 202211474506 A CN202211474506 A CN 202211474506A CN 115861647 A CN115861647 A CN 115861647A
Authority
CN
China
Prior art keywords
feature
image
optical flow
module
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211474506.7A
Other languages
Chinese (zh)
Inventor
项学智
陈一鸣
乔玉龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202211474506.7A priority Critical patent/CN115861647A/en
Publication of CN115861647A publication Critical patent/CN115861647A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides an optical flow estimation method based on multi-scale global cross matching, which comprises the following steps: 1. constructing an image feature enhancement network based on a multi-scale cross attention Module (MCA); 2. constructing an optical flow estimation module; 3. constructing a pixel processing module of the shielded area; 4. inputting two continuous frames of images at the input end of the network, and carrying out supervised training; 5. and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow. The invention provides a multi-scale cross attention Module (MCA), which is used for complementing relevant information among different image blocks in the same characteristic image so that a network can learn image information with various resolutions. Meanwhile, the problem of pixel occlusion is solved by modeling the self-similarity of the image, and then the finally predicted optical flow is obtained.

Description

Optical flow estimation method based on multi-scale global cross matching
Technical Field
The invention provides an optical flow estimation method based on multi-scale global cross matching, and belongs to the field of computer vision.
Background
Optical flow estimation of video continuous frames is a long-standing, fundamental and challenging problem in the field of computer vision that is crucial to building higher-level cognitive abilities in scene understanding, such as object recognition, object tracking, motion recognition, scene segmentation, etc. With the development of deep learning, the solution of the optical flow estimation problem by using deep learning gradually becomes a mainstream scheme, and recently, as a Transformer gradually rises in a vision task, self-attention and a Transformer architecture succeed in a plurality of vision subtasks, and a small amount of applications have been already made in the optical flow estimation task, but because the Transformer has huge calculation amount, an image division method is usually adopted to reduce the calculation amount, but the advantage of the Transformer for establishing remote dependence is destroyed, and information correlation among divided image blocks is lost.
The occlusion is a great challenge of the optical flow estimation problem, the invention adopts the self-similarity of the modeling image, supposes that a network can find points with similar motion by searching points with similar appearances in a reference frame, and designs an occlusion processing module consisting of self attention by utilizing the characteristic that the structure of an optical flow image is similar to that of an input image.
Disclosure of Invention
The invention provides an optical flow estimation method based on multi-scale global cross matching, and provides a multi-scale cross attention Module (MCA), wherein a feature enhancement network formed by the multi-scale cross attention module is constructed, the MCA module is used for complementing related information among different image blocks in the same feature image, so that the network can learn multi-scale image information, and then the self-similarity modeling of the image is carried out by utilizing the characteristic that the structure of an optical flow image is similar to that of an input image, so that the pixel shielding problem is solved, and the accuracy of optical flow estimation is improved.
The purpose of the invention is realized as follows: (1) Constructing an image feature enhancement network based on a multi-scale cross attention module, wherein the model comprises a feature extraction convolution network and a multi-scale feature matching network composed of MCA modules, firstly performing convolution operation on an input image to extract image features, then performing position coding on the feature image and inputting the feature image into the multi-scale feature matching network composed of the MCA modules, wherein the whole multi-scale feature matching network comprises N MCA modules, each MCA module performs correlation calculation on three different resolutions of the input image, image pairs of the three different resolutions are obtained through down sampling, are recovered to a uniform resolution through up sampling after passing through an attention layer and are fused, and then are added with the input for output;
(2) Constructing an optical flow estimation module; inputting two frames of image features output by the last layer in the multi-scale feature matching network into an optical flow estimation module for prediction, wherein the module consists of a global matching module and a softmax layer; performing dot product operation on the input features to obtain global correlation, normalizing the last two dimensions of the global correlation through softmax to obtain matching probability, and multiplying the weighted average and the multiplication of the matching probability and 2D coordinates of pixel grid points to obtain a corresponding relation matrix; finally, the optical flow is obtained by calculating the coordinate difference between corresponding points;
(3) Constructing a pixel processing module of the shielded area; the module is composed of a self-attention layer, 2D optical flow output by an optical flow estimation module and target map features output by a multi-scale feature matching network are input into the self-attention layer, and then the self-attention layer and the 2D optical flow are added to obtain a final optical flow;
(4) Inputting two continuous frames of images at the input end of the network, and carrying out supervised training;
(5) And inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.
The invention also includes such structural features:
1. in the feature (1), two consecutive frames of images I t And I t+1 Firstly, inputting a feature extraction network formed by convolution to obtain
Figure BDA0003957348130000021
A feature map, wherein H and W represent the height and width of the input image, respectively, C represents the number of channels of the input image, and then the feature image is converted into H by down-sampling dimensionality l ×W l X C, wherein H l And W l Is down-sampled to the height and width of the lowest resolution, which is a constant value manually specified, changed due to the different resolutions of the input images, and then input to a cross attention layer through position encoding, whose attention operation is defined as:
Figure BDA0003957348130000022
Figure BDA0003957348130000023
where CAtt stands for cross attention calculation; q, K, V respectively represent the linear projection of the input features, in particular Q from the source feature F s i Derivation, F s Representing source characteristics, i representing characteristic serial numbers at different moments; k, V from the target feature F t i Derivation of F t Representing target characteristics, i representing characteristic serial numbers at different moments; t represents the transpose operation of the matrix, softmax (·) represents the normalization operation, and D represents the dimensions Q and K; w q ,W k ,W v Three different parameter matrices are represented, the scores M of their global cross-attention are used to update the source signatures through the feed-forward neural network FFN layer, the update operation of which is defined as:
Figure BDA0003957348130000024
where Cat represents the Concat operation, updated output characteristics
Figure BDA0003957348130000025
And/or>
Figure BDA0003957348130000026
Restored to the input size by upsampling and then entered into a multi-scale image feature matching network consisting of an MCA module.
2. The MCA module in the characteristic (1) obtains images with three different resolutions by down-samplingCharacteristic pairs, each of which is H from low to high l ×W l ×C、
Figure BDA0003957348130000031
And->
Figure BDA0003957348130000032
Three feature resolutions corresponding to an output attention score of M c 、M m And M f For the Kth MCA block, the resolution is (H) l ,W l ) The characteristic image pair is subjected to pixel-by-pixel cross attention calculation, and the other two resolution characteristics are divided firstly, and the original characteristic F is subjected to division S i Divide into image feature blocks patch of size S, divide into ^ and ^ S in total>
Figure BDA0003957348130000033
A patch, then input into a cross attention module, where M c 、M m And M f The calculation method of (c) is consistent with the calculation method of (M); upsampling feature pairs output by each attention tier back to the input size and merging recompressed channels along the channel to get->
Figure BDA0003957348130000034
Then, outputting after passing through a single-layer convolution network, wherein the operation is as follows:
Figure BDA0003957348130000035
Figure BDA0003957348130000036
/>
wherein LN represents the normalization layer, conV represents the convolution layer, cat represents the Concat operation;
Figure BDA0003957348130000037
output features representing three different resolution feature images passing through the cross attention module->
Figure BDA0003957348130000038
Represents the input of the kth MCA module,
Figure BDA0003957348130000039
represents the output of the Kth MCA module>
Figure BDA00039573481300000310
The image feature pairs representing the output of each attention layer in the kth MCA module are up-sampled back to the input resolution and merged along the channels and recompressed to obtain image feature pairs.
3. Optical flow prediction for successive frames in said feature (2), two successive frames of image I t And I t+1 Finally obtaining an image feature pair F after an image feature enhancement network based on a multi-scale cross attention Module (MCA) 1 And F 2 Then calculating the correlation between the two image features to compare F 1 Each pixel of (1) with respect to F 2 The operation of the feature similarity of all pixels in (1) is defined as follows:
Figure BDA00039573481300000311
where CM represents a correlation matrix in which each element represents F 1 Coordinate sum of (5) and (F) 2 The correlation of the coordinates in (1); then, the last two dimensions of the CM are normalized by using a softmax operation to obtain a matching probability distribution M, which is defined as follows:
Figure BDA00039573481300000312
then, the weighted average of the 2D coordinates of the pixel grid G is multiplied by the matching probability distribution M to obtain a corresponding relation matrix
Figure BDA00039573481300000313
Wherein the size of the pixel grid G and F 1 And F 2 Likewise, its operation is defined as follows:
Figure BDA00039573481300000314
finally, the coordinate difference OF the corresponding pixel is calculated to obtain the optical flow OF, which operates as follows:
Figure BDA0003957348130000041
4. pixel processing for occlusion regions in the feature (3); propagating the high-quality optical flow estimation result of the matching area to the unmatched area through the self-similarity of the features, and realizing the self-attention layer, wherein the calculation adopts a mode of calculating attention by a sliding window, and the definition is as follows:
Figure BDA0003957348130000042
wherein
Figure BDA0003957348130000043
Representing the output optical flow before upsampling, and finally using the upsampling to obtain the final optical flow.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a multi-scale cross attention Module (MCA), which is used for complementing relevant information among different image blocks in the same characteristic image so that a network can learn multi-scale image information. Meanwhile, the problem of pixel occlusion is solved by modeling the self-similarity of the image, and then the finally predicted optical flow is obtained.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a multi-scale spatial cross attention Module (MCA) based optical flow network architecture;
FIG. 3 is an initialization structure diagram;
FIG. 4 is a block diagram of a multi-scale spatial cross attention Module (MCA);
FIG. 5 Concat structure diagram in MCA module;
FIG. 6 is a cross-attention layer structure diagram;
FIG. 7 is a block diagram of optical flow estimation;
fig. 8 is a block diagram of the occlusion processing module.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
With reference to fig. 1-8, the present invention is implemented by the following steps:
the method comprises the following steps: and constructing an image feature enhancement network based on a multi-scale space cross attention Module (MCA).
As shown in FIG. 2, two consecutive frames of images I are input t And I t+1 To the convolution network of feature extraction, the feature extraction module is composed of convolution with step size of 2 and size of 7 × 7, convolution with 1 × 1 and three residual modules, and each residual module is composed of convolution with step size of 1 and size of 3 × 3. Obtained after passing through a feature extraction module
Figure BDA0003957348130000044
And then the image feature pair is down-sampled to H l ×W l X C, wherein H l And W l Is down-sampled to the lowest resolution height and width, which is a constant value manually specified, changed due to the resolution of the input image, and then input to a cross attention layer via position encoding, whose attention operation is defined as shown in fig. 6:
Figure BDA0003957348130000051
Figure BDA0003957348130000052
where CAtt stands for cross attention calculation; q, K, V respectively represent linear projections of the input features, in particular Q from the source featuresF s i Derivation, F s Representing source characteristics, i representing characteristic serial numbers at different moments; k, V from the target feature F t i Derivation of F t Representing target characteristics, i representing characteristic serial numbers at different moments; t represents the transpose operation of the matrix, softmax (·) represents the normalization operation, and D represents the dimensions Q and K; w q ,W k ,W v Representing three different parameter matrices whose global cross-attention score, M, is used to update the source signature through the FFN layer. The update operation is defined as:
Figure BDA0003957348130000053
wherein Cat represents a Concat operation, followed by
Figure BDA0003957348130000054
And/or>
Figure BDA0003957348130000055
The feature image is restored to the input resolution by upsampling, and then input into a multi-scale feature matching network composed of an MCA module. The entire multi-scale feature matching network contains N MCA modules. There will be three image feature pair-pass cross attention layers of different resolutions in each MCA module, and the attention score of its corresponding output is M c 、M m And M f The three image feature pairs with different resolutions are obtained by the same stride average pooling down-sampling, and are restored to the uniform resolution by the same up-sampling after passing through the attention layer, and then are combined along the channel and pass through the convolution layer, and finally are added with the input image feature pairs for output. For the Kth MCA block, the resolution is (H) l ,W l ) The image pair of the feature points is subjected to pixel-by-pixel cross attention calculation, and the image features of the other two resolutions are divided firstly, and the original feature F is subjected to division S i Divide into image feature blocks patch of size S, divide into ^ and ^ S in total>
Figure BDA0003957348130000056
And a patch. Then input to hand overFork attention module, wherein M c 、M m And M f The calculation method of (2) is the same as the above-described M calculation method. Upsampling pairs of features output by respective attention layers back to input resolution and combining recompression channels along channels->
Figure BDA0003957348130000057
Then, outputting after passing through a single-layer convolution network, wherein the operation is as follows:
Figure BDA0003957348130000058
Figure BDA0003957348130000059
wherein LN represents the normalization layer, conV represents the convolution layer, cat represents the Concat operation;
Figure BDA0003957348130000061
output features representing three different resolution feature images passing through the cross attention module->
Figure BDA0003957348130000062
Represents the input of the kth MCA module,
Figure BDA0003957348130000063
represents the output of the Kth MCA module>
Figure BDA0003957348130000064
The image feature pairs representing the output of each attention layer in the kth MCA module are up-sampled back to the input resolution and merged along the channels and recompressed to get image feature pairs.
Step two: and constructing an optical flow estimation network.
As shown in fig. 7, two consecutive frames of video images I t And I t+1 Inputting the data into the whole image feature enhancement network, and firstly obtaining a preliminary image feature enhancement network through a convolutional neural networkImage features, where a fixed two-dimensional sine and cosine position code is first added to the features since the two sets of features do not have the notion of spatial position. Adding location information, which allows the matching process to take into account not only feature similarity but also their spatial distance, helps to resolve ambiguities and improve performance, and after adding location information, the resulting image feature pair F is finally output, after passing through the N MCA modules 1 And F 2 Considering that the corresponding pixels in the two frames should have a high similarity, the correlation between the two image features is calculated to compare F 1 Each pixel of (1) with respect to F 2 The feature similarity of all pixels in the image, this step can be accomplished by a simple dot product, whose operation is defined as follows:
Figure BDA0003957348130000065
where CM represents a correlation matrix in which each element represents F 1 Coordinate sum of (5) and (F) 2 The correlation of coordinates in (1). To determine the correspondence, a simple method is to directly take the position with the largest correlation. However, this operation is not differentiable, which can hamper end-to-end training. To solve this problem, a micro-matchable layer is used, that is, the last two dimensions of the CM are normalized by using the softmax operation to obtain a matching probability distribution M, which is defined as follows:
Figure BDA0003957348130000066
then, the weighted average of the 2D coordinates of the pixel grid G is multiplied by the matching probability distribution M to obtain a corresponding relation matrix
Figure BDA0003957348130000067
Wherein the size of the pixel grid G and F 1 And F 2 Likewise, its operation is defined as follows:
Figure BDA0003957348130000068
finally, the coordinate difference OF the corresponding pixel is calculated to obtain the optical flow OF, which operates as follows:
Figure BDA0003957348130000069
step three: and processing pixels of the occlusion area.
The principle of predicting the optical flow is that position information is added before the characteristics enter a related layer, firstly, a remarkable characteristic point is found through autocorrelation, then, corresponding positions of the characteristic points of the two images are found through cross attention, obviously, the position information difference of the characteristic points of the two images is the optical flow, but the process considers that pixels of an occlusion area do not exist, once the pixels of the occlusion area exist, the characteristic coordinates of a target image cannot correspond to the positions of the occluded pixels, and the predicted optical flow is inaccurate at this time. But because the structure of the target image and the optical flow image has a certain similarity. To resolve ambiguities caused by occlusion, the idea is to allow the network to reason about at a higher level, that is, to globally aggregate the motion characteristics of similar pixels after implicitly reasoning which pixels are similar in the appearance feature space. It is assumed that by finding points with similar appearance in the frame of reference, the network will be able to find points with similar motion. This is because it is observed that the movement of points on a single object is generally uniform. This statistical bias is used to propagate motion information from high confidence non-occluded pixels to low confidence occluded pixels. Here, the confidence may be interpreted as whether there is a significant match, i.e. a high correlation value at the correct displacement. Transformers networks are known for their ability to model long-term dependencies. In the self-attention mechanism of transformations, query, key, and value are from the same feature vector, unlike the self-attention mechanism in transformations, where a generalized attention variant is used. The query and key features are output by the image feature network to model the appearance self-similarity in frame 1. The value feature is the optical flow projection through the softmax layer, and the output optical flow is coded by 4D related volume. The attention matrix computed from the query and key features is used to aggregate the value features as a motion hidden representation.
As shown in fig. 8, the high-quality optical flow estimation result of the matching area is propagated to the unmatched area by the self-similarity of the features, and the method is implemented by a self-attention layer, and the calculation is performed in a manner of calculating attention by a sliding window, which is defined as follows:
Figure BDA0003957348130000071
wherein
Figure BDA0003957348130000072
Representing the output optical flow before upsampling. Finally, up-sampling is used to obtain the final optical flow.
Step four: two continuous frames of images are input at the input end of the network, and the network is supervised and trained by using the whole network loss function. All traffic predictions are monitored using the group route.
Step five: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.
Two consecutive frames of images I t And I t+1 Firstly, inputting the data into a feature extraction module, passing a convolution layer with the step size of 2 and the size of 7 multiplied by 7, then passing three continuous residual blocks, wherein each residual block is formed by convolution with the step size of 1 and the size of 3 multiplied by 3, and finally passing convolution with the size of 1 multiplied by 1, wherein the number of convolution kernels in each layer is 64, 64, 96, 128 and 128. Except for the convolution of the last layer, each layer of convolution is followed by a normalization layer and a ReLU activation layer. Obtained after passing through a feature extraction module
Figure BDA0003957348130000073
And then downsampling the image feature pairs to H through step-wise averaging pooling l ×W l X C, wherein H l And W l Is downsampled to the lowestThe height and width of the resolution, which are constant values manually specified, vary depending on the resolution of the input image, and are then input to a global cross attention layer via position encoding, which is defined as follows:
Figure BDA0003957348130000074
Figure BDA0003957348130000081
where CAtt stands for cross attention calculation; q, K, V respectively represent linear projections of the input features, in particular Q from the source feature F s i Derivation, F s Representing source characteristics, i representing characteristic serial numbers at different moments; k, V from the target feature F t i Derivation of F t Representing target characteristics, i representing characteristic serial numbers at different moments; t represents the transpose operation of the matrix, softmax (·) represents the normalization operation, and D represents the dimensions Q and K; w q ,W k ,W v Represents three different parameter matrices whose global cross-attention score, M, is used to update the source signature through the FFN layer. The update operation is defined as:
Figure BDA0003957348130000082
wherein Cat represents a Concat operation, followed by
Figure BDA0003957348130000083
And/or>
Figure BDA0003957348130000084
The input resolution is restored through bilinear upsampling, and then the feature image is input into a multi-scale feature matching network formed by an MCA module. The entire multi-scale feature matching network contains N MCA modules. There will be three image feature pairs of different resolutions in each MCA module that pass through the cross attention layer and their corresponding outputsAttention score of M c 、M m And M f The three image feature pairs with different resolutions are obtained by the same stride average pooling down-sampling, are restored to the uniform resolution by bilinearity after passing through the attention layer, are combined along the channel and pass through the convolution layer, and are finally added with the input image feature pairs for output. For the Kth MCA block, the resolution is (H) l ,W l ) The image pair of the feature points is subjected to pixel-by-pixel cross attention calculation, and the image features of the other two resolutions are divided firstly, and the original feature F is subjected to division S i Divide into image feature blocks patch of size S, divide into ^ and ^ S in total>
Figure BDA0003957348130000085
A patch. Then input into a cross attention module, where M c 、M m And M f The calculation method of (2) is the same as the above-described M calculation method. Upsampling feature pairs output by the various attention layers back to the input resolution and merging recompressed channels along the channel results in->
Figure BDA0003957348130000086
Then, outputting after passing through a single-layer convolution network, wherein the operation is as follows:
Figure BDA0003957348130000087
Figure BDA0003957348130000088
wherein LN represents the normalization layer, conV represents the convolution layer, cat represents the Concat operation;
Figure BDA0003957348130000089
output features representing three different resolutions of feature images through a cross attention module>
Figure BDA00039573481300000810
Represents the K thThe inputs of the MCA modules are connected to,
Figure BDA00039573481300000811
represents the output of the Kth MCA module>
Figure BDA00039573481300000812
The image feature pairs representing the output of each attention layer in the kth MCA module are up-sampled back to the input resolution and merged along the channels and recompressed to obtain image feature pairs. Finally outputting the image characteristics after passing through the N MCA modules, and finally outputting to obtain an image characteristic pair F 1 And F 2 Then calculating the correlation between two image features to compare F 1 Each pixel of (1) with respect to F 2 The feature similarity of all pixels in the image, this step can be accomplished by a simple dot product, whose operation is defined as follows:
Figure BDA0003957348130000091
where CM represents a correlation matrix in which each element represents F 1 Coordinate sum of (5) and (F) 2 The correlation of coordinates in (1). Normalizing the last two dimensions of the CM using the softmax operation yields a matching probability distribution M, which is defined as follows:
Figure BDA0003957348130000092
then, the weighted average of the 2D coordinates of the pixel grid G is multiplied by the matching probability distribution M to obtain a corresponding relation matrix
Figure BDA0003957348130000093
Wherein the size of the pixel grid G and F 1 And F 2 Likewise, its operation is defined as follows:
Figure BDA0003957348130000094
finally, the coordinate difference OF the corresponding pixel is calculated to obtain the optical flow OF, which operates as follows:
Figure BDA0003957348130000095
/>
then, propagating the high-quality optical flow estimation result of the matching area to the unmatched area through the self-similarity of the features, wherein the method is realized through a self-attention layer, and the calculation adopts a mode of calculating attention by a sliding window, and the method is defined as follows:
Figure BDA0003957348130000096
wherein
Figure BDA0003957348130000097
Representing the output optical flow before upsampling. Finally, up-sampling is used to obtain the final optical flow.
In summary, the invention provides an optical flow estimation method based on multi-scale global cross matching, which comprises the following steps: 1. constructing an image feature enhancement network based on a multi-scale cross attention Module (MCA); 2. constructing an optical flow estimation module; 3. constructing a pixel processing module of the shielded area; 4. inputting two continuous frames of images at the input end of the network, and carrying out supervised training; 5. and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow. The invention provides a multi-scale cross attention Module (MCA), which is used for complementing relevant information among different image blocks in the same characteristic image so that a network can learn image information with various resolutions. Meanwhile, the problem of pixel occlusion is solved by modeling the self-similarity of the image, and then the finally predicted optical flow is obtained.

Claims (5)

1. An optical flow estimation method based on multi-scale global cross matching is characterized in that:
(1) Constructing an image feature enhancement network based on a multi-scale cross attention module, wherein the model comprises a feature extraction convolution network and a multi-scale feature matching network composed of MCA modules, firstly performing convolution operation on an input image to extract image features, then performing position coding on the feature image and inputting the feature image into the multi-scale feature matching network composed of the MCA modules, wherein the whole multi-scale feature matching network comprises N MCA modules, each MCA module performs correlation calculation on three different resolutions of the input image, image pairs of the three different resolutions are obtained through down sampling, are recovered to a uniform resolution through up sampling after passing through an attention layer and are fused, and then are added with the input for output;
(2) Constructing an optical flow estimation module; inputting two frames of image features output by the last layer in the multi-scale feature matching network into an optical flow estimation module for prediction, wherein the module consists of a global matching module and a softmax layer; performing dot product operation on input features to obtain global correlation, normalizing the last two dimensions of the global correlation through softmax to obtain matching probability, and multiplying the weighted average sum of the matching probability and 2D coordinates of pixel grid points to obtain a corresponding relation matrix; finally, the optical flow is obtained by calculating the coordinate difference between corresponding points;
(3) Constructing a pixel processing module of the shielded area; the module is composed of a self-attention layer, 2D optical flow output by an optical flow estimation module and target graph features output by a multi-scale feature matching network are input into the self-attention layer, and then the self-attention layer and the 2D optical flow are added to obtain a final optical flow;
(4) Inputting two continuous frames of images at the input end of the network, and carrying out supervised training;
(5) And inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.
2. The method of claim 1, wherein the method comprises:
in the feature (1), two consecutive frames of images I t And I t+1 Firstly, inputting a feature extraction network formed by convolution to obtain
Figure FDA0003957348120000011
A feature map, wherein H and W represent the height and width of the input image, respectively, C represents the number of channels of the input image, and then the feature image is changed into H by down-sampling dimension l ×W l X C, wherein H l And W l Is down-sampled to the height and width of the lowest resolution, which is a constant value manually specified, changed due to the different resolutions of the input images, and then input to a cross attention layer through position encoding, whose attention operation is defined as:
Figure FDA0003957348120000012
Figure FDA0003957348120000013
where CAtt stands for cross attention calculation; q, K, V respectively represent linear projections of the input features, in particular Q from the source feature F s i Derivation, F s Representing source characteristics, i representing characteristic serial numbers at different moments; k, V from the target feature F t i Derivation of F t Representing target characteristics, i representing characteristic serial numbers at different moments; t represents the transpose operation of the matrix, softmax (·) represents the normalization operation, and D represents the dimensions Q and K; w q ,W k ,W v Three different parameter matrices are represented, the scores M of their global cross-attention are used to update the source signatures through the feed-forward neural network FFN layer, the update operation of which is defined as:
Figure FDA0003957348120000021
where Cat represents the Concat operation, updated output characteristics
Figure FDA0003957348120000022
And/or>
Figure FDA0003957348120000023
Restored to the input size by upsampling and then entered into a multi-scale image feature matching network consisting of an MCA module. />
3. The method of claim 2, wherein the method comprises: the MCA module in the feature (1) obtains three image feature pairs with different resolutions through down-sampling, wherein the three image feature pairs are respectively from low to high
Figure FDA0003957348120000024
And->
Figure FDA0003957348120000025
Three feature resolutions corresponding to an output attention score of M c 、M m And M f For the Kth MCA block, the resolution is (H) l ,W l ) The characteristic image pair is subjected to pixel-by-pixel cross attention calculation, and the other two resolution characteristics are divided firstly, and the original characteristic F is subjected to division S i Divide into blocks of image features of size S × S patch, into { (R) } in total>
Figure FDA0003957348120000026
A patch, then input into a cross attention module, where M c 、M m And M f The calculation method of (c) is consistent with the calculation method of (M); upsampling feature pairs output by each attention tier back to the input size and merging recompressed channels along the channel to get->
Figure FDA0003957348120000027
Then, outputting after passing through a single-layer convolution network, wherein the operation is as follows:
Figure FDA0003957348120000028
Figure FDA0003957348120000029
wherein LN represents the normalization layer, conV represents the convolution layer, cat represents the Concat operation;
Figure FDA00039573481200000210
output features representing three different resolution feature images passing through the cross attention module->
Figure FDA00039573481200000211
Represents an input of the Kth MCA module>
Figure FDA00039573481200000212
Represents the output of the Kth MCA module>
Figure FDA00039573481200000213
The image feature pairs representing the output of each attention layer in the kth MCA module are up-sampled back to the input resolution and merged along the channels and recompressed to obtain image feature pairs.
4. The optical flow estimation method based on multi-scale global cross matching according to claim 1, wherein: optical flow prediction for successive frames in said feature (2), two successive frames of image I t And I t+1 Finally obtaining an image feature pair F after an image feature enhancement network based on a multi-scale cross attention Module (MCA) 1 And F 2 Then calculating the correlation between the two image features to compare F 1 Each pixel of (1) with respect to F 2 The operation of the feature similarity of all pixels in (1) is defined as follows:
Figure FDA0003957348120000031
where CM represents a correlation matrix in which each element represents F 1 Coordinate sum of (1) and (F) 2 The correlation of the coordinates in (1); then, the last two dimensions of the CM are normalized by using a softmax operation to obtain a matching probability distribution M, which is defined as follows:
Figure FDA0003957348120000032
then, the weighted average of the 2D coordinates of the pixel grid G is multiplied by the matching probability distribution M to obtain a corresponding relation matrix
Figure FDA0003957348120000033
Wherein the size of the pixel grid G and F 1 And F 2 Likewise, its operation is defined as follows:
Figure FDA0003957348120000034
finally, the coordinate difference OF the corresponding pixel is calculated to obtain the optical flow OF, which operates as follows:
Figure FDA0003957348120000035
5. the optical flow estimation method based on multi-scale global cross matching according to claim 1, wherein: pixel processing for occlusion regions in the feature (3); propagating the high-quality optical flow estimation result of the matching area to the unmatched area through the self-similarity of the features, and realizing the self-attention layer, wherein the calculation adopts a mode of calculating attention by a sliding window, and the definition is as follows:
Figure FDA0003957348120000036
wherein
Figure FDA0003957348120000037
Representing the output optical flow before upsampling, and finally using the upsampling to obtain the final optical flow. />
CN202211474506.7A 2022-11-22 2022-11-22 Optical flow estimation method based on multi-scale global cross matching Pending CN115861647A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211474506.7A CN115861647A (en) 2022-11-22 2022-11-22 Optical flow estimation method based on multi-scale global cross matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211474506.7A CN115861647A (en) 2022-11-22 2022-11-22 Optical flow estimation method based on multi-scale global cross matching

Publications (1)

Publication Number Publication Date
CN115861647A true CN115861647A (en) 2023-03-28

Family

ID=85665351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211474506.7A Pending CN115861647A (en) 2022-11-22 2022-11-22 Optical flow estimation method based on multi-scale global cross matching

Country Status (1)

Country Link
CN (1) CN115861647A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116563147A (en) * 2023-05-04 2023-08-08 北京联合大学 Underwater image enhancement system and method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116563147A (en) * 2023-05-04 2023-08-08 北京联合大学 Underwater image enhancement system and method
CN116563147B (en) * 2023-05-04 2024-03-26 北京联合大学 Underwater image enhancement system and method

Similar Documents

Publication Publication Date Title
CN112347859B (en) Method for detecting significance target of optical remote sensing image
CN110059772B (en) Remote sensing image semantic segmentation method based on multi-scale decoding network
CN107292912B (en) Optical flow estimation method based on multi-scale corresponding structured learning
CN111652899B (en) Video target segmentation method for space-time component diagram
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN114758383A (en) Expression recognition method based on attention modulation context spatial information
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN113221874A (en) Character recognition system based on Gabor convolution and linear sparse attention
CN115082675B (en) Transparent object image segmentation method and system
CN116612288B (en) Multi-scale lightweight real-time semantic segmentation method and system
CN113537254A (en) Image feature extraction method and device, electronic equipment and readable storage medium
CN114781499B (en) Method for constructing ViT model-based intensive prediction task adapter
CN115861647A (en) Optical flow estimation method based on multi-scale global cross matching
CN115457509A (en) Traffic sign image segmentation algorithm based on improved space-time image convolution
CN116721207A (en) Three-dimensional reconstruction method, device, equipment and storage medium based on transducer model
CN116563682A (en) Attention scheme and strip convolution semantic line detection method based on depth Hough network
CN116363750A (en) Human body posture prediction method, device, equipment and readable storage medium
Hua et al. Visual saliency detection via a recurrent residual convolutional neural network based on densely aggregated features
CN111626298B (en) Real-time image semantic segmentation device and segmentation method
CN117765404A (en) Complex scene change detection method based on feature correlation neural network
US20240070809A1 (en) Cascaded local implicit transformer for arbitrary-scale super-resolution
CN115170985B (en) Remote sensing image semantic segmentation network and segmentation method based on threshold attention
CN116863437A (en) Lane line detection model training method, device, equipment, medium and vehicle
CN115861384A (en) Optical flow estimation method and system based on generation of countermeasure and attention mechanism
CN116343034A (en) Remote sensing image change detection method, system, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination