CN114187331A

CN114187331A - Unsupervised optical flow estimation method based on Transformer feature pyramid network

Info

Publication number: CN114187331A
Application number: CN202111506127.7A
Authority: CN
Inventors: 项学智; 杨洁; 乔玉龙
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-03-15

Abstract

The invention belongs to the technical field of computer vision, and particularly relates to an unsupervised optical flow estimation method based on a Transformer characteristic pyramid network, which comprises the steps of constructing the Transformer characteristic pyramid network; enhancing the feature extraction capability of the feature pyramid network on the image by means of a Transformer model through a re-attention mechanism operation; constructing an optical flow estimation network to enable the network to perform optical flow prediction; and carrying out shielding compensation processing on pixels in a shielding area, and designing a loss function of integral network training to carry out unsupervised training on the network to obtain an unsupervised optical flow estimation model with higher speed and higher precision. The method can enhance the feature extraction capability of the feature pyramid layer on the image, and carry out occlusion compensation processing on occlusion pixels in the image so as to improve the precision of optical flow estimation.

Description

Unsupervised optical flow estimation method based on Transformer feature pyramid network

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an unsupervised optical flow estimation method based on a transform feature pyramid network.

Background

Optical flow estimation is an important research direction of computer vision application, and has a very wide application prospect in the fields of automatic driving, intelligent robots, motion and expression recognition, target tracking and the like. With the development of deep learning, many researchers choose to use deep learning techniques to deal with the optical flow estimation problem, and the method has the advantages of high running speed, high precision and the like, and achieves leading results on a plurality of common data sets. However, the method of processing the optical flow estimation problem by using only the convolutional neural network has some problems to be solved so far, such as: the occlusion problem, the small target detection, the convolution neural network cannot capture global feature information and the like, and the problem can be effectively solved by fusing the Transformer model and the feature pyramid network.

The Transformer is a special self-attention mechanism model, and in recent years, the computational efficiency and the expandability of the Transformer model make the Transformer model widely applied to computer vision. The computer vision field to which the Transformer model is expanded at present mainly comprises image classification, image recognition, target detection, semantic segmentation, image generation and the like.

Disclosure of Invention

The invention aims to provide an unsupervised optical flow estimation method based on a Transformer feature pyramid network, which enhances the feature extraction capability of a feature pyramid layer on an image, and performs occlusion compensation processing on occluded pixels in the image so as to improve the accuracy of optical flow estimation.

An unsupervised optical flow estimation method based on a Transformer feature pyramid network comprises the following steps:

step 1: constructing a characteristic pyramid network based on a Transformer;

the pyramid network based on the Transformer characteristics has two identical network branches, and has 12 layers in total, so that 6 image characteristics can be extracted, and the network of each branch shares the weight; fusing a Transformer model in a second layer, a fourth layer, a sixth layer, an eighth layer, a tenth layer and a last layer based on a Transformer characteristic pyramid network to enhance the characteristic extraction capability of the network, wherein the model consists of an image segmentation module and a convolution mapping re-attention mechanism module; the image segmentation module firstly carries out deformable convolution operation on an input image to extract local features such as edges and the like, then segments the image into sequences and inputs the sequences into the convolution mapping and attention re-paying mechanism module; a convolution mapping re-attention mechanism module performs re-attention mechanism operation on the image sequence to extract global features;

step 2: constructing an optical flow estimation network; inputting image features extracted from each layer in the Transformer feature pyramid network into an optical flow estimation network for prediction;

the optical flow estimation network is composed of 5 layers of convolutional neural networks, firstly, feature deformation is carried out on an input second frame image, then, a feature matching relation between the second frame image after the feature deformation and a first frame image is calculated, namely, a feature matching cost volume is calculated, then, the calculated feature matching cost volume is input into the optical flow estimation network to predict optical flow, and the output optical flow result is subjected to post-processing by utilizing a context network to obtain accurate predicted optical flow;

and step 3: carrying out shielding compensation processing on pixels in a shielding area;

the specific operation of the occlusion compensation processing in the optical flow estimation network is that firstly, difference comparison is carried out on a second frame reconstructed image and a first frame image, and then pixels of an occlusion area are extracted to carry out occlusion compensation processing on the pixels;

the difference contrast processing is to utilize the predicted forward optical flow to distort the second frame image so as to reconstruct the first frame image, and the difference of the reconstructed first frame image and the original first frame image is added into the occlusion image; extracting pixels of the occlusion region is to fill a reconstructed image by extracting pixels of the occlusion region in the original first frame image, and perform occlusion compensation by calculating the difference between the reconstructed image and the original first frame image;

and 4, step 4: designing a loss function of the whole network training; combining the loss function with similarity information processing into an optical flow estimation network, and constructing a loss function of whole network training; defining a function for describing the feature similarity between different image blocks of an input transform model to promote the diversity between different image blocks, and defining a contrast loss function and an error loss function to enable the concerned features of the deep network layer to come from the corresponding input of the shallow layer and be unrelated to the rest shallow layer input; weighting and summing loss items corresponding to the transform model of each pyramid layer, and combining an occlusion compensation loss function as an overall loss function of network training to constrain the training process of the network;

and 5: inputting two continuous frames of images at the input end of the network, and carrying out unsupervised training on the network by using an integral network loss function;

step 6: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.

Further, in the step 1, there are 12 convolutional layers based on the Transformer feature pyramid network, including 6 stages of Transformer models, where the Transformer models in each stage have the same architecture, and can gradually extract 6 feature maps with different sizes; the number of channel features from the first stage to the sixth stage is 16, 32, 64, 96, 128, 196, respectively;

inputting a 3 × 512 × 512 feature map into the first layer of convolutional layer, outputting a 16 × 512 × 512 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 16 × 512 × 512 feature map into the second layer of convolution layer, outputting a 16 × 256 × 256 feature map with a convolution kernel size of 3 × 3 and a step size of 2, and inputting the feature map into the transform model in the first stage;

inputting a 16 × 256 × 256 feature map into the third layer of convolutional layer, outputting a 16 × 256 × 256 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 16 × 256 × 256 feature map into the fourth layer of convolution layer, outputting a 32 × 128 × 128 feature map into the transform model at the second stage, wherein the convolution kernel size is 3 × 3, the step size is 2;

inputting a 32 × 128 × 128 feature map into the fifth convolutional layer, wherein the size of a convolutional kernel is 3 × 3, the step size is 1, and outputting the 32 × 128 × 128 feature map; inputting a 32 × 128 × 128 feature map into the sixth layer of convolution layer, the convolution kernel size is 3 × 3, the step size is 2, outputting a 64 × 64 × 64 feature map and inputting the feature map into the transform model of the third stage;

inputting a feature map of 64 multiplied by 64 to the seventh convolutional layer, wherein the size of a convolution kernel is 3 multiplied by 3, the step size is 1, and outputting a feature map of 64 multiplied by 64; the eighth layer convolution inputs a feature map of 64 multiplied by 64, the convolution kernel size is 3 multiplied by 3, the step size is 2, and the feature map of 96 multiplied by 32 is output and input to a transform model in the fourth stage;

inputting a 96 × 32 × 32 feature map into the ninth convolutional layer, outputting a 96 × 32 × 32 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 96 × 32 × 32 feature map into the tenth convolutional layer, the convolutional kernel size is 3 × 3, the step size is 2, outputting a 128 × 16 × 16 feature map, and inputting the feature map into the transform model in the fifth stage;

inputting a 128 × 16 × 16 feature map into the eleventh convolutional layer, outputting a 128 × 16 × 16 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 128 × 16 × 16 feature map into the twelfth layer of convolution layer, the convolution kernel size is 3 × 3, the step size is 2, and outputting a 196 × 8 × 8 feature map to be input into the transform model in the sixth stage;

and the feature graph output by the convolution layer enters a Transformer model and then is output to the next convolution layer of the feature pyramid so as to enhance the feature extraction capability of the network on the image.

Further, in step 1, the image segmentation module based on the transform feature pyramid network performs a deformable convolution operation on the input image, so that the number of image sequences in the input convolution mapping and attention mechanism module is gradually reduced, the feature resolution is reduced, the width of the input sequence is expanded, the feature size is increased, and the purpose of gradually reducing the feature resolution and increasing the feature size along with the deepening of the convolution layer number is achieved, and the specific steps are as follows:

step 1.1.1: outputting the ith layer pyramid as an image x_iInputting the image into an image segmentation module, extracting edge features through a deformable convolution operation to obtain an output image x'_iCarrying out batch normalization processing and maximum pooling on the image after the deformable convolution operation to obtain an output image x ″)_i；

x″_i＝MaxPool(BN(Deconv(x_i)))

Wherein x is_iOutputting an image for the ith pyramid based on the Transformer feature pyramid network, wherein the feature size of the output image is

H_i×W_iResolution, C, representing the image of the i-th layer_iRepresenting the number of channels of the ith layer image; deconv (·) represents a deformable convolution operation; BN (-) represents batch normalization processing; MaxPool (. cndot.) denotes maximum pooling; x ″)_iRepresenting the output image after the maximum pooling operation;

step 1.1.2: after maximum pooling an image x ″, is obtained_iImage Block x 'sliced into 2D'_iEach image block x'_iHas the dimension of

N denotes the output image x ″_iThe number of the image sequences after segmentation is also the effective length of the input convolution mapping re-attention mechanism module, and

(P, P) represents the resolution of each image block after segmentation;

step 1.1.3: flattening the sliced image blocks into N image sequences

Each image sequence is a two-dimensional matrix (N, D), D ═ P²C；

Step 1.1.4: the converted image sequence is marked through a binary mask, wherein the position 1 represents the image sequence with the important information position, the position 0 represents the image sequence with the similar information, the image sequence is selected to be marked in a self-adaptive mode according to the difference of information contained in the image sequence, the marked image sequence is input into a convolution mapping re-attention mechanism module, the image sequences marked with 1 are subjected to interactive calculation, and the image sequence marked with 0 is only calculated with the image sequence.

Further, the convolutional mapping re-attention mechanism module based on the transform feature pyramid network in step 1 includes two parts, namely a multi-head re-attention mechanism operation and a multi-head linear processing, and specifically includes the following steps:

step 1.2.1: the ith layer image sequence to be input

Linearly mapping the three-dimensional space to form a 2D image, obtaining three projected features through three depth separable convolution operations, flattening the three projected features to obtain three different two-dimensional vectors of the ith layer

Each vector is defined as follows:

wherein Reshape2d (·) represents spatially recombining the image sequence; s represents the size of the convolution kernel; conv2d (·) represents a depth separable convolution operation; flatting after the Flatten (·) represents mapping to obtain a two-dimensional vector;

representing an i-th layer flattened image sequence;

representing a two-dimensional vector obtained by flattening the ith layer after the depth separable convolution operation;

step 1.2.2: the obtained two-dimensional vector

The two-dimensional vector mapped by each input image sequence can be expressed as q ═ q in a matrix in a multi-head re-attention mechanism₁,q₂,...,q_N]，k＝[k₁,k₂,...,k_N]，v＝[v₁,v₂,...,v_N](ii) a With the continuous deepening of the model depth, the attention similarity of the same image sequence in different layers is larger, but the similarity between multiple heads from the same layer is smaller, a transfer matrix is adopted to combine information between different heads, the information is utilized to carry out attention operation again, the transfer matrices trained by different layers are different, and therefore attention collapse is avoided; the multi-head re-attention mechanism operation is defined as follows:

wherein q, k, v respectively represent a matrix obtained by convolution mapping of each input image sequence; t represents a transpose operation of the matrix; q.k^TRepresenting the correlation of two locations; softmax (·) denotes the pair q.k^TCarrying out normalization operation; d represents the dimensions of q and k; θ represents the dimension of attention on multiple heads;

representing the image slice sequence which is subjected to convolution mapping flattening after spatial recombination;

representing a multi-headed re-attention mechanism operation on the sequence of image slices;

step 1.2.3: adding the input and output of the multi-head re-attention mechanism operation, and then performing batch normalization processing to realize the normalization of the activation value of each layer, wherein the specific definition is as follows:

wherein MSA (-) represents a multi-headed re-attention mechanism operation; BN (-) represents batch normalization processing;

representing the image sequence output after batch normalization processing;

step 1.2.4: the multi-head linear processing operation comprises a feedforward network and batch normalization processing, the feedforward network carries out dimension expansion on each image sequence, an activation function of the batch normalization processing is GELU nonlinear activation, residual connection is carried out after each operation to prevent network degradation, and the operation is specifically defined as follows:

wherein FFN (-) represents a feed-forward network; BN (-) represents batch normalization processing; GELU (. circle.) represents a Gaussian error linear unit activation function;

representing an image sequence output after multi-head re-attention operation batch normalization processing;

representing that multi-head linear processing is carried out on the output image after multi-head re-attention operation;

step 1.2.5: and performing linear mapping dimensionality reduction on the image sequence output by the multi-head linear processing operation, performing spatial reconstruction to form a 2D image, and outputting the 2D image to a next characteristic pyramid layer.

Further, in step 2, the input of the optical flow estimation network is two consecutive frames of images, and the output is a corresponding estimated optical flow, specifically comprising the steps of:

step 2.1: for input two continuous frames of images I₁And I₂For the second image I₂Performing feature deformation, performing 2-time upsampling on the estimated optical flow from the i-1 layer based on the Transformer feature pyramid network, and performing bilinear interpolation deformation on the features of the second graph to the first graph, wherein the operation of feature deformation is defined as follows:

wherein x represents a pixel value;

representing that 2 times of upsampling is carried out on an optical flow result obtained by an i-1 layer optical flow estimation network;

representing the ith layer image characteristic of the second image of the pyramid;

representing the image features after feature deformation;

step 2.2: calculating the correlation between the second image feature and the first image feature after the feature deformation, namely calculating the feature matching cost volume of each layer, wherein the calculation of the feature matching cost volume is defined as follows:

wherein x is₁,x₂Pixel values representing the first and second images, respectively;

a feature map representing a first image on an i-th layer based on a transform feature pyramid network;

representing the image characteristics after characteristic deformation is carried out on the second image on the ith layer based on the Transformer characteristic pyramid network; m represents the length of the feature map; t represents a transpose operation of the matrix; CV ofⁱ(x₁,x₂) Representing a feature map matching cost volume result of the ith layer based on the Transformer feature pyramid network;

step 2.3: and inputting the feature matching cost volume result, the features of the first graph and the up-sampled high-resolution optical flow into an i-th optical flow estimation layer of the optical flow estimation network to obtain an optical flow estimation result of the layer.

Further, in the step 2.3, five convolutional networks are used for optical flow estimation layers of the optical flow estimation network, the number of channels of each convolutional layer is 128, 96, 64 and 32, and the output optical flow results are post-processed through a context network; the context network is a feedforward convolutional neural network, is based on an expansion convolution design and consists of 7 layers of convolutional networks, the convolution kernel of each convolutional layer is 3 multiplied by 3, the layers have different expansion coefficients, and the convolution expansion coefficient k represents that the input unit of the filter in the layer is k units away from other input units of the filter in the layer in the vertical and horizontal directions; from top to bottom, the expansion coefficients of the convolutional layers are 1,2,4,8,16,1 and 1 in order, and the convolutional layers with large expansion coefficients can effectively enlarge the size of the receptive field of each output unit on the pyramid to output a refined optical flow.

Further, in the step 3: the method for carrying out shielding compensation processing on the pixels in the shielding area specifically comprises the following steps:

carrying out difference comparison on the predicted optical flow image and the original image, and then extracting pixels of an occlusion area to carry out occlusion compensation processing on the pixels; the contrast processing is to warp the second frame image I by using the predicted forward optical flow₂Thereby synthesizing a first frame reconstructed image I₁', reconstructing the image I₁' with original image I₁The differences are added to the occlusion map, defined as follows：

Wherein x represents a pixel value; o_fwRepresenting an occlusion map; i is₁(x) And l'₁(x) Respectively representing a first frame original image and a second frame reconstructed image; sigma (-) represents a pixel-level similarity measure for computing the original image I₁(x) And reconstructed image I'₁(x) The similarity between them;

estimating a difference contrast loss in the network for the ith layer optical flow;

the extraction of the pixels of the occlusion region is performed by using the original image I₁Corresponding pixel of the occlusion region in (1) fills in the image I'₁Obtaining a reconstructed image I₁By reconstructing the image I ″)₁And I₁The difference between to calculate the loss in occlusion region is defined as follows:

wherein x represents a pixel value; i ″)₁(x) Is represented in the reconstructed first picture I'₁(x) Adding a reconstruction pixel obtained by corresponding shielding pixel; σ (-) represents a pixel-level similarity metric;

estimating losses in occlusion regions in the network for the ith layer of optical flow;

occlusion compensation loss function in final i-th layer optical flow estimation network

Is obtained by summing two loss functions, and is defined as follows:

wherein,

representing the difference contrast loss in the ith layer optical flow estimation network;

representing the loss in the occlusion region in the ith layer optical flow estimation network;

and representing an occlusion compensation loss function in the ith layer optical flow estimation network.

Further, the loss function of the overall network training in step 4 is: let the characteristics of the input image x and its corresponding N image sequences of the i-th layer be

And the characteristics are similar to functions

As a penalty term in the loss function to promote diversity between different image sequences, is defined as follows:

wherein N represents the total number of image segmentations; i | · | purple wind₂Represents L₂Performing norm operation; t represents a transposition operation of the vector; x is the number of_n，x_mRespectively representing the characteristics of different image slice sequences of the same layer;

a characteristic similarity loss function representing the similarity between different image sequences of the ith layer;

defining an ith layer contrast loss function

The features learned in shallow layers of the Transformer model are more diversified than those learned in deep layers, and the contrast loss function can use the features learned in shallow layers to regularize the deep features so as to reduce the similarity between different image sequences and increase the feature diversity of the image sequences in a deep network, and is defined as follows:

wherein,

an nth image sequence feature representing the first layer;

representing the ith layer corresponding to the nth image sequence characteristic;

indicating that the feature vector is transposed; n represents the number of image sequences;

representing an image sequence contrast loss function of the ith layer;

defining an i-th layer error loss function

Each deep image sequence in the Transformer model should focus only on the corresponding image sequence input from the shallow layer, and ignore the remaining irrelevant image sequence features, which is defined as follows:

wherein,

representing the nth image sequence of the first layerPerforming sign;

representing the ith layer corresponding to the nth image sequence characteristic; n represents the number of image sequences;

an error loss function representing the image sequence of the i-th layer;

the loss function for defining the whole network training is obtained by weighted sum of loss functions of the Transformer model in six stages and luminosity loss function, and the formula is as follows:

wherein λ is₁,λ₂,λ₃Constraint balance factors respectively representing the proportion of loss functions of the transform model under different pyramid scales, wherein the higher the resolution is, the larger the function of the statistical loss function in the network training is, and the higher the weight coefficient is;

respectively representing a characteristic similarity loss function, a contrast loss function and an error loss function of the transform model of the ith layer;

representing the occlusion compensation loss function in the ith layer of optical flow estimation network, and finally calculating the obtained L_finalAs a loss function of the final overall network training.

The invention has the beneficial effects that:

the invention introduces a Transformer model into a characteristic pyramid network to enhance the characteristic extraction capability of the network, trains the whole network by designing a loss function of shielding compensation, and further obtains the finally predicted optical flow. The method can enhance the feature extraction capability of the feature pyramid layer on the image, and carry out occlusion compensation processing on occlusion pixels in the image so as to improve the precision of optical flow estimation.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a schematic diagram of a transform-based feature pyramid in the present invention.

FIG. 3 is a schematic diagram of an image segmentation module according to the present invention.

FIG. 4 is a block diagram of a convolutional mapping re-attention mechanism of the present invention.

Fig. 5 is an overall architecture diagram of the present invention.

FIG. 6 is a schematic diagram of the occlusion compensation of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention aims to provide an unsupervised optical flow estimation method based on a Transformer feature pyramid network, and belongs to the field of computer vision. Firstly, a Transformer-based feature pyramid network is constructed, and the feature extraction capability of the feature pyramid network on the image is enhanced by means of a Transformer model through a re-attention mechanism operation. And secondly, constructing an optical flow estimation network so that the network can perform optical flow prediction. And finally, carrying out shielding compensation processing on pixels in a shielding area, and designing a loss function of the whole network training to carry out unsupervised training on the network to obtain an unsupervised optical flow estimation model with higher speed and higher precision.

The invention provides an unsupervised optical flow estimation method based on a Transformer feature pyramid network, and aims to enhance the feature extraction capability of a feature pyramid layer on an image and perform occlusion compensation processing on occluded pixels in the image so as to improve the accuracy of optical flow estimation.

The purpose of the invention is realized as follows:

the method comprises the following steps: and constructing a characteristic pyramid network based on the Transformer. The feature pyramid network has two identical network branches, which have 12 layers in total, and can extract 6 image features, and the network of each branch shares the weight. And a Transformer model is fused on the second layer, the fourth layer, the sixth layer, the eighth layer, the tenth layer and the last layer of the characteristic pyramid to enhance the characteristic extraction capability of the network, and the model consists of an image segmentation module and a convolution mapping and re-attention mechanism module. The image segmentation module firstly carries out deformable convolution operation on an input image to extract local features such as edges and the like, and then segments the image into a sequence and inputs the sequence into the convolution mapping and attention re-paying mechanism module. And the convolution mapping re-attention mechanism module performs a re-attention mechanism operation on the image sequence to extract global features.

Step two: and constructing an optical flow estimation network. And inputting the image features extracted on the basis of each layer in the Transformer feature pyramid network into an optical flow estimation network for prediction, wherein the network consists of 5 layers of convolutional neural networks. Firstly, feature deformation is carried out on an input second frame image, then a feature matching relation between the second frame image after the feature deformation and the first frame image is calculated, namely a feature matching cost volume, then the calculated feature matching cost volume is input into the optical flow estimation network to predict the optical flow, and the output optical flow result is subjected to post-processing by utilizing a context network to obtain the accurate predicted optical flow.

Step three: and carrying out shielding compensation processing on the pixels in the shielding area. The specific operation of the occlusion compensation processing in the optical flow estimation network is to firstly perform difference comparison on the second frame reconstructed image and the first frame image, and then extract pixels of an occlusion area to perform occlusion compensation processing on the pixels. The difference contrast processing is to reconstruct the first frame image by distorting the second frame image with the predicted forward optical flow, and to add the difference between the reconstructed first frame image and the original first frame image to the occlusion map. The step of extracting the pixels of the occlusion region is to fill the reconstructed image by extracting the pixels of the occlusion region in the original first frame image, and to perform occlusion compensation by calculating the difference between the reconstructed image and the original first frame image.

Step four: and designing a loss function of the whole network training. And integrating the loss function with similarity information processing into the optical flow estimation network, and constructing a loss function trained by the whole network. The definition uses a function describing the feature similarity between different image blocks of the input transform model to promote the diversity between different image blocks, and simultaneously defines a contrast loss function and an error loss function, so that the concerned features of the deep network layer can come from the corresponding input of the shallow layer and are independent of the rest shallow layer inputs. And weighting and summing the loss items corresponding to the transform model of each pyramid layer, and combining the occlusion compensation loss function as the whole loss function of the network training to constrain the training process of the network.

Step five: two continuous frames of images are input at the input end of the network, and the network is subjected to unsupervised training by utilizing the whole network loss function.

Step six: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.

The invention firstly introduces a Transformer model into a characteristic pyramid network to enhance the characteristic extraction capability of the network, and simultaneously trains the whole network by designing a loss function of shielding compensation to further obtain the finally predicted optical flow.

Example 1:

the method comprises the following steps: and constructing a characteristic pyramid network based on the Transformer.

As shown in fig. 2, an image I is input into a feature pyramid network, which has 12 convolution layers and includes 6 stages of transform models, and the transform models in each stage have the same architecture, so that 6 feature maps with different sizes can be extracted step by step.

The numbers of channel features from the first stage to the sixth stage are 16, 32, 64, 96, 128 and 196 respectively, the convolution layer of the first layer inputs a feature map of 3 x 512, the size of a convolution kernel is 3 x 3, the step size is 1, and a feature map of 16 x 512 is output. The second layer convolution layer inputs the 16 × 512 × 512 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the output 16 × 256 × 256 feature map is input into the transform model in the first stage.

The third layer of convolution layer inputs a 16 × 256 × 256 feature map, the convolution kernel size is 3 × 3, the step size is 1, and a 16 × 256 × 256 feature map is output. The fourth layer convolution layer inputs a 16 × 256 × 256 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the feature map of 32 × 128 × 128 is output and input to the transform model at the second stage.

The fifth layer convolution layer inputs a feature map of 32 × 128 × 128, the convolution kernel size is 3 × 3, the step size is 1, and a feature map of 32 × 128 × 128 is output. The sixth layer convolution layer inputs the 32 × 128 × 128 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the output 64 × 64 × 64 feature map is input to the transform model of the third stage.

The seventh convolutional layer inputs a feature map of 64 × 64 × 64, the convolutional kernel size is 3 × 3, the step size is 1, and a feature map of 64 × 64 × 64 is output. The eighth layer convolution inputs a feature map of 64 × 64 × 64, the convolution kernel size is 3 × 3, the step size is 2, and the output feature map of 96 × 32 × 32 is input to the transform model in the fourth stage.

The ninth convolutional layer inputs a 96 × 32 × 32 feature map, the convolutional kernel size is 3 × 3, the step size is 1, and a 96 × 32 × 32 feature map is output. The tenth layer convolution layer inputs a 96 × 32 × 32 feature map, the convolution kernel size is 3 × 3, the step size is 2, and a 128 × 16 × 16 feature map is output and input to the transform model in the fifth stage.

The eleventh convolutional layer inputs a 128 × 16 × 16 feature map, the convolutional kernel size is 3 × 3, the step size is 1, and a 128 × 16 × 16 feature map is output. The twelfth layer convolution layer inputs a 128 × 16 × 16 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the feature map of 196 × 8 × 8 is output and input to the transform model of the sixth stage.

The Transformer model comprises two parts, namely an image segmentation module and a convolution mapping re-attention mechanism module. First, the standard Transformer module is used for processing machine language translation, a 1D language sequence is input, and in order to process a 2D image, the image segmentation module segments the input image, as shown in fig. 3. The module carries out deformable convolution operation on the input image, so that the number of image sequences in the input convolution mapping and attention mechanism module is gradually reduced, the characteristic resolution is reduced, the width of the input sequence is expanded, the characteristic size is increased, and the aims of gradually reducing the characteristic resolution and increasing the characteristic size along with the deepening of the number of layers of convolution are fulfilled. The deformable convolution has a total of four layers, each layer is composed of a standard convolution layer and a deviation layer, the convolution kernel size of the first layer of the standard convolution layer is 7 multiplied by 7, the step size is 2, the convolution kernel sizes of the second layer and the third layer are 3 multiplied by 3, the step size is 1, the convolution kernel size of the fourth layer is 7 multiplied by 7, and the step size is 2. The deviation layer of each layer is used for learning the characteristic diagram output by the previous standard convolution layer to obtain the deformation offset of the deformable convolution.

Let each ith layer pyramid output image based on Transformer feature pyramid network branch be x_iThe characteristic size of the output image is

Wherein H_i×W_iResolution, C, representing the image of the i-th layer_iIndicating the number of channels of the i-th layer image. Outputting the layer of pyramid as an image x_iInputting the image into an image segmentation module, extracting edge features through a deformable convolution operation to obtain an output image x'_iCarrying out batch normalization processing and maximum pooling on the image after the deformable convolution operation to obtain an output image x ″)_iThe definition is as follows:

x″_i＝MaxPool(BN(Deconv(x_i)))

where Deconv (. cndot.) represents a deformable convolution operation, BN (. cndot.) represents batch normalization, MaxPool (. cndot.) represents maximum pooling, x_iOutput image, x ″, representing the ith layer pyramid_iRepresenting the output image after the maximum pooling operation. After maximum pooling an image x ″, is obtained_iSliced 2D image blocks x'_iEach image block x'_iHas the dimension of

(P,P) The resolution of each image block after the segmentation is represented.

Flattening the sliced image blocks into N image sequences

Each image sequence is a two-dimensional matrix (N, D), D ═ P²C. The converted image sequence is marked through a binary mask, wherein the position 1 represents the image sequence with the important information position, the position 0 represents the image sequence with the similar information, the image sequence is selected to be marked in a self-adaptive mode according to the difference of information contained in the image sequence, the marked image sequence is input into a convolution mapping re-attention mechanism module, the image sequences marked with 1 are subjected to interactive calculation, and the image sequence marked with 0 is only calculated with the image sequence.

The convolution mapping re-attention mechanism module comprises a multi-head re-attention mechanism operation part and a multi-head linear processing part. First, for an input image sequence

A multi-headed re-attention mechanism operation was performed as shown in fig. 4. The module firstly inputs the ith layer image sequence

Each vector is defined as follows:

where Reshape2d (-) indicates spatial recomposition of the image sequence, and s represents the size of the convolution kernel conv2d (-) indicates the depth separable convolution operationFlatten (·) represents that the mapping is flattened to obtain a two-dimensional vector,

representing the i-th layer flattened image sequence,

representing a two-dimensional vector flattened by the i-th layer after a depth separable convolution operation. The obtained two-dimensional vector

The two-dimensional vector mapped by each input image sequence can be expressed as q ═ q in a matrix in a multi-head re-attention mechanism₁,q₂,...,q_N]，k＝[k₁,k₂,...,k_N]，v＝[v₁,v₂,...,v_N]. With the continuous deepening of the model, the attention similarity of the same image sequence in different layers is larger, but the similarity between multiple heads from the same layer is smaller, information between the different heads is combined by adopting a transfer matrix, attention operation is carried out again by utilizing the transfer matrix, and the transfer matrices trained by different layers are different, so that the attention collapse is avoided. The multi-head re-attention mechanism operation is defined as follows:

wherein q, k, v respectively represent a matrix obtained by convolution mapping of each input image sequence, T represents a transposition operation of the matrix, q · k^TDenotes the correlation of two positions, softmax (·) denotes the pair q · k^TPerforming a normalization operation, d representing the dimensions of q and k, theta representing the dimension of attention on multiple heads,

representing a sequence of image slices flattened by convolution mapping after spatial reconstruction,

showing a multi-headed re-attention mechanism operating on the sequence of image slices. The next layer of the multi-head re-attention operation processing is a batch normalization processing layer, the input and the output of the multi-head re-attention mechanism operation are added, and then batch normalization processing is carried out, so that the activation value normalization of each layer is realized, and the specific definition is as follows:

wherein MSA (-) represents a multi-head re-attention mechanism operation, BN (-) represents batch normalization processing,

representing the image sequence output after batch normalization processing. The multi-head linear processing operation comprises a feedforward network and batch normalization processing, the feedforward network carries out dimension expansion on each image sequence, an activation function of the batch normalization processing is GELU nonlinear activation, residual connection is carried out after each operation to prevent network degradation, and the operation is specifically defined as follows:

wherein FFN (-) represents a feed-forward network, BN (-) represents batch normalization processing, GELU (-) represents a Gaussian error linear unit activation function,

representing the image sequence output after multi-head re-attention operation batch normalization processing,

and performing multi-head linear processing on the output image after multi-head re-attention operation, performing linear mapping and dimensionality reduction on the image sequence output by the layer, performing spatial reconstruction to form a 2D image, outputting the 2D image to the next layer of the characteristic pyramid layer, and repeating the model training for N times.

Step two: and constructing an optical flow estimation network.

As shown in FIG. 5, given 2 input images I₁And I₂The feature pyramid has two identical shared weight branches. For the second image I₂Performing feature transformation, specifically performing 2-fold upsampling on the output estimated optical flow from the ith-1 layer of the pyramid, and performing bilinear interpolation transformation on the features of the second graph to the first graph, wherein the feature transformation operation is defined as follows:

where x represents the pixel value and,

the method is characterized in that the optical flow result obtained by an i-1 layer optical flow estimation network is up-sampled by 2 times,

the ith layer image features representing the second image of the pyramid,

representing the image features after feature deformation. Calculating the correlation between the second image feature and the first image feature after the feature deformation, namely calculating the feature matching cost volume of each layer, wherein the calculation of the feature matching cost volume is defined as follows:

wherein x is₁,x₂Pixels representing the first and second images respectivelyThe value of the one or more of the one,

a feature map representing the first image on the ith layer of the pyramid,

representing the image characteristics of the second image subjected to characteristic deformation on the ith layer of the pyramid, wherein M represents the length of a characteristic diagram, T represents the transposition operation of a matrix, and finally calculating the obtained CVⁱ(x₁,x₂) And matching the cost volume result with the feature graph representing the ith layer of the feature pyramid.

And inputting the obtained feature matching cost volume result, the features of the first graph and the up-sampled high-resolution optical flow into an optical flow estimation network of the ith layer to obtain an optical flow estimation result of the ith layer. The optical flow estimation layer uses five layers of convolution networks, and the number of channels of each convolution layer is 128, 96, 64 and 32 respectively. The output optical flow result is post-processed by a context network, such as median filtering or bilateral filtering, and the context network is a feedforward convolutional neural network, and is based on an expansion convolution design and composed of 7 layers of convolutional networks, the convolution kernel of each convolutional layer is 3 × 3, the layers have different expansion coefficients, the convolutional expansion coefficient k represents that the input unit of the filter in the layer is separated from other input units of the filter in the layer by k units in the vertical and horizontal directions, the expansion coefficients of the convolutional layers are 1,2,4,8,16,1 and 1 in sequence from top to bottom, and the convolutional layer with a large expansion coefficient can effectively enlarge the size of the receptive field of each output unit on a pyramid so as to output a refined optical flow.

Step three: and carrying out shielding compensation processing on the pixels in the shielding area.

As shown in fig. 6, the specific operation of the occlusion compensation process in the optical flow estimation network is to perform difference comparison between the predicted optical flow image and the original image, and then extract the pixels of the occlusion area to perform the occlusion compensation process. The contrast processing is to warp the second frame image I by using the predicted forward optical flow₂Thereby synthesizing a first frame reconstructed image I'₁The reconstructed image I'₁With the original image I₁The differences are added to the occlusion map, defined as follows:

where x denotes the pixel value, o_fwShowing an occlusion map, I₁(x) And I₁(x) Respectively representing an original image of a first frame and a reconstructed image of a second frame, sigma (-) representing a pixel-level similarity measure for calculating an original image I₁(x) And reconstructed image I'₁(x) Similarity between them, and finally calculating the obtained

The differential contrast loss in the network is estimated for the ith layer optical flow. The extraction of the pixels of the occlusion region is performed by using the original image I₁Corresponding pixel of the occlusion region in (1) fills in the image I'₁Obtaining a reconstructed image I₁By reconstructing the image I ″)₁And I₁The difference between to calculate the loss in occlusion region is defined as follows:

wherein x represents a pixel value, I ″₁(x) Is represented in the reconstructed first picture I'₁(x) Adding reconstructed pixels obtained by corresponding shielded pixels, wherein sigma (DEG) represents pixel-level similarity measurement, and finally calculating the obtained reconstructed pixels

Losses in occlusion regions in the network are estimated for the ith layer of optical flow. Occlusion compensation loss function in final i-th layer optical flow estimation network

Is obtained by summing two loss functions, and is defined as follows:

wherein,

representing the difference versus loss in the i-th layer optical flow estimation network,

representing the loss in occlusion regions in the ith layer optical flow estimation network,

Step four: designing loss functions for whole network training

And (4) incorporating a Transformer model loss function with similarity information processing into the optical flow estimation network. As the similarity between different input image sequences is increased along with the continuous deepening of the Transformer model, a loss function for describing the characteristic similarity between different image sequences of the ith layer is defined

Let the characteristics of the input image x and its corresponding N image sequences of the i-th layer be

And the characteristics are similar to functions

where N represents the total number of image segmentations, ·₂Represents L₂Norm operation, T denotes the direction ofTransposition of quantities, x_n，x_mRespectively representing features of different image slice sequences of the same layer,

a characteristic similarity loss function representing the similarity between different image sequences of the ith layer.

Defining an ith layer contrast loss function

wherein,

an nth image sequence feature representing the first layer,

indicating that the ith layer corresponds to the nth image sequence feature,

indicating that the feature vector is transposed, N indicates the number of image sequences,

representing the image sequence contrast loss function of the ith layer.

Defining an i-th layer error loss function

The image sequence of each deep layer in the Transformer model should be focused only on the image sequence from the deep layerThe shallow corresponding image sequence input, while ignoring the remaining irrelevant image sequence features, is defined as follows:

wherein,

an nth image sequence feature representing the first layer,

indicating that the ith layer corresponds to the nth image sequence feature, N indicates the number of image sequences,

representing the error loss function of the image sequence of the i-th layer.

Defining a total network training loss function is obtained by weighted sum of loss functions of the six-stage Transformer model and a luminosity loss function, and the formula is as follows:

wherein λ is₁,λ₂,λ₃Is a constraint balance factor which respectively represents the proportion of the loss function of the transform model under different pyramid scales, the higher the resolution is, the larger the function of the statistical loss function in the network training is, the higher the weight coefficient is,

respectively representing the characteristic similarity loss function, the contrast loss function and the error loss function of the transform model of the ith layer,

representing the occlusion compensation loss function in the ith layer of optical flow estimation network, and finally calculating the obtained L_finalTo be the finalLoss function of network training.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An unsupervised optical flow estimation method based on a Transformer feature pyramid network is characterized by comprising the following steps of:

step 1: constructing a characteristic pyramid network based on a Transformer;

2. The method of claim 1, wherein the method comprises: the 12 layers of convolution layers are arranged in the step 1 based on the Transformer characteristic pyramid network, and comprise 6 stages of Transformer models, wherein the Transformer models in each stage have the same architecture, and 6 characteristic graphs with different sizes can be extracted step by step; the number of channel features from the first stage to the sixth stage is 16, 32, 64, 96, 128, 196, respectively;

3. The method of claim 1, wherein the method comprises: in the step 1, the image segmentation module based on the transform feature pyramid network performs the deformable convolution operation on the input image, so that the number of image sequences in the input convolution mapping and attention mechanism module is gradually reduced, the feature resolution is reduced, the width of the input sequence is expanded, the feature size is increased, and the purpose of gradually reducing the feature resolution and increasing the feature size along with the deepening of the number of convolution layers is achieved, and the specific steps are as follows:

x″_i＝MaxPool(BN(Deconv(x_i)))

H_i×W_iResolution, C, representing the image of the i-th layer_iRepresenting the number of channels of the ith layer image; deconv (·) denotes a deformable convolution operationMaking; BN (-) represents batch normalization processing; MaxPool (. cndot.) denotes maximum pooling; x ″)_iRepresenting the output image after the maximum pooling operation;

(P, P) represents the resolution of each image block after segmentation;

step 1.1.3: flattening the sliced image blocks into N image sequences

Each image sequence is a two-dimensional matrix (N, D), D ═ P²C；

4. The method of claim 3, wherein the method comprises: the convolutional mapping re-attention mechanism module based on the Transformer feature pyramid network in the step 1 comprises two parts of multi-head re-attention mechanism operation and multi-head linear processing, and the specific steps are as follows:

step 1.2.1: will be provided withInput ith layer image sequence

Each vector is defined as follows:

representing an i-th layer flattened image sequence;

step 1.2.2: the obtained two-dimensional vector

The two-dimensional vector mapped by each input image sequence can be expressed as q ═ q in a matrix in a multi-head re-attention mechanism₁,q₂,...,q_N]，k＝[k₁,k₂,...,k_N]，v＝[v₁,v₂,...,v_N](ii) a With the continuous deepening of the model depth, the attention similarity of the same image sequence at different layers is larger, but the similarity between multiple heads from the same layer is smallerCombining information among different heads by adopting a transfer matrix, and performing attention operation again by using the information, wherein the transfer matrices trained by different layers are different, so as to avoid attention collapse; the multi-head re-attention mechanism operation is defined as follows:

representing the image sequence output after batch normalization processing;

5. The method of claim 1, wherein the method comprises: in the step 2, the input of the optical flow estimation network is two continuous frames of images, and the output is a corresponding estimated optical flow, and the specific steps are as follows:

step 2.1: for input two continuous frames of images I₁And I₂For the second image I₂Performing feature deformation, performing 2 times of upsampling on the second frame by using output estimation optical flow from the i-1 th layer based on the Transformer feature pyramid networkThe feature of the graph is transformed into the first graph by bilinear interpolation, and the operation of the feature transformation is defined as follows:

wherein x represents a pixel value;

representing the image features after feature deformation;

representing the image characteristics after characteristic deformation is carried out on the second image on the ith layer based on the Transformer characteristic pyramid network; m represents a featureThe length of the graph; t represents a transpose operation of the matrix; CV ofⁱ(x₁,x₂) Representing a feature map matching cost volume result of the ith layer based on the Transformer feature pyramid network;

6. The method of claim 5, wherein the method comprises: in the step 2.3, the optical flow estimation layer of the optical flow estimation network uses five layers of convolution networks, the number of channels of each convolution layer is 128, 96, 64 and 32, and the output optical flow result is post-processed through a context network; the context network is a feedforward convolutional neural network, is based on an expansion convolution design and consists of 7 layers of convolutional networks, the convolution kernel of each convolutional layer is 3 multiplied by 3, the layers have different expansion coefficients, and the convolution expansion coefficient k represents that the input unit of the filter in the layer is k units away from other input units of the filter in the layer in the vertical and horizontal directions; from top to bottom, the expansion coefficients of the convolutional layers are 1,2,4,8,16,1 and 1 in order, and the convolutional layers with large expansion coefficients can effectively enlarge the size of the receptive field of each output unit on the pyramid to output a refined optical flow.

7. The method of claim 1, wherein the method comprises: in the step 3: the method for carrying out shielding compensation processing on the pixels in the shielding area specifically comprises the following steps:

carrying out difference comparison on the predicted optical flow image and the original image, and then extracting pixels of an occlusion area to carry out occlusion compensation processing on the pixels; the contrast processing is to warp the second frame image I by using the predicted forward optical flow₂Thereby synthesizing a first frame reconstructed image I'₁The reconstructed image I'₁With the original imageI₁The differences are added to the occlusion map, defined as follows:

Is obtained by summing two loss functions, and is defined as follows:

wherein,

8. The method of claim 1, wherein the method comprises: the loss function of the overall network training in the step 4 is as follows: let the characteristics of the input image x and its corresponding N image sequences of the i-th layer be

And the characteristics are similar to functions

defining an ith layer contrast loss function

wherein,

an nth image sequence feature representing the first layer;

representing an image sequence contrast loss function of the ith layer;

defining an i-th layer error loss function

Each deep image sequence in the Transformer model should focus only on the corresponding image sequence from the shallow layerThe input, while ignoring the remaining irrelevant image sequence features, is defined as follows:

wherein,

an nth image sequence feature representing the first layer;

an error loss function representing the image sequence of the i-th layer;

representing occlusions in a layer i optical flow estimation networkCompensating the loss function, and finally calculating the obtained L_finalAs a loss function of the final overall network training.