CN114187331A - Unsupervised optical flow estimation method based on Transformer feature pyramid network - Google Patents
Unsupervised optical flow estimation method based on Transformer feature pyramid network Download PDFInfo
- Publication number
- CN114187331A CN114187331A CN202111506127.7A CN202111506127A CN114187331A CN 114187331 A CN114187331 A CN 114187331A CN 202111506127 A CN202111506127 A CN 202111506127A CN 114187331 A CN114187331 A CN 114187331A
- Authority
- CN
- China
- Prior art keywords
- image
- layer
- network
- feature
- optical flow
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003287 optical effect Effects 0.000 title claims abstract description 122
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000012545 processing Methods 0.000 claims abstract description 67
- 230000007246 mechanism Effects 0.000 claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 32
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 230000006870 function Effects 0.000 claims description 99
- 238000013507 mapping Methods 0.000 claims description 41
- 238000010606 normalization Methods 0.000 claims description 34
- 239000013598 vector Substances 0.000 claims description 23
- 239000011159 matrix material Substances 0.000 claims description 19
- 238000003709 image segmentation Methods 0.000 claims description 17
- 230000004913 activation Effects 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 12
- 238000013527 convolutional neural network Methods 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 7
- 238000012546 transfer Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 230000010365 information processing Effects 0.000 claims description 4
- 230000006798 recombination Effects 0.000 claims description 4
- 238000005215 recombination Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 230000017105 transposition Effects 0.000 claims description 4
- 230000015556 catabolic process Effects 0.000 claims description 3
- 238000006731 degradation reaction Methods 0.000 claims description 3
- 238000013461 design Methods 0.000 claims description 3
- 230000002452 interceptive effect Effects 0.000 claims description 3
- 238000012805 post-processing Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 238000011524 similarity measure Methods 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002146 bilateral effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/269—Analysis of motion using gradient-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20016—Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of computer vision, and particularly relates to an unsupervised optical flow estimation method based on a Transformer characteristic pyramid network, which comprises the steps of constructing the Transformer characteristic pyramid network; enhancing the feature extraction capability of the feature pyramid network on the image by means of a Transformer model through a re-attention mechanism operation; constructing an optical flow estimation network to enable the network to perform optical flow prediction; and carrying out shielding compensation processing on pixels in a shielding area, and designing a loss function of integral network training to carry out unsupervised training on the network to obtain an unsupervised optical flow estimation model with higher speed and higher precision. The method can enhance the feature extraction capability of the feature pyramid layer on the image, and carry out occlusion compensation processing on occlusion pixels in the image so as to improve the precision of optical flow estimation.
Description
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to an unsupervised optical flow estimation method based on a transform feature pyramid network.
Background
Optical flow estimation is an important research direction of computer vision application, and has a very wide application prospect in the fields of automatic driving, intelligent robots, motion and expression recognition, target tracking and the like. With the development of deep learning, many researchers choose to use deep learning techniques to deal with the optical flow estimation problem, and the method has the advantages of high running speed, high precision and the like, and achieves leading results on a plurality of common data sets. However, the method of processing the optical flow estimation problem by using only the convolutional neural network has some problems to be solved so far, such as: the occlusion problem, the small target detection, the convolution neural network cannot capture global feature information and the like, and the problem can be effectively solved by fusing the Transformer model and the feature pyramid network.
The Transformer is a special self-attention mechanism model, and in recent years, the computational efficiency and the expandability of the Transformer model make the Transformer model widely applied to computer vision. The computer vision field to which the Transformer model is expanded at present mainly comprises image classification, image recognition, target detection, semantic segmentation, image generation and the like.
Disclosure of Invention
The invention aims to provide an unsupervised optical flow estimation method based on a Transformer feature pyramid network, which enhances the feature extraction capability of a feature pyramid layer on an image, and performs occlusion compensation processing on occluded pixels in the image so as to improve the accuracy of optical flow estimation.
An unsupervised optical flow estimation method based on a Transformer feature pyramid network comprises the following steps:
step 1: constructing a characteristic pyramid network based on a Transformer;
the pyramid network based on the Transformer characteristics has two identical network branches, and has 12 layers in total, so that 6 image characteristics can be extracted, and the network of each branch shares the weight; fusing a Transformer model in a second layer, a fourth layer, a sixth layer, an eighth layer, a tenth layer and a last layer based on a Transformer characteristic pyramid network to enhance the characteristic extraction capability of the network, wherein the model consists of an image segmentation module and a convolution mapping re-attention mechanism module; the image segmentation module firstly carries out deformable convolution operation on an input image to extract local features such as edges and the like, then segments the image into sequences and inputs the sequences into the convolution mapping and attention re-paying mechanism module; a convolution mapping re-attention mechanism module performs re-attention mechanism operation on the image sequence to extract global features;
step 2: constructing an optical flow estimation network; inputting image features extracted from each layer in the Transformer feature pyramid network into an optical flow estimation network for prediction;
the optical flow estimation network is composed of 5 layers of convolutional neural networks, firstly, feature deformation is carried out on an input second frame image, then, a feature matching relation between the second frame image after the feature deformation and a first frame image is calculated, namely, a feature matching cost volume is calculated, then, the calculated feature matching cost volume is input into the optical flow estimation network to predict optical flow, and the output optical flow result is subjected to post-processing by utilizing a context network to obtain accurate predicted optical flow;
and step 3: carrying out shielding compensation processing on pixels in a shielding area;
the specific operation of the occlusion compensation processing in the optical flow estimation network is that firstly, difference comparison is carried out on a second frame reconstructed image and a first frame image, and then pixels of an occlusion area are extracted to carry out occlusion compensation processing on the pixels;
the difference contrast processing is to utilize the predicted forward optical flow to distort the second frame image so as to reconstruct the first frame image, and the difference of the reconstructed first frame image and the original first frame image is added into the occlusion image; extracting pixels of the occlusion region is to fill a reconstructed image by extracting pixels of the occlusion region in the original first frame image, and perform occlusion compensation by calculating the difference between the reconstructed image and the original first frame image;
and 4, step 4: designing a loss function of the whole network training; combining the loss function with similarity information processing into an optical flow estimation network, and constructing a loss function of whole network training; defining a function for describing the feature similarity between different image blocks of an input transform model to promote the diversity between different image blocks, and defining a contrast loss function and an error loss function to enable the concerned features of the deep network layer to come from the corresponding input of the shallow layer and be unrelated to the rest shallow layer input; weighting and summing loss items corresponding to the transform model of each pyramid layer, and combining an occlusion compensation loss function as an overall loss function of network training to constrain the training process of the network;
and 5: inputting two continuous frames of images at the input end of the network, and carrying out unsupervised training on the network by using an integral network loss function;
step 6: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.
Further, in the step 1, there are 12 convolutional layers based on the Transformer feature pyramid network, including 6 stages of Transformer models, where the Transformer models in each stage have the same architecture, and can gradually extract 6 feature maps with different sizes; the number of channel features from the first stage to the sixth stage is 16, 32, 64, 96, 128, 196, respectively;
inputting a 3 × 512 × 512 feature map into the first layer of convolutional layer, outputting a 16 × 512 × 512 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 16 × 512 × 512 feature map into the second layer of convolution layer, outputting a 16 × 256 × 256 feature map with a convolution kernel size of 3 × 3 and a step size of 2, and inputting the feature map into the transform model in the first stage;
inputting a 16 × 256 × 256 feature map into the third layer of convolutional layer, outputting a 16 × 256 × 256 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 16 × 256 × 256 feature map into the fourth layer of convolution layer, outputting a 32 × 128 × 128 feature map into the transform model at the second stage, wherein the convolution kernel size is 3 × 3, the step size is 2;
inputting a 32 × 128 × 128 feature map into the fifth convolutional layer, wherein the size of a convolutional kernel is 3 × 3, the step size is 1, and outputting the 32 × 128 × 128 feature map; inputting a 32 × 128 × 128 feature map into the sixth layer of convolution layer, the convolution kernel size is 3 × 3, the step size is 2, outputting a 64 × 64 × 64 feature map and inputting the feature map into the transform model of the third stage;
inputting a feature map of 64 multiplied by 64 to the seventh convolutional layer, wherein the size of a convolution kernel is 3 multiplied by 3, the step size is 1, and outputting a feature map of 64 multiplied by 64; the eighth layer convolution inputs a feature map of 64 multiplied by 64, the convolution kernel size is 3 multiplied by 3, the step size is 2, and the feature map of 96 multiplied by 32 is output and input to a transform model in the fourth stage;
inputting a 96 × 32 × 32 feature map into the ninth convolutional layer, outputting a 96 × 32 × 32 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 96 × 32 × 32 feature map into the tenth convolutional layer, the convolutional kernel size is 3 × 3, the step size is 2, outputting a 128 × 16 × 16 feature map, and inputting the feature map into the transform model in the fifth stage;
inputting a 128 × 16 × 16 feature map into the eleventh convolutional layer, outputting a 128 × 16 × 16 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 128 × 16 × 16 feature map into the twelfth layer of convolution layer, the convolution kernel size is 3 × 3, the step size is 2, and outputting a 196 × 8 × 8 feature map to be input into the transform model in the sixth stage;
and the feature graph output by the convolution layer enters a Transformer model and then is output to the next convolution layer of the feature pyramid so as to enhance the feature extraction capability of the network on the image.
Further, in step 1, the image segmentation module based on the transform feature pyramid network performs a deformable convolution operation on the input image, so that the number of image sequences in the input convolution mapping and attention mechanism module is gradually reduced, the feature resolution is reduced, the width of the input sequence is expanded, the feature size is increased, and the purpose of gradually reducing the feature resolution and increasing the feature size along with the deepening of the convolution layer number is achieved, and the specific steps are as follows:
step 1.1.1: outputting the ith layer pyramid as an image xiInputting the image into an image segmentation module, extracting edge features through a deformable convolution operation to obtain an output image x'iCarrying out batch normalization processing and maximum pooling on the image after the deformable convolution operation to obtain an output image x ″)i;
x″i=MaxPool(BN(Deconv(xi)))
Wherein x isiOutputting an image for the ith pyramid based on the Transformer feature pyramid network, wherein the feature size of the output image isHi×WiResolution, C, representing the image of the i-th layeriRepresenting the number of channels of the ith layer image; deconv (·) represents a deformable convolution operation; BN (-) represents batch normalization processing; MaxPool (. cndot.) denotes maximum pooling; x ″)iRepresenting the output image after the maximum pooling operation;
step 1.1.2: after maximum pooling an image x ″, is obtainediImage Block x 'sliced into 2D'iEach image block x'iHas the dimension ofN denotes the output image x ″iThe number of the image sequences after segmentation is also the effective length of the input convolution mapping re-attention mechanism module, and(P, P) represents the resolution of each image block after segmentation;
step 1.1.3: flattening the sliced image blocks into N image sequencesEach image sequence is a two-dimensional matrix (N, D), D ═ P2C;
Step 1.1.4: the converted image sequence is marked through a binary mask, wherein the position 1 represents the image sequence with the important information position, the position 0 represents the image sequence with the similar information, the image sequence is selected to be marked in a self-adaptive mode according to the difference of information contained in the image sequence, the marked image sequence is input into a convolution mapping re-attention mechanism module, the image sequences marked with 1 are subjected to interactive calculation, and the image sequence marked with 0 is only calculated with the image sequence.
Further, the convolutional mapping re-attention mechanism module based on the transform feature pyramid network in step 1 includes two parts, namely a multi-head re-attention mechanism operation and a multi-head linear processing, and specifically includes the following steps:
step 1.2.1: the ith layer image sequence to be inputLinearly mapping the three-dimensional space to form a 2D image, obtaining three projected features through three depth separable convolution operations, flattening the three projected features to obtain three different two-dimensional vectors of the ith layerEach vector is defined as follows:
wherein Reshape2d (·) represents spatially recombining the image sequence; s represents the size of the convolution kernel; conv2d (·) represents a depth separable convolution operation; flatting after the Flatten (·) represents mapping to obtain a two-dimensional vector;representing an i-th layer flattened image sequence;representing a two-dimensional vector obtained by flattening the ith layer after the depth separable convolution operation;
step 1.2.2: the obtained two-dimensional vectorThe two-dimensional vector mapped by each input image sequence can be expressed as q ═ q in a matrix in a multi-head re-attention mechanism1,q2,...,qN],k=[k1,k2,...,kN],v=[v1,v2,...,vN](ii) a With the continuous deepening of the model depth, the attention similarity of the same image sequence in different layers is larger, but the similarity between multiple heads from the same layer is smaller, a transfer matrix is adopted to combine information between different heads, the information is utilized to carry out attention operation again, the transfer matrices trained by different layers are different, and therefore attention collapse is avoided; the multi-head re-attention mechanism operation is defined as follows:
wherein q, k, v respectively represent a matrix obtained by convolution mapping of each input image sequence; t represents a transpose operation of the matrix; q.kTRepresenting the correlation of two locations; softmax (·) denotes the pair q.kTCarrying out normalization operation; d represents the dimensions of q and k; θ represents the dimension of attention on multiple heads;representing the image slice sequence which is subjected to convolution mapping flattening after spatial recombination;representing a multi-headed re-attention mechanism operation on the sequence of image slices;
step 1.2.3: adding the input and output of the multi-head re-attention mechanism operation, and then performing batch normalization processing to realize the normalization of the activation value of each layer, wherein the specific definition is as follows:
wherein MSA (-) represents a multi-headed re-attention mechanism operation; BN (-) represents batch normalization processing;representing the image slice sequence which is subjected to convolution mapping flattening after spatial recombination;representing the image sequence output after batch normalization processing;
step 1.2.4: the multi-head linear processing operation comprises a feedforward network and batch normalization processing, the feedforward network carries out dimension expansion on each image sequence, an activation function of the batch normalization processing is GELU nonlinear activation, residual connection is carried out after each operation to prevent network degradation, and the operation is specifically defined as follows:
wherein FFN (-) represents a feed-forward network; BN (-) represents batch normalization processing; GELU (. circle.) represents a Gaussian error linear unit activation function;representing an image sequence output after multi-head re-attention operation batch normalization processing;representing that multi-head linear processing is carried out on the output image after multi-head re-attention operation;
step 1.2.5: and performing linear mapping dimensionality reduction on the image sequence output by the multi-head linear processing operation, performing spatial reconstruction to form a 2D image, and outputting the 2D image to a next characteristic pyramid layer.
Further, in step 2, the input of the optical flow estimation network is two consecutive frames of images, and the output is a corresponding estimated optical flow, specifically comprising the steps of:
step 2.1: for input two continuous frames of images I1And I2For the second image I2Performing feature deformation, performing 2-time upsampling on the estimated optical flow from the i-1 layer based on the Transformer feature pyramid network, and performing bilinear interpolation deformation on the features of the second graph to the first graph, wherein the operation of feature deformation is defined as follows:
wherein x represents a pixel value;representing that 2 times of upsampling is carried out on an optical flow result obtained by an i-1 layer optical flow estimation network;representing the ith layer image characteristic of the second image of the pyramid;representing the image features after feature deformation;
step 2.2: calculating the correlation between the second image feature and the first image feature after the feature deformation, namely calculating the feature matching cost volume of each layer, wherein the calculation of the feature matching cost volume is defined as follows:
wherein x is1,x2Pixel values representing the first and second images, respectively;a feature map representing a first image on an i-th layer based on a transform feature pyramid network;representing the image characteristics after characteristic deformation is carried out on the second image on the ith layer based on the Transformer characteristic pyramid network; m represents the length of the feature map; t represents a transpose operation of the matrix; CV ofi(x1,x2) Representing a feature map matching cost volume result of the ith layer based on the Transformer feature pyramid network;
step 2.3: and inputting the feature matching cost volume result, the features of the first graph and the up-sampled high-resolution optical flow into an i-th optical flow estimation layer of the optical flow estimation network to obtain an optical flow estimation result of the layer.
Further, in the step 2.3, five convolutional networks are used for optical flow estimation layers of the optical flow estimation network, the number of channels of each convolutional layer is 128, 96, 64 and 32, and the output optical flow results are post-processed through a context network; the context network is a feedforward convolutional neural network, is based on an expansion convolution design and consists of 7 layers of convolutional networks, the convolution kernel of each convolutional layer is 3 multiplied by 3, the layers have different expansion coefficients, and the convolution expansion coefficient k represents that the input unit of the filter in the layer is k units away from other input units of the filter in the layer in the vertical and horizontal directions; from top to bottom, the expansion coefficients of the convolutional layers are 1,2,4,8,16,1 and 1 in order, and the convolutional layers with large expansion coefficients can effectively enlarge the size of the receptive field of each output unit on the pyramid to output a refined optical flow.
Further, in the step 3: the method for carrying out shielding compensation processing on the pixels in the shielding area specifically comprises the following steps:
carrying out difference comparison on the predicted optical flow image and the original image, and then extracting pixels of an occlusion area to carry out occlusion compensation processing on the pixels; the contrast processing is to warp the second frame image I by using the predicted forward optical flow2Thereby synthesizing a first frame reconstructed image I1', reconstructing the image I1' with original image I1The differences are added to the occlusion map, defined as follows:
Wherein x represents a pixel value; ofwRepresenting an occlusion map; i is1(x) And l'1(x) Respectively representing a first frame original image and a second frame reconstructed image; sigma (-) represents a pixel-level similarity measure for computing the original image I1(x) And reconstructed image I'1(x) The similarity between them;estimating a difference contrast loss in the network for the ith layer optical flow;
the extraction of the pixels of the occlusion region is performed by using the original image I1Corresponding pixel of the occlusion region in (1) fills in the image I'1Obtaining a reconstructed image I1By reconstructing the image I ″)1And I1The difference between to calculate the loss in occlusion region is defined as follows:
wherein x represents a pixel value; i ″)1(x) Is represented in the reconstructed first picture I'1(x) Adding a reconstruction pixel obtained by corresponding shielding pixel; σ (-) represents a pixel-level similarity metric;estimating losses in occlusion regions in the network for the ith layer of optical flow;
occlusion compensation loss function in final i-th layer optical flow estimation networkIs obtained by summing two loss functions, and is defined as follows:
wherein,representing the difference contrast loss in the ith layer optical flow estimation network;representing the loss in the occlusion region in the ith layer optical flow estimation network;and representing an occlusion compensation loss function in the ith layer optical flow estimation network.
Further, the loss function of the overall network training in step 4 is: let the characteristics of the input image x and its corresponding N image sequences of the i-th layer beAnd the characteristics are similar to functionsAs a penalty term in the loss function to promote diversity between different image sequences, is defined as follows:
wherein N represents the total number of image segmentations; i | · | purple wind2Represents L2Performing norm operation; t represents a transposition operation of the vector; x is the number ofn,xmRespectively representing the characteristics of different image slice sequences of the same layer;a characteristic similarity loss function representing the similarity between different image sequences of the ith layer;
defining an ith layer contrast loss functionThe features learned in shallow layers of the Transformer model are more diversified than those learned in deep layers, and the contrast loss function can use the features learned in shallow layers to regularize the deep features so as to reduce the similarity between different image sequences and increase the feature diversity of the image sequences in a deep network, and is defined as follows:
wherein,an nth image sequence feature representing the first layer;representing the ith layer corresponding to the nth image sequence characteristic;indicating that the feature vector is transposed; n represents the number of image sequences;representing an image sequence contrast loss function of the ith layer;
defining an i-th layer error loss functionEach deep image sequence in the Transformer model should focus only on the corresponding image sequence input from the shallow layer, and ignore the remaining irrelevant image sequence features, which is defined as follows:
wherein,representing the nth image sequence of the first layerPerforming sign;representing the ith layer corresponding to the nth image sequence characteristic; n represents the number of image sequences;an error loss function representing the image sequence of the i-th layer;
the loss function for defining the whole network training is obtained by weighted sum of loss functions of the Transformer model in six stages and luminosity loss function, and the formula is as follows:
wherein λ is1,λ2,λ3Constraint balance factors respectively representing the proportion of loss functions of the transform model under different pyramid scales, wherein the higher the resolution is, the larger the function of the statistical loss function in the network training is, and the higher the weight coefficient is;respectively representing a characteristic similarity loss function, a contrast loss function and an error loss function of the transform model of the ith layer;representing the occlusion compensation loss function in the ith layer of optical flow estimation network, and finally calculating the obtained LfinalAs a loss function of the final overall network training.
The invention has the beneficial effects that:
the invention introduces a Transformer model into a characteristic pyramid network to enhance the characteristic extraction capability of the network, trains the whole network by designing a loss function of shielding compensation, and further obtains the finally predicted optical flow. The method can enhance the feature extraction capability of the feature pyramid layer on the image, and carry out occlusion compensation processing on occlusion pixels in the image so as to improve the precision of optical flow estimation.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a schematic diagram of a transform-based feature pyramid in the present invention.
FIG. 3 is a schematic diagram of an image segmentation module according to the present invention.
FIG. 4 is a block diagram of a convolutional mapping re-attention mechanism of the present invention.
Fig. 5 is an overall architecture diagram of the present invention.
FIG. 6 is a schematic diagram of the occlusion compensation of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention aims to provide an unsupervised optical flow estimation method based on a Transformer feature pyramid network, and belongs to the field of computer vision. Firstly, a Transformer-based feature pyramid network is constructed, and the feature extraction capability of the feature pyramid network on the image is enhanced by means of a Transformer model through a re-attention mechanism operation. And secondly, constructing an optical flow estimation network so that the network can perform optical flow prediction. And finally, carrying out shielding compensation processing on pixels in a shielding area, and designing a loss function of the whole network training to carry out unsupervised training on the network to obtain an unsupervised optical flow estimation model with higher speed and higher precision.
The invention provides an unsupervised optical flow estimation method based on a Transformer feature pyramid network, and aims to enhance the feature extraction capability of a feature pyramid layer on an image and perform occlusion compensation processing on occluded pixels in the image so as to improve the accuracy of optical flow estimation.
The purpose of the invention is realized as follows:
the method comprises the following steps: and constructing a characteristic pyramid network based on the Transformer. The feature pyramid network has two identical network branches, which have 12 layers in total, and can extract 6 image features, and the network of each branch shares the weight. And a Transformer model is fused on the second layer, the fourth layer, the sixth layer, the eighth layer, the tenth layer and the last layer of the characteristic pyramid to enhance the characteristic extraction capability of the network, and the model consists of an image segmentation module and a convolution mapping and re-attention mechanism module. The image segmentation module firstly carries out deformable convolution operation on an input image to extract local features such as edges and the like, and then segments the image into a sequence and inputs the sequence into the convolution mapping and attention re-paying mechanism module. And the convolution mapping re-attention mechanism module performs a re-attention mechanism operation on the image sequence to extract global features.
Step two: and constructing an optical flow estimation network. And inputting the image features extracted on the basis of each layer in the Transformer feature pyramid network into an optical flow estimation network for prediction, wherein the network consists of 5 layers of convolutional neural networks. Firstly, feature deformation is carried out on an input second frame image, then a feature matching relation between the second frame image after the feature deformation and the first frame image is calculated, namely a feature matching cost volume, then the calculated feature matching cost volume is input into the optical flow estimation network to predict the optical flow, and the output optical flow result is subjected to post-processing by utilizing a context network to obtain the accurate predicted optical flow.
Step three: and carrying out shielding compensation processing on the pixels in the shielding area. The specific operation of the occlusion compensation processing in the optical flow estimation network is to firstly perform difference comparison on the second frame reconstructed image and the first frame image, and then extract pixels of an occlusion area to perform occlusion compensation processing on the pixels. The difference contrast processing is to reconstruct the first frame image by distorting the second frame image with the predicted forward optical flow, and to add the difference between the reconstructed first frame image and the original first frame image to the occlusion map. The step of extracting the pixels of the occlusion region is to fill the reconstructed image by extracting the pixels of the occlusion region in the original first frame image, and to perform occlusion compensation by calculating the difference between the reconstructed image and the original first frame image.
Step four: and designing a loss function of the whole network training. And integrating the loss function with similarity information processing into the optical flow estimation network, and constructing a loss function trained by the whole network. The definition uses a function describing the feature similarity between different image blocks of the input transform model to promote the diversity between different image blocks, and simultaneously defines a contrast loss function and an error loss function, so that the concerned features of the deep network layer can come from the corresponding input of the shallow layer and are independent of the rest shallow layer inputs. And weighting and summing the loss items corresponding to the transform model of each pyramid layer, and combining the occlusion compensation loss function as the whole loss function of the network training to constrain the training process of the network.
Step five: two continuous frames of images are input at the input end of the network, and the network is subjected to unsupervised training by utilizing the whole network loss function.
Step six: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.
The invention firstly introduces a Transformer model into a characteristic pyramid network to enhance the characteristic extraction capability of the network, and simultaneously trains the whole network by designing a loss function of shielding compensation to further obtain the finally predicted optical flow.
Example 1:
the method comprises the following steps: and constructing a characteristic pyramid network based on the Transformer.
As shown in fig. 2, an image I is input into a feature pyramid network, which has 12 convolution layers and includes 6 stages of transform models, and the transform models in each stage have the same architecture, so that 6 feature maps with different sizes can be extracted step by step.
The numbers of channel features from the first stage to the sixth stage are 16, 32, 64, 96, 128 and 196 respectively, the convolution layer of the first layer inputs a feature map of 3 x 512, the size of a convolution kernel is 3 x 3, the step size is 1, and a feature map of 16 x 512 is output. The second layer convolution layer inputs the 16 × 512 × 512 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the output 16 × 256 × 256 feature map is input into the transform model in the first stage.
The third layer of convolution layer inputs a 16 × 256 × 256 feature map, the convolution kernel size is 3 × 3, the step size is 1, and a 16 × 256 × 256 feature map is output. The fourth layer convolution layer inputs a 16 × 256 × 256 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the feature map of 32 × 128 × 128 is output and input to the transform model at the second stage.
The fifth layer convolution layer inputs a feature map of 32 × 128 × 128, the convolution kernel size is 3 × 3, the step size is 1, and a feature map of 32 × 128 × 128 is output. The sixth layer convolution layer inputs the 32 × 128 × 128 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the output 64 × 64 × 64 feature map is input to the transform model of the third stage.
The seventh convolutional layer inputs a feature map of 64 × 64 × 64, the convolutional kernel size is 3 × 3, the step size is 1, and a feature map of 64 × 64 × 64 is output. The eighth layer convolution inputs a feature map of 64 × 64 × 64, the convolution kernel size is 3 × 3, the step size is 2, and the output feature map of 96 × 32 × 32 is input to the transform model in the fourth stage.
The ninth convolutional layer inputs a 96 × 32 × 32 feature map, the convolutional kernel size is 3 × 3, the step size is 1, and a 96 × 32 × 32 feature map is output. The tenth layer convolution layer inputs a 96 × 32 × 32 feature map, the convolution kernel size is 3 × 3, the step size is 2, and a 128 × 16 × 16 feature map is output and input to the transform model in the fifth stage.
The eleventh convolutional layer inputs a 128 × 16 × 16 feature map, the convolutional kernel size is 3 × 3, the step size is 1, and a 128 × 16 × 16 feature map is output. The twelfth layer convolution layer inputs a 128 × 16 × 16 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the feature map of 196 × 8 × 8 is output and input to the transform model of the sixth stage.
And the feature graph output by the convolution layer enters a Transformer model and then is output to the next convolution layer of the feature pyramid so as to enhance the feature extraction capability of the network on the image.
The Transformer model comprises two parts, namely an image segmentation module and a convolution mapping re-attention mechanism module. First, the standard Transformer module is used for processing machine language translation, a 1D language sequence is input, and in order to process a 2D image, the image segmentation module segments the input image, as shown in fig. 3. The module carries out deformable convolution operation on the input image, so that the number of image sequences in the input convolution mapping and attention mechanism module is gradually reduced, the characteristic resolution is reduced, the width of the input sequence is expanded, the characteristic size is increased, and the aims of gradually reducing the characteristic resolution and increasing the characteristic size along with the deepening of the number of layers of convolution are fulfilled. The deformable convolution has a total of four layers, each layer is composed of a standard convolution layer and a deviation layer, the convolution kernel size of the first layer of the standard convolution layer is 7 multiplied by 7, the step size is 2, the convolution kernel sizes of the second layer and the third layer are 3 multiplied by 3, the step size is 1, the convolution kernel size of the fourth layer is 7 multiplied by 7, and the step size is 2. The deviation layer of each layer is used for learning the characteristic diagram output by the previous standard convolution layer to obtain the deformation offset of the deformable convolution.
Let each ith layer pyramid output image based on Transformer feature pyramid network branch be xiThe characteristic size of the output image isWherein Hi×WiResolution, C, representing the image of the i-th layeriIndicating the number of channels of the i-th layer image. Outputting the layer of pyramid as an image xiInputting the image into an image segmentation module, extracting edge features through a deformable convolution operation to obtain an output image x'iCarrying out batch normalization processing and maximum pooling on the image after the deformable convolution operation to obtain an output image x ″)iThe definition is as follows:
x″i=MaxPool(BN(Deconv(xi)))
where Deconv (. cndot.) represents a deformable convolution operation, BN (. cndot.) represents batch normalization, MaxPool (. cndot.) represents maximum pooling, xiOutput image, x ″, representing the ith layer pyramidiRepresenting the output image after the maximum pooling operation. After maximum pooling an image x ″, is obtainediSliced 2D image blocks x'iEach image block x'iHas the dimension ofN denotes the output image x ″iThe number of the image sequences after segmentation is also the effective length of the input convolution mapping re-attention mechanism module, and(P,P) The resolution of each image block after the segmentation is represented.
Flattening the sliced image blocks into N image sequencesEach image sequence is a two-dimensional matrix (N, D), D ═ P2C. The converted image sequence is marked through a binary mask, wherein the position 1 represents the image sequence with the important information position, the position 0 represents the image sequence with the similar information, the image sequence is selected to be marked in a self-adaptive mode according to the difference of information contained in the image sequence, the marked image sequence is input into a convolution mapping re-attention mechanism module, the image sequences marked with 1 are subjected to interactive calculation, and the image sequence marked with 0 is only calculated with the image sequence.
The convolution mapping re-attention mechanism module comprises a multi-head re-attention mechanism operation part and a multi-head linear processing part. First, for an input image sequenceA multi-headed re-attention mechanism operation was performed as shown in fig. 4. The module firstly inputs the ith layer image sequenceLinearly mapping the three-dimensional space to form a 2D image, obtaining three projected features through three depth separable convolution operations, flattening the three projected features to obtain three different two-dimensional vectors of the ith layerEach vector is defined as follows:
where Reshape2d (-) indicates spatial recomposition of the image sequence, and s represents the size of the convolution kernel conv2d (-) indicates the depth separable convolution operationFlatten (·) represents that the mapping is flattened to obtain a two-dimensional vector,representing the i-th layer flattened image sequence,representing a two-dimensional vector flattened by the i-th layer after a depth separable convolution operation. The obtained two-dimensional vectorThe two-dimensional vector mapped by each input image sequence can be expressed as q ═ q in a matrix in a multi-head re-attention mechanism1,q2,...,qN],k=[k1,k2,...,kN],v=[v1,v2,...,vN]. With the continuous deepening of the model, the attention similarity of the same image sequence in different layers is larger, but the similarity between multiple heads from the same layer is smaller, information between the different heads is combined by adopting a transfer matrix, attention operation is carried out again by utilizing the transfer matrix, and the transfer matrices trained by different layers are different, so that the attention collapse is avoided. The multi-head re-attention mechanism operation is defined as follows:
wherein q, k, v respectively represent a matrix obtained by convolution mapping of each input image sequence, T represents a transposition operation of the matrix, q · kTDenotes the correlation of two positions, softmax (·) denotes the pair q · kTPerforming a normalization operation, d representing the dimensions of q and k, theta representing the dimension of attention on multiple heads,representing a sequence of image slices flattened by convolution mapping after spatial reconstruction,showing a multi-headed re-attention mechanism operating on the sequence of image slices. The next layer of the multi-head re-attention operation processing is a batch normalization processing layer, the input and the output of the multi-head re-attention mechanism operation are added, and then batch normalization processing is carried out, so that the activation value normalization of each layer is realized, and the specific definition is as follows:
wherein MSA (-) represents a multi-head re-attention mechanism operation, BN (-) represents batch normalization processing,representing a sequence of image slices flattened by convolution mapping after spatial reconstruction,representing the image sequence output after batch normalization processing. The multi-head linear processing operation comprises a feedforward network and batch normalization processing, the feedforward network carries out dimension expansion on each image sequence, an activation function of the batch normalization processing is GELU nonlinear activation, residual connection is carried out after each operation to prevent network degradation, and the operation is specifically defined as follows:
wherein FFN (-) represents a feed-forward network, BN (-) represents batch normalization processing, GELU (-) represents a Gaussian error linear unit activation function,representing the image sequence output after multi-head re-attention operation batch normalization processing,and performing multi-head linear processing on the output image after multi-head re-attention operation, performing linear mapping and dimensionality reduction on the image sequence output by the layer, performing spatial reconstruction to form a 2D image, outputting the 2D image to the next layer of the characteristic pyramid layer, and repeating the model training for N times.
Step two: and constructing an optical flow estimation network.
As shown in FIG. 5, given 2 input images I1And I2The feature pyramid has two identical shared weight branches. For the second image I2Performing feature transformation, specifically performing 2-fold upsampling on the output estimated optical flow from the ith-1 layer of the pyramid, and performing bilinear interpolation transformation on the features of the second graph to the first graph, wherein the feature transformation operation is defined as follows:
where x represents the pixel value and,the method is characterized in that the optical flow result obtained by an i-1 layer optical flow estimation network is up-sampled by 2 times,the ith layer image features representing the second image of the pyramid,representing the image features after feature deformation. Calculating the correlation between the second image feature and the first image feature after the feature deformation, namely calculating the feature matching cost volume of each layer, wherein the calculation of the feature matching cost volume is defined as follows:
wherein x is1,x2Pixels representing the first and second images respectivelyThe value of the one or more of the one,a feature map representing the first image on the ith layer of the pyramid,representing the image characteristics of the second image subjected to characteristic deformation on the ith layer of the pyramid, wherein M represents the length of a characteristic diagram, T represents the transposition operation of a matrix, and finally calculating the obtained CVi(x1,x2) And matching the cost volume result with the feature graph representing the ith layer of the feature pyramid.
And inputting the obtained feature matching cost volume result, the features of the first graph and the up-sampled high-resolution optical flow into an optical flow estimation network of the ith layer to obtain an optical flow estimation result of the ith layer. The optical flow estimation layer uses five layers of convolution networks, and the number of channels of each convolution layer is 128, 96, 64 and 32 respectively. The output optical flow result is post-processed by a context network, such as median filtering or bilateral filtering, and the context network is a feedforward convolutional neural network, and is based on an expansion convolution design and composed of 7 layers of convolutional networks, the convolution kernel of each convolutional layer is 3 × 3, the layers have different expansion coefficients, the convolutional expansion coefficient k represents that the input unit of the filter in the layer is separated from other input units of the filter in the layer by k units in the vertical and horizontal directions, the expansion coefficients of the convolutional layers are 1,2,4,8,16,1 and 1 in sequence from top to bottom, and the convolutional layer with a large expansion coefficient can effectively enlarge the size of the receptive field of each output unit on a pyramid so as to output a refined optical flow.
Step three: and carrying out shielding compensation processing on the pixels in the shielding area.
As shown in fig. 6, the specific operation of the occlusion compensation process in the optical flow estimation network is to perform difference comparison between the predicted optical flow image and the original image, and then extract the pixels of the occlusion area to perform the occlusion compensation process. The contrast processing is to warp the second frame image I by using the predicted forward optical flow2Thereby synthesizing a first frame reconstructed image I'1The reconstructed image I'1With the original image I1The differences are added to the occlusion map, defined as follows:
where x denotes the pixel value, ofwShowing an occlusion map, I1(x) And I1(x) Respectively representing an original image of a first frame and a reconstructed image of a second frame, sigma (-) representing a pixel-level similarity measure for calculating an original image I1(x) And reconstructed image I'1(x) Similarity between them, and finally calculating the obtainedThe differential contrast loss in the network is estimated for the ith layer optical flow. The extraction of the pixels of the occlusion region is performed by using the original image I1Corresponding pixel of the occlusion region in (1) fills in the image I'1Obtaining a reconstructed image I1By reconstructing the image I ″)1And I1The difference between to calculate the loss in occlusion region is defined as follows:
wherein x represents a pixel value, I ″1(x) Is represented in the reconstructed first picture I'1(x) Adding reconstructed pixels obtained by corresponding shielded pixels, wherein sigma (DEG) represents pixel-level similarity measurement, and finally calculating the obtained reconstructed pixelsLosses in occlusion regions in the network are estimated for the ith layer of optical flow. Occlusion compensation loss function in final i-th layer optical flow estimation networkIs obtained by summing two loss functions, and is defined as follows:
wherein,representing the difference versus loss in the i-th layer optical flow estimation network,representing the loss in occlusion regions in the ith layer optical flow estimation network,and representing an occlusion compensation loss function in the ith layer optical flow estimation network.
Step four: designing loss functions for whole network training
And (4) incorporating a Transformer model loss function with similarity information processing into the optical flow estimation network. As the similarity between different input image sequences is increased along with the continuous deepening of the Transformer model, a loss function for describing the characteristic similarity between different image sequences of the ith layer is definedLet the characteristics of the input image x and its corresponding N image sequences of the i-th layer beAnd the characteristics are similar to functionsAs a penalty term in the loss function to promote diversity between different image sequences, is defined as follows:
where N represents the total number of image segmentations, ·2Represents L2Norm operation, T denotes the direction ofTransposition of quantities, xn,xmRespectively representing features of different image slice sequences of the same layer,a characteristic similarity loss function representing the similarity between different image sequences of the ith layer.
Defining an ith layer contrast loss functionThe features learned in shallow layers of the Transformer model are more diversified than those learned in deep layers, and the contrast loss function can use the features learned in shallow layers to regularize the deep features so as to reduce the similarity between different image sequences and increase the feature diversity of the image sequences in a deep network, and is defined as follows:
wherein,an nth image sequence feature representing the first layer,indicating that the ith layer corresponds to the nth image sequence feature,indicating that the feature vector is transposed, N indicates the number of image sequences,representing the image sequence contrast loss function of the ith layer.
Defining an i-th layer error loss functionThe image sequence of each deep layer in the Transformer model should be focused only on the image sequence from the deep layerThe shallow corresponding image sequence input, while ignoring the remaining irrelevant image sequence features, is defined as follows:
wherein,an nth image sequence feature representing the first layer,indicating that the ith layer corresponds to the nth image sequence feature, N indicates the number of image sequences,representing the error loss function of the image sequence of the i-th layer.
Defining a total network training loss function is obtained by weighted sum of loss functions of the six-stage Transformer model and a luminosity loss function, and the formula is as follows:
wherein λ is1,λ2,λ3Is a constraint balance factor which respectively represents the proportion of the loss function of the transform model under different pyramid scales, the higher the resolution is, the larger the function of the statistical loss function in the network training is, the higher the weight coefficient is,respectively representing the characteristic similarity loss function, the contrast loss function and the error loss function of the transform model of the ith layer,representing the occlusion compensation loss function in the ith layer of optical flow estimation network, and finally calculating the obtained LfinalTo be the finalLoss function of network training.
Step five: two continuous frames of images are input at the input end of the network, and the network is subjected to unsupervised training by utilizing the whole network loss function.
Step six: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (8)
1. An unsupervised optical flow estimation method based on a Transformer feature pyramid network is characterized by comprising the following steps of:
step 1: constructing a characteristic pyramid network based on a Transformer;
the pyramid network based on the Transformer characteristics has two identical network branches, and has 12 layers in total, so that 6 image characteristics can be extracted, and the network of each branch shares the weight; fusing a Transformer model in a second layer, a fourth layer, a sixth layer, an eighth layer, a tenth layer and a last layer based on a Transformer characteristic pyramid network to enhance the characteristic extraction capability of the network, wherein the model consists of an image segmentation module and a convolution mapping re-attention mechanism module; the image segmentation module firstly carries out deformable convolution operation on an input image to extract local features such as edges and the like, then segments the image into sequences and inputs the sequences into the convolution mapping and attention re-paying mechanism module; a convolution mapping re-attention mechanism module performs re-attention mechanism operation on the image sequence to extract global features;
step 2: constructing an optical flow estimation network; inputting image features extracted from each layer in the Transformer feature pyramid network into an optical flow estimation network for prediction;
the optical flow estimation network is composed of 5 layers of convolutional neural networks, firstly, feature deformation is carried out on an input second frame image, then, a feature matching relation between the second frame image after the feature deformation and a first frame image is calculated, namely, a feature matching cost volume is calculated, then, the calculated feature matching cost volume is input into the optical flow estimation network to predict optical flow, and the output optical flow result is subjected to post-processing by utilizing a context network to obtain accurate predicted optical flow;
and step 3: carrying out shielding compensation processing on pixels in a shielding area;
the specific operation of the occlusion compensation processing in the optical flow estimation network is that firstly, difference comparison is carried out on a second frame reconstructed image and a first frame image, and then pixels of an occlusion area are extracted to carry out occlusion compensation processing on the pixels;
the difference contrast processing is to utilize the predicted forward optical flow to distort the second frame image so as to reconstruct the first frame image, and the difference of the reconstructed first frame image and the original first frame image is added into the occlusion image; extracting pixels of the occlusion region is to fill a reconstructed image by extracting pixels of the occlusion region in the original first frame image, and perform occlusion compensation by calculating the difference between the reconstructed image and the original first frame image;
and 4, step 4: designing a loss function of the whole network training; combining the loss function with similarity information processing into an optical flow estimation network, and constructing a loss function of whole network training; defining a function for describing the feature similarity between different image blocks of an input transform model to promote the diversity between different image blocks, and defining a contrast loss function and an error loss function to enable the concerned features of the deep network layer to come from the corresponding input of the shallow layer and be unrelated to the rest shallow layer input; weighting and summing loss items corresponding to the transform model of each pyramid layer, and combining an occlusion compensation loss function as an overall loss function of network training to constrain the training process of the network;
and 5: inputting two continuous frames of images at the input end of the network, and carrying out unsupervised training on the network by using an integral network loss function;
step 6: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.
2. The method of claim 1, wherein the method comprises: the 12 layers of convolution layers are arranged in the step 1 based on the Transformer characteristic pyramid network, and comprise 6 stages of Transformer models, wherein the Transformer models in each stage have the same architecture, and 6 characteristic graphs with different sizes can be extracted step by step; the number of channel features from the first stage to the sixth stage is 16, 32, 64, 96, 128, 196, respectively;
inputting a 3 × 512 × 512 feature map into the first layer of convolutional layer, outputting a 16 × 512 × 512 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 16 × 512 × 512 feature map into the second layer of convolution layer, outputting a 16 × 256 × 256 feature map with a convolution kernel size of 3 × 3 and a step size of 2, and inputting the feature map into the transform model in the first stage;
inputting a 16 × 256 × 256 feature map into the third layer of convolutional layer, outputting a 16 × 256 × 256 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 16 × 256 × 256 feature map into the fourth layer of convolution layer, outputting a 32 × 128 × 128 feature map into the transform model at the second stage, wherein the convolution kernel size is 3 × 3, the step size is 2;
inputting a 32 × 128 × 128 feature map into the fifth convolutional layer, wherein the size of a convolutional kernel is 3 × 3, the step size is 1, and outputting the 32 × 128 × 128 feature map; inputting a 32 × 128 × 128 feature map into the sixth layer of convolution layer, the convolution kernel size is 3 × 3, the step size is 2, outputting a 64 × 64 × 64 feature map and inputting the feature map into the transform model of the third stage;
inputting a feature map of 64 multiplied by 64 to the seventh convolutional layer, wherein the size of a convolution kernel is 3 multiplied by 3, the step size is 1, and outputting a feature map of 64 multiplied by 64; the eighth layer convolution inputs a feature map of 64 multiplied by 64, the convolution kernel size is 3 multiplied by 3, the step size is 2, and the feature map of 96 multiplied by 32 is output and input to a transform model in the fourth stage;
inputting a 96 × 32 × 32 feature map into the ninth convolutional layer, outputting a 96 × 32 × 32 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 96 × 32 × 32 feature map into the tenth convolutional layer, the convolutional kernel size is 3 × 3, the step size is 2, outputting a 128 × 16 × 16 feature map, and inputting the feature map into the transform model in the fifth stage;
inputting a 128 × 16 × 16 feature map into the eleventh convolutional layer, outputting a 128 × 16 × 16 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 128 × 16 × 16 feature map into the twelfth layer of convolution layer, the convolution kernel size is 3 × 3, the step size is 2, and outputting a 196 × 8 × 8 feature map to be input into the transform model in the sixth stage;
and the feature graph output by the convolution layer enters a Transformer model and then is output to the next convolution layer of the feature pyramid so as to enhance the feature extraction capability of the network on the image.
3. The method of claim 1, wherein the method comprises: in the step 1, the image segmentation module based on the transform feature pyramid network performs the deformable convolution operation on the input image, so that the number of image sequences in the input convolution mapping and attention mechanism module is gradually reduced, the feature resolution is reduced, the width of the input sequence is expanded, the feature size is increased, and the purpose of gradually reducing the feature resolution and increasing the feature size along with the deepening of the number of convolution layers is achieved, and the specific steps are as follows:
step 1.1.1: outputting the ith layer pyramid as an image xiInputting the image into an image segmentation module, extracting edge features through a deformable convolution operation to obtain an output image x'iCarrying out batch normalization processing and maximum pooling on the image after the deformable convolution operation to obtain an output image x ″)i;
x″i=MaxPool(BN(Deconv(xi)))
Wherein x isiOutputting an image for the ith pyramid based on the Transformer feature pyramid network, wherein the feature size of the output image isHi×WiResolution, C, representing the image of the i-th layeriRepresenting the number of channels of the ith layer image; deconv (·) denotes a deformable convolution operationMaking; BN (-) represents batch normalization processing; MaxPool (. cndot.) denotes maximum pooling; x ″)iRepresenting the output image after the maximum pooling operation;
step 1.1.2: after maximum pooling an image x ″, is obtainediImage Block x 'sliced into 2D'iEach image block x'iHas the dimension ofN denotes the output image x ″iThe number of the image sequences after segmentation is also the effective length of the input convolution mapping re-attention mechanism module, and(P, P) represents the resolution of each image block after segmentation;
step 1.1.3: flattening the sliced image blocks into N image sequencesEach image sequence is a two-dimensional matrix (N, D), D ═ P2C;
Step 1.1.4: the converted image sequence is marked through a binary mask, wherein the position 1 represents the image sequence with the important information position, the position 0 represents the image sequence with the similar information, the image sequence is selected to be marked in a self-adaptive mode according to the difference of information contained in the image sequence, the marked image sequence is input into a convolution mapping re-attention mechanism module, the image sequences marked with 1 are subjected to interactive calculation, and the image sequence marked with 0 is only calculated with the image sequence.
4. The method of claim 3, wherein the method comprises: the convolutional mapping re-attention mechanism module based on the Transformer feature pyramid network in the step 1 comprises two parts of multi-head re-attention mechanism operation and multi-head linear processing, and the specific steps are as follows:
step 1.2.1: will be provided withInput ith layer image sequenceLinearly mapping the three-dimensional space to form a 2D image, obtaining three projected features through three depth separable convolution operations, flattening the three projected features to obtain three different two-dimensional vectors of the ith layerEach vector is defined as follows:
wherein Reshape2d (·) represents spatially recombining the image sequence; s represents the size of the convolution kernel; conv2d (·) represents a depth separable convolution operation; flatting after the Flatten (·) represents mapping to obtain a two-dimensional vector;representing an i-th layer flattened image sequence;representing a two-dimensional vector obtained by flattening the ith layer after the depth separable convolution operation;
step 1.2.2: the obtained two-dimensional vectorThe two-dimensional vector mapped by each input image sequence can be expressed as q ═ q in a matrix in a multi-head re-attention mechanism1,q2,...,qN],k=[k1,k2,...,kN],v=[v1,v2,...,vN](ii) a With the continuous deepening of the model depth, the attention similarity of the same image sequence at different layers is larger, but the similarity between multiple heads from the same layer is smallerCombining information among different heads by adopting a transfer matrix, and performing attention operation again by using the information, wherein the transfer matrices trained by different layers are different, so as to avoid attention collapse; the multi-head re-attention mechanism operation is defined as follows:
wherein q, k, v respectively represent a matrix obtained by convolution mapping of each input image sequence; t represents a transpose operation of the matrix; q.kTRepresenting the correlation of two locations; softmax (·) denotes the pair q.kTCarrying out normalization operation; d represents the dimensions of q and k; θ represents the dimension of attention on multiple heads;representing the image slice sequence which is subjected to convolution mapping flattening after spatial recombination;representing a multi-headed re-attention mechanism operation on the sequence of image slices;
step 1.2.3: adding the input and output of the multi-head re-attention mechanism operation, and then performing batch normalization processing to realize the normalization of the activation value of each layer, wherein the specific definition is as follows:
wherein MSA (-) represents a multi-headed re-attention mechanism operation; BN (-) represents batch normalization processing;representing the image slice sequence which is subjected to convolution mapping flattening after spatial recombination;representing the image sequence output after batch normalization processing;
step 1.2.4: the multi-head linear processing operation comprises a feedforward network and batch normalization processing, the feedforward network carries out dimension expansion on each image sequence, an activation function of the batch normalization processing is GELU nonlinear activation, residual connection is carried out after each operation to prevent network degradation, and the operation is specifically defined as follows:
wherein FFN (-) represents a feed-forward network; BN (-) represents batch normalization processing; GELU (. circle.) represents a Gaussian error linear unit activation function;representing an image sequence output after multi-head re-attention operation batch normalization processing;representing that multi-head linear processing is carried out on the output image after multi-head re-attention operation;
step 1.2.5: and performing linear mapping dimensionality reduction on the image sequence output by the multi-head linear processing operation, performing spatial reconstruction to form a 2D image, and outputting the 2D image to a next characteristic pyramid layer.
5. The method of claim 1, wherein the method comprises: in the step 2, the input of the optical flow estimation network is two continuous frames of images, and the output is a corresponding estimated optical flow, and the specific steps are as follows:
step 2.1: for input two continuous frames of images I1And I2For the second image I2Performing feature deformation, performing 2 times of upsampling on the second frame by using output estimation optical flow from the i-1 th layer based on the Transformer feature pyramid networkThe feature of the graph is transformed into the first graph by bilinear interpolation, and the operation of the feature transformation is defined as follows:
wherein x represents a pixel value;representing that 2 times of upsampling is carried out on an optical flow result obtained by an i-1 layer optical flow estimation network;representing the ith layer image characteristic of the second image of the pyramid;representing the image features after feature deformation;
step 2.2: calculating the correlation between the second image feature and the first image feature after the feature deformation, namely calculating the feature matching cost volume of each layer, wherein the calculation of the feature matching cost volume is defined as follows:
wherein x is1,x2Pixel values representing the first and second images, respectively;a feature map representing a first image on an i-th layer based on a transform feature pyramid network;representing the image characteristics after characteristic deformation is carried out on the second image on the ith layer based on the Transformer characteristic pyramid network; m represents a featureThe length of the graph; t represents a transpose operation of the matrix; CV ofi(x1,x2) Representing a feature map matching cost volume result of the ith layer based on the Transformer feature pyramid network;
step 2.3: and inputting the feature matching cost volume result, the features of the first graph and the up-sampled high-resolution optical flow into an i-th optical flow estimation layer of the optical flow estimation network to obtain an optical flow estimation result of the layer.
6. The method of claim 5, wherein the method comprises: in the step 2.3, the optical flow estimation layer of the optical flow estimation network uses five layers of convolution networks, the number of channels of each convolution layer is 128, 96, 64 and 32, and the output optical flow result is post-processed through a context network; the context network is a feedforward convolutional neural network, is based on an expansion convolution design and consists of 7 layers of convolutional networks, the convolution kernel of each convolutional layer is 3 multiplied by 3, the layers have different expansion coefficients, and the convolution expansion coefficient k represents that the input unit of the filter in the layer is k units away from other input units of the filter in the layer in the vertical and horizontal directions; from top to bottom, the expansion coefficients of the convolutional layers are 1,2,4,8,16,1 and 1 in order, and the convolutional layers with large expansion coefficients can effectively enlarge the size of the receptive field of each output unit on the pyramid to output a refined optical flow.
7. The method of claim 1, wherein the method comprises: in the step 3: the method for carrying out shielding compensation processing on the pixels in the shielding area specifically comprises the following steps:
carrying out difference comparison on the predicted optical flow image and the original image, and then extracting pixels of an occlusion area to carry out occlusion compensation processing on the pixels; the contrast processing is to warp the second frame image I by using the predicted forward optical flow2Thereby synthesizing a first frame reconstructed image I'1The reconstructed image I'1With the original imageI1The differences are added to the occlusion map, defined as follows:
wherein x represents a pixel value; ofwRepresenting an occlusion map; i is1(x) And l'1(x) Respectively representing a first frame original image and a second frame reconstructed image; sigma (-) represents a pixel-level similarity measure for computing the original image I1(x) And reconstructed image I'1(x) The similarity between them;estimating a difference contrast loss in the network for the ith layer optical flow;
the extraction of the pixels of the occlusion region is performed by using the original image I1Corresponding pixel of the occlusion region in (1) fills in the image I'1Obtaining a reconstructed image I1By reconstructing the image I ″)1And I1The difference between to calculate the loss in occlusion region is defined as follows:
wherein x represents a pixel value; i ″)1(x) Is represented in the reconstructed first picture I'1(x) Adding a reconstruction pixel obtained by corresponding shielding pixel; σ (-) represents a pixel-level similarity metric;estimating losses in occlusion regions in the network for the ith layer of optical flow;
occlusion compensation loss function in final i-th layer optical flow estimation networkIs obtained by summing two loss functions, and is defined as follows:
wherein,representing the difference contrast loss in the ith layer optical flow estimation network;representing the loss in the occlusion region in the ith layer optical flow estimation network;and representing an occlusion compensation loss function in the ith layer optical flow estimation network.
8. The method of claim 1, wherein the method comprises: the loss function of the overall network training in the step 4 is as follows: let the characteristics of the input image x and its corresponding N image sequences of the i-th layer beAnd the characteristics are similar to functionsAs a penalty term in the loss function to promote diversity between different image sequences, is defined as follows:
wherein N represents the total number of image segmentations; i | · | purple wind2Represents L2Performing norm operation; t represents a transposition operation of the vector; x is the number ofn,xmRespectively representing the characteristics of different image slice sequences of the same layer;a characteristic similarity loss function representing the similarity between different image sequences of the ith layer;
defining an ith layer contrast loss functionThe features learned in shallow layers of the Transformer model are more diversified than those learned in deep layers, and the contrast loss function can use the features learned in shallow layers to regularize the deep features so as to reduce the similarity between different image sequences and increase the feature diversity of the image sequences in a deep network, and is defined as follows:
wherein,an nth image sequence feature representing the first layer;representing the ith layer corresponding to the nth image sequence characteristic;indicating that the feature vector is transposed; n represents the number of image sequences;representing an image sequence contrast loss function of the ith layer;
defining an i-th layer error loss functionEach deep image sequence in the Transformer model should focus only on the corresponding image sequence from the shallow layerThe input, while ignoring the remaining irrelevant image sequence features, is defined as follows:
wherein,an nth image sequence feature representing the first layer;representing the ith layer corresponding to the nth image sequence characteristic; n represents the number of image sequences;an error loss function representing the image sequence of the i-th layer;
the loss function for defining the whole network training is obtained by weighted sum of loss functions of the Transformer model in six stages and luminosity loss function, and the formula is as follows:
wherein λ is1,λ2,λ3Constraint balance factors respectively representing the proportion of loss functions of the transform model under different pyramid scales, wherein the higher the resolution is, the larger the function of the statistical loss function in the network training is, and the higher the weight coefficient is;respectively representing a characteristic similarity loss function, a contrast loss function and an error loss function of the transform model of the ith layer;representing occlusions in a layer i optical flow estimation networkCompensating the loss function, and finally calculating the obtained LfinalAs a loss function of the final overall network training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111506127.7A CN114187331A (en) | 2021-12-10 | 2021-12-10 | Unsupervised optical flow estimation method based on Transformer feature pyramid network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111506127.7A CN114187331A (en) | 2021-12-10 | 2021-12-10 | Unsupervised optical flow estimation method based on Transformer feature pyramid network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114187331A true CN114187331A (en) | 2022-03-15 |
Family
ID=80543042
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111506127.7A Pending CN114187331A (en) | 2021-12-10 | 2021-12-10 | Unsupervised optical flow estimation method based on Transformer feature pyramid network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114187331A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114677412A (en) * | 2022-03-18 | 2022-06-28 | 苏州大学 | Method, device and equipment for estimating optical flow |
CN115018888A (en) * | 2022-07-04 | 2022-09-06 | 东南大学 | Optical flow unsupervised estimation method based on Transformer |
CN115719368A (en) * | 2022-11-29 | 2023-02-28 | 上海船舶运输科学研究所有限公司 | Multi-target ship tracking method and system |
CN115761594A (en) * | 2022-11-28 | 2023-03-07 | 南昌航空大学 | Optical flow calculation method based on global and local coupling |
CN115880567A (en) * | 2023-03-03 | 2023-03-31 | 深圳精智达技术股份有限公司 | Self-attention calculation method and device, electronic equipment and storage medium |
CN116630324A (en) * | 2023-07-25 | 2023-08-22 | 吉林大学 | Method for automatically evaluating adenoid hypertrophy by MRI (magnetic resonance imaging) image based on deep learning |
CN116739919A (en) * | 2023-05-22 | 2023-09-12 | 武汉大学 | Method and system for detecting and repairing solar flicker in optical ocean image of unmanned plane |
CN116740414A (en) * | 2023-05-15 | 2023-09-12 | 中国科学院自动化研究所 | Image recognition method, device, electronic equipment and storage medium |
CN117877099A (en) * | 2024-03-11 | 2024-04-12 | 南京信息工程大学 | Non-supervision contrast remote physiological measurement method based on space-time characteristic enhancement |
WO2024174804A1 (en) * | 2023-02-21 | 2024-08-29 | 浙江阿里巴巴机器人有限公司 | Service providing method, device, and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582483A (en) * | 2020-05-14 | 2020-08-25 | 哈尔滨工程大学 | Unsupervised learning optical flow estimation method based on space and channel combined attention mechanism |
AU2020103901A4 (en) * | 2020-12-04 | 2021-02-11 | Chongqing Normal University | Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field |
CN112465872A (en) * | 2020-12-10 | 2021-03-09 | 南昌航空大学 | Image sequence optical flow estimation method based on learnable occlusion mask and secondary deformation optimization |
CN113538527A (en) * | 2021-07-08 | 2021-10-22 | 上海工程技术大学 | Efficient lightweight optical flow estimation method |
CN113706526A (en) * | 2021-10-26 | 2021-11-26 | 北京字节跳动网络技术有限公司 | Training method and device for endoscope image feature learning model and classification model |
-
2021
- 2021-12-10 CN CN202111506127.7A patent/CN114187331A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582483A (en) * | 2020-05-14 | 2020-08-25 | 哈尔滨工程大学 | Unsupervised learning optical flow estimation method based on space and channel combined attention mechanism |
AU2020103901A4 (en) * | 2020-12-04 | 2021-02-11 | Chongqing Normal University | Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field |
CN112465872A (en) * | 2020-12-10 | 2021-03-09 | 南昌航空大学 | Image sequence optical flow estimation method based on learnable occlusion mask and secondary deformation optimization |
CN113538527A (en) * | 2021-07-08 | 2021-10-22 | 上海工程技术大学 | Efficient lightweight optical flow estimation method |
CN113706526A (en) * | 2021-10-26 | 2021-11-26 | 北京字节跳动网络技术有限公司 | Training method and device for endoscope image feature learning model and classification model |
Non-Patent Citations (2)
Title |
---|
YANG, B (YANG, BO) [1] ; XIE, H (XIE, HUAN) [1] ; LI, HB (LI, HONGBIN) [2] ; LI, NH (LI, NUOHAN) [3] ; LIU, AC (LIU, ANCHANG) [2] : "Unsupervised Optical Flow Estimation Based on Improved Feature Pyramid", 《 NEURAL PROCESSING LETTERS》, no. 52, 14 August 2020 (2020-08-14), pages 1601 - 1612, XP037257628, DOI: 10.1007/s11063-020-10328-2 * |
刘香凝;赵洋;王荣刚;: "基于自注意力机制的多阶段无监督单目深度估计网络", 信号处理, no. 09 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114677412A (en) * | 2022-03-18 | 2022-06-28 | 苏州大学 | Method, device and equipment for estimating optical flow |
CN115018888A (en) * | 2022-07-04 | 2022-09-06 | 东南大学 | Optical flow unsupervised estimation method based on Transformer |
CN115018888B (en) * | 2022-07-04 | 2024-08-06 | 东南大学 | Optical flow unsupervised estimation method based on transducer |
CN115761594A (en) * | 2022-11-28 | 2023-03-07 | 南昌航空大学 | Optical flow calculation method based on global and local coupling |
CN115761594B (en) * | 2022-11-28 | 2023-07-21 | 南昌航空大学 | Optical flow calculation method based on global and local coupling |
CN115719368B (en) * | 2022-11-29 | 2024-05-17 | 上海船舶运输科学研究所有限公司 | Multi-target ship tracking method and system |
CN115719368A (en) * | 2022-11-29 | 2023-02-28 | 上海船舶运输科学研究所有限公司 | Multi-target ship tracking method and system |
WO2024174804A1 (en) * | 2023-02-21 | 2024-08-29 | 浙江阿里巴巴机器人有限公司 | Service providing method, device, and storage medium |
CN115880567A (en) * | 2023-03-03 | 2023-03-31 | 深圳精智达技术股份有限公司 | Self-attention calculation method and device, electronic equipment and storage medium |
CN116740414A (en) * | 2023-05-15 | 2023-09-12 | 中国科学院自动化研究所 | Image recognition method, device, electronic equipment and storage medium |
CN116740414B (en) * | 2023-05-15 | 2024-03-01 | 中国科学院自动化研究所 | Image recognition method, device, electronic equipment and storage medium |
CN116739919B (en) * | 2023-05-22 | 2024-08-02 | 武汉大学 | Method and system for detecting and repairing solar flicker in optical ocean image of unmanned plane |
CN116739919A (en) * | 2023-05-22 | 2023-09-12 | 武汉大学 | Method and system for detecting and repairing solar flicker in optical ocean image of unmanned plane |
CN116630324B (en) * | 2023-07-25 | 2023-10-13 | 吉林大学 | Method for automatically evaluating adenoid hypertrophy by MRI (magnetic resonance imaging) image based on deep learning |
CN116630324A (en) * | 2023-07-25 | 2023-08-22 | 吉林大学 | Method for automatically evaluating adenoid hypertrophy by MRI (magnetic resonance imaging) image based on deep learning |
CN117877099A (en) * | 2024-03-11 | 2024-04-12 | 南京信息工程大学 | Non-supervision contrast remote physiological measurement method based on space-time characteristic enhancement |
CN117877099B (en) * | 2024-03-11 | 2024-05-14 | 南京信息工程大学 | Non-supervision contrast remote physiological measurement method based on space-time characteristic enhancement |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114187331A (en) | Unsupervised optical flow estimation method based on Transformer feature pyramid network | |
Tu et al. | Maxim: Multi-axis mlp for image processing | |
CN113361560B (en) | Semantic-based multi-pose virtual fitting method | |
CN113743269B (en) | Method for recognizing human body gesture of video in lightweight manner | |
CN112131959B (en) | 2D human body posture estimation method based on multi-scale feature reinforcement | |
CN109756690A (en) | Lightweight view interpolation method based on feature rank light stream | |
CN112837224A (en) | Super-resolution image reconstruction method based on convolutional neural network | |
CN112232134B (en) | Human body posture estimation method based on hourglass network and attention mechanism | |
CN113554032B (en) | Remote sensing image segmentation method based on multi-path parallel network of high perception | |
CN113870335A (en) | Monocular depth estimation method based on multi-scale feature fusion | |
CN115018888B (en) | Optical flow unsupervised estimation method based on transducer | |
CN113792641A (en) | High-resolution lightweight human body posture estimation method combined with multispectral attention mechanism | |
CN111696038A (en) | Image super-resolution method, device, equipment and computer-readable storage medium | |
CN114724155A (en) | Scene text detection method, system and equipment based on deep convolutional neural network | |
Liu et al. | An efficient residual learning neural network for hyperspectral image superresolution | |
CN116612288B (en) | Multi-scale lightweight real-time semantic segmentation method and system | |
Li et al. | Model-informed Multi-stage Unsupervised Network for Hyperspectral Image Super-resolution | |
CN118134952B (en) | Medical image segmentation method based on feature interaction | |
CN115641285A (en) | Binocular vision stereo matching method based on dense multi-scale information fusion | |
CN116071748A (en) | Unsupervised video target segmentation method based on frequency domain global filtering | |
Chen et al. | PDWN: Pyramid deformable warping network for video interpolation | |
CN109934283A (en) | A kind of adaptive motion object detection method merging CNN and SIFT light stream | |
CN117710429A (en) | Improved lightweight monocular depth estimation method integrating CNN and transducer | |
Luo et al. | Dual-Window Multiscale Transformer for Hyperspectral Snapshot Compressive Imaging | |
CN115731280A (en) | Self-supervision monocular depth estimation method based on Swin-Transformer and CNN parallel network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |