CN114187331A - Unsupervised optical flow estimation method based on Transformer feature pyramid network - Google Patents

Unsupervised optical flow estimation method based on Transformer feature pyramid network Download PDF

Info

Publication number
CN114187331A
CN114187331A CN202111506127.7A CN202111506127A CN114187331A CN 114187331 A CN114187331 A CN 114187331A CN 202111506127 A CN202111506127 A CN 202111506127A CN 114187331 A CN114187331 A CN 114187331A
Authority
CN
China
Prior art keywords
image
layer
network
feature
optical flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111506127.7A
Other languages
Chinese (zh)
Inventor
项学智
杨洁
乔玉龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202111506127.7A priority Critical patent/CN114187331A/en
Publication of CN114187331A publication Critical patent/CN114187331A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of computer vision, and particularly relates to an unsupervised optical flow estimation method based on a Transformer characteristic pyramid network, which comprises the steps of constructing the Transformer characteristic pyramid network; enhancing the feature extraction capability of the feature pyramid network on the image by means of a Transformer model through a re-attention mechanism operation; constructing an optical flow estimation network to enable the network to perform optical flow prediction; and carrying out shielding compensation processing on pixels in a shielding area, and designing a loss function of integral network training to carry out unsupervised training on the network to obtain an unsupervised optical flow estimation model with higher speed and higher precision. The method can enhance the feature extraction capability of the feature pyramid layer on the image, and carry out occlusion compensation processing on occlusion pixels in the image so as to improve the precision of optical flow estimation.

Description

Unsupervised optical flow estimation method based on Transformer feature pyramid network
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to an unsupervised optical flow estimation method based on a transform feature pyramid network.
Background
Optical flow estimation is an important research direction of computer vision application, and has a very wide application prospect in the fields of automatic driving, intelligent robots, motion and expression recognition, target tracking and the like. With the development of deep learning, many researchers choose to use deep learning techniques to deal with the optical flow estimation problem, and the method has the advantages of high running speed, high precision and the like, and achieves leading results on a plurality of common data sets. However, the method of processing the optical flow estimation problem by using only the convolutional neural network has some problems to be solved so far, such as: the occlusion problem, the small target detection, the convolution neural network cannot capture global feature information and the like, and the problem can be effectively solved by fusing the Transformer model and the feature pyramid network.
The Transformer is a special self-attention mechanism model, and in recent years, the computational efficiency and the expandability of the Transformer model make the Transformer model widely applied to computer vision. The computer vision field to which the Transformer model is expanded at present mainly comprises image classification, image recognition, target detection, semantic segmentation, image generation and the like.
Disclosure of Invention
The invention aims to provide an unsupervised optical flow estimation method based on a Transformer feature pyramid network, which enhances the feature extraction capability of a feature pyramid layer on an image, and performs occlusion compensation processing on occluded pixels in the image so as to improve the accuracy of optical flow estimation.
An unsupervised optical flow estimation method based on a Transformer feature pyramid network comprises the following steps:
step 1: constructing a characteristic pyramid network based on a Transformer;
the pyramid network based on the Transformer characteristics has two identical network branches, and has 12 layers in total, so that 6 image characteristics can be extracted, and the network of each branch shares the weight; fusing a Transformer model in a second layer, a fourth layer, a sixth layer, an eighth layer, a tenth layer and a last layer based on a Transformer characteristic pyramid network to enhance the characteristic extraction capability of the network, wherein the model consists of an image segmentation module and a convolution mapping re-attention mechanism module; the image segmentation module firstly carries out deformable convolution operation on an input image to extract local features such as edges and the like, then segments the image into sequences and inputs the sequences into the convolution mapping and attention re-paying mechanism module; a convolution mapping re-attention mechanism module performs re-attention mechanism operation on the image sequence to extract global features;
step 2: constructing an optical flow estimation network; inputting image features extracted from each layer in the Transformer feature pyramid network into an optical flow estimation network for prediction;
the optical flow estimation network is composed of 5 layers of convolutional neural networks, firstly, feature deformation is carried out on an input second frame image, then, a feature matching relation between the second frame image after the feature deformation and a first frame image is calculated, namely, a feature matching cost volume is calculated, then, the calculated feature matching cost volume is input into the optical flow estimation network to predict optical flow, and the output optical flow result is subjected to post-processing by utilizing a context network to obtain accurate predicted optical flow;
and step 3: carrying out shielding compensation processing on pixels in a shielding area;
the specific operation of the occlusion compensation processing in the optical flow estimation network is that firstly, difference comparison is carried out on a second frame reconstructed image and a first frame image, and then pixels of an occlusion area are extracted to carry out occlusion compensation processing on the pixels;
the difference contrast processing is to utilize the predicted forward optical flow to distort the second frame image so as to reconstruct the first frame image, and the difference of the reconstructed first frame image and the original first frame image is added into the occlusion image; extracting pixels of the occlusion region is to fill a reconstructed image by extracting pixels of the occlusion region in the original first frame image, and perform occlusion compensation by calculating the difference between the reconstructed image and the original first frame image;
and 4, step 4: designing a loss function of the whole network training; combining the loss function with similarity information processing into an optical flow estimation network, and constructing a loss function of whole network training; defining a function for describing the feature similarity between different image blocks of an input transform model to promote the diversity between different image blocks, and defining a contrast loss function and an error loss function to enable the concerned features of the deep network layer to come from the corresponding input of the shallow layer and be unrelated to the rest shallow layer input; weighting and summing loss items corresponding to the transform model of each pyramid layer, and combining an occlusion compensation loss function as an overall loss function of network training to constrain the training process of the network;
and 5: inputting two continuous frames of images at the input end of the network, and carrying out unsupervised training on the network by using an integral network loss function;
step 6: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.
Further, in the step 1, there are 12 convolutional layers based on the Transformer feature pyramid network, including 6 stages of Transformer models, where the Transformer models in each stage have the same architecture, and can gradually extract 6 feature maps with different sizes; the number of channel features from the first stage to the sixth stage is 16, 32, 64, 96, 128, 196, respectively;
inputting a 3 × 512 × 512 feature map into the first layer of convolutional layer, outputting a 16 × 512 × 512 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 16 × 512 × 512 feature map into the second layer of convolution layer, outputting a 16 × 256 × 256 feature map with a convolution kernel size of 3 × 3 and a step size of 2, and inputting the feature map into the transform model in the first stage;
inputting a 16 × 256 × 256 feature map into the third layer of convolutional layer, outputting a 16 × 256 × 256 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 16 × 256 × 256 feature map into the fourth layer of convolution layer, outputting a 32 × 128 × 128 feature map into the transform model at the second stage, wherein the convolution kernel size is 3 × 3, the step size is 2;
inputting a 32 × 128 × 128 feature map into the fifth convolutional layer, wherein the size of a convolutional kernel is 3 × 3, the step size is 1, and outputting the 32 × 128 × 128 feature map; inputting a 32 × 128 × 128 feature map into the sixth layer of convolution layer, the convolution kernel size is 3 × 3, the step size is 2, outputting a 64 × 64 × 64 feature map and inputting the feature map into the transform model of the third stage;
inputting a feature map of 64 multiplied by 64 to the seventh convolutional layer, wherein the size of a convolution kernel is 3 multiplied by 3, the step size is 1, and outputting a feature map of 64 multiplied by 64; the eighth layer convolution inputs a feature map of 64 multiplied by 64, the convolution kernel size is 3 multiplied by 3, the step size is 2, and the feature map of 96 multiplied by 32 is output and input to a transform model in the fourth stage;
inputting a 96 × 32 × 32 feature map into the ninth convolutional layer, outputting a 96 × 32 × 32 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 96 × 32 × 32 feature map into the tenth convolutional layer, the convolutional kernel size is 3 × 3, the step size is 2, outputting a 128 × 16 × 16 feature map, and inputting the feature map into the transform model in the fifth stage;
inputting a 128 × 16 × 16 feature map into the eleventh convolutional layer, outputting a 128 × 16 × 16 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 128 × 16 × 16 feature map into the twelfth layer of convolution layer, the convolution kernel size is 3 × 3, the step size is 2, and outputting a 196 × 8 × 8 feature map to be input into the transform model in the sixth stage;
and the feature graph output by the convolution layer enters a Transformer model and then is output to the next convolution layer of the feature pyramid so as to enhance the feature extraction capability of the network on the image.
Further, in step 1, the image segmentation module based on the transform feature pyramid network performs a deformable convolution operation on the input image, so that the number of image sequences in the input convolution mapping and attention mechanism module is gradually reduced, the feature resolution is reduced, the width of the input sequence is expanded, the feature size is increased, and the purpose of gradually reducing the feature resolution and increasing the feature size along with the deepening of the convolution layer number is achieved, and the specific steps are as follows:
step 1.1.1: outputting the ith layer pyramid as an image xiInputting the image into an image segmentation module, extracting edge features through a deformable convolution operation to obtain an output image x'iCarrying out batch normalization processing and maximum pooling on the image after the deformable convolution operation to obtain an output image x ″)i
x″i=MaxPool(BN(Deconv(xi)))
Wherein x isiOutputting an image for the ith pyramid based on the Transformer feature pyramid network, wherein the feature size of the output image is
Figure BDA0003404461250000031
Hi×WiResolution, C, representing the image of the i-th layeriRepresenting the number of channels of the ith layer image; deconv (·) represents a deformable convolution operation; BN (-) represents batch normalization processing; MaxPool (. cndot.) denotes maximum pooling; x ″)iRepresenting the output image after the maximum pooling operation;
step 1.1.2: after maximum pooling an image x ″, is obtainediImage Block x 'sliced into 2D'iEach image block x'iHas the dimension of
Figure BDA0003404461250000032
N denotes the output image x ″iThe number of the image sequences after segmentation is also the effective length of the input convolution mapping re-attention mechanism module, and
Figure BDA0003404461250000033
(P, P) represents the resolution of each image block after segmentation;
step 1.1.3: flattening the sliced image blocks into N image sequences
Figure BDA0003404461250000034
Each image sequence is a two-dimensional matrix (N, D), D ═ P2C;
Step 1.1.4: the converted image sequence is marked through a binary mask, wherein the position 1 represents the image sequence with the important information position, the position 0 represents the image sequence with the similar information, the image sequence is selected to be marked in a self-adaptive mode according to the difference of information contained in the image sequence, the marked image sequence is input into a convolution mapping re-attention mechanism module, the image sequences marked with 1 are subjected to interactive calculation, and the image sequence marked with 0 is only calculated with the image sequence.
Further, the convolutional mapping re-attention mechanism module based on the transform feature pyramid network in step 1 includes two parts, namely a multi-head re-attention mechanism operation and a multi-head linear processing, and specifically includes the following steps:
step 1.2.1: the ith layer image sequence to be input
Figure BDA0003404461250000041
Linearly mapping the three-dimensional space to form a 2D image, obtaining three projected features through three depth separable convolution operations, flattening the three projected features to obtain three different two-dimensional vectors of the ith layer
Figure BDA0003404461250000042
Each vector is defined as follows:
Figure BDA0003404461250000043
wherein Reshape2d (·) represents spatially recombining the image sequence; s represents the size of the convolution kernel; conv2d (·) represents a depth separable convolution operation; flatting after the Flatten (·) represents mapping to obtain a two-dimensional vector;
Figure BDA0003404461250000044
representing an i-th layer flattened image sequence;
Figure BDA0003404461250000045
representing a two-dimensional vector obtained by flattening the ith layer after the depth separable convolution operation;
step 1.2.2: the obtained two-dimensional vector
Figure BDA0003404461250000046
The two-dimensional vector mapped by each input image sequence can be expressed as q ═ q in a matrix in a multi-head re-attention mechanism1,q2,...,qN],k=[k1,k2,...,kN],v=[v1,v2,...,vN](ii) a With the continuous deepening of the model depth, the attention similarity of the same image sequence in different layers is larger, but the similarity between multiple heads from the same layer is smaller, a transfer matrix is adopted to combine information between different heads, the information is utilized to carry out attention operation again, the transfer matrices trained by different layers are different, and therefore attention collapse is avoided; the multi-head re-attention mechanism operation is defined as follows:
Figure BDA0003404461250000047
wherein q, k, v respectively represent a matrix obtained by convolution mapping of each input image sequence; t represents a transpose operation of the matrix; q.kTRepresenting the correlation of two locations; softmax (·) denotes the pair q.kTCarrying out normalization operation; d represents the dimensions of q and k; θ represents the dimension of attention on multiple heads;
Figure BDA0003404461250000048
representing the image slice sequence which is subjected to convolution mapping flattening after spatial recombination;
Figure BDA0003404461250000049
representing a multi-headed re-attention mechanism operation on the sequence of image slices;
step 1.2.3: adding the input and output of the multi-head re-attention mechanism operation, and then performing batch normalization processing to realize the normalization of the activation value of each layer, wherein the specific definition is as follows:
Figure BDA0003404461250000051
wherein MSA (-) represents a multi-headed re-attention mechanism operation; BN (-) represents batch normalization processing;
Figure BDA0003404461250000052
representing the image slice sequence which is subjected to convolution mapping flattening after spatial recombination;
Figure BDA0003404461250000053
representing the image sequence output after batch normalization processing;
step 1.2.4: the multi-head linear processing operation comprises a feedforward network and batch normalization processing, the feedforward network carries out dimension expansion on each image sequence, an activation function of the batch normalization processing is GELU nonlinear activation, residual connection is carried out after each operation to prevent network degradation, and the operation is specifically defined as follows:
Figure BDA0003404461250000054
wherein FFN (-) represents a feed-forward network; BN (-) represents batch normalization processing; GELU (. circle.) represents a Gaussian error linear unit activation function;
Figure BDA0003404461250000055
representing an image sequence output after multi-head re-attention operation batch normalization processing;
Figure BDA0003404461250000056
representing that multi-head linear processing is carried out on the output image after multi-head re-attention operation;
step 1.2.5: and performing linear mapping dimensionality reduction on the image sequence output by the multi-head linear processing operation, performing spatial reconstruction to form a 2D image, and outputting the 2D image to a next characteristic pyramid layer.
Further, in step 2, the input of the optical flow estimation network is two consecutive frames of images, and the output is a corresponding estimated optical flow, specifically comprising the steps of:
step 2.1: for input two continuous frames of images I1And I2For the second image I2Performing feature deformation, performing 2-time upsampling on the estimated optical flow from the i-1 layer based on the Transformer feature pyramid network, and performing bilinear interpolation deformation on the features of the second graph to the first graph, wherein the operation of feature deformation is defined as follows:
Figure BDA0003404461250000057
wherein x represents a pixel value;
Figure BDA0003404461250000058
representing that 2 times of upsampling is carried out on an optical flow result obtained by an i-1 layer optical flow estimation network;
Figure BDA0003404461250000059
representing the ith layer image characteristic of the second image of the pyramid;
Figure BDA00034044612500000510
representing the image features after feature deformation;
step 2.2: calculating the correlation between the second image feature and the first image feature after the feature deformation, namely calculating the feature matching cost volume of each layer, wherein the calculation of the feature matching cost volume is defined as follows:
Figure BDA00034044612500000511
wherein x is1,x2Pixel values representing the first and second images, respectively;
Figure BDA00034044612500000512
a feature map representing a first image on an i-th layer based on a transform feature pyramid network;
Figure BDA00034044612500000513
representing the image characteristics after characteristic deformation is carried out on the second image on the ith layer based on the Transformer characteristic pyramid network; m represents the length of the feature map; t represents a transpose operation of the matrix; CV ofi(x1,x2) Representing a feature map matching cost volume result of the ith layer based on the Transformer feature pyramid network;
step 2.3: and inputting the feature matching cost volume result, the features of the first graph and the up-sampled high-resolution optical flow into an i-th optical flow estimation layer of the optical flow estimation network to obtain an optical flow estimation result of the layer.
Further, in the step 2.3, five convolutional networks are used for optical flow estimation layers of the optical flow estimation network, the number of channels of each convolutional layer is 128, 96, 64 and 32, and the output optical flow results are post-processed through a context network; the context network is a feedforward convolutional neural network, is based on an expansion convolution design and consists of 7 layers of convolutional networks, the convolution kernel of each convolutional layer is 3 multiplied by 3, the layers have different expansion coefficients, and the convolution expansion coefficient k represents that the input unit of the filter in the layer is k units away from other input units of the filter in the layer in the vertical and horizontal directions; from top to bottom, the expansion coefficients of the convolutional layers are 1,2,4,8,16,1 and 1 in order, and the convolutional layers with large expansion coefficients can effectively enlarge the size of the receptive field of each output unit on the pyramid to output a refined optical flow.
Further, in the step 3: the method for carrying out shielding compensation processing on the pixels in the shielding area specifically comprises the following steps:
carrying out difference comparison on the predicted optical flow image and the original image, and then extracting pixels of an occlusion area to carry out occlusion compensation processing on the pixels; the contrast processing is to warp the second frame image I by using the predicted forward optical flow2Thereby synthesizing a first frame reconstructed image I1', reconstructing the image I1' with original image I1The differences are added to the occlusion map, defined as follows:
Figure BDA0003404461250000061
Wherein x represents a pixel value; ofwRepresenting an occlusion map; i is1(x) And l'1(x) Respectively representing a first frame original image and a second frame reconstructed image; sigma (-) represents a pixel-level similarity measure for computing the original image I1(x) And reconstructed image I'1(x) The similarity between them;
Figure BDA0003404461250000062
estimating a difference contrast loss in the network for the ith layer optical flow;
the extraction of the pixels of the occlusion region is performed by using the original image I1Corresponding pixel of the occlusion region in (1) fills in the image I'1Obtaining a reconstructed image I1By reconstructing the image I ″)1And I1The difference between to calculate the loss in occlusion region is defined as follows:
Figure BDA0003404461250000063
wherein x represents a pixel value; i ″)1(x) Is represented in the reconstructed first picture I'1(x) Adding a reconstruction pixel obtained by corresponding shielding pixel; σ (-) represents a pixel-level similarity metric;
Figure BDA0003404461250000064
estimating losses in occlusion regions in the network for the ith layer of optical flow;
occlusion compensation loss function in final i-th layer optical flow estimation network
Figure BDA0003404461250000065
Is obtained by summing two loss functions, and is defined as follows:
Figure BDA0003404461250000071
wherein,
Figure BDA0003404461250000072
representing the difference contrast loss in the ith layer optical flow estimation network;
Figure BDA0003404461250000073
representing the loss in the occlusion region in the ith layer optical flow estimation network;
Figure BDA0003404461250000074
and representing an occlusion compensation loss function in the ith layer optical flow estimation network.
Further, the loss function of the overall network training in step 4 is: let the characteristics of the input image x and its corresponding N image sequences of the i-th layer be
Figure BDA0003404461250000075
And the characteristics are similar to functions
Figure BDA0003404461250000076
As a penalty term in the loss function to promote diversity between different image sequences, is defined as follows:
Figure BDA0003404461250000077
wherein N represents the total number of image segmentations; i | · | purple wind2Represents L2Performing norm operation; t represents a transposition operation of the vector; x is the number ofn,xmRespectively representing the characteristics of different image slice sequences of the same layer;
Figure BDA0003404461250000078
a characteristic similarity loss function representing the similarity between different image sequences of the ith layer;
defining an ith layer contrast loss function
Figure BDA0003404461250000079
The features learned in shallow layers of the Transformer model are more diversified than those learned in deep layers, and the contrast loss function can use the features learned in shallow layers to regularize the deep features so as to reduce the similarity between different image sequences and increase the feature diversity of the image sequences in a deep network, and is defined as follows:
Figure BDA00034044612500000710
wherein,
Figure BDA00034044612500000711
an nth image sequence feature representing the first layer;
Figure BDA00034044612500000712
representing the ith layer corresponding to the nth image sequence characteristic;
Figure BDA00034044612500000713
indicating that the feature vector is transposed; n represents the number of image sequences;
Figure BDA00034044612500000714
representing an image sequence contrast loss function of the ith layer;
defining an i-th layer error loss function
Figure BDA00034044612500000715
Each deep image sequence in the Transformer model should focus only on the corresponding image sequence input from the shallow layer, and ignore the remaining irrelevant image sequence features, which is defined as follows:
Figure BDA00034044612500000716
wherein,
Figure BDA00034044612500000717
representing the nth image sequence of the first layerPerforming sign;
Figure BDA00034044612500000718
representing the ith layer corresponding to the nth image sequence characteristic; n represents the number of image sequences;
Figure BDA00034044612500000719
an error loss function representing the image sequence of the i-th layer;
the loss function for defining the whole network training is obtained by weighted sum of loss functions of the Transformer model in six stages and luminosity loss function, and the formula is as follows:
Figure BDA0003404461250000081
wherein λ is123Constraint balance factors respectively representing the proportion of loss functions of the transform model under different pyramid scales, wherein the higher the resolution is, the larger the function of the statistical loss function in the network training is, and the higher the weight coefficient is;
Figure BDA0003404461250000082
respectively representing a characteristic similarity loss function, a contrast loss function and an error loss function of the transform model of the ith layer;
Figure BDA0003404461250000083
representing the occlusion compensation loss function in the ith layer of optical flow estimation network, and finally calculating the obtained LfinalAs a loss function of the final overall network training.
The invention has the beneficial effects that:
the invention introduces a Transformer model into a characteristic pyramid network to enhance the characteristic extraction capability of the network, trains the whole network by designing a loss function of shielding compensation, and further obtains the finally predicted optical flow. The method can enhance the feature extraction capability of the feature pyramid layer on the image, and carry out occlusion compensation processing on occlusion pixels in the image so as to improve the precision of optical flow estimation.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a schematic diagram of a transform-based feature pyramid in the present invention.
FIG. 3 is a schematic diagram of an image segmentation module according to the present invention.
FIG. 4 is a block diagram of a convolutional mapping re-attention mechanism of the present invention.
Fig. 5 is an overall architecture diagram of the present invention.
FIG. 6 is a schematic diagram of the occlusion compensation of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention aims to provide an unsupervised optical flow estimation method based on a Transformer feature pyramid network, and belongs to the field of computer vision. Firstly, a Transformer-based feature pyramid network is constructed, and the feature extraction capability of the feature pyramid network on the image is enhanced by means of a Transformer model through a re-attention mechanism operation. And secondly, constructing an optical flow estimation network so that the network can perform optical flow prediction. And finally, carrying out shielding compensation processing on pixels in a shielding area, and designing a loss function of the whole network training to carry out unsupervised training on the network to obtain an unsupervised optical flow estimation model with higher speed and higher precision.
The invention provides an unsupervised optical flow estimation method based on a Transformer feature pyramid network, and aims to enhance the feature extraction capability of a feature pyramid layer on an image and perform occlusion compensation processing on occluded pixels in the image so as to improve the accuracy of optical flow estimation.
The purpose of the invention is realized as follows:
the method comprises the following steps: and constructing a characteristic pyramid network based on the Transformer. The feature pyramid network has two identical network branches, which have 12 layers in total, and can extract 6 image features, and the network of each branch shares the weight. And a Transformer model is fused on the second layer, the fourth layer, the sixth layer, the eighth layer, the tenth layer and the last layer of the characteristic pyramid to enhance the characteristic extraction capability of the network, and the model consists of an image segmentation module and a convolution mapping and re-attention mechanism module. The image segmentation module firstly carries out deformable convolution operation on an input image to extract local features such as edges and the like, and then segments the image into a sequence and inputs the sequence into the convolution mapping and attention re-paying mechanism module. And the convolution mapping re-attention mechanism module performs a re-attention mechanism operation on the image sequence to extract global features.
Step two: and constructing an optical flow estimation network. And inputting the image features extracted on the basis of each layer in the Transformer feature pyramid network into an optical flow estimation network for prediction, wherein the network consists of 5 layers of convolutional neural networks. Firstly, feature deformation is carried out on an input second frame image, then a feature matching relation between the second frame image after the feature deformation and the first frame image is calculated, namely a feature matching cost volume, then the calculated feature matching cost volume is input into the optical flow estimation network to predict the optical flow, and the output optical flow result is subjected to post-processing by utilizing a context network to obtain the accurate predicted optical flow.
Step three: and carrying out shielding compensation processing on the pixels in the shielding area. The specific operation of the occlusion compensation processing in the optical flow estimation network is to firstly perform difference comparison on the second frame reconstructed image and the first frame image, and then extract pixels of an occlusion area to perform occlusion compensation processing on the pixels. The difference contrast processing is to reconstruct the first frame image by distorting the second frame image with the predicted forward optical flow, and to add the difference between the reconstructed first frame image and the original first frame image to the occlusion map. The step of extracting the pixels of the occlusion region is to fill the reconstructed image by extracting the pixels of the occlusion region in the original first frame image, and to perform occlusion compensation by calculating the difference between the reconstructed image and the original first frame image.
Step four: and designing a loss function of the whole network training. And integrating the loss function with similarity information processing into the optical flow estimation network, and constructing a loss function trained by the whole network. The definition uses a function describing the feature similarity between different image blocks of the input transform model to promote the diversity between different image blocks, and simultaneously defines a contrast loss function and an error loss function, so that the concerned features of the deep network layer can come from the corresponding input of the shallow layer and are independent of the rest shallow layer inputs. And weighting and summing the loss items corresponding to the transform model of each pyramid layer, and combining the occlusion compensation loss function as the whole loss function of the network training to constrain the training process of the network.
Step five: two continuous frames of images are input at the input end of the network, and the network is subjected to unsupervised training by utilizing the whole network loss function.
Step six: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.
The invention firstly introduces a Transformer model into a characteristic pyramid network to enhance the characteristic extraction capability of the network, and simultaneously trains the whole network by designing a loss function of shielding compensation to further obtain the finally predicted optical flow.
Example 1:
the method comprises the following steps: and constructing a characteristic pyramid network based on the Transformer.
As shown in fig. 2, an image I is input into a feature pyramid network, which has 12 convolution layers and includes 6 stages of transform models, and the transform models in each stage have the same architecture, so that 6 feature maps with different sizes can be extracted step by step.
The numbers of channel features from the first stage to the sixth stage are 16, 32, 64, 96, 128 and 196 respectively, the convolution layer of the first layer inputs a feature map of 3 x 512, the size of a convolution kernel is 3 x 3, the step size is 1, and a feature map of 16 x 512 is output. The second layer convolution layer inputs the 16 × 512 × 512 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the output 16 × 256 × 256 feature map is input into the transform model in the first stage.
The third layer of convolution layer inputs a 16 × 256 × 256 feature map, the convolution kernel size is 3 × 3, the step size is 1, and a 16 × 256 × 256 feature map is output. The fourth layer convolution layer inputs a 16 × 256 × 256 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the feature map of 32 × 128 × 128 is output and input to the transform model at the second stage.
The fifth layer convolution layer inputs a feature map of 32 × 128 × 128, the convolution kernel size is 3 × 3, the step size is 1, and a feature map of 32 × 128 × 128 is output. The sixth layer convolution layer inputs the 32 × 128 × 128 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the output 64 × 64 × 64 feature map is input to the transform model of the third stage.
The seventh convolutional layer inputs a feature map of 64 × 64 × 64, the convolutional kernel size is 3 × 3, the step size is 1, and a feature map of 64 × 64 × 64 is output. The eighth layer convolution inputs a feature map of 64 × 64 × 64, the convolution kernel size is 3 × 3, the step size is 2, and the output feature map of 96 × 32 × 32 is input to the transform model in the fourth stage.
The ninth convolutional layer inputs a 96 × 32 × 32 feature map, the convolutional kernel size is 3 × 3, the step size is 1, and a 96 × 32 × 32 feature map is output. The tenth layer convolution layer inputs a 96 × 32 × 32 feature map, the convolution kernel size is 3 × 3, the step size is 2, and a 128 × 16 × 16 feature map is output and input to the transform model in the fifth stage.
The eleventh convolutional layer inputs a 128 × 16 × 16 feature map, the convolutional kernel size is 3 × 3, the step size is 1, and a 128 × 16 × 16 feature map is output. The twelfth layer convolution layer inputs a 128 × 16 × 16 feature map, the convolution kernel size is 3 × 3, the step size is 2, and the feature map of 196 × 8 × 8 is output and input to the transform model of the sixth stage.
And the feature graph output by the convolution layer enters a Transformer model and then is output to the next convolution layer of the feature pyramid so as to enhance the feature extraction capability of the network on the image.
The Transformer model comprises two parts, namely an image segmentation module and a convolution mapping re-attention mechanism module. First, the standard Transformer module is used for processing machine language translation, a 1D language sequence is input, and in order to process a 2D image, the image segmentation module segments the input image, as shown in fig. 3. The module carries out deformable convolution operation on the input image, so that the number of image sequences in the input convolution mapping and attention mechanism module is gradually reduced, the characteristic resolution is reduced, the width of the input sequence is expanded, the characteristic size is increased, and the aims of gradually reducing the characteristic resolution and increasing the characteristic size along with the deepening of the number of layers of convolution are fulfilled. The deformable convolution has a total of four layers, each layer is composed of a standard convolution layer and a deviation layer, the convolution kernel size of the first layer of the standard convolution layer is 7 multiplied by 7, the step size is 2, the convolution kernel sizes of the second layer and the third layer are 3 multiplied by 3, the step size is 1, the convolution kernel size of the fourth layer is 7 multiplied by 7, and the step size is 2. The deviation layer of each layer is used for learning the characteristic diagram output by the previous standard convolution layer to obtain the deformation offset of the deformable convolution.
Let each ith layer pyramid output image based on Transformer feature pyramid network branch be xiThe characteristic size of the output image is
Figure BDA0003404461250000111
Wherein Hi×WiResolution, C, representing the image of the i-th layeriIndicating the number of channels of the i-th layer image. Outputting the layer of pyramid as an image xiInputting the image into an image segmentation module, extracting edge features through a deformable convolution operation to obtain an output image x'iCarrying out batch normalization processing and maximum pooling on the image after the deformable convolution operation to obtain an output image x ″)iThe definition is as follows:
x″i=MaxPool(BN(Deconv(xi)))
where Deconv (. cndot.) represents a deformable convolution operation, BN (. cndot.) represents batch normalization, MaxPool (. cndot.) represents maximum pooling, xiOutput image, x ″, representing the ith layer pyramidiRepresenting the output image after the maximum pooling operation. After maximum pooling an image x ″, is obtainediSliced 2D image blocks x'iEach image block x'iHas the dimension of
Figure BDA0003404461250000112
N denotes the output image x ″iThe number of the image sequences after segmentation is also the effective length of the input convolution mapping re-attention mechanism module, and
Figure BDA0003404461250000113
(P,P) The resolution of each image block after the segmentation is represented.
Flattening the sliced image blocks into N image sequences
Figure BDA0003404461250000114
Each image sequence is a two-dimensional matrix (N, D), D ═ P2C. The converted image sequence is marked through a binary mask, wherein the position 1 represents the image sequence with the important information position, the position 0 represents the image sequence with the similar information, the image sequence is selected to be marked in a self-adaptive mode according to the difference of information contained in the image sequence, the marked image sequence is input into a convolution mapping re-attention mechanism module, the image sequences marked with 1 are subjected to interactive calculation, and the image sequence marked with 0 is only calculated with the image sequence.
The convolution mapping re-attention mechanism module comprises a multi-head re-attention mechanism operation part and a multi-head linear processing part. First, for an input image sequence
Figure BDA0003404461250000115
A multi-headed re-attention mechanism operation was performed as shown in fig. 4. The module firstly inputs the ith layer image sequence
Figure BDA0003404461250000116
Linearly mapping the three-dimensional space to form a 2D image, obtaining three projected features through three depth separable convolution operations, flattening the three projected features to obtain three different two-dimensional vectors of the ith layer
Figure BDA0003404461250000117
Each vector is defined as follows:
Figure BDA0003404461250000118
where Reshape2d (-) indicates spatial recomposition of the image sequence, and s represents the size of the convolution kernel conv2d (-) indicates the depth separable convolution operationFlatten (·) represents that the mapping is flattened to obtain a two-dimensional vector,
Figure BDA0003404461250000121
representing the i-th layer flattened image sequence,
Figure BDA0003404461250000122
representing a two-dimensional vector flattened by the i-th layer after a depth separable convolution operation. The obtained two-dimensional vector
Figure BDA0003404461250000123
The two-dimensional vector mapped by each input image sequence can be expressed as q ═ q in a matrix in a multi-head re-attention mechanism1,q2,...,qN],k=[k1,k2,...,kN],v=[v1,v2,...,vN]. With the continuous deepening of the model, the attention similarity of the same image sequence in different layers is larger, but the similarity between multiple heads from the same layer is smaller, information between the different heads is combined by adopting a transfer matrix, attention operation is carried out again by utilizing the transfer matrix, and the transfer matrices trained by different layers are different, so that the attention collapse is avoided. The multi-head re-attention mechanism operation is defined as follows:
Figure BDA0003404461250000124
wherein q, k, v respectively represent a matrix obtained by convolution mapping of each input image sequence, T represents a transposition operation of the matrix, q · kTDenotes the correlation of two positions, softmax (·) denotes the pair q · kTPerforming a normalization operation, d representing the dimensions of q and k, theta representing the dimension of attention on multiple heads,
Figure BDA0003404461250000125
representing a sequence of image slices flattened by convolution mapping after spatial reconstruction,
Figure BDA0003404461250000126
showing a multi-headed re-attention mechanism operating on the sequence of image slices. The next layer of the multi-head re-attention operation processing is a batch normalization processing layer, the input and the output of the multi-head re-attention mechanism operation are added, and then batch normalization processing is carried out, so that the activation value normalization of each layer is realized, and the specific definition is as follows:
Figure BDA0003404461250000127
wherein MSA (-) represents a multi-head re-attention mechanism operation, BN (-) represents batch normalization processing,
Figure BDA0003404461250000128
representing a sequence of image slices flattened by convolution mapping after spatial reconstruction,
Figure BDA0003404461250000129
representing the image sequence output after batch normalization processing. The multi-head linear processing operation comprises a feedforward network and batch normalization processing, the feedforward network carries out dimension expansion on each image sequence, an activation function of the batch normalization processing is GELU nonlinear activation, residual connection is carried out after each operation to prevent network degradation, and the operation is specifically defined as follows:
Figure BDA00034044612500001210
wherein FFN (-) represents a feed-forward network, BN (-) represents batch normalization processing, GELU (-) represents a Gaussian error linear unit activation function,
Figure BDA00034044612500001211
representing the image sequence output after multi-head re-attention operation batch normalization processing,
Figure BDA00034044612500001212
and performing multi-head linear processing on the output image after multi-head re-attention operation, performing linear mapping and dimensionality reduction on the image sequence output by the layer, performing spatial reconstruction to form a 2D image, outputting the 2D image to the next layer of the characteristic pyramid layer, and repeating the model training for N times.
Step two: and constructing an optical flow estimation network.
As shown in FIG. 5, given 2 input images I1And I2The feature pyramid has two identical shared weight branches. For the second image I2Performing feature transformation, specifically performing 2-fold upsampling on the output estimated optical flow from the ith-1 layer of the pyramid, and performing bilinear interpolation transformation on the features of the second graph to the first graph, wherein the feature transformation operation is defined as follows:
Figure BDA0003404461250000131
where x represents the pixel value and,
Figure BDA0003404461250000132
the method is characterized in that the optical flow result obtained by an i-1 layer optical flow estimation network is up-sampled by 2 times,
Figure BDA0003404461250000133
the ith layer image features representing the second image of the pyramid,
Figure BDA0003404461250000134
representing the image features after feature deformation. Calculating the correlation between the second image feature and the first image feature after the feature deformation, namely calculating the feature matching cost volume of each layer, wherein the calculation of the feature matching cost volume is defined as follows:
Figure BDA0003404461250000135
wherein x is1,x2Pixels representing the first and second images respectivelyThe value of the one or more of the one,
Figure BDA0003404461250000136
a feature map representing the first image on the ith layer of the pyramid,
Figure BDA0003404461250000137
representing the image characteristics of the second image subjected to characteristic deformation on the ith layer of the pyramid, wherein M represents the length of a characteristic diagram, T represents the transposition operation of a matrix, and finally calculating the obtained CVi(x1,x2) And matching the cost volume result with the feature graph representing the ith layer of the feature pyramid.
And inputting the obtained feature matching cost volume result, the features of the first graph and the up-sampled high-resolution optical flow into an optical flow estimation network of the ith layer to obtain an optical flow estimation result of the ith layer. The optical flow estimation layer uses five layers of convolution networks, and the number of channels of each convolution layer is 128, 96, 64 and 32 respectively. The output optical flow result is post-processed by a context network, such as median filtering or bilateral filtering, and the context network is a feedforward convolutional neural network, and is based on an expansion convolution design and composed of 7 layers of convolutional networks, the convolution kernel of each convolutional layer is 3 × 3, the layers have different expansion coefficients, the convolutional expansion coefficient k represents that the input unit of the filter in the layer is separated from other input units of the filter in the layer by k units in the vertical and horizontal directions, the expansion coefficients of the convolutional layers are 1,2,4,8,16,1 and 1 in sequence from top to bottom, and the convolutional layer with a large expansion coefficient can effectively enlarge the size of the receptive field of each output unit on a pyramid so as to output a refined optical flow.
Step three: and carrying out shielding compensation processing on the pixels in the shielding area.
As shown in fig. 6, the specific operation of the occlusion compensation process in the optical flow estimation network is to perform difference comparison between the predicted optical flow image and the original image, and then extract the pixels of the occlusion area to perform the occlusion compensation process. The contrast processing is to warp the second frame image I by using the predicted forward optical flow2Thereby synthesizing a first frame reconstructed image I'1The reconstructed image I'1With the original image I1The differences are added to the occlusion map, defined as follows:
Figure BDA0003404461250000141
where x denotes the pixel value, ofwShowing an occlusion map, I1(x) And I1(x) Respectively representing an original image of a first frame and a reconstructed image of a second frame, sigma (-) representing a pixel-level similarity measure for calculating an original image I1(x) And reconstructed image I'1(x) Similarity between them, and finally calculating the obtained
Figure BDA0003404461250000142
The differential contrast loss in the network is estimated for the ith layer optical flow. The extraction of the pixels of the occlusion region is performed by using the original image I1Corresponding pixel of the occlusion region in (1) fills in the image I'1Obtaining a reconstructed image I1By reconstructing the image I ″)1And I1The difference between to calculate the loss in occlusion region is defined as follows:
Figure BDA0003404461250000143
wherein x represents a pixel value, I ″1(x) Is represented in the reconstructed first picture I'1(x) Adding reconstructed pixels obtained by corresponding shielded pixels, wherein sigma (DEG) represents pixel-level similarity measurement, and finally calculating the obtained reconstructed pixels
Figure BDA0003404461250000144
Losses in occlusion regions in the network are estimated for the ith layer of optical flow. Occlusion compensation loss function in final i-th layer optical flow estimation network
Figure BDA0003404461250000145
Is obtained by summing two loss functions, and is defined as follows:
Figure BDA0003404461250000146
wherein,
Figure BDA0003404461250000147
representing the difference versus loss in the i-th layer optical flow estimation network,
Figure BDA0003404461250000148
representing the loss in occlusion regions in the ith layer optical flow estimation network,
Figure BDA0003404461250000149
and representing an occlusion compensation loss function in the ith layer optical flow estimation network.
Step four: designing loss functions for whole network training
And (4) incorporating a Transformer model loss function with similarity information processing into the optical flow estimation network. As the similarity between different input image sequences is increased along with the continuous deepening of the Transformer model, a loss function for describing the characteristic similarity between different image sequences of the ith layer is defined
Figure BDA00034044612500001410
Let the characteristics of the input image x and its corresponding N image sequences of the i-th layer be
Figure BDA00034044612500001411
And the characteristics are similar to functions
Figure BDA00034044612500001412
As a penalty term in the loss function to promote diversity between different image sequences, is defined as follows:
Figure BDA00034044612500001413
where N represents the total number of image segmentations, ·2Represents L2Norm operation, T denotes the direction ofTransposition of quantities, xn,xmRespectively representing features of different image slice sequences of the same layer,
Figure BDA00034044612500001414
a characteristic similarity loss function representing the similarity between different image sequences of the ith layer.
Defining an ith layer contrast loss function
Figure BDA0003404461250000151
The features learned in shallow layers of the Transformer model are more diversified than those learned in deep layers, and the contrast loss function can use the features learned in shallow layers to regularize the deep features so as to reduce the similarity between different image sequences and increase the feature diversity of the image sequences in a deep network, and is defined as follows:
Figure BDA0003404461250000152
wherein,
Figure BDA0003404461250000153
an nth image sequence feature representing the first layer,
Figure BDA0003404461250000154
indicating that the ith layer corresponds to the nth image sequence feature,
Figure BDA0003404461250000155
indicating that the feature vector is transposed, N indicates the number of image sequences,
Figure BDA0003404461250000156
representing the image sequence contrast loss function of the ith layer.
Defining an i-th layer error loss function
Figure BDA0003404461250000157
The image sequence of each deep layer in the Transformer model should be focused only on the image sequence from the deep layerThe shallow corresponding image sequence input, while ignoring the remaining irrelevant image sequence features, is defined as follows:
Figure BDA0003404461250000158
wherein,
Figure BDA0003404461250000159
an nth image sequence feature representing the first layer,
Figure BDA00034044612500001510
indicating that the ith layer corresponds to the nth image sequence feature, N indicates the number of image sequences,
Figure BDA00034044612500001511
representing the error loss function of the image sequence of the i-th layer.
Defining a total network training loss function is obtained by weighted sum of loss functions of the six-stage Transformer model and a luminosity loss function, and the formula is as follows:
Figure BDA00034044612500001512
wherein λ is123Is a constraint balance factor which respectively represents the proportion of the loss function of the transform model under different pyramid scales, the higher the resolution is, the larger the function of the statistical loss function in the network training is, the higher the weight coefficient is,
Figure BDA00034044612500001513
respectively representing the characteristic similarity loss function, the contrast loss function and the error loss function of the transform model of the ith layer,
Figure BDA00034044612500001514
representing the occlusion compensation loss function in the ith layer of optical flow estimation network, and finally calculating the obtained LfinalTo be the finalLoss function of network training.
Step five: two continuous frames of images are input at the input end of the network, and the network is subjected to unsupervised training by utilizing the whole network loss function.
Step six: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. An unsupervised optical flow estimation method based on a Transformer feature pyramid network is characterized by comprising the following steps of:
step 1: constructing a characteristic pyramid network based on a Transformer;
the pyramid network based on the Transformer characteristics has two identical network branches, and has 12 layers in total, so that 6 image characteristics can be extracted, and the network of each branch shares the weight; fusing a Transformer model in a second layer, a fourth layer, a sixth layer, an eighth layer, a tenth layer and a last layer based on a Transformer characteristic pyramid network to enhance the characteristic extraction capability of the network, wherein the model consists of an image segmentation module and a convolution mapping re-attention mechanism module; the image segmentation module firstly carries out deformable convolution operation on an input image to extract local features such as edges and the like, then segments the image into sequences and inputs the sequences into the convolution mapping and attention re-paying mechanism module; a convolution mapping re-attention mechanism module performs re-attention mechanism operation on the image sequence to extract global features;
step 2: constructing an optical flow estimation network; inputting image features extracted from each layer in the Transformer feature pyramid network into an optical flow estimation network for prediction;
the optical flow estimation network is composed of 5 layers of convolutional neural networks, firstly, feature deformation is carried out on an input second frame image, then, a feature matching relation between the second frame image after the feature deformation and a first frame image is calculated, namely, a feature matching cost volume is calculated, then, the calculated feature matching cost volume is input into the optical flow estimation network to predict optical flow, and the output optical flow result is subjected to post-processing by utilizing a context network to obtain accurate predicted optical flow;
and step 3: carrying out shielding compensation processing on pixels in a shielding area;
the specific operation of the occlusion compensation processing in the optical flow estimation network is that firstly, difference comparison is carried out on a second frame reconstructed image and a first frame image, and then pixels of an occlusion area are extracted to carry out occlusion compensation processing on the pixels;
the difference contrast processing is to utilize the predicted forward optical flow to distort the second frame image so as to reconstruct the first frame image, and the difference of the reconstructed first frame image and the original first frame image is added into the occlusion image; extracting pixels of the occlusion region is to fill a reconstructed image by extracting pixels of the occlusion region in the original first frame image, and perform occlusion compensation by calculating the difference between the reconstructed image and the original first frame image;
and 4, step 4: designing a loss function of the whole network training; combining the loss function with similarity information processing into an optical flow estimation network, and constructing a loss function of whole network training; defining a function for describing the feature similarity between different image blocks of an input transform model to promote the diversity between different image blocks, and defining a contrast loss function and an error loss function to enable the concerned features of the deep network layer to come from the corresponding input of the shallow layer and be unrelated to the rest shallow layer input; weighting and summing loss items corresponding to the transform model of each pyramid layer, and combining an occlusion compensation loss function as an overall loss function of network training to constrain the training process of the network;
and 5: inputting two continuous frames of images at the input end of the network, and carrying out unsupervised training on the network by using an integral network loss function;
step 6: and inputting two continuous frames of images in the trained model for testing, and outputting a corresponding estimated optical flow.
2. The method of claim 1, wherein the method comprises: the 12 layers of convolution layers are arranged in the step 1 based on the Transformer characteristic pyramid network, and comprise 6 stages of Transformer models, wherein the Transformer models in each stage have the same architecture, and 6 characteristic graphs with different sizes can be extracted step by step; the number of channel features from the first stage to the sixth stage is 16, 32, 64, 96, 128, 196, respectively;
inputting a 3 × 512 × 512 feature map into the first layer of convolutional layer, outputting a 16 × 512 × 512 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 16 × 512 × 512 feature map into the second layer of convolution layer, outputting a 16 × 256 × 256 feature map with a convolution kernel size of 3 × 3 and a step size of 2, and inputting the feature map into the transform model in the first stage;
inputting a 16 × 256 × 256 feature map into the third layer of convolutional layer, outputting a 16 × 256 × 256 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 16 × 256 × 256 feature map into the fourth layer of convolution layer, outputting a 32 × 128 × 128 feature map into the transform model at the second stage, wherein the convolution kernel size is 3 × 3, the step size is 2;
inputting a 32 × 128 × 128 feature map into the fifth convolutional layer, wherein the size of a convolutional kernel is 3 × 3, the step size is 1, and outputting the 32 × 128 × 128 feature map; inputting a 32 × 128 × 128 feature map into the sixth layer of convolution layer, the convolution kernel size is 3 × 3, the step size is 2, outputting a 64 × 64 × 64 feature map and inputting the feature map into the transform model of the third stage;
inputting a feature map of 64 multiplied by 64 to the seventh convolutional layer, wherein the size of a convolution kernel is 3 multiplied by 3, the step size is 1, and outputting a feature map of 64 multiplied by 64; the eighth layer convolution inputs a feature map of 64 multiplied by 64, the convolution kernel size is 3 multiplied by 3, the step size is 2, and the feature map of 96 multiplied by 32 is output and input to a transform model in the fourth stage;
inputting a 96 × 32 × 32 feature map into the ninth convolutional layer, outputting a 96 × 32 × 32 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 96 × 32 × 32 feature map into the tenth convolutional layer, the convolutional kernel size is 3 × 3, the step size is 2, outputting a 128 × 16 × 16 feature map, and inputting the feature map into the transform model in the fifth stage;
inputting a 128 × 16 × 16 feature map into the eleventh convolutional layer, outputting a 128 × 16 × 16 feature map with a convolutional kernel size of 3 × 3 and a step size of 1; inputting a 128 × 16 × 16 feature map into the twelfth layer of convolution layer, the convolution kernel size is 3 × 3, the step size is 2, and outputting a 196 × 8 × 8 feature map to be input into the transform model in the sixth stage;
and the feature graph output by the convolution layer enters a Transformer model and then is output to the next convolution layer of the feature pyramid so as to enhance the feature extraction capability of the network on the image.
3. The method of claim 1, wherein the method comprises: in the step 1, the image segmentation module based on the transform feature pyramid network performs the deformable convolution operation on the input image, so that the number of image sequences in the input convolution mapping and attention mechanism module is gradually reduced, the feature resolution is reduced, the width of the input sequence is expanded, the feature size is increased, and the purpose of gradually reducing the feature resolution and increasing the feature size along with the deepening of the number of convolution layers is achieved, and the specific steps are as follows:
step 1.1.1: outputting the ith layer pyramid as an image xiInputting the image into an image segmentation module, extracting edge features through a deformable convolution operation to obtain an output image x'iCarrying out batch normalization processing and maximum pooling on the image after the deformable convolution operation to obtain an output image x ″)i
x″i=MaxPool(BN(Deconv(xi)))
Wherein x isiOutputting an image for the ith pyramid based on the Transformer feature pyramid network, wherein the feature size of the output image is
Figure FDA0003404461240000031
Hi×WiResolution, C, representing the image of the i-th layeriRepresenting the number of channels of the ith layer image; deconv (·) denotes a deformable convolution operationMaking; BN (-) represents batch normalization processing; MaxPool (. cndot.) denotes maximum pooling; x ″)iRepresenting the output image after the maximum pooling operation;
step 1.1.2: after maximum pooling an image x ″, is obtainediImage Block x 'sliced into 2D'iEach image block x'iHas the dimension of
Figure FDA0003404461240000032
N denotes the output image x ″iThe number of the image sequences after segmentation is also the effective length of the input convolution mapping re-attention mechanism module, and
Figure FDA0003404461240000033
(P, P) represents the resolution of each image block after segmentation;
step 1.1.3: flattening the sliced image blocks into N image sequences
Figure FDA0003404461240000034
Each image sequence is a two-dimensional matrix (N, D), D ═ P2C;
Step 1.1.4: the converted image sequence is marked through a binary mask, wherein the position 1 represents the image sequence with the important information position, the position 0 represents the image sequence with the similar information, the image sequence is selected to be marked in a self-adaptive mode according to the difference of information contained in the image sequence, the marked image sequence is input into a convolution mapping re-attention mechanism module, the image sequences marked with 1 are subjected to interactive calculation, and the image sequence marked with 0 is only calculated with the image sequence.
4. The method of claim 3, wherein the method comprises: the convolutional mapping re-attention mechanism module based on the Transformer feature pyramid network in the step 1 comprises two parts of multi-head re-attention mechanism operation and multi-head linear processing, and the specific steps are as follows:
step 1.2.1: will be provided withInput ith layer image sequence
Figure FDA0003404461240000035
Linearly mapping the three-dimensional space to form a 2D image, obtaining three projected features through three depth separable convolution operations, flattening the three projected features to obtain three different two-dimensional vectors of the ith layer
Figure FDA0003404461240000036
Each vector is defined as follows:
Figure FDA0003404461240000037
wherein Reshape2d (·) represents spatially recombining the image sequence; s represents the size of the convolution kernel; conv2d (·) represents a depth separable convolution operation; flatting after the Flatten (·) represents mapping to obtain a two-dimensional vector;
Figure FDA0003404461240000038
representing an i-th layer flattened image sequence;
Figure FDA0003404461240000039
representing a two-dimensional vector obtained by flattening the ith layer after the depth separable convolution operation;
step 1.2.2: the obtained two-dimensional vector
Figure FDA0003404461240000041
The two-dimensional vector mapped by each input image sequence can be expressed as q ═ q in a matrix in a multi-head re-attention mechanism1,q2,...,qN],k=[k1,k2,...,kN],v=[v1,v2,...,vN](ii) a With the continuous deepening of the model depth, the attention similarity of the same image sequence at different layers is larger, but the similarity between multiple heads from the same layer is smallerCombining information among different heads by adopting a transfer matrix, and performing attention operation again by using the information, wherein the transfer matrices trained by different layers are different, so as to avoid attention collapse; the multi-head re-attention mechanism operation is defined as follows:
Figure FDA0003404461240000042
wherein q, k, v respectively represent a matrix obtained by convolution mapping of each input image sequence; t represents a transpose operation of the matrix; q.kTRepresenting the correlation of two locations; softmax (·) denotes the pair q.kTCarrying out normalization operation; d represents the dimensions of q and k; θ represents the dimension of attention on multiple heads;
Figure FDA0003404461240000043
representing the image slice sequence which is subjected to convolution mapping flattening after spatial recombination;
Figure FDA0003404461240000044
representing a multi-headed re-attention mechanism operation on the sequence of image slices;
step 1.2.3: adding the input and output of the multi-head re-attention mechanism operation, and then performing batch normalization processing to realize the normalization of the activation value of each layer, wherein the specific definition is as follows:
Figure FDA0003404461240000045
wherein MSA (-) represents a multi-headed re-attention mechanism operation; BN (-) represents batch normalization processing;
Figure FDA0003404461240000046
representing the image slice sequence which is subjected to convolution mapping flattening after spatial recombination;
Figure FDA0003404461240000047
representing the image sequence output after batch normalization processing;
step 1.2.4: the multi-head linear processing operation comprises a feedforward network and batch normalization processing, the feedforward network carries out dimension expansion on each image sequence, an activation function of the batch normalization processing is GELU nonlinear activation, residual connection is carried out after each operation to prevent network degradation, and the operation is specifically defined as follows:
Figure FDA0003404461240000048
wherein FFN (-) represents a feed-forward network; BN (-) represents batch normalization processing; GELU (. circle.) represents a Gaussian error linear unit activation function;
Figure FDA0003404461240000049
representing an image sequence output after multi-head re-attention operation batch normalization processing;
Figure FDA00034044612400000410
representing that multi-head linear processing is carried out on the output image after multi-head re-attention operation;
step 1.2.5: and performing linear mapping dimensionality reduction on the image sequence output by the multi-head linear processing operation, performing spatial reconstruction to form a 2D image, and outputting the 2D image to a next characteristic pyramid layer.
5. The method of claim 1, wherein the method comprises: in the step 2, the input of the optical flow estimation network is two continuous frames of images, and the output is a corresponding estimated optical flow, and the specific steps are as follows:
step 2.1: for input two continuous frames of images I1And I2For the second image I2Performing feature deformation, performing 2 times of upsampling on the second frame by using output estimation optical flow from the i-1 th layer based on the Transformer feature pyramid networkThe feature of the graph is transformed into the first graph by bilinear interpolation, and the operation of the feature transformation is defined as follows:
Figure FDA0003404461240000051
wherein x represents a pixel value;
Figure FDA0003404461240000052
representing that 2 times of upsampling is carried out on an optical flow result obtained by an i-1 layer optical flow estimation network;
Figure FDA0003404461240000053
representing the ith layer image characteristic of the second image of the pyramid;
Figure FDA0003404461240000054
representing the image features after feature deformation;
step 2.2: calculating the correlation between the second image feature and the first image feature after the feature deformation, namely calculating the feature matching cost volume of each layer, wherein the calculation of the feature matching cost volume is defined as follows:
Figure FDA0003404461240000055
wherein x is1,x2Pixel values representing the first and second images, respectively;
Figure FDA0003404461240000056
a feature map representing a first image on an i-th layer based on a transform feature pyramid network;
Figure FDA0003404461240000057
representing the image characteristics after characteristic deformation is carried out on the second image on the ith layer based on the Transformer characteristic pyramid network; m represents a featureThe length of the graph; t represents a transpose operation of the matrix; CV ofi(x1,x2) Representing a feature map matching cost volume result of the ith layer based on the Transformer feature pyramid network;
step 2.3: and inputting the feature matching cost volume result, the features of the first graph and the up-sampled high-resolution optical flow into an i-th optical flow estimation layer of the optical flow estimation network to obtain an optical flow estimation result of the layer.
6. The method of claim 5, wherein the method comprises: in the step 2.3, the optical flow estimation layer of the optical flow estimation network uses five layers of convolution networks, the number of channels of each convolution layer is 128, 96, 64 and 32, and the output optical flow result is post-processed through a context network; the context network is a feedforward convolutional neural network, is based on an expansion convolution design and consists of 7 layers of convolutional networks, the convolution kernel of each convolutional layer is 3 multiplied by 3, the layers have different expansion coefficients, and the convolution expansion coefficient k represents that the input unit of the filter in the layer is k units away from other input units of the filter in the layer in the vertical and horizontal directions; from top to bottom, the expansion coefficients of the convolutional layers are 1,2,4,8,16,1 and 1 in order, and the convolutional layers with large expansion coefficients can effectively enlarge the size of the receptive field of each output unit on the pyramid to output a refined optical flow.
7. The method of claim 1, wherein the method comprises: in the step 3: the method for carrying out shielding compensation processing on the pixels in the shielding area specifically comprises the following steps:
carrying out difference comparison on the predicted optical flow image and the original image, and then extracting pixels of an occlusion area to carry out occlusion compensation processing on the pixels; the contrast processing is to warp the second frame image I by using the predicted forward optical flow2Thereby synthesizing a first frame reconstructed image I'1The reconstructed image I'1With the original imageI1The differences are added to the occlusion map, defined as follows:
Figure FDA0003404461240000061
wherein x represents a pixel value; ofwRepresenting an occlusion map; i is1(x) And l'1(x) Respectively representing a first frame original image and a second frame reconstructed image; sigma (-) represents a pixel-level similarity measure for computing the original image I1(x) And reconstructed image I'1(x) The similarity between them;
Figure FDA0003404461240000062
estimating a difference contrast loss in the network for the ith layer optical flow;
the extraction of the pixels of the occlusion region is performed by using the original image I1Corresponding pixel of the occlusion region in (1) fills in the image I'1Obtaining a reconstructed image I1By reconstructing the image I ″)1And I1The difference between to calculate the loss in occlusion region is defined as follows:
Figure FDA0003404461240000063
wherein x represents a pixel value; i ″)1(x) Is represented in the reconstructed first picture I'1(x) Adding a reconstruction pixel obtained by corresponding shielding pixel; σ (-) represents a pixel-level similarity metric;
Figure FDA0003404461240000064
estimating losses in occlusion regions in the network for the ith layer of optical flow;
occlusion compensation loss function in final i-th layer optical flow estimation network
Figure FDA0003404461240000065
Is obtained by summing two loss functions, and is defined as follows:
Figure FDA0003404461240000066
wherein,
Figure FDA0003404461240000067
representing the difference contrast loss in the ith layer optical flow estimation network;
Figure FDA0003404461240000068
representing the loss in the occlusion region in the ith layer optical flow estimation network;
Figure FDA0003404461240000069
and representing an occlusion compensation loss function in the ith layer optical flow estimation network.
8. The method of claim 1, wherein the method comprises: the loss function of the overall network training in the step 4 is as follows: let the characteristics of the input image x and its corresponding N image sequences of the i-th layer be
Figure FDA00034044612400000610
And the characteristics are similar to functions
Figure FDA00034044612400000611
As a penalty term in the loss function to promote diversity between different image sequences, is defined as follows:
Figure FDA00034044612400000612
wherein N represents the total number of image segmentations; i | · | purple wind2Represents L2Performing norm operation; t represents a transposition operation of the vector; x is the number ofn,xmRespectively representing the characteristics of different image slice sequences of the same layer;
Figure FDA00034044612400000613
a characteristic similarity loss function representing the similarity between different image sequences of the ith layer;
defining an ith layer contrast loss function
Figure FDA0003404461240000071
The features learned in shallow layers of the Transformer model are more diversified than those learned in deep layers, and the contrast loss function can use the features learned in shallow layers to regularize the deep features so as to reduce the similarity between different image sequences and increase the feature diversity of the image sequences in a deep network, and is defined as follows:
Figure FDA0003404461240000072
wherein,
Figure FDA0003404461240000073
an nth image sequence feature representing the first layer;
Figure FDA0003404461240000074
representing the ith layer corresponding to the nth image sequence characteristic;
Figure FDA0003404461240000075
indicating that the feature vector is transposed; n represents the number of image sequences;
Figure FDA0003404461240000076
representing an image sequence contrast loss function of the ith layer;
defining an i-th layer error loss function
Figure FDA0003404461240000077
Each deep image sequence in the Transformer model should focus only on the corresponding image sequence from the shallow layerThe input, while ignoring the remaining irrelevant image sequence features, is defined as follows:
Figure FDA0003404461240000078
wherein,
Figure FDA0003404461240000079
an nth image sequence feature representing the first layer;
Figure FDA00034044612400000710
representing the ith layer corresponding to the nth image sequence characteristic; n represents the number of image sequences;
Figure FDA00034044612400000711
an error loss function representing the image sequence of the i-th layer;
the loss function for defining the whole network training is obtained by weighted sum of loss functions of the Transformer model in six stages and luminosity loss function, and the formula is as follows:
Figure FDA00034044612400000712
wherein λ is123Constraint balance factors respectively representing the proportion of loss functions of the transform model under different pyramid scales, wherein the higher the resolution is, the larger the function of the statistical loss function in the network training is, and the higher the weight coefficient is;
Figure FDA00034044612400000713
respectively representing a characteristic similarity loss function, a contrast loss function and an error loss function of the transform model of the ith layer;
Figure FDA00034044612400000714
representing occlusions in a layer i optical flow estimation networkCompensating the loss function, and finally calculating the obtained LfinalAs a loss function of the final overall network training.
CN202111506127.7A 2021-12-10 2021-12-10 Unsupervised optical flow estimation method based on Transformer feature pyramid network Pending CN114187331A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111506127.7A CN114187331A (en) 2021-12-10 2021-12-10 Unsupervised optical flow estimation method based on Transformer feature pyramid network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111506127.7A CN114187331A (en) 2021-12-10 2021-12-10 Unsupervised optical flow estimation method based on Transformer feature pyramid network

Publications (1)

Publication Number Publication Date
CN114187331A true CN114187331A (en) 2022-03-15

Family

ID=80543042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111506127.7A Pending CN114187331A (en) 2021-12-10 2021-12-10 Unsupervised optical flow estimation method based on Transformer feature pyramid network

Country Status (1)

Country Link
CN (1) CN114187331A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114677412A (en) * 2022-03-18 2022-06-28 苏州大学 Method, device and equipment for estimating optical flow
CN115018888A (en) * 2022-07-04 2022-09-06 东南大学 Optical flow unsupervised estimation method based on Transformer
CN115719368A (en) * 2022-11-29 2023-02-28 上海船舶运输科学研究所有限公司 Multi-target ship tracking method and system
CN115761594A (en) * 2022-11-28 2023-03-07 南昌航空大学 Optical flow calculation method based on global and local coupling
CN115880567A (en) * 2023-03-03 2023-03-31 深圳精智达技术股份有限公司 Self-attention calculation method and device, electronic equipment and storage medium
CN116630324A (en) * 2023-07-25 2023-08-22 吉林大学 Method for automatically evaluating adenoid hypertrophy by MRI (magnetic resonance imaging) image based on deep learning
CN116739919A (en) * 2023-05-22 2023-09-12 武汉大学 Method and system for detecting and repairing solar flicker in optical ocean image of unmanned plane
CN116740414A (en) * 2023-05-15 2023-09-12 中国科学院自动化研究所 Image recognition method, device, electronic equipment and storage medium
CN117877099A (en) * 2024-03-11 2024-04-12 南京信息工程大学 Non-supervision contrast remote physiological measurement method based on space-time characteristic enhancement
WO2024174804A1 (en) * 2023-02-21 2024-08-29 浙江阿里巴巴机器人有限公司 Service providing method, device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582483A (en) * 2020-05-14 2020-08-25 哈尔滨工程大学 Unsupervised learning optical flow estimation method based on space and channel combined attention mechanism
AU2020103901A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field
CN112465872A (en) * 2020-12-10 2021-03-09 南昌航空大学 Image sequence optical flow estimation method based on learnable occlusion mask and secondary deformation optimization
CN113538527A (en) * 2021-07-08 2021-10-22 上海工程技术大学 Efficient lightweight optical flow estimation method
CN113706526A (en) * 2021-10-26 2021-11-26 北京字节跳动网络技术有限公司 Training method and device for endoscope image feature learning model and classification model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582483A (en) * 2020-05-14 2020-08-25 哈尔滨工程大学 Unsupervised learning optical flow estimation method based on space and channel combined attention mechanism
AU2020103901A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field
CN112465872A (en) * 2020-12-10 2021-03-09 南昌航空大学 Image sequence optical flow estimation method based on learnable occlusion mask and secondary deformation optimization
CN113538527A (en) * 2021-07-08 2021-10-22 上海工程技术大学 Efficient lightweight optical flow estimation method
CN113706526A (en) * 2021-10-26 2021-11-26 北京字节跳动网络技术有限公司 Training method and device for endoscope image feature learning model and classification model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YANG, B (YANG, BO) [1] ; XIE, H (XIE, HUAN) [1] ; LI, HB (LI, HONGBIN) [2] ; LI, NH (LI, NUOHAN) [3] ; LIU, AC (LIU, ANCHANG) [2] : "Unsupervised Optical Flow Estimation Based on Improved Feature Pyramid", 《 NEURAL PROCESSING LETTERS》, no. 52, 14 August 2020 (2020-08-14), pages 1601 - 1612, XP037257628, DOI: 10.1007/s11063-020-10328-2 *
刘香凝;赵洋;王荣刚;: "基于自注意力机制的多阶段无监督单目深度估计网络", 信号处理, no. 09 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114677412A (en) * 2022-03-18 2022-06-28 苏州大学 Method, device and equipment for estimating optical flow
CN115018888A (en) * 2022-07-04 2022-09-06 东南大学 Optical flow unsupervised estimation method based on Transformer
CN115018888B (en) * 2022-07-04 2024-08-06 东南大学 Optical flow unsupervised estimation method based on transducer
CN115761594A (en) * 2022-11-28 2023-03-07 南昌航空大学 Optical flow calculation method based on global and local coupling
CN115761594B (en) * 2022-11-28 2023-07-21 南昌航空大学 Optical flow calculation method based on global and local coupling
CN115719368B (en) * 2022-11-29 2024-05-17 上海船舶运输科学研究所有限公司 Multi-target ship tracking method and system
CN115719368A (en) * 2022-11-29 2023-02-28 上海船舶运输科学研究所有限公司 Multi-target ship tracking method and system
WO2024174804A1 (en) * 2023-02-21 2024-08-29 浙江阿里巴巴机器人有限公司 Service providing method, device, and storage medium
CN115880567A (en) * 2023-03-03 2023-03-31 深圳精智达技术股份有限公司 Self-attention calculation method and device, electronic equipment and storage medium
CN116740414A (en) * 2023-05-15 2023-09-12 中国科学院自动化研究所 Image recognition method, device, electronic equipment and storage medium
CN116740414B (en) * 2023-05-15 2024-03-01 中国科学院自动化研究所 Image recognition method, device, electronic equipment and storage medium
CN116739919B (en) * 2023-05-22 2024-08-02 武汉大学 Method and system for detecting and repairing solar flicker in optical ocean image of unmanned plane
CN116739919A (en) * 2023-05-22 2023-09-12 武汉大学 Method and system for detecting and repairing solar flicker in optical ocean image of unmanned plane
CN116630324B (en) * 2023-07-25 2023-10-13 吉林大学 Method for automatically evaluating adenoid hypertrophy by MRI (magnetic resonance imaging) image based on deep learning
CN116630324A (en) * 2023-07-25 2023-08-22 吉林大学 Method for automatically evaluating adenoid hypertrophy by MRI (magnetic resonance imaging) image based on deep learning
CN117877099A (en) * 2024-03-11 2024-04-12 南京信息工程大学 Non-supervision contrast remote physiological measurement method based on space-time characteristic enhancement
CN117877099B (en) * 2024-03-11 2024-05-14 南京信息工程大学 Non-supervision contrast remote physiological measurement method based on space-time characteristic enhancement

Similar Documents

Publication Publication Date Title
CN114187331A (en) Unsupervised optical flow estimation method based on Transformer feature pyramid network
Tu et al. Maxim: Multi-axis mlp for image processing
CN113361560B (en) Semantic-based multi-pose virtual fitting method
CN113743269B (en) Method for recognizing human body gesture of video in lightweight manner
CN112131959B (en) 2D human body posture estimation method based on multi-scale feature reinforcement
CN109756690A (en) Lightweight view interpolation method based on feature rank light stream
CN112837224A (en) Super-resolution image reconstruction method based on convolutional neural network
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN113554032B (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN115018888B (en) Optical flow unsupervised estimation method based on transducer
CN113792641A (en) High-resolution lightweight human body posture estimation method combined with multispectral attention mechanism
CN111696038A (en) Image super-resolution method, device, equipment and computer-readable storage medium
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
Liu et al. An efficient residual learning neural network for hyperspectral image superresolution
CN116612288B (en) Multi-scale lightweight real-time semantic segmentation method and system
Li et al. Model-informed Multi-stage Unsupervised Network for Hyperspectral Image Super-resolution
CN118134952B (en) Medical image segmentation method based on feature interaction
CN115641285A (en) Binocular vision stereo matching method based on dense multi-scale information fusion
CN116071748A (en) Unsupervised video target segmentation method based on frequency domain global filtering
Chen et al. PDWN: Pyramid deformable warping network for video interpolation
CN109934283A (en) A kind of adaptive motion object detection method merging CNN and SIFT light stream
CN117710429A (en) Improved lightweight monocular depth estimation method integrating CNN and transducer
Luo et al. Dual-Window Multiscale Transformer for Hyperspectral Snapshot Compressive Imaging
CN115731280A (en) Self-supervision monocular depth estimation method based on Swin-Transformer and CNN parallel network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination