CN115690170A

CN115690170A - Method and system for self-adaptive optical flow estimation aiming at different-scale targets

Info

Publication number: CN115690170A
Application number: CN202211221511.7A
Authority: CN
Inventors: 钟宝江; 李牧
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2023-02-03
Anticipated expiration: 2042-10-08

Abstract

The embodiment of the invention provides a method and a system for self-adaptive optical flow estimation aiming at different scales of an object, wherein the method comprises the steps of inputting two adjacent frames of images into a convolutional neural network, and extracting the characteristics of the two frames of images to obtain the shallow characteristics of the two frames of images; processing shallow features of the two frames of images to obtain multi-scale features of the two frames of images; obtaining multi-scale cost quantity by utilizing information interaction among the coarse scale features, the medium scale features and the fine scale features of the two frames of images; performing context coding on a first frame image in the two frames of images, and calculating an optical flow estimation result by combining the multi-scale cost; and fitting the optical flow estimation result by using the endpoint error of the optical flow as a loss function. The method solves the problem of poor estimation performance caused by losing fine details of objects with different scales due to single cost, and improves the accuracy of optical flow estimation.

Description

Method and system for self-adaptive optical flow estimation aiming at different-scale targets

Technical Field

The invention relates to the technical field of computer vision, in particular to a method and a system for estimating optical flow in a self-adaptive manner aiming at different scales of targets.

Background

Optical flow estimation, which is the task of estimating per-pixel motion between video frames, is a fundamental technique for a wide range of computer vision applications, such as motion segmentation, motion recognition, and autopilot. Optical flow estimation has traditionally been considered a knowledge-driven technique, and conventional methods typically construct optical flow as an energy function optimization problem that specifies various constraints by considering existing knowledge (e.g., corner points), however, optimizing such constraint functions typically takes too long and runs too slow to be applied in real-time systems, and on the other hand, designing various corner points and making it a robust optimization goal is difficult.

In recent years, optical flow estimation techniques have advanced significantly with the development of convolutional neural networks, which provide a powerful ability to learn from large amounts of data compared to knowledge-driven methods, making these techniques data-driven strategies. To learn the optical flow, many methods use encoder-decoders or spatial pyramid structures. One pioneering work was the FlowNet proposed by Dosovitskiy et al in 2015, in which two models were proposed, namely FlowNet and FlowNet c, spynet, introduced a feature pyramid module that uses a spatial pyramid network to distort images at each level and decompose large displacements into small displacements, so that only one displacement needs to be calculated at each pyramid level, thereby greatly reducing the amount of calculation. Teed and Deng propose RAFT, in which a lightweight loop module is coupled with the GRU module as an update operator.

In the above network, in the feature extraction process, the receptive fields of the artificial neurons in each layer are generally designed to be the same size, and since they all use a single network structure, the amount of cost is generated in a single manner. However, the cost quantity represents the similarity between two adjacent frames, and the accurate cost quantity is the key to obtaining an accurate optical flow estimation, which unfortunately may result in losing fine details of different scale objects, resulting in poor estimation performance.

Disclosure of Invention

The embodiment of the invention provides a method and a system for adaptively estimating optical flows of targets with different scales, which are used for solving the problem that in the prior art, fine details of objects with different scales are lost due to single cost, so that the estimation performance is poor.

The embodiment of the invention provides a method for estimating optical flow in a self-adaptive way aiming at different scale targets, which comprises the following steps:

s1: inputting two adjacent frames of images into a convolutional neural network, and extracting the characteristics of the two frames of images to obtain shallow layer characteristics of the two frames of images;

s2: processing shallow features of the two frames of images to obtain multi-scale features of the two frames of images, wherein the multi-scale features comprise coarse scale features, medium scale features and fine scale features;

s3: obtaining multi-scale cost quantity by utilizing information interaction among the coarse scale features, the medium scale features and the fine scale features of the two frames of images;

s4: performing context coding on a first frame image in the two frames of images, and calculating an optical flow estimation result by combining the multi-scale cost quantities;

s5: and fitting the optical flow estimation result by using the endpoint error of the optical flow as a loss function.

Preferably, the convolutional neural network employs a downsampling structure.

Preferably, the two frames of images are subjected to feature extraction, and the method for obtaining the shallow features of the two frames of images comprises convolution, pooling and normalization.

Preferably, the method for processing the shallow feature of the two frames of images comprises a segmentation operation, a fusion operation and a selection operation.

Preferably, the segmentation operation specifically includes:

given an intermediate feature mapping

As input, two convolutional layers are used, the convolutional kernels having a size of 3 respectivelyAnd 5, mapping the intermediate features

Image feature segmented into two different scales

And

in this, a convolution kernel of 5 × 5 is replaced with a convolution kernel of 3 × 3 in size, and a dilation convolution with a dilation coefficient of 2 is set.

Preferably, the fusion operation specifically comprises:

firstly, the multi-scale information of the two different branches is fused through element summation operation, and the two-scale fused features are obtained:

then, for M _fuse Global information in the spatial dimension is captured using a global average pool:

wherein,

representing a global average pooling operation, H and W being the height and width of the feature dimension, respectively;

finally, using the full-connectivity layer aggregation feature, add bulk specification layers and activation functions after the full-connectivity layer:

wherein,

represents the fully connected layer, δ represents the ReLu activation function,

representing a batch normalization layer.

Preferably, the selecting operation specifically includes:

guiding the feature matrix t to use soft attention across channels to adaptively select different information space scales, wherein the dimension of t needs to be expanded, and then using a softmax operator on the aspect of the channels to obtain attention weight:

wherein, the above is the formula of softmax;

generating a final feature map M using the derived attention weights _fine And M _coarse That is, applying the corresponding weighting coefficients to the segmented features:

wherein,

and

features representing two scales obtained after segmenting input M features, and adding the two scales to obtain a fused feature M _fuse 。

Preferably, the method for obtaining the multi-scale cost amount by using the information interaction between the coarse-scale feature, the medium-scale feature and the fine-scale feature of the two frames of images comprises the following steps:

generating initial cost quantity by using the fine-scale features, and then reinforcing the fine-scale features by using the medium-scale features to obtain cost quantity 1;

pooling the cost 1 to obtain a cost 2, wherein the cost 2 is a fusion of the mesoscale feature and the coarse-scale feature;

pooling said cost 2 yields a cost 3.

The invention also provides a system for adaptive optical flow estimation aiming at different scale targets, which comprises the following steps:

the shallow layer feature extraction module is used for inputting two adjacent frames of images into a convolutional neural network and extracting features of the two frames of images to obtain shallow layer features of the two frames of images;

the multi-scale feature extraction module is used for processing the shallow features of the two frames of images to obtain multi-scale features of the two frames of images, wherein the multi-scale features comprise coarse scale features, medium scale features and fine scale features;

the multi-scale cost amount generation module is used for utilizing information interaction among the coarse scale features, the medium scale features and the fine scale features of the two frames of images to obtain multi-scale cost amount;

the optical flow estimation calculation module is used for carrying out context coding on a first frame image in the two frames of images and calculating an optical flow estimation result by combining the multi-scale cost;

and the optical flow estimation fitting module is used for fitting the optical flow estimation result by using the endpoint error of the optical flow as a loss function.

An embodiment of the present invention provides a network device, including a processor, a memory, and a bus system, where the processor and the memory are connected via the bus system, the memory is used to store instructions, and the processor is used to execute the instructions stored in the memory, so as to implement any one of the above methods.

According to the technical scheme, the invention has the following advantages:

the invention provides a method and a system for adaptively estimating optical flow for different scales of objects, firstly, a characteristic selectable module is introduced into the field of optical flow estimation and integrated into a network, which is beneficial to the network to generate multi-scale characteristic information, so that more accurate optical flow estimation results are learned for objects of different scales; secondly, a multi-scale cost generation module is introduced, and the multi-scale cost enhances the similarity characterization capability; finally, the optical flow estimation method utilizes the characteristic selectable module to enhance the generation of the multi-scale cost quantity, and jointly learns the multi-scale cost quantity and the context codes, thereby solving the problem of poor estimation performance caused by losing fine details of objects with different scales due to single cost quantity and improving the accuracy of optical flow estimation.

Drawings

In order to illustrate embodiments of the present invention or technical solutions in the prior art more clearly, the following brief description of the drawings which are required in the embodiments will be made, and features and advantages of the present invention will be understood more clearly by referring to the drawings, which are schematic and should not be understood as limiting the present invention in any way, and for those skilled in the art, other drawings can be obtained from these drawings without creative effort. Wherein:

FIG. 1 is a schematic diagram of an adaptive optical flow estimation method for different scale targets according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an adaptive optical-flow estimation system for different scale targets according to an embodiment of the invention;

fig. 3 is a schematic block diagram of a network device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

As shown in fig. 1, an embodiment of the present invention provides an optical flow estimation method adaptive to objects with different scales, where the method includes:

s101: inputting two adjacent frames of images into a convolutional neural network, and extracting the characteristics of the two frames of images to obtain shallow layer characteristics of the two frames of images;

s102: processing shallow features of the two frames of images to obtain multi-scale features of the two frames of images, wherein the multi-scale features comprise coarse scale features, medium scale features and fine scale features;

s103: obtaining multi-scale cost quantity by utilizing information interaction among the coarse scale features, the medium scale features and the fine scale features of the two frames of images;

s104: performing context coding on a first frame image in the two frames of images, and calculating an optical flow estimation result by combining the multi-scale cost quantities;

s105: and fitting the optical flow estimation result by using the endpoint error of the optical flow as a loss function.

The invention introduces the characteristic selectable module into the optical flow estimation field and integrates the module into the network, which is helpful for the network to generate multi-scale characteristic information, thereby learning more accurate optical flow estimation results for objects with different scales; secondly, a multi-scale cost generation module is introduced, and the multi-scale cost enhances the similarity characterization capability; finally, the optical flow estimation method of the invention utilizes the characteristic selectable module to enhance the generation of the multi-scale cost quantity, and performs the joint learning of the multi-scale cost quantity and the context codes, thereby improving the accuracy of the optical flow estimation.

Further, step S101 includes:

inputting two adjacent frames of images into a convolutional neural network, and extracting the characteristics of the two frames of images to obtain shallow layer characteristics of the two frames of images; the convolutional neural network adopts a downsampling structure, wherein the characteristic extraction method comprises convolution, pooling and normalization.

Further, step S102 includes:

processing the shallow features of the two frames of images to obtain multi-scale features of the two frames of images, so that the convolutional neural network can selectively use the generated multi-scale features, wherein the multi-scale features comprise coarse scale features, medium scale features and fine scale features, and therefore, the convolutional neural network can capture objects with different sizes on the images;

the method for processing the shallow features of the two frames of images comprises the operations of segmentation, fusion and selection;

the segmentation operation specifically includes: given an intermediate feature mapping

Using as input two convolution layers with convolution kernel sizes of 3 and 5, respectively, mapping the intermediate features

Image feature segmented into two different scales

And

wherein, 5 × 5 convolution kernels are replaced by convolution kernels of 3 × 3 size, and a dilation convolution with a dilation coefficient of 2 is set;

the fusion operation specifically comprises the following steps:

firstly, multi-scale information of the two different branches is fused through element summation operation, and two scale fused features are obtained:

wherein,

finally, using the full-connectivity layer aggregation feature, a bulk specification layer and activation functions are added after the full-connectivity layer:

wherein,

representing a batch normalization layer.

The selecting operation specifically includes:

wherein, the above is the formula of softmax;

wherein,

and

Further, step S103 includes:

obtaining multi-scale cost quantity by utilizing information interaction among the coarse scale features, the medium scale features and the fine scale features of the two frames of images, and specifically comprising the following steps: generating initial cost quantity by using the fine scale features, and then reinforcing the fine scale features by using the medium scale features to obtain cost quantity 1; pooling the cost 1 to obtain a cost 2, wherein the cost 2 is a fusion of the mesoscale feature and the coarse-scale feature; pooling said cost 2 to obtain a cost 3.

In previous approaches, the process of cost generation was achieved by simply using a set of global average pooling, which resulted in the loss of fine detail; the invention provides a new cost generation process, and combines the extracted multi-scale image characteristics to strengthen the interaction between different scales of information, thereby considering the receptive field of each scale.

Further, step S104 includes:

the optical flow records the position offset of each pixel point between two frames, so that a context network is used for obtaining the context code of the first frame image, and the learning of the optical flow network on the position information between the two frames is assisted; and searching the corresponding relation between the two frames on the cost quantity by utilizing the context information of the first frame image, thereby calculating an accurate optical flow estimation result.

Further, in step S105, the method includes:

and monitoring the learning of the optical flow by using the endpoint error of the optical flow as a loss function, and fitting the estimated optical flow estimation result. The method of the invention generates good optical flow estimation precision and can be applied to the fields of unmanned driving, robots and the like.

Example two

As shown in FIG. 2, the present invention also provides a system for adaptive optical flow estimation for different scale objects, the system comprising:

the shallow feature extraction module 201 is configured to input two adjacent frames of images into a convolutional neural network, and perform feature extraction on the two frames of images to obtain shallow features of the two frames of images;

a multi-scale feature extraction module 202, configured to process shallow features of the two frames of images to obtain multi-scale features of the two frames of images, where the multi-scale features include a coarse-scale feature, a medium-scale feature, and a fine-scale feature;

a multi-scale cost amount generating module 203, configured to obtain a multi-scale cost amount by using information interaction between the coarse-scale feature, the medium-scale feature, and the fine-scale feature of the two frames of images;

an optical flow estimation calculation module 204, configured to perform context coding on a first frame image of the two frame images, and calculate an optical flow estimation result by combining the multi-scale cost quantities;

and an optical flow estimation fitting module 205, configured to fit the optical flow estimation result by using the endpoint error of the optical flow as a loss function.

The system is configured to implement the method for adaptive optical flow estimation for different scale targets according to the first embodiment, and details are not repeated herein in order to avoid redundancy.

EXAMPLE III

As shown in fig. 3, an embodiment of the present invention further provides a network apparatus, where the apparatus includes a processor 301, a memory 302, and a bus system 303, where the processor 301 and the memory 302 are connected via the bus system 303, the memory 302 is configured to store instructions, and the processor 301 is configured to execute the instructions stored in the memory 302;

wherein the processor 301 is configured to: inputting two adjacent frames of images into a convolutional neural network, and extracting the characteristics of the two frames of images to obtain shallow layer characteristics of the two frames of images; processing the shallow features of the two frames of images to obtain multi-scale features of the two frames of images, wherein the multi-scale features comprise coarse scale features, medium scale features and fine scale features; obtaining multi-scale cost quantity by utilizing information interaction among the coarse scale features, the medium scale features and the fine scale features of the two frames of images; performing context coding on a first frame image in the two frames of images, and calculating an optical flow estimation result by combining the multi-scale cost; and fitting the optical flow estimation result by using the endpoint error of the optical flow as a loss function.

The network device introduces the characteristic selectable module into the optical flow estimation field and integrates the characteristic selectable module into the network, so that the network is facilitated to generate multi-scale characteristic information, and a more accurate optical flow estimation result is learned for objects with different scales; secondly, a multi-scale cost generation module is introduced, and the multi-scale cost enhances the similarity characterization capability; finally, the optical flow estimation method of the invention utilizes the characteristic selectable module to enhance the generation of the multi-scale cost quantity, and performs the joint learning of the multi-scale cost quantity and the context codes, thereby improving the accuracy of the optical flow estimation.

Optionally, as an embodiment, an unmanned vehicle comprises the above network device comprising a processor 301, a memory 302 and a bus system 303, which are not described in detail herein to avoid redundancy.

Optionally, as an embodiment, a robot includes the above network device, where the network device includes a processor 301, a memory 302, and a bus system 303, and for avoiding repetition, detailed descriptions are omitted here, and are omitted here for brevity.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Various other modifications and alterations will occur to those skilled in the art upon reading the foregoing description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A method of adaptive optical flow estimation for different scale objects, comprising:

s4: performing context coding on a first frame image in the two frames of images, and calculating an optical flow estimation result by combining the multi-scale cost;

2. The method of adaptive optical flow estimation for different scale objects as claimed in claim 1, wherein the convolutional neural network employs a downsampling structure.

3. The method of claim 1, wherein the two frames of images are subjected to feature extraction, and the method of obtaining shallow features of the two frames of images comprises convolution, pooling and normalization.

4. The method of claim 1, wherein the method of processing the shallow features of the two frames of images comprises a segmentation operation, a fusion operation, and a selection operation.

5. The method for adaptive optical flow estimation for different scale objects according to claim 4, wherein the segmentation specifically comprises:

given an intermediate feature mapping

Using as input two convolutional layers, with convolutional kernel sizes of 3 and 5, respectively, mapping the intermediate features

Image feature segmented into two different scales

And

6. The method for adaptive optical flow estimation for different scale targets according to claim 4, wherein the fusion operation specifically comprises:

wherein,

wherein,

represents a fully connected layerAnd delta represents the ReLu activation function,

representing a batch normalization layer.

7. The method according to claim 4, wherein said selecting operation comprises in particular:

wherein, the above is the formula of softmax;

generating a final feature map M using the derived attention weights _fine And M _coarse That is, corresponding weighting coefficients are applied to the segmented features:

wherein,

and

representing the features of two scales obtained after the input M features are divided, and adding the features to obtain the fused feature M _fuse 。

8. The method of claim 1, wherein the method for obtaining multi-scale cost by using information interaction between coarse-scale features, medium-scale features and fine-scale features of the two frames of images comprises:

generating initial cost quantity by using the fine scale features, and then reinforcing the fine scale features by using the medium scale features to obtain cost quantity 1;

pooling the cost 1 to obtain a cost 2, wherein the cost 2 is fused with the mesoscale feature and the coarse-scale feature;

pooling said cost 2 yields a cost 3.

9. A system for adaptive optical flow estimation for different scale objects, comprising:

the multi-scale cost quantity generation module is used for obtaining multi-scale cost quantity by utilizing information interaction among the coarse scale features, the medium scale features and the fine scale features of the two frames of images;

10. A network device comprising a processor, a memory and a bus system, the processor and the memory being connected via the bus system, the memory being adapted to store instructions, and the processor being adapted to execute the instructions stored by the memory to implement the method of any one of claims 1 to 8.