CN116823908A

CN116823908A - Monocular image depth estimation method based on multi-scale feature correlation enhancement

Info

Publication number: CN116823908A
Application number: CN202310758435.1A
Authority: CN
Inventors: 明悦; 韦秋吉; 洪开; 吕柏阳; 赵盼孜
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-06-26
Filing date: 2023-06-26
Publication date: 2023-09-29

Abstract

The invention provides a monocular image depth estimation method based on multi-scale feature correlation enhancement. The method comprises the following steps: performing data enhancement preprocessing operation on the input RGB image by utilizing a multi-mode RGB-Depth fusion module; extracting a multi-scale characteristic map after data enhancement by using a multi-scale depth coding module; in the decoding stage, an RFF module is used for obtaining a fine-granularity feature map, an MFCE module is used for enhancing the correlation of features among different scales in the multi-scale features, and the feature map is fused and optimized by combining the RFF module and the MFCE module, so that a pixel-by-pixel depth map is obtained; and optimizing the training of the whole monocular depth estimation network model through the depth representation objective function, and ensuring generalization capability. The method of the invention enhances the correlation between the global features and the local features, learns effective appearance structure information, solves the problem of false estimation of the appearance structure caused by texture deviation, and reconstructs a clear and dense monocular depth map.

Description

Monocular image depth estimation method based on multi-scale feature correlation enhancement

Technical Field

The invention relates to the technical field of image processing, in particular to a monocular image depth estimation method based on multi-scale feature correlation enhancement.

Background

Depth Estimation (Depth Estimation) aims to recover Depth of field information from images, is an important research direction in computer vision, and has been widely applied to the fields of three-dimensional reconstruction, robot navigation, automatic driving and the like. With the progress of deep learning technology, a depth estimation method based on convolutional neural networks (Convolutional Neural Network, CNN) has gradually become an important research point in the field. Depth estimation can be broadly divided into monocular depth estimation (Monocular Depth Estimation), binocular depth estimation (Stereo Depth Estimation) and Multi-mesh depth estimation (Multi-view Depth Estimation). Compared with binocular depth estimation and multi-view depth estimation, monocular depth estimation can complete initial image acquisition work by only one camera, acquisition cost and equipment complexity are reduced, and the requirements of practical application are met. However, the process of recovering three-dimensional scene depth information from a single two-dimensional image is uncertain and multi-explanatory, making monocular depth estimation an ill-posed problem, resulting in inherent scale blurring that makes depth recovery challenging. In recent years, more and more researchers have been focusing on depth estimation based on monocular images, and this task has also gradually become a research hotspot and research difficulty in the field of image depth estimation.

The monocular depth estimation has great application value in the actual scene: in an automatic driving system, monocular depth estimation can help a vehicle to perceive surrounding environment, including detecting the distance of a front obstacle and estimating depth information of a road, so as to ensure safe running of the vehicle; the monocular depth estimation can be used in augmented reality application, so that a virtual object can accurately interact with the real world, and the accurate positioning, shielding and shielding effects of the virtual object can be realized by estimating the depth information of the object in the scene, so that more realistic augmented reality experience is provided; the monocular depth estimation can be used in a human-computer interaction interface, such as gesture recognition and gesture estimation, and by analyzing the depth position of a human body in space, the system can recognize gesture actions or human body gestures, so that natural and visual user interface operation is realized; the monocular depth estimation can be used in a video monitoring system to provide more accurate scene analysis and object tracking, and the spatial relationship in the scene can be better understood by estimating the depth information of the object, and the scene recognition, anomaly detection and safety monitoring can be performed; monocular depth estimation is also very useful for robot navigation and environmental awareness, by estimating the depth of objects and obstacles, the robot can plan paths, avoid obstacles and navigate to achieve accurate and safe movements.

Texture bias causes a problem of false estimation of the appearance structure. Because of the complexity and uneven distribution of the texture of the object in the actual scene, the local area with rich texture is more easily captured by the network model. Most existing CNN (Convolutional Neural Networks, convolutional neural network) methods tend to pay more attention to local texture features and ignore global structure information when performing monocular depth estimation, which easily leads to texture bias phenomenon in the predicted depth map. In practical application, the judgment of the practical distance of the equipment such as a robot to the object is affected.

The powerful image processing capability of the depth neural network in recent years improves the performance of the depth estimation and also provides an end-to-end solution for the implementation of monocular depth estimation. The method can be divided into a data preprocessing method, a depth feature encoding method and a depth feature decoding method according to the algorithm flow of monocular depth estimation.

The data preprocessing method comprises the following steps: the data preprocessing of the monocular depth estimation is optimized and adjusted for the input image to better perform the subsequent depth estimation tasks. These operations include scaling, normalization, data enhancement, etc., which help reduce noise, improve model generalization ability and robustness, while ensuring that the input requirements of the deep learning model are met. In recent years, many preprocessing works have focused on data enhancement, super resolution, and the like to improve the quality and diversity of input images.

Depth feature encoding method: depth feature coding refers to the process of extracting a representation of depth-related features from an input image, which features are fed into a subsequent depth estimation module, such as a decoder or regression module, for predicting a depth map. In conventional approaches, depth feature coding relies primarily on manually designed algorithms. Common methods include SIFT (Scale-invariant feature transform ), SURF (Speeded-Up Robust Features, accelerated robust features), ORB (Oriented FAST and Rotated BRIEF, algorithm for fast feature point extraction and description), and the like. These algorithms detect key points in the image and calculate corresponding feature descriptors, then find corresponding points by feature matching, and use these matching points to calculate the depth of the object in three-dimensional space. However, these conventional methods do not provide sufficient discrimination in coping with complex scene and illumination changes due to limited characterization capabilities. The deep learning method automatically extracts image features in a layering mode, and has stronger characterization capability and higher accuracy. The depth feature encoding process is mostly done automatically by CNN and Transformer, learning abstract and layered feature representations from the input image. Depth feature encoding methods can be broadly divided into two categories:

(1) A convolutional neural network-based encoding method;

(2) A method of encoding based on a transducer. The depth feature coding based on CNN is to extract features of the input image through a convolution layer, an activation function and a pooling layer, and then extract high-level semantic information by gradually adjusting the size of the convolution kernel and the number of channels. The transform-based coding method is to divide an input image into a plurality of non-overlapping image blocks, then linearly embed each image block into a vector, process the vectors through a self-attention mechanism and position coding, and finally perform feature extraction and depth estimation through a multi-layer transform.

The depth feature decoding method comprises the following steps: the depth feature decoding process refers to mapping the high-dimensional features extracted by the encoder to a depth space to generate a depth prediction map. The decoding process typically involves upsampling, fusion, and reconstruction operations. First, an up-sampling operation is performed on the feature map to increase its size to be the same as or close to the input image size. Then, the up-sampled feature map is fused to capture multi-scale information.

At present, a monocular image depth estimation method in the prior art comprises the following steps: a data preprocessing method. For the monocular depth estimation task, there have been many data preprocessing works in recent years focusing on data enhancement, super resolution, and the like to improve the quality and diversity of the input image. The learner uses the original image and the horizontally flipped image to enhance the data. The learner encourages model adaptation to apply super-resolution to image areas to reduce image distortion by pasting low resolution images to the same area of high resolution images or pasting partial areas of high resolution images to the same location of low resolution images. The trainee introduces a cut mix enhancement strategy, namely, a local image block (patch) is acquired in an image in a 'cut-and-paste' mode, wherein ground real depth labels are mixed into the patch in proportion to increase diversity, and the regularization effect of a training pixel reserved area is utilized. A learner proposes a data enhancement method for instance segmentation, in which a copied instance object is randomly pasted to an arbitrary position on an image in a Copy-paste enhancement (Copy-Paste augmentation) manner, so that robustness is improved while training cost is not increased.

Although the preprocessing methods described above increase the diversity of images by way of data enhancement, these methods tend to introduce problems such as over-sharpening of the image or destruction of the image geometry. The learner proposed an adaptive super-resolution method, although increasing the number of samples of an image, not so much changed in appearance of the image, and also increased the risk of oversharpening the image, resulting in an increase in error of depth estimation. The "cut-and-paste" data enhancement method proposed by the scholars greatly changes the appearance of the image, but also destroys the geometric structure in the image at the same time, and reduces the stability of the training model.

Drawbacks of one of the monocular image depth estimation methods in the above prior art include: although the preprocessing methods such as exposure correction, feature point matching or image rotation shearing can be used for better improving the quality of input samples, the methods cannot solve the problem of structural limitation of RGB images, and cannot reduce irrelevant detail interference caused by dense areas in the images, so that depth feature encoding is insufficient.

Another monocular image depth estimation method in the prior art includes: in combination with a depth feature decoding method of a convolutional neural network, a learner proposes a decoding network based on rapid upsampling, but the convolutional kernel of the network is smaller, the receptive field of the network is limited, and in the process of feature decoding, only simple bilinear interpolation is adopted to improve the resolution of a depth map, so that more depth features are lost in the network. In order to reduce the loss of the features, a learner adds jump connection between a decoding network layer and a corresponding coding network layer, and fuses a coarse depth map in the decoding network with a fine spatial feature map in the coding network, so that the mapping and expression of the depth features in the decoding process are enhanced, and the accuracy of depth estimation is improved. In addition to employing a jump connection to enhance feature decoding, there are also students using two different modules in a multi-scale feature fusion network architecture, the first module convolving with filters of different sizes, merging all individual feature maps. The second module uses extended convolution instead of full join layers, thereby reducing computation and increasing acceptance domain. However, these feature fusion methods do not sufficiently eliminate features having low correlation. Therefore, the utilization of the underlying features in the predicted depth map is not always sufficiently improved.

The drawbacks of another monocular image depth estimation method in the above prior art include: although the depth feature decoding method based on the convolutional neural network greatly improves the precision of the pixel level in monocular depth estimation, CNN mainly depends on a local perception mechanism, so that the correlation between global features and local features is insufficient, and the problem of global appearance structure information loss still exists in the feature learning process. Furthermore, downsampling operations in the encoder-decoder architecture result in loss of detail information, making integration of global features and local features difficult. As the number of network layers increases, extraneous detail features are continually passed along the feature fusion process, thereby exacerbating the texture bias.

Disclosure of Invention

The embodiment of the invention provides a monocular image depth estimation method based on multi-scale feature correlation enhancement, which is used for effectively extracting depth information of a monocular image.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

A monocular image depth estimation method based on multi-scale feature correlation enhancement, comprising:

performing data enhancement preprocessing operation on the input RGB image by utilizing a multi-mode RGB-Depth fusion module;

extracting a multi-scale characteristic map after data enhancement by using a multi-scale depth coding module;

and in the decoding stage, an RFF module is used for acquiring a fine-grained feature map according to the multi-scale feature map, an MFCE module is used for enhancing the correlation of features among different scales in the multi-scale features, and the pixel-by-pixel depth map of the input RGB image is acquired by combining the RFF module and the MFCE module to fuse and optimize the feature map.

Preferably, the preprocessing operation for data enhancement on the input RGB image by using the multi-mode RGB-Depth fusion module includes:

the multi-mode RGB-Depth fusion module fuses the ground real Depth map into the RGB image in a slicing mode, randomly selects a part of the Depth map in the horizontal and vertical directions to be pasted on the same position of the color image, and uses Representing RGB image, forming RGB-D image with depth information, using +.>Representing the ground true depth map, W and H are the width and height of the image, C _s And C _t Respectively representing the channel number in RGB image and ground real depth map, and the image x 'after data enhancement' _s Expressed as:

x′ _s ＝M×x _s +(1-M)×x _t (1)

if C _s And C _t If the numbers are different, combining the RGB image and the ground real depth map in the channel direction to make the channel numbers of the RGB image and the ground real depth map consistent, and M matrix (M epsilon {0,1 }) represents x _s Quilt x _t The location of the replacement area, width height (w, h) and replacement area is expressed as:

(w,h)＝(min((W-a×W)×c×p,1),min((H-a×H)×c×p,1)) (2)

image[x:x+w,:,i]＝depth[x:x+w,:] (3)

image[:,y:y+h,i]＝depth[:,y:y+h] (4)

where x=a×w, y=a×h, i denotes the three channel numbers of the RGB image, a and c are coefficients ranging between (0, 1), and p denotes a super parameter (p e (0, 1)).

Preferably, the decoding stage uses the RFF module to obtain a fine-grained feature map, uses the MFCE module to enhance correlation of features between different scales in the multi-scale features, fuses and optimizes the feature map by combining the RFF module and the MFCE module, and obtains a pixel-by-pixel depth map, including:

providing that the multi-scale features comprise low-resolution feature maps F with different resolutions ₁ And higher resolution feature map F ₂ The RFF module makes the low-resolution feature map F ₁ Upsampling by bilinear interpolation to increase resolution to a higher resolution feature map F ₂ The same applies the low-resolution feature map F ₁ And the higher resolution feature map F ₂ Splicing in the same dimension to obtain featuresFIG. F ₃ The characteristic diagram F ₃ The features of different receptive fields are obtained by convolution of two branches, the upper branch adopts two-dimensional convolution with a convolution kernel of 3 to extract the features, input data are standardized by a BatchNorm neural network layer, and finally nonlinear relations among all layers of the network are increased by a ReLU activation function; the lower branch adopts 5 multiplied by 5 two-dimensional convolution to extract features and is normalized by BatchNorm, and the features obtained by the upper branch and the lower branch are fused to obtain a fused feature map F _RFF ：

F ₃ ＝Cat(Up(F ₁ ,F ₂ )) (5)

F _RFF ＝Cov _5,5 (Cov _3,3 (F ₃ ))+Cov _5,5 (F ₃ ) (6)

Up-sampling procedure, wherein Up (-) is denoted bilinear interpolation, cov _3,3 (. Cndot.) and Cov _5,5 (. Cndot.) represents a 3×3 convolution and a 5×5 convolution, respectively;

let the multi-scale feature map input in the MFCE module beW and H are represented as width and height of the feature map, respectively, C represents the number of channels of the feature map, and F is the low resolution feature map F in F ₁ With higher resolution feature map F ₂ Fusing through a first RFF module to generate an enhanced feature map F _E ，F _E Extracting feature F by adaptive averaging pooling layer _E1 、F _E2 And F _E3 F is to F _E1 、F _E2 And F _E3 Channel stitching is performed and global feature F is formed by a 1 x 1 convolution process _G F is to F _E Forming feature F by parallel processing of asymmetric convolution and standard convolution _L Feature F _G And feature F _L Splicing according to channels, and obtaining an optimized characteristic diagram F through 1X 1 convolution kernel processing _MFCE The MFCE module calculates the following processes:

F _E ＝RFF(F ₁ ,F ₂ ) (7)

F _Ei ＝RFF(F ₁ ,AAP _i (F _E ))(i＝1,2,3) (8)

F _G ＝Cov _1,1 (Cat(F _E1 ,F _E2 ,F _E3 )) (9)

F _L ＝Cov _9,1 (Cov _1,9 (F _E ))+Cov _3,3 (F _E ) (10)

F _MFCE ＝Cov _1,1 (Cat(F _G ,F _L )) (11)

wherein Cov is _n,m (. Cndot.) represents two-dimensional convolution with a convolution kernel of size n×m, cat (. Cndot.) represents the concatenation of feature graphs on channels, and RFF represents a multi-scale feature fusion module;

and outputting the pixel-by-pixel depth map of the input RGB image through an RFF module and an MFCE module.

Preferably, the method further comprises:

parameters and training processes of the multi-modal RGB-Depth fusion module, the multi-scale Depth coding module, the RFF module, and the MFCE module are optimized by a Depth characterization objective function.

According to the technical scheme provided by the embodiment of the invention, the monocular image depth estimation algorithm with enhanced multi-scale feature correlation provided by the embodiment of the invention not only enhances the features of an input image and provides more geometric information and semantic information for a depth estimation model, but also enhances the correlation between global features and local features, learns effective appearance structure information, solves the problem of false estimation of the appearance structure caused by texture deviation, and reconstructs a clear and dense monocular depth map.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a process flow diagram of a monocular image depth estimation method based on multi-scale feature correlation enhancement provided by an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a process of a multi-mode RGB-Depth fusion module according to an embodiment of the present invention;

FIG. 3 is a network structure diagram of a multi-scale depth decoder according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a process of an RFF module according to an embodiment of the present invention;

fig. 5 is a process flow diagram of an MFCE module according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the purpose of facilitating an understanding of the embodiments of the invention, reference will now be made to the drawings of several specific embodiments illustrated in the drawings and in no way should be taken to limit the embodiments of the invention.

Monocular depth estimation (Monocular Depth Estimation) refers to the process of recovering depth of field information from a single two-dimensional image. Multiscale feature fusion (Multi-scale Feature Fusion) refers to the process of fusing feature maps of different scale sizes in some way.

In order to enrich geometric information and semantic information in monocular images and enhance correlation between global features and local features and solve the problem of false estimation of appearance structures caused by texture deviation, the embodiment of the invention provides a monocular image depth estimation method based on multi-scale feature correlation enhancement, the processing flow is shown in fig. 1, and the flow comprises four processing steps:

and S1, performing data enhancement preprocessing operation on the input RGB image by utilizing a multi-mode RGB-Depth fusion module so as to enhance the input characteristics of the image and realize image correction.

And S2, extracting the preprocessed multi-scale feature map by using a multi-scale depth coding module.

And S3, acquiring a fine-grained feature map according to the Multi-scale feature map by using an RFF (Relevant Feature Fusion, related feature fusion) module in a decoding stage, enhancing the correlation of different scale features in the Multi-scale features by using an MFCE (Multi-scale Feature Correlation Enhancement ) module, fusing and optimizing the feature map by combining the RFF module and the MFCE module, and acquiring a pixel-by-pixel depth map.

And S4, optimizing the training of the whole monocular depth estimation network model through the depth representation objective function, and ensuring generalization capability.

Specifically, the step S1 includes: in order to improve the global feature extraction capability of a monocular Depth estimation algorithm and alleviate the problem of false estimation of an appearance structure caused by texture deviation, the method firstly designs a multi-mode RGB-Depth fusion module in an image preprocessing stage, introduces an extra mode Depth into an original RGB image, relieves the uncertainty of directly acquiring information from the RGB image, and reduces the noise of an input image. Then in the depth feature decoding stage, a multi-scale feature fusion module and a multi-scale feature correlation enhancement module are designed, wherein the multi-scale feature fusion module is used for fusing receptive fields with different sizes and enhancing the correlation among features; the multi-scale feature correlation enhancement module learns the correlation between the global features and the local features by the combination of the multi-level average pooling layer and the multi-level convolution layer, so that the receptive field is enlarged, and the global information is optimized.

Fig. 2 is a process flow diagram of a multi-mode RGB-Depth fusion module according to an embodiment of the invention. The multi-mode RGB-Depth fusion module adopts a Depth map fusion data enhancement method, namely, a ground real Depth map is fused into an RGB image to form an RGB-D image with Depth information, and the RGB-D image is used as the input of a network model, so that the diversity of visual information is improved, and the noise of an input image is reduced. The multi-mode RGB-Depth fusion module of the invention adopts the idea of slicing, as shown in figure 2, and randomly selects a part of the area of the Depth map in the horizontal and vertical directions to be pasted on the same position of the color image as an input image. By usingRepresenting RGB image, forming RGB-D image with depth information, using +.>Representing the ground true depth map, W and H are the width and height of the image, C _s And C _t The number of channels in the input image and the depth map, respectively. Then the data enhanced image x' _s Can be expressed as:

x′ _s ＝M×x _s +(1-M)×x _t (1)

if C _s And C _t The number is different, they are combined in advance in the channel direction so that the channel numbers are uniform. M matrix (M.epsilon.0, 1) represents x _s Quilt x _t An alternative region. The width and height (w, h) and the location of the replacement area can be expressed as:

(w,h)＝(min((W-a×W)×c×p,1),min((H-a×H)×c×p,1)) (2)

image[x:x+w,:,i]＝depth[x:x+w,:] (3)

image[:,y:y+h,i]＝depth[:,y:y+h] (4)

The network structure of a multi-scale depth decoder according to the embodiment of the present invention is shown in fig. 3. Aiming at four feature graphs with different scales output by multi-scale depth coding, a RFF (Relevant Feature Fusion, related feature fusion) module is used for fusing a high-resolution feature graph 1 and a feature graph 2 to obtain fine-granularity local features; and the MFCE module is used for fusing the low-resolution feature map 3 and the feature map 4, learning the correlation between adjacent features and optimizing the global feature characterization.

And then, the features output by the RFF module and the MFCE module are sent into the RFF module to further fuse global information and local information, the global features and the local features are spliced on the channel by a feature splicing operation, the feature map is restored to the same pixel size as the input image by an up-sampling operation, the feature map is represented by two layers of 3 multiplied by 3 convolution (Conv module) optimization features, and finally, the feature map is mapped into a depth map by a Sigmoid function.

FIG. 4 is a flowchart illustrating an RFF module processing procedure according to an embodiment of the present inventionA drawing. The RFF module takes the feature map as an input to the network for fusing the low resolution feature representation between the two features. The network structure of the RFF module is shown in fig. 4. First, low resolution feature F ₁ Upsampling by bilinear interpolation to increase resolution to F ₂ Identical and spliced in the same dimension to obtain a feature map F ₃ . Then, the features of different receptive fields are obtained through convolution of two branches, the features are extracted through two-dimensional convolution with a convolution kernel of 3, input data are standardized through a BatchNorm neural network layer, training of a stable network is facilitated, and finally nonlinear relations among all layers of the network are increased through a ReLU activation function; the lower leg extracted features using a 5 x 5 two-dimensional convolution and normalized by BatchNorm. The receptive fields of the features acquired by the upper branch and the lower branch are different, and richer fine granularity information can be acquired by fusing the two. The computation process of the multi-scale feature fusion module can be expressed as:

F ₃ ＝Cat(Up(F ₁ ,F ₂ )) (5)

F _RFF ＝Cov _5,5 (Cov _3,3 (F ₃ ))+Cov _5,5 (F ₃ ) (6)

up-sampling procedure, wherein Up (-) is denoted bilinear interpolation, cov _3,3 (. Cndot.) and Cov _5,5 (. Cndot.) represents a convolution of 3×3 and a convolution of 5×5, respectively.

Fig. 5 is a flowchart of a MFCE (Multi-scale Feature Correlation Enhancement, multi-scale feature correlation enhancement module) module according to an embodiment of the present invention. In order to enhance the description of the shape information, the MFCE provided by the present invention enhances the expression of local detail information and global shape information by fusing the context information of neighboring features.

The input characteristic diagram of the local network is recorded asW and H are denoted as the width and height of the feature map, respectively, and C represents the number of channels of the feature map. Input low-resolution feature map F ₁ With higher resolution feature map F ₂ Through the first RFF module (Relevant Feature Fusion, correlation feature fusion) fuses to enhance correlation between different resolution features and generates enhanced feature map F _E Wherein the characteristic diagram F _E Size and feature map F of (2) ₂ Is the same size. Next, as shown in FIG. 5, F _E Important features are extracted more effectively through an APP layer (Adaptive Average Pooling) and adaptive averaging pooling), and the size of tensors is reduced, so that images are transformed into a low-dimensional space, and the capturing of features in a larger range is facilitated. APP layers with different kernel sizes can be used for adaptively realizing adjustment of different image sizes so as to acquire more global shape information and F ₁ Together as inputs to the RFF module to form feature F _E1 、F _E2 And F _E3 . Then F is carried out _E1 、F _E2 And F _E3 Channel stitching is performed and a thinned global feature F is formed through 1X 1 convolution processing _G Wherein the core sizes of the adaptive average pooling layer (AAP) are 2 x 2,4 x 4,6 x 6, respectively. Meanwhile, in order to reduce the information redundancy caused by the symmetric convolution and reduce the parameter quantity and the calculated quantity, a strategy processing F of parallel processing of the asymmetric convolution and the standard convolution is adopted _E . The present invention employs a 1 x 9 asymmetric convolution kernel and a 9 x 1 asymmetric convolution to enhance the local key features in different directions. F (F) _E Forming feature F by parallel processing of asymmetric convolution and standard convolution _L To increase the diversity of local features and enhance the expressive power of features. Finally, the invention will F _G And F is equal to _L And the context correlation among the areas of the image is enhanced by the splicing according to the channels, and the artifact caused by the network is eliminated by the convolution kernel processing of 1 multiplied by 1, so that the shape information is better recovered. The calculation process of the multi-scale characteristic correlation enhancement module is as follows:

F _E ＝RFF(F ₁ ,F ₂ ) (7)

F _Ei ＝RFF(F ₁ ,AAP _i (F _E ))(i＝1,2,3) (8)

F _G ＝Cov _1,1 (Cat(F _E1 ,F _E2 ,F _E3 )) (9)

F _L ＝Cov _9,1 (Cov _1,9 (F _E ))+Cov _3,3 (F _E ) (10)

F _MFCE ＝Cov _1,1 (Cat(F _G ,F _L )) (11)

wherein Cov is _n,m (. Cndot.) represents a two-dimensional convolution with a convolution kernel size of n m, cat (. Cndot.) represents a concatenation of feature maps on channels, and RFF represents a correlation feature fusion module.

The pixel-by-pixel depth map "means a depth value for each pixel of an image, and the pixel-by-pixel depth map of the input RGB image is output through the RFF module and the MFCE module.

And optimizing parameters and training processes of a monocular Depth estimation network model formed by the multi-mode RGB-Depth fusion module, the multi-scale Depth coding module, the RFF module and the MFCE module through a Depth representation objective function.

In summary, the monocular image depth estimation algorithm with enhanced multi-scale feature correlation provided by the embodiment of the invention not only enhances the features of the input image and provides more geometric information and semantic information for the depth estimation model, but also enhances the correlation between global features and local features, learns effective appearance structure information, solves the problem of false estimation of the appearance structure caused by texture deviation, and reconstructs a clear and dense monocular depth map.

The embodiment of the invention provides a monocular image depth estimation algorithm based on multi-scale feature correlation enhancement. The algorithm adopts a multi-mode RGB-Depth fusion module to enhance the characteristics of an input image; adopting a related characteristic fusion module to fuse information of different receptive fields; the multi-scale characteristic correlation enhancement module is adopted to enhance the correlation among the characteristics, so that the expression of the appearance structure information is promoted, and the depth information of the monocular image can be effectively extracted.

Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A monocular image depth estimation method based on multi-scale feature correlation enhancement, comprising:

2. The method of claim 1, wherein the preprocessing operation for data enhancement of the input RGB image using the multi-mode RGB-Depth fusion module comprises:

x′ _s ＝M×x _s +(1-M)×x _t (1)

if C _s And C _t Different numbers, the RGB image and the ground real depth map are arranged in the channel directionThe upper combination makes the channel numbers of the two consistent, M matrix (M E {0,1 }) represents x _s Quilt x _t The location of the replacement area, width height (w, h) and replacement area is expressed as:

(w,h)＝(min((W-a×W)×c×p,1),min((H-a×H)×c×p,1)) (2)

image[x:x+w,:,i]＝depth[x:x+w,:] (3)

image[:,y:y+h,i]＝depth[:,y:y+h] (4)

3. The method of claim 2, wherein the decoding stage uses the RFF module to obtain a fine-grained feature map, uses the MFCE module to enhance correlation of features between different scales in the multi-scale features, fuses and optimizes the feature map by combining the RFF module and the MFCE module, and obtains a pixel-by-pixel depth map, comprising:

providing that the multi-scale features comprise low-resolution feature maps F with different resolutions ₁ And higher resolution feature map F ₂ The RFF module makes the low-resolution feature map F ₁ Upsampling by bilinear interpolation to increase resolution to a higher resolution feature map F ₂ The same applies the low-resolution feature map F ₁ And the higher resolution feature map F ₂ Splicing in the same dimension to obtain a feature map F ₃ The characteristic diagram F ₃ The features of different receptive fields are obtained by convolution of two branches, the upper branch adopts two-dimensional convolution with a convolution kernel of 3 to extract the features, input data are standardized by a BatchNorm neural network layer, and finally nonlinear relations among all layers of the network are increased by a ReLU activation function; the lower branch adopts 5 multiplied by 5 two-dimensional convolution to extract features and is normalized by BatchNorm, and the features obtained by the upper branch and the lower branch are fused to obtain a fused feature map F _RFF ：

F ₃ ＝Cat(Up(F ₁ ,F ₂ )) (5)

F _RFF ＝Cov _5,5 (Cov _3,3 (F ₃ ))+Cov _5,5 (F ₃ ) (6)

F _E ＝RFF(F ₁ ,F ₂ ) (7)

F _Ei ＝RFF(F ₁ ,AAP _i (F _E ))(i＝1,2,3) (8)

F _G ＝Cov _1,1 (Cat(F _E1 ,F _E2 ,F _E3 )) (9)

F _L ＝Cov _9,1 (Cov _1,9 (F _E ))+Cov _3,3 (F _E ) (10)

F _MFCE ＝Cov _1,1 (Cat(F _G ,F _L )) (11)

4. The method of claim 1, wherein the method further comprises: