CN113192149A

CN113192149A - Image depth information monocular estimation method, device and readable storage medium

Info

Publication number: CN113192149A
Application number: CN202110554113.6A
Authority: CN
Inventors: 王飞; 许强; 郭宇; 张秋光; 张雪涛
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2021-07-30
Anticipated expiration: 2041-05-20
Also published as: CN113192149B

Abstract

The invention discloses a monocular estimation method, equipment and a storage medium for image depth information, wherein the monocular estimation method comprises the steps of taking an image to be estimated as the input of a pre-trained self-supervision channel hybrid network; the method comprises the steps that an encoder module is used for encoding an image to be estimated to obtain a plurality of semantic feature maps of different levels; the semantic levels of the semantic feature maps are different, and the resolution is different; mixing and dispersing a plurality of semantic feature maps of different levels in a channel direction by using a channel mixing module to obtain fusion features of different resolutions; decoding the fusion features with different resolutions by using a decoder module respectively to obtain depth estimation of the corresponding resolution, namely obtaining a depth image of the image to be estimated; the method decodes the fusion characteristics through the decoder module to obtain the depth image of the image to be estimated, and the depth image has more reliable local and global information; compared with the existing reference method without the channel mixing module, the depth estimation effect of the invention is greatly improved.

Description

Image depth information monocular estimation method, device and readable storage medium

Technical Field

The invention belongs to the technical field of 3D computer vision, and particularly relates to a monocular estimation method and device for image depth information and a readable storage medium.

Background

Depth estimation is a very important problem in the field of computer vision, and is widely applied to the fields of automatic driving, virtual reality and the like. In order to solve the problem, a method based on configuration of various sensors such as a monocular camera, a multi-view camera, a radar depth sensor and the like is proposed; among them, the monocular camera-based depth estimation method is the simplest to configure, but because the monocular camera-based method has scale blurring property, the method is also the most difficult. The best performance of the depth estimation method at the present stage is a supervised training method based on deep learning, which relies on a large amount of data with depth truth labels, however, the accurate depth truth acquisition cost is high, and a data training model under a specific scene is difficult to adapt under different scenes, so that the method is difficult to be widely applied. At present, the self-supervision monocular depth estimation method based on image pairs or videos is greatly improved, the training is carried out without labeled data, and all supervision information comes from image texture information and geometric constraints, so that a large number of label-free data sets can be widely used for training, and the method can be well adapted to different scenes.

Specifically, the self-supervised monocular depth estimation method only needs a single picture during testing, and can be divided into two types of scenes during training: a monocular video sequence and a stereo image pair; the core ideas of the two methods are that the corresponding relation between pixels is established under different visual angles through an estimated depth map; the training method based on monocular video sequences requires simultaneous estimation of the depth map and camera motion. In the method based on the stereo image pair, the relative position relation between the binocular cameras is calibrated in advance, so that only the depth map needs to be estimated, and the method is superior to a method based on a video sequence.

With the rapid development of the deep learning technology, compared with the traditional method, the performance of the self-supervision monocular depth estimation method based on the neural network is greatly improved; considering a training method based on a stereo image pair, Poggi et al propose learning from an image pair configured by a three-phase machine, and depth estimation of an intermediate image depends on a geometric constraint relation between a left view and a right view respectively; tosi et al propose to assist in supervising the training of the network with the estimation results of traditional methods such as SGM, which are optimized with left and right view-keeping constraints; zhu et al propose to guide the optimization of the depth map profile with the result of semantic segmentation; gonzalez et al adopt a mirror image occlusion module to estimate an occlusion region, and effectively solve the interference of occlusion on network training; most of these methods use a network such as Resnet as an encoder to extract multi-scale and hierarchical features of an image, and then use a decoder to obtain depth estimates from these features. The method only carries out simple addition or superposition on the channel when fusing different hierarchical features, and does not fully utilize the advantages and mutual complementary relations among the different hierarchical features, so that the network is not expressed any further.

In order to solve the problem that features of different groups are difficult to exchange in group constraint, channel shuffle operation is proposed, and the features after group constraint are recombined in the channel direction; su et al, in order to solve the problem of detecting key points in difficult scenes such as occlusion in human pose estimation, have fused different features with channel shuffle, have strengthened the exchange of each level of feature, make the detection precision promote.

At present, in the field of depth estimation, no method is tried to explore how to more effectively fuse features of different layers in a depth estimation network and enhance the expression capability of the features.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a monocular estimation method, equipment and a readable storage medium for image depth information, and aims to solve the technical problems that the existing depth estimation method cannot fully fuse the features of different layers, cannot utilize advantage complementation among the features of different layers, has large influence on depth estimation network performance and has low precision of depth information estimation results.

In order to achieve the purpose, the invention adopts the technical scheme that:

the invention provides a monocular estimation method of image depth information, which comprises the following steps:

taking an image to be estimated as the input of a pre-trained self-supervision channel hybrid network; the self-supervision channel mixing network comprises an encoder module, a channel mixing module CSM and a decoder module;

coding an image to be estimated by using a coder module to obtain a plurality of semantic feature maps of different levels; the semantic levels of the semantic feature maps are different, and the resolution is different;

mixing and dispersing a plurality of semantic feature graphs of different levels in a channel direction by using a channel mixing module CSM to obtain fusion features of different resolutions;

and decoding the fusion features with different resolutions by using a decoder module to obtain depth estimation of the corresponding resolution, namely obtaining the depth image of the image to be estimated.

Further, the encoder module adopts a multi-scale feature encoder G_Enc(ii) a Wherein, the multi-scale feature encoder G_EncAn encoder based on the Resnet18 network; multi-scale feature encoder G_EncComprises a convolution layer conv1, an encoder first layer1, an encoder second layer2, an encoder third layer3 and an encoder fourth layer 4;

multi-scale feature encoder G_EncThe input of the image is an RGB image, and the output of the image is a semantic feature map R-Conv-1, a semantic feature map R-Conv-2, a semantic feature map R-Conv-3, a semantic feature map R-Conv-4 and a semantic feature map R-Conv-5;

the resolution of the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 is reduced in sequence.

Further, a channel mixing module CSM is used for mixing and dispersing a plurality of semantic feature graphs of different levels in the channel direction to obtain a process of fusion features of different resolutions, which specifically comprises the following steps:

performing convolution operation on the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 respectively to obtain a semantic feature map Conv-1, a semantic feature map Conv-2, a semantic feature map Conv-3, a semantic feature map Conv-4 and a semantic feature map Conv-5; the channel data of the semantic feature maps Conv-1-5 are the same;

respectively performing up-sampling operation on the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5 to enable the resolution of the up-sampled semantic feature maps Conv-2-5 to be the same as that of the semantic feature map Conv-1;

merging the semantic feature map Conv-1 and the upsampled semantic feature maps Conv-2-5, and then performing channel mixing operation to obtain mixed semantic features;

uniformly dividing the mixed semantic features in the channel direction to obtain the semantic features with the same five-layer channel number;

keeping the resolution of the semantic features of the first layer unchanged, and marking as a semantic feature map C-Conv-1;

respectively performing down-sampling operation on the semantic features from the second layer to the fifth layer to ensure that the resolution of the semantic features after the down-sampling operation is respectively corresponding to the same resolution of the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5, and obtaining a semantic feature map C-Conv-2, a semantic feature map C-Conv-3, a semantic feature map C-Conv-4 and a semantic feature map C-Conv-5;

performing convolution operation on the semantic feature maps C-Conv-1 to 5 respectively to obtain a semantic feature map S-Conv-1, a semantic feature map S-Conv-2, a semantic feature map S-Conv-3, a semantic feature map S-Conv-4 and a semantic feature map S-Conv-5;

and correspondingly combining the semantic feature maps Conv-1-5 and the semantic feature maps S-Conv-1-5 in a distributed manner to obtain fusion features with different resolutions.

Further, the merging operation adopts a concat function; the splitting operation adopts a Split function; the upsampling operation uses nearest upsampling, and the downsampling operation uses nearest downsampling.

Furthermore, the decoder module adopts a deep neural network decoder; the deep neural network decoder comprises a convolution block, an upper sampling layer, a merging layer, a convolution layer and an output layer;

the input of the volume block is fusion characteristics with different resolutions; the output of the convolution block is connected to the input of the upsampling layer, the output of the upsampling layer is connected to the input of the merging layer, the output of the merging layer is connected to the input of the convolution layer, and the output of the convolution layer is connected to the input of the output layer.

Further, the output layer adopts a sigmoid function; carrying out nonlinear transformation on the output value of the sigmoid function to obtain depth estimation of corresponding resolution;

wherein the nonlinear transformation formula is as follows:

d＝1/(a*y+b)

wherein d is the depth estimate for the corresponding resolution; a is the linear transform coefficient calculated from the maximum depth; y is the output value of the sigmoid function; b is the linear transform coefficient calculated from the minimum depth.

Further, the training process of the pre-trained self-supervision hybrid network specifically includes the following steps:

constructing an image training set; the image sample set comprises a stereo image pair, wherein an image to be estimated in the stereo image pair is marked as a view T, and the other image is marked as a view S; respectively and randomly carrying out the same downsampling and clipping on the view T and the view S to obtain a clipped view T and a clipped view S;

carrying out rough depth estimation on the cut view T to obtain a rough depth map corresponding to the view T; filtering by using consistency constraint of the rough depth map corresponding to the view T and the view S to obtain a filtered depth map; the filtered depth map is used as a training pseudo label;

taking the view T as the input of the self-supervision channel hybrid network, and outputting to obtain a depth map of the view T;

converting the depth map of the view T into a point cloud of the view T, and acquiring a pixel position corresponding to each pixel in the point cloud of the view T in the view S;

converting the color of the corresponding pixel on the view S back to the view T by utilizing a bilinear interpolation method to obtain a generated image T' of the view T;

and respectively carrying out error calculation between the generated images T' of the view T and between the estimated depth and the trained pseudo label to obtain the trained self-supervision channel mixed network.

Further, an error function for performing an error calculation process between the view T and the generated image T' of the view T is an L1 function; the error function of the error calculation process between the estimated depth and the trained pseudo-label is the SSIM function.

The invention also provides image depth information monocular estimation equipment, which comprises a memory, a processor and executable instructions which are stored in the memory and can be run in the processor; the processor, when executing the executable instructions, implements the image depth information monocular estimation method.

The invention also provides a computer-readable storage medium on which computer-executable instructions are stored, which when executed by a processor implement the image depth information monocular estimation method.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a monocular estimation method, equipment and a readable storage medium for image depth information.A coder module is used for acquiring semantic feature maps of different layers of an image to be estimated, and a channel mixing module CSM is used for mixing the semantic features of different layers in the channel direction, so that the respective beneficial information of the feature maps of different layers is shared; decoding the fusion characteristics through a decoder module to obtain a depth image of the image to be estimated, wherein the depth image has more reliable local and global information; compared with the existing reference method without the CSM, the depth estimation effect of the invention is greatly improved.

Further, a multi-scale feature encoder G_EncThe encoder based on the Resnet18 network has the advantages of less network parameters, high training speed and better feature extraction capability.

Furthermore, the channel mixing module mixes and disperses the features of different layers in the channel direction, so that the new features are fused with the information of all the original layer features; the fusion features not only contain deep semantic features of low-resolution features, but also have large receptive field, and help a decoder to deeply reason weak textures and occlusion areas and understand the interrelations among objects; meanwhile, the local detail information of the large-resolution feature pairs is included, so that the estimation precision of the decoder on the depth of the pixel position under the general condition is effectively improved.

Further, a depth neural network decoder is adopted to decode the fusion features with different resolutions, so that a corresponding depth map is decoded on each resolution of the input features; in the process of coding the depth map of each resolution, the characteristics of the output of the coder of the current resolution and the characteristics of the up-sampling of the previous resolution are included; the former ensures that the feature information of the original resolution extracted by the encoder is not damaged, and the latter ensures the penetration of the information of other layer features to the features of the current layer; the depth map decoding of the resolution of each decoder is supervised during training, and the effective response of the network to the scale transformation problem in the scene can be fully helped.

Further, the self-supervision hybrid network is based on the training of the stereo image pair, and the position relation between the left and right target cameras is calibrated in advance; the self-supervision hybrid network only needs to estimate the depth map, and directly utilizes the existing camera parameters to calculate the corresponding relation of pixels in the left image and the right image, thereby greatly reducing the learning difficulty of the network and improving the estimation precision of the depth map; by adopting the training of the stereo image pair, the estimated depth scale is clear and consistent with the true value of the depth map.

Drawings

Fig. 1 is a network structure diagram of an embodiment of an unsupervised channel hybrid network;

FIG. 2 is a block diagram of a mixing channel module in an embodiment;

FIG. 3 is a view showing a structure of a channel mixing layer in the embodiment;

FIG. 4 is an original image of a depth to be estimated in an embodiment;

FIG. 5 is a depth image estimated by a prior art reference method;

FIG. 6 is a depth image estimated by an embodiment method.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects of the present invention more apparent, the following embodiments further describe the present invention in detail. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a monocular estimation method of image depth information, which takes an image to be estimated as the input of a pre-trained self-supervision channel hybrid network; the self-supervision hybrid network comprises an encoder module, a channel hybrid module CSM and a decoder module;

the specific process is as follows:

step 1, encoding an image to be estimated by using an encoder module to obtain a plurality of semantic feature maps of different levels; the semantic levels of the semantic feature maps are different, and the resolution is different.

Encoder module employing multi-scale feature encoder G_Enc(ii) a Wherein, the multi-scale feature encoder G_EncAn encoder based on the Resnet18 network; multi-scale feature encoder G_EncComprises a convolutional layer conv1, an encoder first layer1, an encoder second layer2, an encoder third layer3 and an encoder fourth layer 4.

Multi-scale feature encoder G_EncThe input of the image is an RGB image, and the output of the image is a semantic feature map R-Conv-1, a semantic feature map R-Conv-2, a semantic feature map R-Conv-3, a semantic feature map R-Conv-4 and a semantic feature map R-Conv-5; the resolution of the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 is reduced in sequence.

And 2, mixing and dispersing the semantic feature maps of different levels in the channel direction by using a channel mixing module CSM to obtain fusion features with different resolutions.

The specific process is as follows:

step 21, performing convolution operation on the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 respectively to obtain a semantic feature map Conv-1, a semantic feature map Conv-2, a semantic feature map Conv-3, a semantic feature map Conv-4 and a semantic feature map Conv-5; the channel data of the semantic feature maps Conv-1-5 are the same.

And step 22, performing up-sampling operation on the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5 respectively to enable the resolution of the up-sampled semantic feature maps Conv-2-5 to be the same as that of the semantic feature map Conv-1.

And step 23, merging the semantic feature map Conv-1 and the upsampled semantic feature maps Conv-2-5, and then performing channel mixing operation to obtain mixed semantic features.

And 24, uniformly dividing the mixed semantic features in the channel direction to obtain the semantic features with the same five-layer channel number.

Step 25, keeping the resolution of the first layer of semantic features unchanged, and marking as a semantic feature map C-Conv-1; and performing down-sampling operation on the semantic features from the second layer to the fifth layer respectively to ensure that the resolution of the semantic features after the down-sampling operation is respectively the same as the resolution of the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5, so as to obtain the semantic feature map C-Conv-2, the semantic feature map C-Conv-3, the semantic feature map C-Conv-4 and the semantic feature map C-Conv-5.

And 26, performing convolution operation on the semantic feature map C-Conv-1, the semantic feature map C-Conv-2, the semantic feature map C-Conv-3, the semantic feature map C-Conv-4 and the semantic feature map C-Conv-5 respectively to obtain a semantic feature map S-Conv-1, a semantic feature map S-Conv-2, a semantic feature map S-Conv-3, a semantic feature map S-Conv-4 and a semantic feature map S-Conv-5.

And 27, carrying out one-to-one correspondence on the semantic feature maps Conv-1-5 and the semantic feature maps S-Conv-1-5 to carry out merging operation to obtain fusion features of different frequencies.

And 3, decoding the fusion features with different resolutions by using a decoder module respectively to obtain depth estimation of the corresponding resolution, namely obtaining the depth image of the image to be estimated.

Wherein, the decoder module adopts a deep neural network decoder; the deep neural network decoder comprises a convolution block, an upper sampling layer, a merging layer, a convolution layer and an output layer; the input of the volume block is fusion characteristics with different resolutions; the output of the convolution block is connected to the input of the upsampling layer, the output of the upsampling layer is connected to the input of the merging layer, the output of the merging layer is connected to the input of the convolution layer, and the output of the convolution layer is connected to the input of the output layer.

In the invention, the output layer adopts a sigmoid function, and carries out nonlinear transformation on the output value of the sigmoid function to obtain depth estimation of corresponding resolution;

wherein the nonlinear transformation formula is as follows:

d＝1/(a*y+b)

In the invention, the training process of the pre-trained self-supervision hybrid network specifically comprises the following steps:

respectively carrying out error calculation between the generated images T' of the view T and between the estimated depth and the trained pseudo label; obtaining a trained self-supervision channel hybrid network; an error function for performing an error calculation process between the view T and the generated image T' of the view T is an L1 function; the error function of the error calculation process between the estimated depth and the trained pseudo-label is the SSIM function.

The image depth information monocular estimation method inputs a single test image for a trained network model, and automatically obtains the corresponding depth of each pixel of the image; comparing the obtained depth map with the true value, it is found that a very high estimation accuracy is obtained.

The invention also provides an image depth information monocular estimation system, which comprises an encoder module, a channel mixing module CSM and a decoder module;

the encoder module is used for encoding the image to be estimated to obtain a plurality of semantic feature maps of different levels; the semantic levels of the semantic feature maps are different, and the resolution is different.

And the channel mixing module CSM is used for mixing and dispersing the semantic feature maps of different levels in the channel direction to obtain fusion features with different resolutions.

And the decoder module is used for decoding the fusion features with different resolutions respectively to obtain depth estimation of the corresponding resolution, namely the depth image of the image to be estimated.

The present invention also provides an image depth information monocular estimation device, including: a processor, a memory, and a computer program, such as an image depth information monocular estimation program, stored in the memory and executable on the processor.

When the processor executes the computer program, the steps in the image depth information monocular estimation method are realized, for example, an image to be estimated is used as the input of a pre-trained self-supervision channel hybrid network; the self-supervision channel mixing network comprises an encoder module, a channel mixing module CSM and a decoder module; coding an image to be estimated by using a coder module to obtain a plurality of semantic feature maps of different levels; the semantic levels of the semantic feature maps are different, and the resolution is different; mixing and dispersing a plurality of semantic feature graphs of different levels in a channel direction by using a channel mixing module CSM to obtain fusion features of different resolutions; and decoding the fusion features with different resolutions by using a decoder module to obtain depth estimation of the corresponding resolution, namely obtaining the depth image of the image to be estimated.

Or, the processor implements the functions of each module/unit in the image depth information monocular estimation system when executing the computer program, for example, an encoder module, which is used for encoding an image to be estimated to obtain a plurality of semantic feature maps of different levels; the semantic levels of the semantic feature maps are different, and the resolution is different. And the channel mixing module CSM is used for mixing and dispersing the semantic feature maps of different levels in the channel direction to obtain fusion features with different resolutions. And the decoder module is used for decoding the fusion features with different resolutions respectively to obtain depth estimation of the corresponding resolution, namely the depth image of the image to be estimated.

Illustratively, the computer program may be divided into one or more units, which are stored in the memory and executed by the processor to accomplish the present invention. The one or more units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the image depth information monocular estimation device. For example, the computer program may be divided into an encoder module, a channel mixing module CSM and a decoder module, each module having the following specific functions: the encoder module is used for encoding the image to be estimated to obtain a plurality of semantic feature maps of different levels; the semantic levels of the semantic feature maps are different, and the resolution is different. And the channel mixing module CSM is used for mixing and dispersing the semantic feature maps of different levels in the channel direction to obtain fusion features with different resolutions. And the decoder module is used for decoding the fusion features with different resolutions respectively to obtain depth estimation of the corresponding resolution, namely the depth image of the image to be estimated.

The image depth information monocular estimation device may be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The image depth information monocular estimation device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the above is merely an example of the image depth information monocular estimating device, and does not constitute a limitation of the image depth information monocular estimating device, and may include more or less components, or combine some components, or different components, for example, the image depth information monocular estimating device may further include an input-output device, a network access device, a bus, etc.

The processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, said processor being the control center of said image depth information monocular estimating device, various interfaces and lines connecting the various parts of the whole image depth information monocular estimating device.

The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the image depth information monocular estimation device by running or executing the computer programs and/or modules stored in the memory, and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like.

In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash memory card (FlashCard), at least one disk storage device, a flash memory device, or other volatile solid state storage device.

The unit integrated with the image depth information monocular estimation device may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the processes of the above methods can be implemented by the present invention, and the implementation of the computer program can also be implemented by the relevant hardware, and the computer program can be stored in a computer readable storage medium, and when the computer program is executed by a processor, the steps of the above methods can be implemented.

Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, Read-only memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc.

It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

Examples

As shown in fig. 1 to 3, the present embodiment provides a monocular estimation method of image depth information, including the following steps:

step 1, constructing an image training set, wherein the image training set comprises a stereo image pair; the stereo image pair is acquired by a stereo camera, and the position relation between the stereo cameras is known.

Recording an image to be estimated in the stereo image pair as a view T, and recording the other image as a view S; respectively and randomly carrying out the same downsampling and clipping on the view T and the view S to obtain a clipped view T and a clipped view S; meanwhile, correspondingly adjusting the internal parameters of the stereo camera; preferably, the internal parameters of the stereo camera comprise a camera focal length and principal point coordinates; the resolution of the stereo camera internal reference and the stereo image pair are related, the downsampling and cropping of the image change the resolution of the image, the camera internal reference must be modified, otherwise, errors can occur when the corresponding relation of the pixels between the two views is obtained by means of the camera internal reference in step 5, and data augmentation is performed.

Step 2, taking the cut view T as the input of the SGM algorithm, and carrying out rough depth estimation on the cut view T to obtain a rough depth map corresponding to the view T; filtering by using consistency constraint of the rough depth map corresponding to the view T and the view S, and filtering out pixel points with larger errors to obtain a filtered depth map; the filtered depth map serves as a pseudo label for training.

Taking the view T as the input of the self-supervision channel hybrid network, and outputting to obtain a depth map of the view T; converting the depth map of the view T into a point cloud of the view T, and acquiring a pixel position corresponding to each pixel in the point cloud of the view T in the view S; converting the color of the corresponding pixel on the view S back to the view T by utilizing a bilinear interpolation method to obtain a generated image T' of the view T; and respectively carrying out error calculation between the generated images T' of the view T and between the estimated depth and the trained pseudo label to obtain the trained self-supervision channel mixed network.

In this embodiment, an error function in the error calculation process between the view T and the generated image T' of the view T is an L1 function; the error function of the error calculation process between the estimated depth and the trained pseudo-label is the SSIM function.

Step 3, taking the image to be estimated as the input of a pre-trained self-supervision channel hybrid network; the self-supervision hybrid network comprises an encoder module, a channel hybrid module CSM and a decoder module; the specific process is as follows:

step 31, encoding the image to be estimated by using an encoder module to obtain a plurality of semantic feature maps of different levels; the semantic levels of the semantic feature maps are different, and the resolution is different.

In this embodiment, the encoder module employs a multi-scale feature encoder G_Enc(ii) a Wherein, the multi-scale feature encoder G_EncAn encoder based on the Resnet18 network; multi-scale feature encoder G_EncComprises a convolutional layer conv1, an encoder first layer1, an encoder second layer2, an encoder third layer3 and an encoder fourth layer 4.

Multi-scale feature encoder G_EncThe input of (a) is an RGB image; the output is a semantic feature map R-Conv-1, a semantic feature map R-Conv-2, a semantic feature map R-Conv-3, a semantic feature map R-Conv-4 and a semantic feature map R-Conv-5; the resolution of the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 is sequentially reduced; in this embodiment, the resolution of the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4, and the semantic feature map R-Conv-5 is 1/2, 1/4, 1/8, 1/16, and 1/32 of the resolution of the image to be estimated in sequence.

In this embodiment, a multi-scale feature encoder G_EncAn encoder based on the Resnet18 network; multi-scale feature encoder G_EncEncoding an input image into feature maps with different semantic levels and resolutions, extracting 5 semantic feature maps with different depths of a network from the input image based on an encoder of a Resnet18 network, wherein each of the semantic feature maps R-Conv-1-5 contains different semantic information and has different aspects of depth estimationHas important function.

Step 32, mixing and dispersing a plurality of semantic feature graphs of different levels in the channel direction by using a channel mixing module CSM to obtain fusion features of different resolutions; in this embodiment, the channel mixing module CSM includes a first convolution layer, an up-sampling layer, a merging layer, a channel mixing layer, a splitting layer, a down-sampling layer, and a second convolution layer.

The first convolution layer comprises five convolution layers with convolution kernel sizes of (256, C,1,1), wherein C is consistent with the number of channels of each layer of characteristics, and Conv-1-5 is obtained after convolution operation; the up-sampling layer comprises four nearest up-sampling layers and is used for up-sampling Conv 2-5 with the resolution same as that of Conv-1; the merging layer is used for merging all the features subjected to upsampling in the channel direction; the sequence of the internal operation of the channel mixing layer is reshape-transpose-reshape in sequence; the segmentation layer is used for equally dividing the mixed features on the channels to obtain the features with the same number of five channels; the down-sampling layer comprises four nearest down-sampling layers and is used for down-sampling the features from the second layer to the last layer output by the segmentation layer, and the resolution of the output features is respectively consistent with Conv 2-5 to obtain features C-Conv 1-5; the second convolutional layer comprises five convolutional layers with convolutional kernel sizes of (C, C,1,1), where C is 256.

The specific process is as follows:

step 321, performing convolution operation of 1x1 on the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 respectively to obtain a semantic feature map Conv-1, a semantic feature map Conv-2, a semantic feature map Conv-3, a semantic feature map Conv-4 and a semantic feature map Conv-5; the channel data of the semantic feature maps Conv-1-5 are the same; in this embodiment, the number of channels of the semantic feature maps Conv-1 to 5 is 256.

322, respectively performing upsampling operation on the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5, so that the resolution of the upsampled semantic feature maps Conv-2-5 is the same as that of the semantic feature map Conv-1; wherein the upsampling operation adopts nearest sampling.

Step 323, merging the semantic feature map Conv-1 and the upsampled semantic feature maps Conv-2-5 in a channel direction, and then performing channel mixing operation to obtain mixed semantic features.

And 324, performing uniform segmentation operation on the mixed semantic features in the channel direction to obtain the semantic features with the same five-layer channel number.

Step 325, keeping the resolution of the first layer of semantic features unchanged, and marking as a semantic feature map C-Conv-1; respectively performing down-sampling operation on the semantic features from the second layer to the fifth layer to ensure that the resolution of the semantic features after the down-sampling operation is respectively the same as that of the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5, and obtaining a semantic feature map C-Conv-2, a semantic feature map C-Conv-3, a semantic feature map C-Conv-4 and a semantic feature map C-Conv-5; wherein the down-sampling operation adopts nearest sampling.

Step 326, performing convolution operation of 1x1 on the semantic feature map C-Conv-1, the semantic feature map C-Conv-2, the semantic feature map C-Conv-3, the semantic feature map C-Conv-4 and the semantic feature map C-Conv-5 respectively to obtain fusion features with different resolutions, namely the semantic feature map S-Conv-1, the semantic feature map S-Conv-2, the semantic feature map S-Conv-3, the semantic feature map S-Conv-4 and the semantic feature map S-Conv-5.

Step 327, in the channel direction, merging the semantic feature map S-Conv-1, the semantic feature map S-Conv-2, the semantic feature map S-Conv-3, the semantic feature map S-Conv-4, and the semantic feature map S-Conv-5 with the semantic feature map Conv-1, the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4, and the semantic feature map Conv-5 in a one-to-one correspondence manner to obtain fusion features with different resolutions; the fused features of different resolutions are the enhanced feature pyramid and are used as the input of the depth decoder.

In the embodiment, a plurality of semantic feature graphs of different levels are mixed and dispersed in the channel direction through a channel mixing module CSM to obtain fusion features of different resolutions; in order to more effectively utilize the advantages of the different hierarchical features, the embodiment provides the channel mixing module CSM, which performs channel mixing operation between the different hierarchical features to obtain a mixed multi-level enhanced pyramid feature, so that the different hierarchical features are fully communicated on the channel, and the advantages are complementary.

And step 33, taking the fusion features with different resolutions as the input of a decoder module, wherein the decoder module adopts a deep neural network decoder, and the depth neural network decoder is used for decoding the fusion features with different resolutions to sequentially obtain depth estimation of corresponding resolutions, so as to obtain the depth image of the image to be estimated.

In this embodiment, the deep neural network decoder includes a convolution block ConvBlock, an upsampling layer, a merging layer, a convolution layer, and an output layer; the input of the volume block is fusion characteristics with different resolutions; the output of the convolution block is connected with the input of the upper sampling layer, the output of the upper sampling layer is connected with the input of the merging layer, the output of the merging layer is connected with the input of the convolution layer, and the output of the convolution layer is connected with the input of the output layer;

specifically, the input of the convolution block ConvBlock is a feature of a previous adjacent resolution, and the output is a feature of the same resolution; the upper sampling layer is 2 times of the upper sampling layer, and the upper sampling layer performs 2 times of upper sampling on the features with the same resolution ratio to obtain the features of the current resolution ratio; the merging layer input contains features of the current resolution after up-sampling by a factor of 2 and features of the current resolution of the encoder output, which are features merged in the channel direction.

The convolution layer inputs the characteristics combined in the channel direction, and outputs the characteristics with the channel number of 1; the output layer input is the characteristic that the channel number is 1, the value of each output pixel represents the corresponding depth distribution probability, and the depth distribution probability is between 0 and 1.

In this embodiment, the convolution block ConvBlock includes a convolution layer and an activation function; wherein, the former low-resolution feature is added to the current high-resolution feature as a skip-layer feature after being up-sampled, so as to further enhance the depth estimation of the current resolution.

The output layer adopts a sigmoid function; carrying out nonlinear transformation on the output value of the sigmoid function to obtain depth estimation of corresponding resolution; wherein the nonlinear transformation formula is as follows:

d＝1/(a*y+b)

In the embodiment, the deep neural network decoder estimates a corresponding depth map on the characteristics of each resolution ratio through the combination of the convolution and the activation function, and fully utilizes the advantage of multi-scale supervised learning; meanwhile, low-resolution features are transmitted through a skip layer and combined with high-resolution features to jointly estimate a high-resolution depth map.

As shown in fig. 4-6, fig. 4 shows an original image with depth to be estimated in an embodiment, fig. 5 shows a depth image estimated from the original image with depth to be estimated by using a conventional reference method, and fig. 6 shows a depth image estimated by using the estimation method described in the embodiment; as can be seen from fig. 4-6, the contour of the object is clearer in the estimated depth map by the estimation method of the embodiment; in particular, the depth estimation effect for a long and thin object is more excellent.

In the method for monocular estimation of image depth information in this embodiment, a multi-scale feature encoder G is used_EncEncoding an input image into feature maps with different semantic levels and resolutions; mixing and dispersing all the characteristics in the channel direction through a channel mixing module CSM to obtain fused characteristics; the decoder decodes the corresponding depth map on the fusion features of each resolution in sequence; the method is applied to the field of self-supervision monocular depth estimation, and can achieve a better estimation effect.

For a description of a relevant part in the image depth information monocular estimation device and the computer-readable storage medium provided in this embodiment, reference may be made to a detailed description of a corresponding part in the image depth information monocular estimation method described in this embodiment, and details are not described herein again.

The image depth information monocular estimation method provided by the invention fully utilizes the characteristic advantages of different semantic levels obtained by the encoder, and helps the decoder to obtain better estimation precision; the encoder shallow layer extracts the high-resolution features, has a small receptive field and is beneficial to improving the accuracy of the depth estimation of the pixel position under a simple condition; the encoder deeply extracts low-resolution features, has a large receptive field and is beneficial to depth inference of pixels under difficult conditions such as weak texture and the like; and the channel mixing module mixes the features obtained by the encoder in the channel direction, so that the features of different semantic levels are exchanged in the channel direction, and then the features are uniformly separated to obtain new fusion features. Combining the fusion characteristic and the original characteristic of the encoder to be used as the input characteristic of a decoder in the next stage; the features of all resolutions are estimated by a decoder to form a corresponding depth map, and the features of the lower resolution of the upper level are transmitted by a jump layer and combined with the features of the higher resolution of the lower layer to jointly estimate the depth map with the higher resolution.

The above-described embodiment is only one of the embodiments that can implement the technical solution of the present invention, and the scope of the present invention is not limited by the embodiment, but includes any variations, substitutions and other embodiments that can be easily conceived by those skilled in the art within the technical scope of the present invention disclosed.

Claims

1. A monocular estimation method of image depth information is characterized by comprising the following steps:

2. The method of claim 1, wherein the encoder module employs a multi-scale feature encoder G_Enc(ii) a Wherein, the multi-scale feature encoder G_EncAn encoder based on the Resnet18 network; multi-scale feature encoder G_EncComprises a convolution layer conv1, an encoder first layer1, an encoder second layer2, an encoder third layer3 and an encoder fourth layer 4;

3. The image depth information monocular estimation method of claim 2, wherein a channel mixing module CSM is used to mix and disperse a plurality of semantic feature maps of different levels in a channel direction to obtain a process of fusion features of different resolutions, which is specifically as follows:

4. The method according to claim 3, wherein the merging operation uses a concat function; the splitting operation adopts a Split function; the upsampling operation uses nearest upsampling, and the downsampling operation uses nearest downsampling.

5. The monocular image depth information estimation method of claim 1, wherein the decoder module employs a depth neural network decoder; the deep neural network decoder comprises a convolution block, an upper sampling layer, a merging layer, a convolution layer and an output layer;

6. The monocular image depth information estimation method of claim 5, wherein the output layer uses a sigmoid function; carrying out nonlinear transformation on the output value of the sigmoid function to obtain depth estimation of corresponding resolution;

wherein the nonlinear transformation formula is as follows:

d＝1/(a*y+b)

7. The image depth information monocular estimation method of claim 1, wherein a training process of a pre-trained self-supervision hybrid network specifically comprises the following steps:

8. The monocular image depth information estimation method of claim 7, wherein an error function for performing an error calculation process between the view T and the generated image T' of the view T is a function L1; the error function of the error calculation process between the estimated depth and the trained pseudo-label is the SSIM function.

9. An image depth information monocular estimation device comprising a memory, a processor and executable instructions stored in said memory and executable in said processor; the processor, when executing the executable instructions, implements the method of any of claims 1-8.

10. A computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement the method of any one of claims 1-8.