CN115578704A - Depth estimation model training method, depth estimation device, depth estimation equipment and medium - Google Patents

Depth estimation model training method, depth estimation device, depth estimation equipment and medium Download PDF

Info

Publication number
CN115578704A
CN115578704A CN202211228909.3A CN202211228909A CN115578704A CN 115578704 A CN115578704 A CN 115578704A CN 202211228909 A CN202211228909 A CN 202211228909A CN 115578704 A CN115578704 A CN 115578704A
Authority
CN
China
Prior art keywords
image
depth estimation
loss
target image
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211228909.3A
Other languages
Chinese (zh)
Inventor
贾炎
范潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202211228909.3A priority Critical patent/CN115578704A/en
Publication of CN115578704A publication Critical patent/CN115578704A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The invention discloses a depth estimation model training method, a depth estimation model training device and a depth estimation model training medium, and relates to the technical field of automatic driving. The method comprises the following steps: acquiring a target image and an adjacent frame image; constructing a depth estimation model and a pose transformation model; the depth estimation model comprises a coding network and a decoding network, wherein the coding network is added with a dense connection and channel attention mechanism to obtain feature maps of various scales; decoding the network fusion feature map to obtain depth estimation information; the pose transformation model acquires a pose transformation relation between the target image and the adjacent frame image; generating a reconstructed image based on the depth estimation information and the pose transformation relation; and constructing a loss function based on the reconstructed image and the target image, and completing the training of the depth estimation model by using the loss function. The method adds dense connection and channel attention mechanism in the coding network, the extracted features are more effective, comprehensive and accurate, and the accuracy and reasoning speed are improved under the condition that the number of layers of the model is increased less.

Description

Depth estimation model training method, depth estimation device, depth estimation equipment and medium
Technical Field
The invention relates to the technical field of automatic driving, in particular to a depth estimation model training method, a depth estimation model training device and a depth estimation model training medium.
Background
The depth estimation is mainly used for an automatic driving system to obtain vehicle surrounding environment information, is an important component in an automatic driving technology, and is also a research hotspot in the current automatic driving field. The mainstream methods for obtaining depth information are mostly based on cameras or laser radars. The laser radar can accurately and directly measure the distance of a target object in a certain space, but the limited scanning frequency, the problem of mirror black holes and high price make the laser radar difficult to be widely applied in the field of automatic driving. Compared with a laser radar scheme, the camera-dependent vision scheme has the advantages of high scanning frequency, long service life, low price and the like. Based on the type of camera, vision schemes can be divided into monocular-based cameras and binocular-based cameras. For the binocular camera scheme, the conventional method searches for corresponding points in the left and right cameras based on a feature point matching algorithm. The method has high requirements on the matching accuracy of the feature points and the synchronism of the left camera and the right camera. While the monocular camera-based solution does not have the limitation of binocular camera synchronicity and is less costly, but the accuracy is inferior compared to the binocular camera solution.
The difference based on the training mode can be divided into the monocular depth estimation of supervised learning and the monocular depth estimation of unsupervised learning. The method based on supervised learning has the advantages of relatively simple model, accurate result and easier training process, but requires a great amount of labeled depth information as a supervision signal during model training, and the acquisition cost of the depth information is very high.
Disclosure of Invention
To solve the above technical problems or at least partially solve the above technical problems, embodiments of the present invention provide a depth estimation model training method, a depth estimation device, a depth estimation apparatus, and a medium.
In a first aspect, an embodiment of the present invention provides a depth estimation model training method, including:
acquiring a target image and an adjacent frame image adjacent to the target image, wherein the target image and the adjacent frame image are respectively images shot by a vehicle-mounted front-view monocular camera;
constructing a depth estimation model and a pose transformation model; the depth estimation model comprises a depth estimation coding network and a depth estimation decoding network, wherein the depth estimation coding network adds a dense connection mechanism and a channel attention mechanism into a residual error structure of the depth estimation coding network so as to fuse shallow features and deep features of the target image and obtain a feature map of the target image in multiple scales; the depth estimation decoding network is used for fusing feature maps of the target image in multiple scales to obtain depth estimation information of the target image; the pose transformation model is used for acquiring a pose transformation relation between the target image and the adjacent frame image;
generating a reconstructed image based on the depth estimation information and the pose transformation relation;
and constructing a loss function based on the reconstructed image and the target image, and completing the training of the depth estimation model by using the loss function.
In a second aspect, an embodiment of the present invention provides a depth estimation method, including: acquiring an image to be estimated; processing the image to be estimated by using a pre-trained depth estimation model to obtain a depth map of the image to be estimated, and taking the depth map as a depth estimation result of the image to be estimated, wherein the depth estimation model is obtained by training according to the depth estimation model training method provided by the embodiment of the invention.
In a third aspect, an embodiment of the present invention provides a depth estimation model training apparatus, including:
the vehicle-mounted front-view monocular camera comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a target image and an adjacent frame image adjacent to the target image, and the target image and the adjacent frame image are respectively images shot by the vehicle-mounted front-view monocular camera;
the construction module is used for constructing a depth estimation model and a pose transformation model; the depth estimation model comprises a depth estimation coding network and a depth estimation decoding network, wherein the depth estimation coding network adds a dense connection mechanism and a channel attention mechanism into a residual error structure of the depth estimation coding network so as to fuse shallow features and deep features of the target image and obtain a feature map of the target image in multiple scales; the depth estimation decoding network is used for fusing feature maps of the target image in multiple scales to obtain depth estimation information of the target image; the pose transformation model is used for acquiring a pose transformation relation between the target image and the adjacent frame image;
a reconstruction module for generating a reconstructed image based on the depth estimation information and the pose transformation relationship;
and the parameter training module is used for constructing a loss function based on the reconstructed image and the target image and completing the training of the depth estimation model by using the loss function.
In a fourth aspect, an embodiment of the present invention provides a depth estimation apparatus, including: the second acquisition module is used for acquiring an image to be estimated; and the depth determining module is used for processing the image to be estimated by utilizing a pre-trained depth estimation model to obtain a depth map of the image to be estimated, and taking the depth map as a depth estimation result of the image to be estimated.
In a fifth aspect, an embodiment of the present invention provides an electronic device, including: one or more processors; a storage device, configured to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the depth estimation model training method or the depth estimation method of the embodiments of the present invention.
In a sixth aspect, the present invention provides a computer readable medium, on which a computer program is stored, where the computer program is executed by a processor to implement the depth estimation model training method or the depth estimation method according to the present invention.
One embodiment of the above invention has the following advantages or benefits:
the depth estimation model training method of the embodiment of the invention comprises the steps of obtaining depth estimation information of a target image through a depth estimation model, obtaining a pose transformation relation between the target image and an adjacent frame image through the pose transformation model, generating a reconstructed image based on the depth estimation information and the pose transformation relation, constructing a loss function based on the reconstructed image and the target image, optimizing parameters of the depth estimation model by using the loss function, and finishing training of the depth estimation model, wherein the depth estimation model comprises a depth estimation coding network and a depth estimation decoding network, the depth estimation coding network adds a dense connection mechanism and a channel attention mechanism into a residual error structure of the depth estimation coding network to fuse shallow layer characteristics and deep layer characteristics of the target image and obtain characteristic maps of multiple scales of the target image, so that the extracted characteristics are more effective, comprehensive and accurate, and the accuracy and the reasoning speed are improved under the condition that the number of layers of the model is increased a little; the depth estimation decoding network is used for fusing feature maps of multiple scales of a target image to obtain depth estimation information of the target image, so that shallow features and deep features of the target image are fully fused, namely, positioning information of shallow feature representations and semantic information of deep feature representations of the target image are fully fused, the accuracy of an obtained depth estimation result is high, the accuracy, reasoning speed and robustness of a depth estimation model are improved, the depth estimation decoding network can adapt to most driving scenes, and accurate depth information is provided for an automatic driving system.
According to the depth estimation method, the depth estimation model is used for analyzing the two-dimensional image acquired by the front-view monocular camera of the vehicle to determine the distance between the front object and the vehicle, so that the depth information corresponding to the video frame can be quickly and accurately input after the front-view monocular camera acquires the video frame, the detection and estimation of the distance between the front vehicle, people, buildings and other objects and the vehicle under the environments of different weather, different road conditions and the like are realized, a set of depth estimation system which is high in accuracy and robustness and capable of detecting in real time is formed, and the depth estimation system is convenient to use for subsequent three-dimensional target identification and provides environment information for obstacle avoidance and path planning of an automatically-driven vehicle.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 shows a flow diagram of a depth estimation model training method of an embodiment of the invention;
FIG. 2 is a block diagram of a depth estimation coding network of the depth estimation model training method according to an embodiment of the present invention;
FIG. 3 is a block diagram of a second coding layer of a depth estimation coding network according to an embodiment of the present invention;
FIG. 4 is a block diagram of a depth estimation decoding network of the depth estimation model training method according to an embodiment of the present invention;
FIG. 5 is a sub-flow diagram illustrating a depth estimation model training method according to an embodiment of the invention;
FIG. 6 shows a flow diagram of a depth estimation method of an embodiment of the invention;
FIG. 7 shows a flow diagram of a depth estimation method of another embodiment of the invention;
FIG. 8 is a schematic diagram of a depth estimation model training apparatus according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram showing a depth estimation device according to an embodiment of the present invention;
fig. 10 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the deep learning field, generally, the more layers a model contains, the more complex the model is, the more complex the nonlinear mapping relationship can be constructed, and the more accurate the output of the model is. However, as the number of layers increases and the complexity increases, the reasoning speed of the model becomes slower, more hardware resources are occupied, and longer reasoning time is not allowed for a deep estimation task requiring higher real-time performance. Aiming at the contradiction between the model precision and the model complexity, the embodiment of the invention provides a depth estimation model training method, the depth estimation model of the method provides a new network structure, the precision can be improved under the condition that the number of layers of the model is slightly increased compared with the original number, the reasoning speed is high, the robustness is strong, the method can adapt to most driving scenes, and accurate depth information is provided for an automatic driving system.
Fig. 1 shows a schematic flow chart of a depth estimation model training method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:
step S101: acquiring a target image and an adjacent frame image adjacent to the target image, wherein the target image and the adjacent frame image are respectively images shot by a vehicle-mounted front-view monocular camera.
The embodiment can acquire the video data in the vehicle traveling process from the vehicle-mounted front-view monocular camera and decode the video data into continuous multi-frame images. The target image is any one of the continuous multi-frame images. The adjacent frame images adjacent to the target image comprise an image of a frame before the target image and an image of a frame after the target image.
Step S102: and constructing a depth estimation model and a pose transformation model.
The depth estimation model is used for acquiring depth estimation information of the target image, namely the distance between an object on the target image and the vehicle body. The depth estimation model is divided into an encoding part and a decoding part, namely the depth estimation model comprises a depth estimation encoding network and a depth estimation decoding network. The depth estimation coding network is used for acquiring multiple features of the target image, and the depth estimation decoding network is used for acquiring the depth estimation of the target image, namely the distance between an object on the target image and the vehicle body based on the multiple features of the target image.
In order to improve the precision and the reasoning speed under the condition that the number of layers of the model is less than that of the original model, the embodiment of the invention adds a dense connection mechanism and a channel attention mechanism into a residual error structure of a depth estimation coding network so as to fuse shallow features (also called low-level features) and deep features (also called high-level features) of the target image and obtain a feature map of the target image in multiple scales.
The depth estimation coding network section includes a first coding layer and a plurality of second coding layers. The first coding layer includes a Convolutional layer (Convolutional layer), a Batch Normalization layer (BN), and an active layer (RELu). The second coding layer is of a residual error structure and comprises a plurality of third coding layers and a channel attention layer, and the third coding layers are connected in a dense connection mode. The third encoding layer includes a convolutional layer, a batch normalization layer, and an active layer. In this embodiment, a maximum pooling layer (maxpool) is further included between the first encoding layer and the second encoding layer.
As an example, fig. 2 shows a schematic structural diagram of a depth estimation coding network according to an embodiment of the present invention, and fig. 3 shows a schematic structural diagram of a second coding layer in the depth estimation coding network according to an embodiment of the present invention. The depth estimation coding network of this example may incorporate a dense connection and channel attention mechanism in the residual section based on ResNet 18. The network structure of the depth estimation coding network is shown in fig. 2, and the depth estimation coding network comprises 1 first coding layer and 3 second coding layers. As shown in fig. 3, the second coding layer is a residual structure, and the residual includes three third coding layers consisting of a convolutional layer, a batch normalization layer and an active layer and one channel attention layer, and the input of each third coding layer is composed of the outputs of all the previous third coding layers and the input of the entire second coding layer. For example, the input of the third encoding layer is composed of the original input, the output of the first third encoding layer, and the output of the second third encoding layer, which are spliced in the channel dimension. In the channel attention layer, the output of the third coding layer is firstly subjected to global average pooling according to channels to obtain a tensor of 1x1xC (C is the number of output channels of the third coding layer), then information fusion among the channels is carried out by utilizing a convolution kernel of 1x1 and an activation layer, and then the weight of each channel of the input features is obtained through a convolution layer of 1x1 and the activation layer, wherein the weight is the output of the channel attention layer. And finally, multiplying the output of the third coding layer by the weight of each channel, and adding the result serving as a residual error with the input of the whole second input layer to serve as the output of the depth estimation coding network.
In an optional embodiment, for the second coding layer, under the condition that the number of output channels of the second coding layer is not consistent with the number of input channels of the second coding layer, performing convolution operation and normalization operation on the input information of the second coding layer, so that the number of input channels of the second coding layer is consistent with the number of output channels of the second coding layer. In the case where the number of output channels of the second coding layer is not identical to the number of input channels of the second coding layer, the input information and the output information of the second coding layer cannot be added. In order to solve the problem that the addition cannot be performed, in this embodiment, when the number of output channels of the second coding layer is not consistent with the number of input channels of the second coding layer, a convolution operation and a normalization operation need to be performed on the input information of the second coding layer, so that the number of input channels of the second coding layer is consistent with the number of output channels of the second coding layer.
Compared with a network without dense connection, the depth estimation coding network of the embodiment of the invention strengthens information transfer among feature maps, more effectively utilizes the existing features, and limits the increase of the parameter quantity by halving the output channel of the first third coding layer; the channel self-attention mechanism deepens the fusion between the channels and strengthens the channels which are more effective for depth estimation. The applied 1x1 convolution also has fewer parameters than the full connection, and further effectively prevents the large increase of the calculation amount and the occurrence of overfitting.
In order to fully fuse the shallow feature and the deep feature of a target image acquired by a depth estimation coding network, the embodiment of the invention provides a method for fully fusing semantic information of the deep feature and positioning information of the shallow feature by utilizing the deep feature to the shallow feature and combining the same layer of feature with a densely connected depth estimation decoding network, so that accurate depth estimation information is acquired.
Further, the depth estimation decoding network includes a first decoding layer and a plurality of second decoding layers. The input of the first decoding layer is the feature graph output by the first coding layer and the feature information output by the adjacent second decoding layer. The input of the second decoding layer is the feature graph output by the corresponding second coding layer and the feature information output by the second decoding layer of the previous level.
Further, the first decoding layer and the second decoding layer include an extraction node, a fusion node, and an output node. And the fusion nodes and the output nodes are connected in a dense connection mode. The extraction node is used for acquiring a feature map output by the first coding layer or the second coding layer; the fusion node is used for fusing the feature information output by the extraction node at the same layer and the feature information output by the extraction node and the fusion node at the previous level; the output node is used for obtaining the depth maps of the target image with different scales based on the feature information output by the fusion node.
As an example, a network structure of a depth estimation decoding network is shown in fig. 4, which corresponds to the depth estimation encoding network shown in fig. 2. Each row in fig. 4 is a decoding layer, the first layer is a first decoding layer, and the second to fifth rows are a second decoding layer. Nodes numbered 00, 10, 20, 30, 40 in the figure are designated target layer outputs of the depth estimation coding network, namely layer0 to layer4 in fig. 2, namely nodes numbered 00, 10, 20, 30, 40 are extraction nodes. The node numbered 40 is the output of the deepest layer of the encoded network, corresponding to layer4 in fig. 2. The decoding network passes from the highest level features (represented by nodes 40) to the lower level features, i.e., from the deep level features to the shallow level features. The difference from FPN (Network Feature Pyramid Network) is mainly two points: one is that the decoding layers where nodes 00, 10 and 20 are located not only contain features from the coding network, but also merge the high-level features of all neighboring decoding layers (in fig. 4, the lower row is referred to as high-level features relatively speaking). For example, the decoding layer in which the node 10 is located includes not only the output features from the coding network represented by the node 10, but also features generated by upsampling and fusing the high-level features 20, 21, and 22 of the adjacent decoding layers, which are represented by nodes 11, 12, and 13, respectively (the nodes 11, 12, and 21 are fused nodes). Each high-level feature is transmitted to the adjacent low level, and information fusion between semantics of all levels is enhanced. The second difference is that dense connection is adopted in the layers of the nodes 00, 10 and 20, that is, each node receives the previous same-layer features as input, the features and the advanced features are spliced on the channels, and different features are fused by using a convolution network. The design of dense connection enhances the fusion between the features of the same hierarchy level, and better extracts and utilizes the features of each hierarchy level provided by the coding network. Nodes 22, 13 and 04 in the figure are output nodes, and after passing through the convolutional layer and the active layer, depth estimation values between 0 and 1 are output, namely depth maps (i.e. Disp1, disp2 and Disp3 in fig. 4) at different scales are output.
In an optional embodiment, the method further comprises performing an upsampling operation, a convolution operation and an activation operation on the feature information output by the output node of the first decoding layer to obtain a depth map of the target image. Continuing with fig. 4 as an example, the depth map output by the output node with the number 04 is first up-sampled, and then is convolved and activated, and the output is the depth map (i.e., disp0 in fig. 4) with the resolution consistent with that of the original target.
The input of the pose transformation model is two adjacent images, which are used for acquiring the pose transformation relation between the two adjacent images, namely perspective transformation information (rotation angle along 3 axes and translation amount along 3 axes) from one image to the other image with 6 degrees of freedom. The structure of the pose transformation model is divided into an encoding part and a decoding part on the whole, namely the pose transformation model comprises a pose transformation encoding network and a pose transformation decoding network. The structure design of the pose transformation coding network can be basically the same as ResNet18, but the number of input channels of the first convolutional layer is set to be 6, because two color images are input, the number of channels of each image is 3, and the output is the characteristics of a high layer extracted by a model. The pose transformation decoding network is used for decoding the features obtained by the pose transformation encoding network, and perspective transformation information (rotation angle along 3 axes and translation amount along 3 axes) from the target image to the adjacent frame image with 6 degrees of freedom is obtained by carrying out convolution kernel activation operation on the features of the encoding network for multiple times.
Step S103: and generating a reconstructed image based on the depth estimation information and the pose transformation relation. The reconstructed image is a reconstructed target image.
Specifically, based on the camera imaging principle, the process of reconstructing the target image as shown in fig. 5 may include:
step S501: and determining a projection transformation matrix based on the pose transformation relation. And generating a projection transformation rotation matrix according to the rotation angles of the corresponding three axes from the target image to the adjacent frame image output by the pose transformation model, and generating a projection transformation translation matrix along the translation amounts of the three axes.
Step S502: determining a conversion relation between a pixel coordinate system of the target image and a camera coordinate system based on the internal reference matrix of the monocular camera and the depth estimation information, and determining a first coordinate of a pixel point of the target image under the camera coordinate system based on the conversion relation;
step S503: converting the first coordinate into a second coordinate under a pixel coordinate system of the adjacent frame image based on the internal reference matrix and the projective transformation matrix;
step S504: and sampling the adjacent frame images based on the second coordinates to generate a reconstructed image.
Further, in the embodiment of the present invention, the adjacent frame images of the target image include a preceding adjacent frame image and a following adjacent frame image. The embodiment of the invention can generate the reconstructed image based on the front adjacent frame image and the rear adjacent frame image respectively. Therefore, acquiring the pose transformation relationship between the target image and the adjacent frame image may include: acquiring a first bit posture transformation relation between the target image and the previous adjacent frame image; and acquiring a second position and posture transformation relation between the target image and the rear adjacent frame image. Then, a reconstructed image is generated based on the first pose transformation relation, and a second reconstructed image is generated based on the second pose transformation relation.
In combination with the depth estimation model as described in fig. 2 and 4, the depth estimation model outputs 4 depths with different scales, and the adjacent frame images of the target image include the previous adjacent frame image and the next adjacent frame image, so that one reconstructed image is generated for each depth image and adjacent frame image, and eight reconstructed images are obtained in total.
Step S104: and constructing a loss function based on the reconstructed image and the target image, and completing the training of the depth estimation model by using the loss function.
The design of a loss function has great influence on the performance of the self-supervision deep learning algorithm, and the embodiment of the invention utilizes adjacent frame images to reconstruct a target image based on a depth estimation model and a pose transformation model, and compares the reconstructed image with an original target image to construct loss so as to train the two models.
The depth estimation model training method of the embodiment of the invention comprises the steps of obtaining depth estimation information of a target image through a depth estimation model, obtaining a pose transformation relation between the target image and an adjacent frame image through the pose transformation model, generating a reconstructed image based on the depth estimation information and the pose transformation relation, constructing a loss function based on the reconstructed image and the target image, optimizing parameters of the depth estimation model by using the loss function, and finishing training of the depth estimation model, wherein the depth estimation model comprises a depth estimation coding network and a depth estimation decoding network, the depth estimation coding network adds a dense connection mechanism and a channel attention mechanism into a residual error structure of the depth estimation coding network to fuse shallow layer characteristics and deep layer characteristics of the target image and obtain characteristic maps of multiple scales of the target image, so that the extracted characteristics are more effective, comprehensive and accurate, and the accuracy and the reasoning speed are improved under the condition that the number of layers of the model is increased a little; the depth estimation decoding network is used for fusing feature maps of multiple scales of a target image to obtain depth estimation information of the target image, so that shallow features and deep features of the target image are fully fused, namely, positioning information of shallow feature representations and semantic information of deep feature representations of the target image are fully fused, the accuracy of an obtained depth estimation result is high, the accuracy, reasoning speed and robustness of a depth estimation model are improved, the depth estimation decoding network can adapt to most driving scenes, and accurate depth information is provided for an automatic driving system.
In an alternative embodiment, the process of constructing a loss function based on the reconstructed image and the target image, and completing the training of the depth estimation model by using the loss function may include:
determining a loss function between the reconstructed image and the target image based on the structural similarity and the Manhattan distance;
calculating a target loss of each pixel point between the reconstructed image and the target image based on the loss function;
completing training of the depth estimation model based on the target loss.
The embodiment of the invention calculates the accuracy of reconstruction by using the Structural Similarity (SSIM) and the Manhattan distance between the reconstructed image and the original target image. The SSIM measures the similarity degree from three aspects of brightness, contrast and structure of an image, and the Manhattan distance directly calculates the difference value of image pixel values.
All previous schemes perform calculations on the RGB color space of the image. If the exposure intensity of the reconstructed image is inconsistent with that of the target image, even if the depth model and the pose transformation model give accurate estimation, a certain reconstruction error also exists between the reconstructed image obtained through calculation and the original image. Similarly, when an object in an image is subjected to different shadows formed by other shielding objects, the brightness of corresponding pixels is different due to the difference of the shadows, and finally, the reconstruction error cannot correctly reflect the accuracy of the designed model. The present invention uses the LAB color space instead of the RGB model. The LAB color space still has 3 channels, but unlike RGB, the LAB color space decouples the picture brightness and color, channel L is only responsible for brightness information, and channels a and B gather color information.
Before calculating the SSIM and the Manhattan distance, the embodiment of the invention also comprises the steps of converting the reconstructed image and the target image into an LAB mode, respectively carrying out normalization processing on the values of the reconstructed image and the target image in an A channel and a B channel, and scaling the value of an L channel to be within a preset range. For example, the reconstructed image and the target image are converted from an RGB mode to an LAB mode, and the values of the AB channel are normalized to [ -1,1], while the range of the channel L is scaled to [ - α, α ]. Wherein α is a positive number less than 1. This is equivalent to reducing the effect of brightness on SSIM and manhattan distance, making the error to the target image insensitive to changes in brightness.
In an alternative embodiment, the process of determining a loss function between the reconstructed image in the LAB mode and the target image based on the structural similarity and the manhattan distance includes:
determining a reconstruction loss and a translation loss between the reconstructed image in the LAB mode and the target image based on the structural similarity and the manhattan distance;
determining a loss function between the reconstructed image in the LAB mode and the target image based on the reconstruction loss and the translation loss.
Wherein a reconstruction loss between the reconstructed image in the LAB mode and the target image is determined according to:
Figure BDA0003881140460000111
Figure BDA0003881140460000112
alpha and beta represent weights, respectively, and are positive numbers between 0 and 1, [ i, j]The pixel point of the ith row and the jth column on the target image is represented,
Figure BDA0003881140460000113
shows the reconstruction generated from the kth Zhang Shendu image and the mth adjacent frame imageA loss of reconstruction between the image and the target image,
Figure BDA0003881140460000114
representing the loss of structural similarity between the reconstructed image generated by using the k Zhang Shendu image and the m adjacent frame image and the target image,
Figure BDA0003881140460000115
indicating the Manhattan distance loss between the reconstructed image generated by using the k Zhang Shendu image and the m adjacent frame image and the target image,
Figure BDA0003881140460000116
respectively showing the Manhattan distance loss of the reconstructed image generated by the k Zhang Shendu image and the m adjacent frame image and the target image in an L channel, an A channel and a B channel. m is 1 or 2. In connection with the examples shown in fig. 2 and 4, k has a value of 1, 2, 3 or 4.
Wherein, the calculation formula of SSIM is as follows:
Figure BDA0003881140460000117
wherein, mu src And mu dst Mean, σ, of pixel values of the target image and of the reconstructed image, respectively src And σ dst The variance, σ, of the pixel values of the target image and the reconstructed image, respectively src,dst Representing the covariance of the pixel values of the target image and of the reconstructed image, C 1 And C 2 Is a very small constant. Whereas the loss of SSIM is:
L SSIM =(1-SSIM)/2
the value range of SSIM loss is 0,1.
The loss corresponding to the manhattan distance is defined as:
L dis =|I src -I dst | 1
determining a translation loss between the reconstructed image in the LAB mode and the target image according to:
Figure BDA0003881140460000121
wherein gamma represents a weight, is a positive number between 0 and 1, and L neib,m Representing a translation loss between a reconstructed image generated using the m-th adjacent frame image and the target image, L SSIM (I neib,m ,I dst ) Representing a loss of structural similarity between the m-th adjacent frame image and the target image,
Figure BDA0003881140460000122
representing the manhattan distance loss between the mth adjacent frame image and the target image.
In the embodiment of the present invention, the above formula calculates the reconstruction loss and the motion loss (also referred to as the reconstruction error and the motion error) at the pixel level. The reconstruction error measures the similarity between the reconstructed picture and the target picture. The translation error is measured by the similarity between the adjacent frame picture and the target picture, and can also be regarded as the similarity between a reconstructed image generated by the adjacent frame picture and the target picture.
The translation loss needs to be calculated in the embodiment of the present invention because the unsupervised algorithm used in the present invention considers that the depth of an object that remains stationary in different frame images is infinity (only when the depth of the object is infinity, it is possible to remain stationary in a reconstructed image obtained through pose transformation), and in reality, an object that remains relatively stationary with respect to the camera appears. Similar misjudgment may occur in a low-texture region of the target image. The invention utilizes the characteristic of small translation loss under the situation (the translation loss at the position is small relative to the same position of a static object in adjacent frames and a target image), compares the reconstruction loss at the same pixel with the translation loss, and if the translation loss is less than the reconstruction loss, the position is considered to correspond to the static object or a low-texture area, and the loss generated at the position does not take a loss function into account. Therefore, the trained model is not interfered by a static object, and more accurate depth estimation is provided.
Due to the shielding of adjacent images of the front frame and the rear frame, the small value of the reconstruction error and the small value of the translation error are selected at the pixel level in the embodiment of the invention and are used as the reconstruction error and the translation error under a certain depth map. That is, as shown in the following equation, for two reconstructed images generated by using the same depth map and different adjacent frame images, the reconstruction loss and the translation loss of each pixel point in the two reconstructed images are compared, and the reconstruction loss and the translation loss which are smaller are respectively used as the reconstruction loss and the translation loss between the reconstructed image generated by using the depth map and the target image. The depth estimation model described in fig. 2 and fig. 4 is combined, the depth estimation model outputs 4 depths with different scales, and the adjacent frame images of the target image include a front adjacent frame image and a rear adjacent frame image, so that, for each depth map, a reconstructed image can be generated by respectively using the front adjacent frame image and the rear adjacent frame image, the number of the reconstructed images corresponding to the depth map is two, and the smaller value of the reconstruction error and the smaller value of the translation error are selected from the two reconstructed images on a pixel-by-pixel basis as the reconstruction error and the translation error in the depth map.
Figure BDA0003881140460000131
L neib =min(L neib,1 ,L neib,2 )
After determining a smaller reconstruction loss and a smaller translation loss under each Zhang Shendu image, taking the smaller reconstruction loss as a target reconstruction loss, taking the smaller translation loss as a target translation loss, ignoring pixel points of which the target reconstruction loss is greater than or equal to the target translation loss error, taking the target reconstruction loss of the pixel points of which the target reconstruction loss is less than the target translation loss error as a target loss, and determining a loss function between the reconstructed image in the LAB mode and the target image based on the target loss. That is, as shown in the following formula, the final error of the embodiment of the present invention is only the reconstruction error formed by those pixels smaller than the translation error.
Figure BDA0003881140460000132
Figure BDA0003881140460000133
Therein, loss k,m Representing the loss of the object between the reconstructed image generated using the k Zhang Shendu map and the object image,
Figure BDA0003881140460000134
represents the object reconstruction loss, L neib Representing the target translation loss, M k Representing a mask matrix, whose corresponding position in M has a value of 1 if the reconstruction error of a certain pixel is smaller than the translation error, and 0 otherwise.
In digital image processing, the mask is a two-dimensional matrix. In digital image processing, image masks are mainly used for: (1) And extracting a Region Of Interest (ROI), multiplying the image to be processed by a pre-made ROI mask to obtain an ROI image, wherein the image value in the ROI is kept unchanged, and the image value outside the ROI is 0. (2) Masking, where a mask is used to mask certain areas of the image from processing or from processing parameter calculations, or to process or count only the masked areas. (3) And (4) extracting structural features, namely detecting and extracting the structural features similar to the mask in the image by using a similarity variable or an image matching method.
And finally, the depth map under each scale in the depth estimation model participates in the calculation of the training error, so that different layers in the depth estimation decoding model learn the mapping relation with the depth map through the back propagation of the error as much as possible. The training error of the model is taken as the average value of the training errors corresponding to the multiple depth maps, that is:
Figure BDA0003881140460000141
where L represents the average loss and n represents the number of depth maps.
Aiming at the condition that the exposure intensities of adjacent video frames are inconsistent, the embodiment of the invention provides a method for calculating the loss function by combining an LAB color space with three channels of different weight designs instead of an RGB color space, and the L1 loss between images is calculated by using a new loss function calculation mode, namely, the embodiment of the invention decouples the brightness and the color in the LAB color space and is matched with different weight coefficients, so that the influence of the inconsistency of the light intensity on the depth estimation caused by sunlight shielding, camera hardware and the like in a real scene is reduced. According to the projection principle, an object that is stationary with respect to the monocular camera may be considered to have infinite depth, so its corresponding depth estimate is erroneous at this time. Similarly, the low texture region corresponds to a similar condition, and therefore, the embodiment of the present invention designs a new mask mode M k [i,j]And intelligently identifying a part which is static relative to the camera and a low-texture area, and enabling the loss generated by the part not to participate in the training of the model, thereby avoiding generating a wrong depth estimation result.
According to the depth estimation model training method, the depth information is estimated by using the self-supervision deep learning method, the requirement on a matched embedded system is not high, the synchronization of time of a plurality of cameras is not needed, and the data sample is not required to be marked, so that the difficulty of model training is reduced, the steps of information extraction are reduced, and the efficiency of algorithm deployment is improved; the coding network and the decoding network of the depth estimation model of the embodiment of the invention adopt the residual connection, the dense connection and the channel attention structure, fully excavate and utilize the characteristic information extracted from the image, achieve better depth estimation accuracy under the premise of low parameter quantity, and obtain good balance on the model complexity and the model precision. According to the method for calculating the loss function, disclosed by the embodiment of the invention, the influence of inconsistent exposure of the same object and inconsistent shadow formed by a light source among different frames on the training error is considered, the training error generated by the object which is static relative to a camera and a low-texture area is shielded by using the mask matrix, and the judgment capability of the depth information of the corresponding area is enhanced.
Fig. 6 shows a flow chart of a depth estimation method according to an embodiment of the present invention, and as shown in fig. 6, the method includes:
step S601: acquiring an image to be estimated;
step S602: processing the image to be estimated by using a pre-trained depth estimation model to obtain a depth map of the image to be estimated, and taking the depth map as a depth estimation result of the image to be estimated, wherein the depth estimation model is obtained by training according to the depth estimation model training method of any embodiment.
According to the depth estimation method, the image to be estimated is analyzed through the depth estimation model, so that the depth information corresponding to the video frame can be quickly and accurately input after the front-view monocular camera obtains the video frame, detection and estimation of the distance between the vehicle, people, buildings and other objects in front and the vehicle under the environments of different weather, different road conditions and the like are achieved, and a depth estimation method which is high in accuracy, strong in robustness and capable of detecting in real time is formed.
In an optional embodiment, in combination with the depth estimation model shown in fig. 2 to 4, when the depth map of the image to be estimated is obtained, a depth map output by a first decoding layer of the depth estimation model is obtained, and the depth map output by the first decoding layer is used as a depth estimation result of the image to be estimated. In other alternative embodiments, a depth map output by other decoding layers of the depth estimation model may also be obtained as a depth estimation result of the image to be estimated.
Fig. 7 shows a flow chart of a depth estimation method according to another embodiment of the present invention, as shown in fig. 7, the method includes:
step S701: acquiring an image to be estimated;
step S702: carrying out mirror image overturning on the image to be estimated to obtain a mirror image corresponding to the image to be estimated;
step S703: processing the image to be estimated by using a pre-trained depth estimation model to obtain a first depth map of the image to be estimated;
step S704: processing the mirror image by using the depth estimation model to obtain a second depth map of the mirror image;
step S705: determining a depth estimation result of the image to be estimated based on the first depth map and the second depth map.
In this case, it is equivalent to having two depth estimation maps of the images to be estimated. For the automatic driving scene, for the safety of vehicle driving, the estimation that the distance between two depth estimation maps is shorter pixel by pixel is taken as the final depth estimation result of the image to be estimated. Thus, step S705 may include:
carrying out mirror image turning on the second depth map to obtain a third depth map;
and for each pixel point on the image to be estimated, determining a smaller depth in the first depth map and the third depth map, and taking the smaller depth as a depth estimation result of the pixel point.
The depth estimation method provided by the embodiment of the invention comprises the steps of carrying out mirror image turning on an image to be estimated to obtain a mirror image, obtaining the depth map of the image to be estimated and the depth map of the mirror image by using the depth estimation model, which is equivalent to the depth estimation map with two images to be estimated, and taking the estimation with a short distance from the two depth maps pixel by pixel as the final depth estimation result of the image to be estimated, so that the driving safety of a vehicle can be improved.
Fig. 8 is a schematic structural diagram of a depth estimation model training apparatus 800 according to an embodiment of the present invention, and as shown in fig. 8, the apparatus 800 includes:
a first obtaining module 801, configured to obtain a target image and an adjacent frame image adjacent to the target image, where the target image and the adjacent frame image are images captured by a vehicle-mounted front-view monocular camera respectively;
a building module 802, configured to build a depth estimation model and a pose transformation model; the depth estimation model comprises a depth estimation coding network and a depth estimation decoding network, wherein the depth estimation coding network adds a dense connection mechanism and a channel attention mechanism into a residual error structure of the depth estimation coding network so as to fuse shallow features and deep features of the target image and obtain a feature map of the target image in multiple scales; the depth estimation decoding network is used for fusing feature maps of multiple scales of the target image to obtain depth estimation information of the target image; the pose transformation model is used for acquiring a pose transformation relation between the target image and the adjacent frame image;
a reconstruction module 803, configured to generate a reconstructed image based on the depth estimation information and the pose transformation relationship;
a parameter training module 804, configured to construct a loss function based on the reconstructed image and the target image, and complete training of the depth estimation model by using the loss function.
Optionally, the depth estimation coding network part comprises a first coding layer and a plurality of second coding layers; the first coding layer comprises a convolution layer, a batch normalization layer and an activation layer; the second coding layer is of a residual error structure and comprises a plurality of third coding layers and a channel attention layer, and the third coding layers are connected in a dense connection mode.
Optionally, the third encoding layer comprises a convolutional layer, a bulk normalization layer, and an active layer.
Optionally, the method further comprises: for the second coding layer, under the condition that the number of output channels of the second coding layer is inconsistent with the number of input channels of the second coding layer, performing convolution operation and normalization operation on the input information of the second coding layer, so that the number of input channels of the second coding layer is consistent with the number of output channels of the second coding layer.
Optionally, the depth estimation decoding network comprises a first decoding layer and a plurality of second decoding layers; the input of the first decoding layer is a feature graph output by the first coding layer and feature information output by an adjacent second decoding layer; the input of the second decoding layer is the feature graph output by the corresponding second coding layer and the feature information output by the second decoding layer of the previous level.
Optionally, the first decoding layer and the second decoding layer include an extraction node, a fusion node, and an output node; the extraction node is used for acquiring a feature map output by the first coding layer or the second coding layer; the fusion node is used for fusing the characteristic information output by the extraction node at the same layer and the characteristic information output by the extraction node and the fusion node at the previous level; the output node is used for obtaining the depth maps of the target image in different scales based on the feature information output by the fusion node.
Optionally, the building module is further configured to: and performing up-sampling operation, convolution operation and activation operation on the feature information output by the output node of the first decoding layer to obtain a depth map of the target image.
Optionally, the pose transformation model comprises a pose transformation encoding network and a pose transformation decoding network; the pose transformation coding network is used for acquiring pose transformation characteristics between the target image and the adjacent frame image; the pose transformation decoding network is used for acquiring a rotation angle and a translation amount between the target image and the adjacent frame image based on the pose transformation characteristics, and taking the rotation angle and the translation amount as a pose transformation relation between the target image and the adjacent frame image.
Optionally, the reconstruction module is further configured to: determining a projection transformation matrix based on the pose transformation relation; determining a conversion relation between a pixel coordinate system of the target image and a camera coordinate system based on the internal reference matrix of the monocular camera and the depth estimation information, and determining a first coordinate of a pixel point of the target image under the camera coordinate system based on the conversion relation; converting the first coordinate into a second coordinate under a pixel coordinate system of the adjacent frame image based on the internal reference matrix and the projective transformation matrix; and sampling the adjacent frame images based on the second coordinates to generate a reconstructed image.
Optionally, the adjacent frame images include a front adjacent frame image and a rear adjacent frame image;
the building module is further configured to: acquiring a first bit posture transformation relation between the target image and the previous adjacent frame image; and acquiring a second position and posture transformation relation between the target image and the rear adjacent frame image.
Optionally, the parameter training module is further configured to: determining a loss function between the reconstructed image and the target image based on the structural similarity and the Manhattan distance; calculating a target loss of each pixel point between the reconstructed image and the target image based on the loss function; completing training of the depth estimation model based on the target loss.
Optionally, the parameter training module is further configured to: converting the reconstructed image and the target image into an LAB mode, respectively carrying out normalization processing on values of the reconstructed image and the target image in an A channel and a B channel, and scaling a value of an L channel to a preset range; determining a loss function between the reconstructed image and the target image in the LAB mode based on the structural similarity and the Manhattan distance.
Optionally, the parameter training module is further configured to: determining a reconstruction loss and a translation loss between the reconstructed image in the LAB mode and the target image based on the structural similarity and the manhattan distance; determining a loss function between the reconstructed image in the LAB mode and the target image based on the reconstruction loss and the translation loss.
Optionally, the parameter training module is further configured to determine a reconstruction loss between the reconstructed image in the LAB mode and the target image according to:
Figure BDA0003881140460000181
Figure BDA0003881140460000182
wherein α and β represent weights, [ i, j ] respectively]To express the purposeMarking the pixel point of the ith row and the jth column on the image,
Figure BDA0003881140460000183
indicating the reconstruction loss between the reconstructed image generated by using the k Zhang Shendu image and the m adjacent frame image and the target image,
Figure BDA0003881140460000184
representing the loss of structural similarity between the reconstructed image generated by using the k Zhang Shendu image and the m adjacent frame image and the target image,
Figure BDA0003881140460000185
showing the Manhattan distance loss between the reconstructed image generated by using the k Zhang Shendu image and the m adjacent frame image and the target image,
Figure BDA0003881140460000191
respectively showing the Manhattan distance loss of the reconstructed image generated by the k Zhang Shendu image and the m adjacent frame image and the target image in an L channel, an A channel and a B channel.
Optionally, the parameter training module is further configured to determine a translation loss between the reconstructed image in the LAB mode and the target image according to:
Figure BDA0003881140460000192
wherein γ represents a weight, L neib,m Representing a translation loss between a reconstructed image generated using the m-th adjacent frame image and the target image, L SSIM (I neib,m ,I dst ) Representing a loss of structural similarity between the mth adjacent frame image and the target image,
Figure BDA0003881140460000193
representing the manhattan distance loss between the mth adjacent frame image and the target image.
Optionally, the parameter training module is further configured to: for two reconstructed images generated by using the same depth map and different adjacent frame images, comparing the reconstruction loss and the translation loss of each pixel point in the two reconstructed images, respectively using the smaller reconstruction loss and the smaller translation loss as the reconstruction loss and the translation loss between the reconstructed image generated by using the depth map and a target image, and determining a loss function between the reconstructed image and the target image in the LAB mode based on the smaller reconstruction loss and the smaller translation loss.
Optionally, the parameter training module is further configured to: and taking the smaller reconstruction loss as a target reconstruction loss, taking the smaller translation loss as a target translation loss, neglecting pixel points of which the target reconstruction loss is greater than or equal to the target translation loss error, taking the target reconstruction loss of the pixel points of which the target reconstruction loss is less than the target translation loss error as a target loss, and determining a loss function between the reconstructed image in the LAB mode and the target image based on the target loss.
Optionally, the parameter training module is further configured to determine the target loss according to the following formula:
Figure BDA0003881140460000194
Figure BDA0003881140460000195
therein, loss k,m Representing the loss of the object between the reconstructed image generated using the k Zhang Shendu map and the object image,
Figure BDA0003881140460000196
represents the target reconstruction loss, L neib Representing the target translation loss, M k Representing a mask matrix.
Optionally, the loss function is determined according to:
Figure BDA0003881140460000201
where L represents the average loss and n represents the number of depth maps.
The depth estimation model training device of the embodiment of the invention estimates the depth information by using the self-supervision deep learning method, has low requirement on a matched embedded system, does not need to synchronize the time of a plurality of cameras, has no marking requirement on data samples, reduces the difficulty of model training, reduces the steps of information extraction and improves the efficiency of algorithm deployment; the coding network and the decoding network of the depth estimation model of the embodiment of the invention adopt the residual connection, the dense connection and the channel attention structure, fully excavate and utilize the characteristic information extracted from the image, achieve better depth estimation accuracy under the premise of low parameter quantity, and obtain good balance on the model complexity and the model precision. According to the method for calculating the loss function, disclosed by the embodiment of the invention, the influence of inconsistent exposure of the same object and inconsistent shadow formed by a light source among different frames on the training error is considered, the training error generated by the object which is static relative to a camera and a low-texture area is shielded by using the mask matrix, and the judgment capability of the depth information of the corresponding area is enhanced.
Fig. 9 shows a schematic structural diagram of a depth estimation device 900 according to an embodiment of the present invention, and as shown in fig. 9, the depth estimation device 900 includes:
a second obtaining module 901, configured to obtain an image to be estimated;
a depth determining module 902, configured to process the image to be estimated by using a pre-trained depth estimation model, to obtain a depth map of the image to be estimated, and use the depth map as a depth estimation result of the image to be estimated.
Optionally, the depth determination module is further configured to: processing the image to be estimated by using a pre-trained depth estimation model to obtain a depth map output by a first decoding layer of the depth estimation model, and taking the depth map output by the first decoding layer as a depth estimation result of the image to be estimated.
Optionally, the depth determination module is further configured to: carrying out mirror image overturning on the image to be estimated to obtain a mirror image corresponding to the image to be estimated; processing the image to be estimated by using a pre-trained depth estimation model to obtain a first depth map of the image to be estimated; processing the mirror image by using the depth estimation model to obtain a second depth map of the mirror image; determining a depth estimation result of the image to be estimated based on the first depth map and the second depth map.
Optionally, the depth determination module is further configured to: carrying out mirror image overturning on the second depth map to obtain a third depth map; and for each pixel point on the image to be estimated, determining a smaller depth in the first depth map and the third depth map, and taking the smaller depth as a depth estimation result of the pixel point.
The depth estimation device provided by the embodiment of the invention analyzes the image to be estimated through the depth estimation model, so that the depth information corresponding to the video frame can be quickly and accurately input after the front-view monocular camera obtains the video frame, the detection and estimation of the distance between the vehicle and the object such as the front vehicle, the people and the building under the environments of different weather, different road conditions and the like are realized, and a depth estimation method which is high in accuracy, strong in robustness and capable of detecting in real time is formed.
The device can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
The embodiment of the present invention further provides an electronic device, as shown in fig. 10, which includes a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, wherein the processor 1001, the communication interface 1002 and the memory 1003 complete mutual communication through the communication bus 1004,
a memory 1003 for storing a computer program;
the processor 1001 is configured to implement the following steps when executing the program stored in the memory 1003: acquiring a target image and an adjacent frame image adjacent to the target image, wherein the target image and the adjacent frame image are respectively images shot by a vehicle-mounted front-view monocular camera; constructing a depth estimation model and a pose transformation model; the depth estimation model comprises a depth estimation coding network and a depth estimation decoding network, wherein the depth estimation coding network adds a dense connection mechanism and a channel attention mechanism into a residual error structure of the depth estimation coding network so as to fuse shallow features and deep features of the target image and obtain a feature map of the target image in multiple scales; the depth estimation decoding network is used for fusing feature maps of multiple scales of the target image to obtain depth estimation information of the target image; the pose transformation model is used for acquiring a pose transformation relation between the target image and the adjacent frame image; generating a reconstructed image based on the depth estimation information and the pose transformation relation; constructing a loss function based on the reconstructed image and the target image, and completing the training of the depth estimation model by using the loss function; or, the following steps are implemented: acquiring an image to be estimated; and processing the image to be estimated by utilizing a pre-trained depth estimation model to obtain a depth map of the image to be estimated, and taking the depth map as a depth estimation result of the image to be estimated.
The communication bus 1004 mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 1004 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this is not intended to represent only one bus or type of bus.
The communication interface 1002 is used for communication between the above-described terminal and other devices.
The Memory 1003 may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor 1001.
The Processor 1001 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In yet another embodiment provided by the present invention, a computer-readable medium is further provided, in which instructions are stored, and when the instructions are executed on a computer, the instructions cause the computer to execute the depth estimation model training method or the depth estimation method described in any one of the above embodiments.
In yet another embodiment, a computer program product containing instructions is provided, which when run on a computer, causes the computer to perform the depth estimation model training method or the depth estimation method described in any of the above embodiments.
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (27)

1. A method for training a depth estimation model, comprising:
acquiring a target image and an adjacent frame image adjacent to the target image, wherein the target image and the adjacent frame image are respectively images shot by a vehicle-mounted front-view monocular camera;
constructing a depth estimation model and a pose transformation model; the depth estimation model comprises a depth estimation coding network and a depth estimation decoding network, wherein the depth estimation coding network adds a dense connection mechanism and a channel attention mechanism into a residual error structure of the depth estimation coding network so as to fuse shallow features and deep features of the target image and obtain a feature map of the target image in multiple scales; the depth estimation decoding network is used for fusing feature maps of the target image in multiple scales to obtain depth estimation information of the target image; the pose transformation model is used for acquiring a pose transformation relation between the target image and the adjacent frame image;
generating a reconstructed image based on the depth estimation information and the pose transformation relation;
and constructing a loss function based on the reconstructed image and the target image, and completing the training of the depth estimation model by using the loss function.
2. The method of claim 1, wherein the depth estimation coding network portion comprises a first coding layer and a plurality of second coding layers;
the first coding layer comprises a convolution layer, a batch normalization layer and an activation layer;
the second coding layer is of a residual error structure and comprises a plurality of third coding layers and a channel attention layer, and the third coding layers are connected in a dense connection mode.
3. The method of claim 2, wherein the third coding layer comprises a convolutional layer, a batch normalization layer, and an active layer.
4. The method of claim 2, further comprising:
for the second coding layer, under the condition that the number of output channels of the second coding layer is inconsistent with the number of input channels of the second coding layer, performing convolution operation and normalization operation on the input information of the second coding layer, so that the number of input channels of the second coding layer is consistent with the number of output channels of the second coding layer.
5. The method of claim 2, wherein the depth estimation decoding network comprises a first decoding layer and a plurality of second decoding layers;
the input of the first decoding layer is a feature graph output by the first coding layer and feature information output by an adjacent second decoding layer;
the input of the second decoding layer is the feature graph output by the corresponding second encoding layer and the feature information output by the second decoding layer of the previous level.
6. The method of claim 5, wherein the first decoding layer and the second decoding layer comprise extraction nodes, fusion nodes and output nodes, and wherein the fusion nodes are connected with each other and the fusion nodes and the output nodes are connected with each other in a dense connection manner;
the extraction node is used for acquiring a feature map output by the first coding layer or the second coding layer;
the fusion node is used for fusing the feature information output by the extraction node at the same layer and the feature information output by the extraction node and the fusion node at the previous level;
the output node is used for obtaining the depth maps of the target image in different scales based on the feature information output by the fusion node.
7. The method of claim 6, further comprising:
and performing up-sampling operation, convolution operation and activation operation on the feature information output by the output node of the first decoding layer to obtain a depth map of the target image.
8. The method of claim 1, wherein the pose transformation model comprises a pose transformation encoding network and a pose transformation decoding network;
the pose transformation coding network is used for acquiring pose transformation characteristics between the target image and the adjacent frame image;
the pose transformation decoding network is used for acquiring a rotation angle and a translation amount between the target image and the adjacent frame image based on the pose transformation characteristics, and taking the rotation angle and the translation amount as a pose transformation relation between the target image and the adjacent frame image.
9. The method according to any one of claims 1-8, wherein generating a reconstructed image based on the depth estimation information and the pose transformation relationship comprises:
determining a projection transformation matrix based on the pose transformation relation;
determining a conversion relation between a pixel coordinate system of the target image and a camera coordinate system based on the internal reference matrix of the monocular camera and the depth estimation information, and determining a first coordinate of a pixel point of the target image under the camera coordinate system based on the conversion relation;
converting the first coordinate into a second coordinate under a pixel coordinate system of the adjacent frame image based on the internal reference matrix and the projective transformation matrix;
and sampling the adjacent frame images based on the second coordinates to generate a reconstructed image.
10. The method of claim 9, wherein the adjacent frame images comprise a preceding adjacent frame image and a following adjacent frame image;
acquiring a pose transformation relation between the target image and the adjacent frame image, wherein the pose transformation relation comprises the following steps:
acquiring a first bit posture transformation relation between the target image and the previous adjacent frame image; and acquiring a second position and posture transformation relation between the target image and the rear adjacent frame image.
11. The method of claim 9, wherein constructing a loss function based on the reconstructed image and the target image and using the loss function to complete the training of the depth estimation model comprises:
determining a loss function between the reconstructed image and the target image based on the structural similarity and the Manhattan distance;
calculating a target loss of each pixel point between the reconstructed image and the target image based on the loss function;
completing training of the depth estimation model based on the target loss.
12. The method of claim 11, wherein determining a loss function between the reconstructed image and the target image based on structural similarity and manhattan distance comprises:
converting the reconstructed image and the target image into an LAB mode, respectively carrying out normalization processing on values of the reconstructed image and the target image in an A channel and a B channel, and scaling a value of an L channel to a preset range;
determining a loss function between the reconstructed image and the target image in the LAB mode based on the structural similarity and the Manhattan distance.
13. The method of claim 12, wherein determining a loss function between the reconstructed image and the target image in the LAB mode based on the structural similarity and the manhattan distance comprises:
determining a reconstruction loss and a translation loss between the reconstructed image in the LAB mode and the target image based on the structural similarity and the manhattan distance;
determining a loss function between the reconstructed image in the LAB mode and the target image based on the reconstruction loss and the translation loss.
14. The method of claim 13, wherein a reconstruction loss between the reconstructed image in the LAB mode and the target image is determined according to:
Figure FDA0003881140450000041
Figure FDA0003881140450000042
wherein α and β represent weights, [ i, j ] respectively]The pixel point of the ith row and the jth column on the target image is represented,
Figure FDA0003881140450000043
representing the reconstruction loss between the reconstructed image generated by using the k Zhang Shendu image and the m adjacent frame image and the target image,
Figure FDA0003881140450000044
representing the loss of structural similarity between the reconstructed image generated by using the k Zhang Shendu image and the m adjacent frame image and the target image,
Figure FDA0003881140450000045
indicating the Manhattan distance loss between the reconstructed image generated by using the k Zhang Shendu image and the m adjacent frame image and the target image,
Figure FDA0003881140450000046
respectively indicate the utilization of the k-th sheet depthAnd Manhattan distance loss of a reconstructed image generated by the degree graph and the mth adjacent frame image and the target image in an L channel, an A channel and a B channel.
15. The method of claim 14, wherein a translation loss between the reconstructed image in the LAB mode and the target image is determined according to:
Figure FDA0003881140450000047
wherein γ represents a weight, L neib,m Representing a translation loss between a reconstructed image generated using the m-th adjacent frame image and the target image, L SSIM (I neib,m ,I dst ) Representing a loss of structural similarity between the mth adjacent frame image and the target image,
Figure FDA0003881140450000048
representing the manhattan distance loss between the mth adjacent frame image and the target image.
16. The method of claim 15, wherein determining a loss function between the reconstructed image in the LAB mode and the target image based on the reconstruction loss and the translation loss comprises:
for two reconstructed images generated by using the same depth map and different adjacent frame images, comparing the reconstruction loss and the translation loss of each pixel point in the two reconstructed images, and respectively taking the smaller reconstruction loss and the smaller translation loss as the reconstruction loss and the translation loss between the reconstructed image generated by using the depth map and a target image;
determining a loss function between the reconstructed image in the LAB mode and the target image based on the smaller reconstruction loss and the smaller translation loss.
17. The method of claim 16, wherein determining a loss function between the reconstructed image in the LAB mode and the target image based on the smaller reconstruction loss and smaller translation loss comprises:
and taking the smaller reconstruction loss as a target reconstruction loss, taking the smaller translation loss as a target translation loss, ignoring pixel points of which the target reconstruction loss is greater than or equal to the target translation loss error, taking the target reconstruction loss of the pixel points of which the target reconstruction loss is less than the target translation loss error as a target loss, and determining a loss function between the reconstructed image in the LAB mode and the target image based on the target loss.
18. The method of claim 17, wherein the target loss is determined according to the following equation:
Figure FDA0003881140450000051
Figure FDA0003881140450000052
therein, loss k,m Representing the loss of the object between the reconstructed image generated using the k Zhang Shendu map and the object image,
Figure FDA0003881140450000053
represents the target reconstruction loss, L neib Representing the target translation loss, M k Representing a mask matrix.
19. The method of claim 17, wherein the loss function is determined according to the following equation:
Figure FDA0003881140450000054
where L represents the average loss and n represents the number of depth maps.
20. A method of depth estimation, comprising:
acquiring an image to be estimated;
processing the image to be estimated by using a pre-trained depth estimation model to obtain a depth map of the image to be estimated, and using the depth map as a depth estimation result of the image to be estimated, wherein the depth estimation model is trained according to the method of any one of claims 1-19.
21. The method of claim 20, wherein processing the image to be estimated by using a pre-trained depth estimation model to obtain a depth map of the image to be estimated, and taking the depth map as a depth estimation result of the image to be estimated comprises:
processing the image to be estimated by using a pre-trained depth estimation model to obtain a depth map output by a first decoding layer of the depth estimation model, and taking the depth map output by the first decoding layer as a depth estimation result of the image to be estimated.
22. The method according to claim 20 or 21, wherein the processing the image to be estimated by using a pre-trained depth estimation model to obtain a depth map of the image to be estimated, and using the depth map as a depth estimation result of the image to be estimated comprises:
carrying out mirror image overturning on the image to be estimated to obtain a mirror image corresponding to the image to be estimated;
processing the image to be estimated by using a pre-trained depth estimation model to obtain a first depth map of the image to be estimated;
processing the mirror image by using the depth estimation model to obtain a second depth map of the mirror image;
determining a depth estimation result of the image to be estimated based on the first depth map and the second depth map.
23. The method of claim 22, wherein determining a depth estimation result for the image to be estimated based on the first depth map and the second depth map comprises:
carrying out mirror image turning on the second depth map to obtain a third depth map;
and for each pixel point on the image to be estimated, determining a smaller depth in the first depth map and the third depth map, and taking the smaller depth as a depth estimation result of the pixel point.
24. A depth estimation model training apparatus, comprising:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a target image and an adjacent frame image adjacent to the target image, and the target image and the adjacent frame image are respectively images shot by a vehicle-mounted front-view monocular camera;
the construction module is used for constructing a depth estimation model and a pose transformation model; the depth estimation model comprises a depth estimation coding network and a depth estimation decoding network, wherein the depth estimation coding network adds a dense connection mechanism and a channel attention mechanism into a residual error structure of the depth estimation coding network so as to fuse shallow features and deep features of the target image and obtain a feature map of the target image in multiple scales; the depth estimation decoding network is used for fusing feature maps of multiple scales of the target image to obtain depth estimation information of the target image; the pose transformation model is used for acquiring a pose transformation relation between the target image and the adjacent frame image;
a reconstruction module for generating a reconstructed image based on the depth estimation information and the pose transformation relationship;
and the parameter training module is used for constructing a loss function based on the reconstructed image and the target image and completing the training of the depth estimation model by using the loss function.
25. A depth estimation device, comprising:
the second acquisition module is used for acquiring an image to be estimated;
and the depth determining module is used for processing the image to be estimated by utilizing a pre-trained depth estimation model to obtain a depth map of the image to be estimated, and taking the depth map as a depth estimation result of the image to be estimated.
26. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-23.
27. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-23.
CN202211228909.3A 2022-10-08 2022-10-08 Depth estimation model training method, depth estimation device, depth estimation equipment and medium Pending CN115578704A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211228909.3A CN115578704A (en) 2022-10-08 2022-10-08 Depth estimation model training method, depth estimation device, depth estimation equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211228909.3A CN115578704A (en) 2022-10-08 2022-10-08 Depth estimation model training method, depth estimation device, depth estimation equipment and medium

Publications (1)

Publication Number Publication Date
CN115578704A true CN115578704A (en) 2023-01-06

Family

ID=84585880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211228909.3A Pending CN115578704A (en) 2022-10-08 2022-10-08 Depth estimation model training method, depth estimation device, depth estimation equipment and medium

Country Status (1)

Country Link
CN (1) CN115578704A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115841151A (en) * 2023-02-22 2023-03-24 禾多科技(北京)有限公司 Model training method and device, electronic equipment and computer readable medium
CN117115786A (en) * 2023-10-23 2023-11-24 青岛哈尔滨工程大学创新发展中心 Depth estimation model training method for joint segmentation tracking and application method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115841151A (en) * 2023-02-22 2023-03-24 禾多科技(北京)有限公司 Model training method and device, electronic equipment and computer readable medium
CN117115786A (en) * 2023-10-23 2023-11-24 青岛哈尔滨工程大学创新发展中心 Depth estimation model training method for joint segmentation tracking and application method
CN117115786B (en) * 2023-10-23 2024-01-26 青岛哈尔滨工程大学创新发展中心 Depth estimation model training method for joint segmentation tracking and application method

Similar Documents

Publication Publication Date Title
Kuznietsov et al. Semi-supervised deep learning for monocular depth map prediction
US11100401B2 (en) Predicting depth from image data using a statistical model
CN110675418B (en) Target track optimization method based on DS evidence theory
CN115578704A (en) Depth estimation model training method, depth estimation device, depth estimation equipment and medium
US11651581B2 (en) System and method for correspondence map determination
CN112668573B (en) Target detection position reliability determination method and device, electronic equipment and storage medium
CN113643366B (en) Multi-view three-dimensional object attitude estimation method and device
CN109934183B (en) Image processing method and device, detection equipment and storage medium
CN111354030B (en) Method for generating unsupervised monocular image depth map embedded into SENet unit
CN112634369A (en) Space and or graph model generation method and device, electronic equipment and storage medium
KR20190124113A (en) Deep Learning-based road area estimation apparatus and method using self-supervised learning
CN112465704B (en) Global-local self-adaptive optimized panoramic light field splicing method
CN113711276A (en) Scale-aware monocular positioning and mapping
CN116486288A (en) Aerial target counting and detecting method based on lightweight density estimation network
CN114372523A (en) Binocular matching uncertainty estimation method based on evidence deep learning
CN116309781A (en) Cross-modal fusion-based underwater visual target ranging method and device
CN111179365A (en) Mobile radioactive source radiation image self-adaptive superposition optimization method based on recurrent neural network
Koo et al. A bayesian based deep unrolling algorithm for single-photon lidar systems
CN117542122B (en) Human body pose estimation and three-dimensional reconstruction method, network training method and device
CN112288813B (en) Pose estimation method based on multi-view vision measurement and laser point cloud map matching
CN111696147B (en) Depth estimation method based on improved YOLOv3 model
CN116630216A (en) Target fusion method, device, equipment and storage medium based on radar and image
CN116630528A (en) Static scene reconstruction method based on neural network
Feng et al. Improved deep fully convolutional network with superpixel-based conditional random fields for building extraction
Brockers et al. Stereo vision using cost-relaxation with 3D support regions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination