CN113313732A - Forward-looking scene depth estimation method based on self-supervision learning - Google Patents

Forward-looking scene depth estimation method based on self-supervision learning Download PDF

Info

Publication number
CN113313732A
CN113313732A CN202110708650.1A CN202110708650A CN113313732A CN 113313732 A CN113313732 A CN 113313732A CN 202110708650 A CN202110708650 A CN 202110708650A CN 113313732 A CN113313732 A CN 113313732A
Authority
CN
China
Prior art keywords
depth
image
camera
training
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110708650.1A
Other languages
Chinese (zh)
Inventor
丁萌
尹利董
徐一鸣
李旭
宫淑丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202110708650.1A priority Critical patent/CN113313732A/en
Publication of CN113313732A publication Critical patent/CN113313732A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/207Analysis of motion for motion estimation over a hierarchy of resolutions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a forward-looking scene depth estimation method based on self-supervision learning, which comprises the following steps of: calculating a self-supervision learning reprojection formula; constructing a depth estimation and pose estimation combined training network, designing a loss function, and pre-training KITTI visible light data to obtain a visible light pre-training model; the visible light pre-training model is migrated to the FLIR infrared data for training, dense depth estimation of the infrared image is achieved, the problem that the existing forward-looking scene three-dimensional depth estimation method based on the vision method is only suitable for visible light conditions and cannot be used at night or under the condition of low visibility is solved, three-dimensional depth estimation of the infrared monocular image at night or under the condition of low visibility can be achieved under the condition without real depth data supervision, and further the defect of the vision-aided driving system in night infrared image depth estimation is overcome.

Description

Forward-looking scene depth estimation method based on self-supervision learning
Technical Field
The invention relates to the technical field of image processing, in particular to a forward-looking scene depth estimation method based on self-supervision learning.
Background
In the field of automatic driving, a vision-aided driving system receives more and more attention, and with the continuous enhancement of computing power of various hardware devices, the capability of a computer for acquiring scene information from a single image is continuously improved, the core of the vision-aided driving system is that the depth information of a forward-looking scene of a vehicle is acquired so as to realize lower tasks such as obstacle avoidance, distance measurement and the like, but at present, the estimation of the three-dimensional depth of the forward-looking scene based on a vision method can only be carried out under visible light conditions, but cannot be carried out at night or under the condition of low visibility, for example, the estimation method of the vision depth based on a binocular camera or a multi-view camera determines that the depth estimation range is limited by an installation base line of the camera, so that the estimation method is not suitable for observing long distances; the most representative method in the depth estimation method based on geometric vision is a motion recovery structure, but the motion recovery structure is an off-line algorithm and is not suitable for the real-time requirement in the field of automatic driving; in the deep learning method, a large amount of data with real deep labels needs to be calibrated in advance for supervised learning, a large amount of manpower and material resources are needed for acquiring training data, and only the difference between the fitting depth estimation value and the label value of a machine is considered in the training process, and the visual geometric constraint relation is not considered.
Disclosure of Invention
The invention discloses a forward-looking scene depth estimation method based on self-supervision learning, which solves the problem that the existing forward-looking scene three-dimensional depth estimation method based on a vision method is only suitable for visible light conditions and cannot be used at night or under the condition of low visibility, can realize three-dimensional depth estimation of infrared monocular images at night or under the condition of low visibility without real depth data supervision, and further makes up the defect of a vision-aided driving system in estimating the depth of infrared images at night.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the invention discloses a forward-looking scene depth estimation method based on self-supervision learning, which comprises the following steps of:
calculating a self-supervision learning reprojection formula;
constructing a depth estimation and pose estimation combined training network, designing a loss function, and pre-training KITTI visible light data to obtain a visible light pre-training model;
and transferring the visible light pre-training model to an FLIR infrared data set for training, and realizing the dense depth estimation of the infrared image.
Further, the specific steps of calculating the self-supervised learning reprojection formula include:
calculating an internal parameter matrix k of the camera according to the equipment parameters;
Figure BDA0003132376220000021
wherein f is the focal length of the camera, dxAnd dyIs the pixel size of the camera imaging sensor, u0And v0The coordinate of the center point of the image is located, H is the horizontal resolution of the image, W is the vertical resolution of the image, fovhFor horizontal direct field angle of the camera, fovwIs the vertical field view of the camera;
projecting the three-dimensional point to a two-dimensional plane, and calculating the coordinate transformation of a camera coordinate system and a world coordinate system;
Figure BDA0003132376220000022
Figure BDA0003132376220000023
wherein D is the horizontal depth of a point in the three-dimensional space from the camera, u and v are the coordinates of the point in the imaging plane of the camera, k is the camera reference matrix, and x is the horizontal depth of the point in the three-dimensional space from the cameraw、ywAnd zwIs the coordinate of the point in the world coordinate system, and t is the phaseA displacement vector between a machine coordinate system and a world coordinate system, R is a rotation matrix between the camera coordinate system and the world coordinate system, and T is a pose transformation matrix;
obtaining a self-supervision learning core formula;
pt-1~kTt→t-1Dt(pt)k-1pt (4)
wherein p istIs the coordinate of a certain pixel at the time T, k is the camera internal reference matrix, D is the pixel depth, Tt→t-1And the pose transformation matrixes of the cameras at the time t and the time t-1 are obtained.
Further, the specific steps of constructing a depth estimation and pose estimation joint training network, designing a loss function, and pre-training KITTI visible light data to obtain a visible light pre-training model comprise:
two networks needing joint training are constructed according to a learning task, a ResNet-18 network structure is adopted as a depth estimation network encoder, encoding characteristics are restored into a depth map through upsampling, the ResNet-18 network structure is adopted as a pose estimation network encoder, dimension reduction is carried out by using a small convolution kernel, and six-degree-of-freedom motion of a camera between two frames is estimated;
in the depth estimation network, an encoder sequentially samples the extracted feature quantity of each layer to different dimensions, reduces the scale of the high-dimensional features through maximum pooling, and performs next-step feature extraction to obtain an image with the size smaller than that of the original input image; in the pose estimation network, two superposed images are input, the same encoder is used, and the characteristic dimension is extracted to 2014 dimension;
uniformly upsampling different depth maps extracted from a depth network, and uniformly upsampling the image with the size smaller than that of the original input image obtained by a decoder to the size of the original input image; performing dimensionality reduction in a pose estimation network by using a convolution combination mode to obtain a pose transformation relation with six degrees of freedom;
designing a loss function;
reprojection loss:
Figure BDA0003132376220000031
loss of edge smoothing:
Figure BDA0003132376220000032
total loss function: l ═ μ Lrc+λLs (7)
Wherein, InThe original image is represented by a digital image,
Figure BDA0003132376220000033
representing the image after the re-projection,
Figure BDA0003132376220000034
and
Figure BDA0003132376220000035
for reconstructed previous and subsequent frame image pixel values, p-1And p+1The actual pixel values of the previous and the next frames, N is the total number of the image pixels, SSIM is the structural similarity evaluation of the original image and the image after the re-projection,
Figure BDA0003132376220000036
for the depth smoothing term, for suppressing the depth map from generating local singular noise,
Figure BDA0003132376220000037
the edge perception terms are used for encouraging the model to learn edge information with large depth gradient change, alpha is a weighting coefficient of SSIM and L1 norm loss in the reprojection loss, and mu and lambda are weighting coefficients of two losses in the total loss;
selecting the minimum value of the front and back frame losses according to the minimum reprojection error, and performing pre-training in the KITTI visible light data set to obtain a pre-training model;
further, the specific steps for realizing the infrared image dense depth estimation comprise:
reading a pre-training model, and carrying out migration initialization on convolutional neurons of the infrared training network on the basis of the pre-training model;
and carrying out hyper-parameter adjustment on the pre-training model, and selecting the optimal model as a final result.
Further, the method of pre-training the KITTI visible light data includes multi-scale up-sampling and re-projection errors.
The beneficial technical effects are as follows:
1. the invention discloses a forward-looking scene depth estimation method based on self-supervision learning, which comprises the following steps of: calculating a self-supervision learning reprojection formula; constructing a depth estimation and pose estimation combined training network, designing a loss function, and pre-training KITTI visible light data to obtain a visible light pre-training model; the visible light pre-training model is transferred to the FLIR infrared data set for training, dense depth estimation of the infrared image is achieved, the problem that the existing forward-looking scene three-dimensional depth estimation method based on a vision method is only suitable for visible light conditions and cannot be used at night or under the condition of low visibility is solved, three-dimensional depth estimation of the infrared monocular image at night or under the condition of low visibility can be achieved under the condition without real depth data supervision, and further the defect of a vision-aided driving system in night infrared image depth estimation is overcome;
2. in the invention, the geometric constraint between frames of the monocular video sequence is adopted, and the difference between the reprojected image and the real image is used as the monitoring information, thereby realizing the self-monitoring learning and reducing the acquisition cost of the training data;
3. the invention adopts the minimum re-projection error to re-project a certain frame and two frames before and after the certain frame, but when calculating the loss of a certain point, the minimum loss is taken as the loss value of the neural network back propagation, and the accidental noise influence of the infrared image can be effectively reduced;
4. the invention adopts a multi-scale up-sampling loss calculation method to re-up-sample the small-scale depth map output by the depth estimation network decoder to the resolution of the original image, and performs loss value calculation on the basis, thereby relieving the hole phenomenon of the depth map caused by a low-resolution area;
5. transfer learning is introduced, so that the neural network constructed by the method can fully learn the infrared image characteristics by using the prior road scene knowledge of visible light.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description of the embodiments will be briefly described below.
FIG. 1 is a flow chart of a forward-looking scene depth estimation method based on self-supervised learning according to the present invention;
FIG. 2 is a technical route diagram of a forward-looking scene depth estimation method based on self-supervised learning according to the present invention;
FIG. 3 is a diagram of a depth estimation network in a forward-looking scene depth estimation method based on self-supervised learning according to the present invention;
FIG. 4 is a diagram of a pose estimation network in a forward-looking scene depth estimation method based on self-supervised learning according to the present invention;
fig. 5 is a multi-scale up-sampling process diagram in the forward-looking scene depth estimation method based on the self-supervised learning according to the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The invention discloses a forward-looking scene depth estimation method based on self-supervision learning, which trains and tests a neural network on a desktop workstation, and the realization of a software and hardware platform is shown in table 1.
Table 1:
Figure BDA0003132376220000051
the implementation of the forward-looking scene depth estimation method based on the self-supervised learning specifically comprises the following steps, as shown in fig. 1-2:
s1: calculating a self-supervision learning reprojection formula;
specifically, an internal parameter matrix k of the camera is calculated according to the equipment parameters;
Figure BDA0003132376220000061
wherein f is the focal length of the camera, dxAnd dyIs the pixel size of the camera imaging sensor, u0And v0The coordinate of the center point of the image is located, H is the horizontal resolution of the image, W is the vertical resolution of the image, fovhFor horizontal direct field angle of the camera, fovwIs the vertical field view of the camera;
projecting the three-dimensional point to a two-dimensional plane, and calculating the coordinate transformation of a camera coordinate system and a world coordinate system; according to the camera projection model, observing a certain point in the three-dimensional space from the camera angle:
Figure BDA0003132376220000062
the coordinate transformation relationship between the camera coordinate system and the world coordinate system is described by the rotation matrix and the displacement vector as:
Figure BDA0003132376220000063
wherein D is the horizontal depth of a point in the three-dimensional space from the camera, u and v are the coordinates of the point in the imaging plane of the camera, k is the camera reference matrix, and x is the horizontal depth of the point in the three-dimensional space from the cameraw、ywAnd zwThe coordinate of the point in a world coordinate system is shown as T, a displacement vector (composed of three displacement freedom degrees delta x, delta y and delta z in a Cartesian coordinate system) between a camera coordinate system and the world coordinate system is shown as T, R is a rotation matrix between the camera coordinate system and the world coordinate system, and T is a pose transformation matrix;
obtaining a self-supervision learning core formula;
pt-1~kTt→t-1Dt(pt)k-1pt (4)
wherein p istIs the coordinate of a certain pixel at the time T, k is the camera internal reference matrix, D is the pixel depth, Tt→t-1For the pose transformation matrix of the camera at the time t and the time t-1, it should be noted that the symbol "" represents the projection formula after omitting the depth at the time t-1, representing the constraint between the front and rear frames, specifically, the original derivation formula is Dt-1·pt-1=DtKTt→t-1K-1ptThe depth D at the time t-1 is omitted heret-1Equation (4) is obtained, where the "-" symbol is also understood to be approximately equal to represent the relationship between before and after the reprojection equation.
S2: constructing a depth estimation and pose estimation combined training network, designing a loss function, and pre-training KITTI visible light data to obtain a visible light pre-training model;
specifically, two networks needing joint training are constructed according to a learning task, a depth network is shown in a figure 3, a pose estimation network is shown in a figure 4, a ResNet-18 network structure is adopted as a depth estimation network encoder, coding characteristics are restored into a depth map through up-sampling, the ResNet-18 network structure is adopted as a pose estimation network encoder, a small convolution kernel is used for dimension reduction, and six-degree-of-freedom motion of a camera between two frames is estimated;
in the depth estimation network, an encoder sequentially samples the extracted feature quantity of each layer to 64 dimensions, 128 dimensions, 256 dimensions and 512 dimensions, each layer sequentially adopts 3 x 3 convolution, quasi-normalization and Relu activation functions to reduce the scale of the high-dimensional features through maximum pooling, and next-step feature extraction is carried out to obtain 1/2, 1/4 and 1/8 size images of an original input image; in the pose estimation network, two superposed images are input, the same encoder is used, and the characteristic dimension is extracted to 2014 dimension;
different depth maps extracted from the depth network are uniformly up-sampled, original input images 1/2, 1/4 and 1/8 size maps obtained by a decoder are subjected to a specific process, the specific process is shown in fig. 5, after the decoder obtains four-scale depth maps, the four depth maps are completely restored to the original resolution, and the problem that holes appear in a low-language region of an infrared image can be relieved by carrying out reprojection according to the four depth maps; for a pose design network, designing a small convolutional decoder, and reducing dimensions by using a mode of combining 3 × 3 convolution and 1 × 1 convolution to finally obtain a pose transformation relation of six degrees of freedom;
designing a loss function;
reprojection loss:
Figure BDA0003132376220000071
loss of edge smoothing:
Figure BDA0003132376220000072
total loss function: l ═ μ Lrc+λLs (7)
Wherein, InThe original image is represented by a digital image,
Figure BDA0003132376220000073
representing the image after the re-projection,
Figure BDA0003132376220000074
and
Figure BDA0003132376220000075
for reconstructed previous and subsequent frame image pixel values, p-1And p+1The actual pixel values of the previous and the next frames, N is the total number of the image pixels, SSIM is the structural similarity evaluation of the original image and the image after the re-projection,
Figure BDA0003132376220000081
for the depth smoothing term, for suppressing the depth map from generating local singular noise,
Figure BDA0003132376220000082
is an edge perception item used for encouraging the model to learn the edge information with larger depth gradient change so as to ensure that the outline of the depth map is clear, and alpha is in the reprojection lossSSIM and L1 loss weighting coefficients, μ and λ are the weighting coefficients of the two losses in the total loss, it being noted that the loss L is in the reprojectionrcIn, two parts are included: one part is SSIM structural loss and the second part is L1 norm loss, describing the difference between the projection result and the true value, it being understood that the L1 norm loss function, also called minimum absolute value deviation or minimum absolute value error, in the present patent, the L1 norm loss represents the minimum absolute error between the pixel values of the predicted image and the pixel values of the true image.
And selecting the minimum value of the loss of the front frame and the rear frame according to the minimum reprojection error, and pre-training in the KITTI visible light data set to learn the structural characteristics of the ground road scene in advance.
S3: migrating the visible light pre-training model to an FLIR infrared data set for training to realize dense depth estimation of the infrared image;
specifically, reading a pre-training model, and carrying out migration initialization on convolutional neurons of the infrared training network on the basis of the pre-training model;
and carrying out hyper-parameter adjustment on the pre-training model, wherein the hyper-parameter adjustment comprises the adjustment of Adam optimizer parameters, learning rate, epoch and the like, and selecting the optimal model as a final result.
The invention discloses a forward-looking scene depth estimation method based on self-supervision learning, which is used for training infrared images shot at night or in a low-visibility environment and making up for the deficiency of a vision-aided driving system in estimating the depth of the infrared images at night.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above examples are only for describing the preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims (5)

1. A forward-looking scene depth estimation method based on self-supervision learning is characterized by comprising the following steps:
calculating a self-supervision learning reprojection formula;
constructing a depth estimation and pose estimation combined training network, designing a loss function, and pre-training KITTI visible light data to obtain a visible light pre-training model;
and transferring the visible light pre-training model to an FLIR infrared data set for training, and realizing the dense depth estimation of the infrared image.
2. The method for estimating the depth of the forward-looking scene based on the self-supervised learning as recited in claim 1, wherein the step of calculating the self-supervised learning reprojection formula comprises:
calculating an internal parameter matrix k of the camera according to the equipment parameters;
Figure FDA0003132376210000011
wherein f is the focal length of the camera, dxAnd dyIs the pixel size of the camera imaging sensor, u0And v0The coordinate of the center point of the image is located, H is the horizontal resolution of the image, W is the vertical resolution of the image, fovhFor horizontal direct field angle of the camera, fovwIs the vertical field view of the camera;
projecting the three-dimensional point to a two-dimensional plane, and calculating the coordinate transformation of a camera coordinate system and a world coordinate system;
Figure FDA0003132376210000012
Figure FDA0003132376210000013
where D is the horizontal depth from the camera, i.e. the pixel depth, of a point in the three-dimensional space as viewed from the camera, u and v are the coordinates of the point in the camera imaging plane, PuvAnd PcRespectively representing the values of two-dimensional pixel coordinates and three-dimensional camera coordinate system, k is camera reference matrix, xw、ywAnd zwIs the coordinate of the point in the world coordinate system, xc、ycAnd zcIs the coordinates of the point in the camera coordinate system. T is a displacement vector between the camera coordinate system and the world coordinate system, R is a rotation matrix between the camera coordinate system and the world coordinate system, and T is a pose transformation matrix;
obtaining a self-supervision learning core formula;
pt-1~kTt→t-1Dt(pt)k-1pt (4)
wherein p istIs the coordinate of a certain pixel at the time T, k is the camera internal reference matrix, D is the pixel depth, Tt→t-1And the pose transformation matrixes of the cameras at the time t and the time t-1 are obtained.
3. The method for estimating the depth of the forward-looking scene based on the self-supervised learning as recited in claim 1, wherein the specific step of obtaining the visible light pre-training model comprises:
two networks needing joint training are constructed according to a learning task, a ResNet-18 network structure is adopted as a depth estimation network encoder, encoding characteristics are restored into a depth map through upsampling, the ResNet-18 network structure is adopted as a pose estimation network encoder, dimension reduction is carried out by using a small convolution kernel, and six-degree-of-freedom motion of a camera between two frames is estimated;
in the depth estimation network, an encoder sequentially samples the extracted feature quantity of each layer to different dimensions, reduces the scale of the high-dimensional features through maximum pooling, and performs next-step feature extraction to obtain an image with the size smaller than that of the original input image; in the pose estimation network, two superposed images are input, the same encoder is used, and the characteristic dimension is extracted to 2014 dimension;
uniformly upsampling different depth maps extracted from a depth network, and uniformly upsampling the image with the size smaller than that of the original input image obtained by a decoder to the size of the original input image; performing dimensionality reduction in a pose estimation network by using a convolution combination mode to obtain a pose transformation relation with six degrees of freedom;
designing a loss function;
reprojection loss:
Figure FDA0003132376210000021
loss of edge smoothing:
Figure FDA0003132376210000022
total loss function: l ═ μ Lrc+λLs (7)
Wherein, InThe original image is represented by a digital image,
Figure FDA0003132376210000031
representing the image after the re-projection,
Figure FDA0003132376210000032
and
Figure FDA0003132376210000033
for reconstructed previous and subsequent frame image pixel values, p-1And p+1The actual pixel values of the previous and the next frames, N is the total number of the image pixels, SSIM is the structural similarity evaluation of the original image and the image after the re-projection,
Figure FDA0003132376210000034
for the depth smoothing term, for suppressing the depth map from generating local singular noise,
Figure FDA0003132376210000035
the edge perception terms are used for encouraging the model to learn edge information with large depth gradient change, alpha is a weighting coefficient of SSIM and L1 norm loss in the reprojection loss, and mu and lambda are weighting coefficients of two losses in the total loss;
and selecting the minimum value of the frame loss before and after the frame loss according to the minimum reprojection error, and pre-training in the KITTI visible light data set to obtain a pre-training model.
4. The forward-looking scene depth estimation method based on the self-supervised learning as recited in claim 1, wherein the specific steps of realizing the infrared image dense depth estimation comprise:
reading a pre-training model, and carrying out migration initialization on convolutional neurons of the infrared training network on the basis of the pre-training model;
and carrying out hyper-parameter adjustment on the pre-training model, and selecting the optimal model as a final result.
5. The method of claim 1, wherein the method of pre-training visible light data for KITTI comprises multi-scale up-sampling and re-projection errors.
CN202110708650.1A 2021-06-25 2021-06-25 Forward-looking scene depth estimation method based on self-supervision learning Pending CN113313732A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110708650.1A CN113313732A (en) 2021-06-25 2021-06-25 Forward-looking scene depth estimation method based on self-supervision learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110708650.1A CN113313732A (en) 2021-06-25 2021-06-25 Forward-looking scene depth estimation method based on self-supervision learning

Publications (1)

Publication Number Publication Date
CN113313732A true CN113313732A (en) 2021-08-27

Family

ID=77380207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110708650.1A Pending CN113313732A (en) 2021-06-25 2021-06-25 Forward-looking scene depth estimation method based on self-supervision learning

Country Status (1)

Country Link
CN (1) CN113313732A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763474A (en) * 2021-09-16 2021-12-07 上海交通大学 Scene geometric constraint-based indoor monocular depth estimation method
CN114549612A (en) * 2022-02-25 2022-05-27 北京百度网讯科技有限公司 Model training and image processing method, device, equipment and storage medium
CN114972517A (en) * 2022-06-10 2022-08-30 上海人工智能创新中心 RAFT-based self-supervision depth estimation method
CN114998411A (en) * 2022-04-29 2022-09-02 中国科学院上海微系统与信息技术研究所 Self-supervision monocular depth estimation method and device combined with space-time enhanced luminosity loss
CN116168070A (en) * 2023-01-16 2023-05-26 南京航空航天大学 Monocular depth estimation method and system based on infrared image
CN116245927A (en) * 2023-02-09 2023-06-09 湖北工业大学 ConvDepth-based self-supervision monocular depth estimation method and system
CN116563458A (en) * 2023-04-07 2023-08-08 郑州大学 Three-dimensional reconstruction method for internal diseases of drainage pipeline based on image depth estimation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325782A (en) * 2020-02-18 2020-06-23 南京航空航天大学 Unsupervised monocular view depth estimation method based on multi-scale unification
CN111783582A (en) * 2020-06-22 2020-10-16 东南大学 Unsupervised monocular depth estimation algorithm based on deep learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325782A (en) * 2020-02-18 2020-06-23 南京航空航天大学 Unsupervised monocular view depth estimation method based on multi-scale unification
CN111783582A (en) * 2020-06-22 2020-10-16 东南大学 Unsupervised monocular depth estimation algorithm based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李旭 等: "VDAS中基于单目红外图像的深度估计方法", 《系统工程与电子技术》 *
高宏伟 等: "《电子封装工艺与装备技术基础教程》", 30 June 2017, 西安电子科技大学出版社 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763474A (en) * 2021-09-16 2021-12-07 上海交通大学 Scene geometric constraint-based indoor monocular depth estimation method
CN113763474B (en) * 2021-09-16 2024-04-09 上海交通大学 Indoor monocular depth estimation method based on scene geometric constraint
CN114549612A (en) * 2022-02-25 2022-05-27 北京百度网讯科技有限公司 Model training and image processing method, device, equipment and storage medium
CN114998411A (en) * 2022-04-29 2022-09-02 中国科学院上海微系统与信息技术研究所 Self-supervision monocular depth estimation method and device combined with space-time enhanced luminosity loss
CN114998411B (en) * 2022-04-29 2024-01-09 中国科学院上海微系统与信息技术研究所 Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss
CN114972517A (en) * 2022-06-10 2022-08-30 上海人工智能创新中心 RAFT-based self-supervision depth estimation method
CN114972517B (en) * 2022-06-10 2024-05-31 上海人工智能创新中心 Self-supervision depth estimation method based on RAFT
CN116168070A (en) * 2023-01-16 2023-05-26 南京航空航天大学 Monocular depth estimation method and system based on infrared image
CN116168070B (en) * 2023-01-16 2023-10-13 南京航空航天大学 Monocular depth estimation method and system based on infrared image
CN116245927A (en) * 2023-02-09 2023-06-09 湖北工业大学 ConvDepth-based self-supervision monocular depth estimation method and system
CN116245927B (en) * 2023-02-09 2024-01-16 湖北工业大学 ConvDepth-based self-supervision monocular depth estimation method and system
CN116563458A (en) * 2023-04-07 2023-08-08 郑州大学 Three-dimensional reconstruction method for internal diseases of drainage pipeline based on image depth estimation

Similar Documents

Publication Publication Date Title
CN113313732A (en) Forward-looking scene depth estimation method based on self-supervision learning
CN109377530B (en) Binocular depth estimation method based on depth neural network
CN108665496B (en) End-to-end semantic instant positioning and mapping method based on deep learning
CN110738697B (en) Monocular depth estimation method based on deep learning
CN111798475B (en) Indoor environment 3D semantic map construction method based on point cloud deep learning
CN111325797B (en) Pose estimation method based on self-supervision learning
CN108986136B (en) Binocular scene flow determination method and system based on semantic segmentation
CN111563415B (en) Binocular vision-based three-dimensional target detection system and method
CN110782490B (en) Video depth map estimation method and device with space-time consistency
CN110503680B (en) Unsupervised convolutional neural network-based monocular scene depth estimation method
CN113160375B (en) Three-dimensional reconstruction and camera pose estimation method based on multi-task learning algorithm
CN110009674B (en) Monocular image depth of field real-time calculation method based on unsupervised depth learning
CN113283525B (en) Image matching method based on deep learning
CN112767467B (en) Double-image depth estimation method based on self-supervision deep learning
CN113657388A (en) Image semantic segmentation method fusing image super-resolution reconstruction
CN110910327B (en) Unsupervised deep completion method based on mask enhanced network model
WO2024051184A1 (en) Optical flow mask-based unsupervised monocular depth estimation method
CN114119889B (en) Cross-modal fusion-based 360-degree environmental depth completion and map reconstruction method
CN113077505A (en) Optimization method of monocular depth estimation network based on contrast learning
CN111860651A (en) Monocular vision-based semi-dense map construction method for mobile robot
CN115035171A (en) Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion
CN115546273A (en) Scene structure depth estimation method for indoor fisheye image
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN116152442A (en) Three-dimensional point cloud model generation method and device
CN112802202A (en) Image processing method, image processing device, electronic equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210827

RJ01 Rejection of invention patent application after publication