CN117670969A

CN117670969A - Depth estimation method, device, terminal equipment and storage medium

Info

Publication number: CN117670969A
Application number: CN202311747857.5A
Authority: CN
Inventors: 徐敏; 郭鑫岚; 谢锴; 张新; 谷靖
Original assignee: Guangdong Huitian Aerospace Technology Co Ltd
Current assignee: Guangdong Huitian Aerospace Technology Co Ltd
Priority date: 2023-12-18
Filing date: 2023-12-18
Publication date: 2024-03-08

Abstract

The invention discloses a depth estimation method, a device, terminal equipment and a storage medium, wherein the method comprises the following steps: acquiring binocular image data of a binocular camera and camera calibration parameters; correcting the binocular image data and obtaining corrected binocular image data; estimating by a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region; based on the parallax map, calculating to obtain a depth map according to the camera calibration parameters and a preset parallax depth conversion rule. By the method, the accuracy of depth estimation of the binocular image is improved.

Description

Depth estimation method, device, terminal equipment and storage medium

Technical Field

The present invention relates to the field of image processing, and in particular, to a depth estimation method, a device, a terminal device, and a storage medium.

Background

Image depth estimation is the acquisition of depth information of an image scene from a two-dimensional image, i.e. the vertical distance of each point pixel in the scene to the camera imaging plane. Image depth estimation techniques are widely used in the field of computer vision, such as automatic driving automobiles, robotics, augmented reality, and medical image analysis.

Image depth estimation methods are generally based on the principle of stereoscopic vision, estimating depth by analyzing external factors such as correspondence and occlusion in binocular images. The existing binocular depth estimation technology obtains corresponding feature points on a right-eye image by traversing pixel points of the left-eye image and utilizing a binocular stereo matching algorithm, calculates parallax of the feature points on the left-eye image and the right-eye image, and then obtains depth of the pixel points on the image according to the parallax and depth relation.

The accuracy of parallax estimation is directly determined by the merits of the binocular stereo matching algorithm. Currently, a binocular stereo matching algorithm based on deep learning is generally adopted, and the similarity of feature points is measured by calculating a Cost Volume so as to obtain matched feature points. Considering the requirements of computing resources and real-time performance, the binocular stereo matching algorithm based on the deep learning needs to downsample left and right images, calculate a Cost Volume at a downsampling layer, and then upsample to input resolution to perform depth estimation. However, the downsampling process results in that the parallaxes of a plurality of feature points can only be represented by the same feature point on the downsampled feature map, and the semantics of some tiny obstacles on the feature map are easy to lose, so that the accuracy of depth estimation is affected. In addition, because of the limitation of the hardware equipment of the binocular vision system applying the binocular stereo matching algorithm, the depth change caused by the slight change of the remote parallax is larger and more difficult to learn when the remote parallax is estimated by the image depth estimation technology based on the stereo vision principle.

In summary, the existing binocular depth estimation technology based on deep learning has the technical problem that the accuracy of depth estimation is limited.

Disclosure of Invention

The invention mainly aims to provide a depth estimation method, a device, terminal equipment and a storage medium, and aims to solve the technical problem that the accuracy of depth estimation is limited in the existing binocular depth estimation technology based on deep learning.

In order to achieve the above object, the present invention provides a depth estimation method, including:

acquiring binocular image data of a binocular camera and camera calibration parameters;

correcting the binocular image data and obtaining corrected binocular image data;

estimating by a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region;

based on the parallax map, calculating to obtain a depth map according to the camera calibration parameters and a preset parallax depth conversion rule.

Optionally, the correcting the binocular image data includes:

And correcting the binocular image data by taking the alignment of the characteristic points of the binocular image in the horizontal direction of the image as a target according to camera calibration parameters.

Optionally, the estimating, based on the corrected binocular image data, by a depth estimation model obtained by pre-training, further includes, before obtaining the disparity map:

acquiring a binocular training image for training and a corresponding real parallax image;

performing parallax estimation on the binocular training image by using a binocular stereo matching algorithm model based on deep learning to obtain an estimated parallax image;

calculating the loss between the estimated disparity map and the real disparity map according to a preset loss function, wherein the preset loss function is composed of a disparity loss function of the disparity map and a disparity derivative loss function based on an edge contour area;

and based on the loss, adjusting parameters of the binocular stereo matching algorithm model to obtain a trained depth estimation model.

Optionally, the acquiring the binocular training image for training and the corresponding real parallax map includes:

acquiring binocular image data of a known binocular camera, camera calibration parameters and corresponding depth truth values;

Correcting binocular image data of a known binocular camera to obtain a binocular training image for training;

processing the depth truth value according to the camera calibration parameters and a preset parallax depth conversion rule to obtain a first parallax truth value diagram;

and downsampling the first parallax truth diagram to obtain a corresponding real parallax diagram.

Optionally, the preset loss function includes an edge loss function and a parallax loss function, and calculating the loss between the estimated parallax map and the real parallax map according to the preset loss function includes:

based on an edge detection algorithm, carrying out edge extraction on the binocular training image, and forming an edge contour map by pixel points meeting preset conditions of the edge detection algorithm;

performing neighborhood expansion on the edge profile map to obtain a mask;

calculating the absolute value of parallax derivative loss of the corresponding pixel point of the mask according to the edge loss function to obtain edge loss;

according to the parallax loss function, calculating an absolute value of a relative error of the parallax loss of the pixel points corresponding to the estimated parallax image and the real parallax image, and carrying out logarithmic processing on the absolute value to obtain parallax loss;

And obtaining the loss between the estimated disparity map and the real disparity map according to the edge loss and the disparity loss.

Optionally, after the processing the depth truth value according to the camera calibration parameter and a preset parallax depth conversion rule to obtain the first parallax truth value diagram, the method further includes:

resampling the first parallax truth value graph with the parallax smaller than a preset value to obtain a second parallax truth value graph;

and downsampling the second parallax truth diagram to obtain a corresponding real parallax diagram.

Optionally, the edge detection algorithm includes any one of an edge detection algorithm based on Sobel, prewitt or Laplacian operator and an edge detection model based on deep learning.

In addition, to achieve the above object, the present invention also provides a depth estimation apparatus, including:

the acquisition data module is used for acquiring binocular image data of the binocular camera and camera calibration parameters;

the data correction module is used for correcting the binocular image data and obtaining corrected binocular image data;

the data processing module is used for estimating through a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through monitoring training based on parallax derivative loss of an edge contour area;

And the depth calculation module is used for calculating a depth map according to the camera calibration parameters and a preset parallax depth conversion rule based on the parallax map.

The embodiment of the application also provides a terminal device, which comprises: a memory, a processor, and a depth estimation program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the depth estimation method as described above.

The embodiment of the application also proposes a storage medium having a depth estimation program stored thereon, which when executed by a processor implements the steps of the depth estimation method as described above.

In the depth estimation of the binocular image, the loss of the edge contour area is increased in the loss function of the binocular depth estimation model, and the parallax loss of the binocular image is converted into logarithmic space for normalization processing.

Drawings

FIG. 1 is a schematic diagram of functional modules of a terminal device to which a depth estimation device of the present application belongs;

FIG. 2 is a schematic flow chart of a first embodiment of a depth estimation method according to the present application;

FIG. 3 is a simplified schematic diagram of a binocular vision system in an embodiment of a depth estimation method of the present application;

fig. 4 is a schematic diagram of a refinement flow of step S220 in an embodiment of the depth estimation method of the present application;

FIG. 5 is a schematic flow chart of a second embodiment of a depth estimation method according to the present application;

fig. 6 is a schematic diagram of a refinement flow of step S510 in an embodiment of the depth estimation method of the present application;

fig. 7 is a schematic diagram of a refinement flow of step S530 in an embodiment of the depth estimation method of the present application;

fig. 8 is a flowchart of a third embodiment of a depth estimation method according to the present application.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The main solutions of the embodiments of the present invention are: acquiring binocular image data of a binocular camera and camera calibration parameters; correcting the binocular image data and obtaining corrected binocular image data; estimating by a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region; based on the parallax map, calculating to obtain a depth map according to the camera calibration parameters and a preset parallax depth conversion rule.

In the prior art, in the process of downsampling left and right eye images, detail features in the images are easy to lose, and the technical problem of limited accuracy of depth estimation exists.

Compared with the prior art, the binocular depth estimation model constructed by the invention can effectively reduce semantic loss of long-distance and tiny obstacles in the downsampling process and improve the accuracy of depth estimation.

Referring to fig. 1, fig. 1 is a schematic diagram of a depth estimation device of a hardware running environment according to an embodiment of the present invention.

As shown in fig. 1, the depth estimation apparatus may include: a processor 101, such as a central processing unit (Central Processing Unit, CPU), a communication bus 102, a user interface 103, a network interface 104, a memory 105. Wherein the communication bus 102 is used to enable connected communication between these components. The user interface 103 may comprise a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 103 may further comprise a standard wired interface, a wireless interface. The network interface 104 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 105 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable Non-Volatile Memory (NVM), such as a disk Memory. The memory 105 may alternatively be a storage device separate from the aforementioned processor 101.

It will be appreciated by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the depth estimation device, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

In the depth estimation device shown in fig. 1, the network interface 104 is mainly used for data communication with other devices; the user interface 103 is mainly used for data interaction with a user; the processor 101 and the memory 105 in the depth estimation device of the present invention may be provided in a depth estimation device, which invokes a depth estimation program stored in the memory 105 through the processor 101 and performs the depth estimation method provided by the embodiment of the present invention.

Specifically, the depth estimation program in the memory 105, when executed by the processor, implements the steps of:

Further, the depth estimation program in the memory 105, when executed by the processor, further performs the steps of:

performing parallax estimation on the binocular training image by using a three-dimensional matching algorithm model based on deep learning to obtain an estimated parallax image;

and based on the loss, adjusting parameters of the stereo matching algorithm model to obtain a trained depth estimation model.

performing neighborhood expansion on the edge profile map to obtain a mask;

the edge detection algorithm comprises any one of an edge detection algorithm based on Sobel, prewitt or Laplacian operators and an edge detection model based on deep learning.

According to the scheme, binocular image data of a binocular camera and camera calibration parameters are obtained; correcting the binocular image data and obtaining corrected binocular image data; estimating by a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region; based on the parallax map, calculating to obtain a depth map according to the camera calibration parameters and a preset parallax depth conversion rule. The depth estimation method is used for predicting and estimating the scene depth of the binocular image, improves the accuracy of image depth estimation, and has the advantage of more accurate depth estimation results.

Based on the above terminal device architecture, but not limited to the above architecture, the method embodiments of the present application are presented.

Referring to fig. 2, fig. 2 is a flowchart of a first embodiment of a depth estimation method according to the present application. The depth estimation method comprises the following steps:

step S210: acquiring binocular image data of a binocular camera and camera calibration parameters;

step S220: correcting the binocular image data and obtaining corrected binocular image data;

step S230: estimating by a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region;

step S240: based on the parallax map, calculating to obtain a depth map according to the camera calibration parameters and a preset parallax depth conversion rule.

The implementation main body of the method of the embodiment may be a depth estimation device, or may be a depth estimation terminal device or a server, and the embodiment uses the depth estimation device as an example, where the depth estimation device may be integrated on a terminal device such as a smart phone, a tablet computer, and the like, which has a data processing function.

The scheme of the embodiment mainly realizes the prediction estimation of the scene depth information in the binocular image and improves the accuracy of binocular depth estimation.

The steps of this embodiment are explained in detail as follows:

specifically, people can realize identification and understanding of objects, scenes and behaviors by researching computer vision so that a computer can acquire information from images or videos. In the field of computer vision, image depth estimation is a key technical task, acquires depth information of an image scene from a two-dimensional image, facilitates subsequent tasks such as three-dimensional reconstruction, target detection, identification and tracking, and is widely applied to the fields of automatic driving, robot technology, augmented reality, medical image analysis and the like.

Image depth estimation is typically based on the principle of stereoscopic vision, i.e. reconstructing the depth perception capability in a machine vision system by simulating the binocular vision mechanism of the human eye, thereby enabling an accurate measurement of the depth of an object in an image. With the wide development of deep learning, the technical performance of binocular depth estimation based on deep learning has exceeded that of the traditional method.

However, the existing binocular depth estimation technology based on deep learning needs to downsample left and right images, and calculate the Cost Volume at the downsampling layer. The detail features of the image are easily lost in the downsampling process, so that the depth estimation result of the image is not accurate enough, and the problem that the accuracy of the depth estimation is limited exists.

Therefore, the invention designs an edge-enhanced remote binocular depth estimation method aiming at the problems of the prior method. The depth estimation method specifically comprises the following steps:

first, binocular image data to be processed needs to be acquired by a binocular camera in the depth estimation apparatus. The binocular image data refers to images shot by a binocular camera, namely a left camera and a right camera, and comprises a left-eye image and a right-eye image. Meanwhile, the depth estimation device also needs to acquire the calibration parameters of the binocular camera, wherein the camera calibration parameters refer to the horizontal distance between the left camera and the right camera in the binocular camera and the focal length of the camera.

secondly, because in practical application, the positions and the postures of the binocular cameras are different, the acquired binocular images are also different in alignment, namely the same pixel point on the binocular images is not on the same horizontal line on the same plane. In order to facilitate the subsequent calculation of parallax using the binocular image, the depth estimation device needs to correct the binocular image data, so as to ensure that the same pixel point on the binocular image is on the same horizontal line on the same plane.

The parallax refers to the difference between the positions of the same object in the image, which is observed at different positions or viewing angles, that is, the difference between the horizontal coordinates of the same pixel point on the left and right eye images in the corrected image data.

and then, inputting corrected binocular image data into a depth estimation model obtained through pre-training to perform parallax estimation, and obtaining a corresponding parallax map, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region. In order to ensure that the depth estimation model has stability to the scale change of the input data, the input data needs to be normalized, and the size of the input image is ensured to be consistent, so that the corrected binocular image data is compressed, and the binocular image data with the preset size is obtained. Based on binocular image data with preset size, a depth estimation model obtained through pre-training is utilized, prediction analysis is conducted through a deep learning network, an estimated parallax image is obtained, and the specific process of parallax calculation can refer to the process of training the depth estimation model.

The preset size is determined by a person according to the requirement of a depth estimation model, and the disparity map is represented by storing two-dimensional images of all pixel disparity values of a single view after stereo correction.

And finally, according to the obtained parallax map, combining camera calibration parameters and setting a preset parallax depth conversion rule based on an image parallax depth conversion relation of a human eye vision principle, calculating a depth value of a corresponding pixel point, and obtaining a corresponding depth map. The disparity map obtained in step S230 of the present invention is a disparity map of a left view, so the depth map is also a depth map of a left view; the image parallax conversion relationship based on the principle of human eye vision can be described with reference to fig. 3 below.

The binocular depth estimation algorithm obtains corresponding feature points on the right eye image by traversing pixel points of the left eye image and utilizing a matching algorithm, calculates parallax of the feature points on the left eye image and the right eye image, and then according to the parallax and depthThe depth of the pixel point on the left eye image is obtained by the degree relation. For example, FIG. 3 is a simplified representation of a binocular vision system, P _L Is the pixel point on the left eye image, P _R The parallax and depth relation of the pixel points on the left and right eye images meets the following formula:

wherein f is the focal length of the camera, b is the base line of the left and right eye cameras, d=x _L -X _R X is the parallax of corresponding pixel point on left and right eye images _L Is the horizontal position of the pixel point on the left eye image, X _R Is the horizontal position of the pixel point on the right eye image, Z is the depth of the real space point, P is the point in the real space, O _L 、O _R The center of the left and right eye cameras respectively.

According to the scheme, binocular image data of a binocular camera and camera calibration parameters are obtained; correcting the binocular image data and obtaining corrected binocular image data; estimating by a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region; based on the parallax map, calculating to obtain a depth map according to the camera calibration parameters and a preset parallax depth conversion rule. In the depth estimation of the binocular image, the loss of the edge contour area is increased in the loss function of the binocular depth estimation model, and the parallax loss of the binocular image is converted into logarithmic space for normalization processing.

Referring to fig. 4, fig. 4 is a schematic flow chart of correcting binocular image data in an embodiment of a depth estimation method of the present application. Based on the embodiment shown in fig. 2, in this embodiment, the step S220: correcting the binocular image data, and obtaining corrected binocular image data includes:

s2201: and correcting the binocular image data by taking the alignment of the characteristic points of the binocular image in the horizontal direction of the image as a target according to camera calibration parameters.

Specifically, the binocular image is corrected according to camera calibration parameters, i.e., a horizontal distance between left and right cameras in the binocular camera, and a focal length of the camera, with reference to a schematic diagram of the stereoscopic system of fig. 3. The depth estimation device is based on the feature point P on the left-eye image _L Finding the corresponding feature point P on the right-eye image _R And corresponding characteristic points on the left and right eye images are aligned in the horizontal direction of the images to finish image correction operation, and in order to ensure operation efficiency, only a plurality of characteristic points are generally selected for image correction. The feature points may be more obvious and easily identified pixels in the image, for example, contour points, bright points in darker areas, dark points in lighter areas, and the like.

According to the scheme, binocular image data of a binocular camera and camera calibration parameters are obtained; according to camera calibration parameters, aiming at the alignment of the characteristic points of the binocular image in the horizontal direction of the image, correcting the binocular image data, and obtaining corrected binocular image data; estimating by a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region; based on the parallax map, calculating to obtain a depth map according to the camera calibration parameters and a preset parallax depth conversion rule.

The binocular image data are required to be corrected through the depth estimation device, corresponding feature points on the binocular image are guaranteed to be on the same horizontal line on the same plane, parallax is conveniently calculated according to a parallax depth conversion rule, and the operation efficiency of the depth estimation model is improved.

Referring to fig. 5, fig. 5 is a schematic flow chart of training a depth estimation model in an embodiment of the depth estimation method of the present application. Based on the embodiment shown in fig. 2, in step S230: based on the corrected binocular image data, estimating by a depth estimation model obtained through pre-training, and before obtaining a parallax image, further comprising:

Step S510: acquiring a binocular training image for training and a corresponding real parallax image;

specifically, in order to perform supervised training on the depth estimation model, known data related to input and output of the model needs to be collected to be used as a data set of the training model, that is, a binocular image data set with known image depth and a corresponding depth truth value need to be obtained from various image processing open source communities, paid image data websites or network public platforms, historical binocular image data of the depth estimation device can also be collected, and meanwhile, the corresponding depth truth value is calculated through laser point cloud, SLAM mapping and other modes. The camera calibration parameters refer to a horizontal distance b between left and right cameras in the binocular camera and a focal length f of the camera, and for convenience of description, the focal length f of the camera is hereinafter referred to as a base line b of the binocular camera.

And converting the acquired depth true value into a corresponding real parallax image according to camera calibration parameters and a preset parallax depth conversion rule. Meanwhile, in order to ensure that the depth estimation model has stability to the scale change of the input data, the input data needs to be normalized, and the dimension of the input image is ensured to be consistent, so that the parallax image obtained through conversion is downsampled, and a real parallax image with the preset dimension is obtained. Wherein the preset size is determined by a person according to the requirements of the depth estimation model.

Step S520: performing parallax estimation on the binocular training image by using a three-dimensional matching algorithm model based on deep learning to obtain an estimated parallax image;

specifically, after a binocular training image for training a model is obtained, the binocular training image is subjected to parallax estimation by utilizing an existing stereo matching algorithm model based on deep learning, and an estimated parallax image is obtained. The stereo matching algorithm model based on the deep learning can be HITNet (Hierarchical Iterative Tile Refinement Network), CREStereo (Cascaded Recurrent Network Stereo), IGEV-Stereo (Iterative Geometry Encoding Volume for Stereo Matching) and other algorithm models.

For example, a HITNet algorithm model may be selected as the stereo matching algorithm model to perform parallax estimation on the binocular training image, and the HITNet algorithm does not require the size of the input binocular image, so that the model input size may be set according to the experience of the person.

The HITNet algorithm model relies on a rapid multi-resolution initialization step, a tiny 2D geometric propagation and a warping operation to infer parallax, the calculated amount is smaller than that of other deep learning algorithm models, the algorithm running speed is faster, and the accuracy can be guaranteed to be higher.

The HITNet algorithm model comprises the steps of feature extraction, multi-resolution initialization, 2D geometric propagation and the like. Tile Hypothesis is an important concept in the HITNet algorithm, which is defined as a planar patch with a learnable feature, is to divide an image into a series of fixed-size, non-overlapping blocks (or tiles), and to perform disparity estimation for each block. Wherein the block size can be adjusted according to the specific application scenario and image resolution. Specifically, tile Hypothesis consists of a geometric part describing a tilted plane with parallax, x and y-direction (dx, dy) parallax gradients, and a learnable part P. The feature descriptor P is a learnable representation of tile that allows the network to attach additional information to tile.

When parallax estimation is carried out on the HITNet algorithm model, firstly, input binocular image data is subjected to downsampling through a deep learning network (U-net) to extract image features under multiple scales, then, the upsampling is used for carrying out feature extraction under different scales, and then, a group of feature images under different resolutions are obtained through upsampling. Then, parallax matching calculation is carried out on a group of feature images under different resolutions, and the initial parallax data d and the feature vector P of each block are extracted under different resolutions.

The tile hypotheses are taken as input in the 2D geometrical propagation stage, and more refined new tile hypotheses are output based on the spatial propagation of the information and the information fusion, so that the estimation of the tile hypotheses and the additional characteristics thereof is updated. The features of the feature extraction stage are internally warped from the right image (secondary image) to the left image (reference image) to predict the high-precision offset of the input tile. And finally obtaining an estimated parallax map through the updating of the tile hypotheses.

Step S530: calculating the loss between the estimated disparity map and the real disparity map according to a preset loss function, wherein the preset loss function is composed of a disparity loss function of the disparity map and a disparity derivative loss function based on an edge contour area;

specifically, the depth estimation device firstly extracts edge points of the binocular training image based on an edge detection algorithm, forms an edge contour map from pixel points meeting preset conditions of the edge detection algorithm, and then performs neighborhood expansion on the edge contour map to obtain a mask P, namely an edge contour region used for monitoring and training a depth estimation model.

Since the preset loss function is composed of a parallax loss function of the parallax map and a parallax derivative loss function based on the edge contour region, the depth estimation device calculates the parallax loss of the parallax map and the parallax derivative loss based on the edge contour region respectively.

Specifically, according to a parallax derivative loss function based on an edge contour region, calculating the absolute value of the parallax derivative loss of the pixel point corresponding to the mask P to obtain the edge loss; then, according to a parallax loss function of the parallax map, calculating an absolute value of a relative error of parallax loss of pixel points corresponding to the estimated parallax map and the real parallax map, and carrying out logarithmic processing on the absolute value to obtain parallax loss; and finally, obtaining the loss between the estimated parallax map and the real parallax map according to the edge loss and the parallax loss.

Step S540: and based on the loss, adjusting parameters of the stereo matching algorithm model to obtain a trained depth estimation model.

Specifically, the parallax estimation accuracy of the stereo matching algorithm model is judged according to the loss calculated in step S530. When the loss exceeds a preset minimum loss value, parameters of the stereo matching algorithm model need to be adjusted until the loss between the training model and the disparity map is not larger than the preset minimum loss value, and the trained depth estimation model is stored. The preset minimum loss value is set by related personnel according to the experience value so as to ensure the accuracy of parallax estimation of the model.

According to the technical scheme, the binocular training image for training and the corresponding real parallax image are obtained; performing parallax estimation on the binocular training image by using a three-dimensional matching algorithm model based on deep learning to obtain an estimated parallax image; calculating the loss between the estimated disparity map and the real disparity map according to a preset loss function, wherein the preset loss function is composed of a disparity loss function of the disparity map and a disparity derivative loss function based on an edge contour area; based on the loss, adjusting parameters of the stereo matching algorithm model to obtain a trained depth estimation model; acquiring binocular image data of a binocular camera and camera calibration parameters; correcting the binocular image data and obtaining corrected binocular image data; estimating by a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region; based on the parallax map, calculating to obtain a depth map according to the camera calibration parameters and a preset parallax depth conversion rule.

The existing deep learning-based stereo matching algorithm model is utilized, the loss function is set to be composed of a parallax loss function of a parallax image and a parallax derivative loss function based on an edge contour area, the deep learning-based stereo matching algorithm model is subjected to supervised training, so that the stereo matching algorithm model continuously learns image characteristics and performs parallax estimation, the subsequent parallax estimation on binocular images by using a trained depth estimation model is facilitated, and the accuracy of image depth estimation is improved.

Referring to fig. 6, fig. 6 is a flowchart illustrating a process of obtaining training data of a depth estimation model according to an embodiment of the depth estimation method of the present application. Based on the embodiment shown in fig. 5, in this embodiment, the step S510: the method for acquiring the binocular training image for training and the corresponding real parallax image comprises the following steps of:

step S5101: acquiring binocular image data of a known binocular camera, camera calibration parameters and corresponding depth truth values;

specifically, binocular image data of a known binocular camera, camera calibration parameters, and corresponding depth truth values are acquired by a depth estimation device. The depth estimation device can acquire binocular image data sets with known image depth and corresponding depth truth values from various image processing open source communities, payment image data websites or network public platforms; the binocular image data of the historical shooting can be collected, and the corresponding depth truth value can be obtained through calculation through laser point cloud, SLAM mapping and other modes. The camera calibration parameter refers to a camera focal length f and a base line b of the binocular camera.

Step S5102: correcting binocular image data of a known binocular camera to obtain a binocular training image for training;

specifically, because in practical application, the positions and the attitudes of the binocular cameras are different, the acquired binocular images also have alignment differences, that is, the same pixel point on the binocular images is not on the same horizontal line on the same plane. In order to facilitate the subsequent calculation of parallax using the binocular image, the depth estimation device needs to correct the binocular image data, so as to ensure that the same pixel point on the binocular image is on the same horizontal line on the same plane.

Step S5103: processing the depth truth value according to the camera calibration parameters and a preset parallax depth conversion rule to obtain a first parallax truth value diagram;

specifically, the parallax and depth relationship in the stereoscopic vision system mentioned in the foregoing embodiment satisfies the following formula:

wherein f is the focal length of the camera, b is the base line of the left and right eye cameras, d is the parallax of the corresponding pixel point on the left and right eye images, and Z is the depth of the real space point. And (2) substituting the camera focal length f obtained in the step S5101, the base line b of the binocular camera and the depth truth value into a formula to calculate, thereby obtaining a first parallax truth value diagram.

Step S5104: and downsampling the first parallax truth diagram to obtain a corresponding real parallax diagram.

Specifically, in order to ensure that the depth estimation model obtained through training has stability to the scale change of the input data, normalization processing is needed to be performed on the input data, the size consistency of the input image is ensured, the depth estimation device is used for downsampling the first parallax truth diagram, and the size of the first parallax truth diagram is reduced to the preset size of the model input. Wherein the preset size is determined by a person according to the requirements of the depth estimation model.

According to the scheme, the binocular image data of the known binocular camera, the camera calibration parameters and the corresponding depth truth values are obtained; correcting binocular image data of a known binocular camera to obtain a binocular training image for training; processing the depth truth value according to the camera calibration parameters and a preset parallax depth conversion rule to obtain a first parallax truth value diagram; downsampling the first parallax truth diagram to obtain a corresponding real parallax diagram; performing parallax estimation on the binocular training image by using a three-dimensional matching algorithm model based on deep learning to obtain an estimated parallax image; calculating the loss between the estimated disparity map and the real disparity map according to a preset loss function, wherein the preset loss function is composed of a disparity loss function of the disparity map and a disparity derivative loss function based on an edge contour area; based on the loss, adjusting parameters of the stereo matching algorithm model to obtain a trained depth estimation model; acquiring binocular image data of a binocular camera and camera calibration parameters; correcting the binocular image data and obtaining corrected binocular image data; estimating by a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region; based on the parallax map, calculating to obtain a depth map according to the camera calibration parameters and a preset parallax depth conversion rule.

The binocular image data of the known binocular camera, camera calibration parameters and corresponding depth truth values are processed, the binocular image data with fixed size and a real parallax map are extracted, subsequent training of a depth estimation model is facilitated, and stability of the depth estimation model is improved.

Referring to fig. 7, fig. 7 is a flow chart illustrating a disparity estimation loss of a depth estimation model in an embodiment of a depth estimation method according to the present application. Based on the embodiment shown in fig. 5, in this embodiment, the step S530: according to a preset loss function, calculating the loss between the estimated disparity map and the real disparity map comprises:

step S5301: based on an edge detection algorithm, carrying out edge extraction on the binocular training image, and forming an edge contour map by pixel points meeting preset conditions of the edge detection algorithm;

specifically, the depth estimation device performs edge extraction on a known binocular training data set by using an existing edge detection algorithm, and forms an edge contour map from pixel points meeting preset conditions of the edge detection algorithm.

Further, the edge detection algorithm includes any one of an edge detection algorithm based on Sobel, prewitt or Laplacian operator and an edge detection model based on deep learning.

The preset conditions meeting the edge detection algorithm means that under different edge detection algorithms, preset conditions need to be set, so that the extracted pixel points are indeed image edge points. For example, for an edge detection algorithm based on Sobel, prewitt or Laplacian operator, the preset condition is that the gradient value of the pixel point needs to meet a certain threshold value, wherein the threshold value is set according to manual experience, and the threshold value can also be set by using a self-adaptive threshold value method; the preset condition of the edge detection model based on deep learning is that the predicted edge probability of the pixel point meets a certain threshold, wherein the threshold is set according to manual experience, and the threshold can also be set by using a self-adaptive threshold method.

For example, in this embodiment, edge extraction is selected for a known binocular training data set, and a gradient threshold is set by using an adaptive threshold method.

Specifically, the edge detection algorithm based on the Sobel operator is based on convolution operation to realize detection of edges in the horizontal direction and the vertical direction, and is hereinafter referred to as Sobel operator for convenience of description. Before edge detection is performed by using a Sobel operator, gray processing is performed on the binocular training image, and a gray image is obtained through conversion. Because of the discontinuity in gray values, the abrupt change in gray of the image portion can be detected with a gradient, which refers to the rate of change of the gray value of the image. The Sobel operator detects edges by calculating the gray gradient of pixel points.

The principle of the Sobel operator is to calculate the gradient by performing a convolution operation on the image twice. One convolution operation is used to calculate the gradient in the horizontal direction and another convolution operation is used to calculate the gradient in the vertical direction. For a pixel point a in an image, its gray gradient can be calculated by the following formula: g=abs (Gx) +abs (Gy), where G is the gray scale gradient of pixel a, gx is the gradient of pixel a in the horizontal direction, and Gy is the gradient of pixel a in the vertical direction.

And when convolution operation is carried out, multiplying the Sobel operator convolution matrix with pixel points in the image respectively, and adding the results to obtain gradient values of the pixel points. And (3) circulating in this way, convolving the pixel points of the whole image, and finally obtaining the gray gradient image of the whole image.

And finally, marking the pixel points with the gradient larger than the threshold value as edge points according to the gradient threshold value, and forming the edges of the image by the marked edge points so as to obtain an edge contour map of the image.

The setting of the gradient threshold value can be performed by a median filtering algorithm, all pixel points on the image are traversed by the median filtering algorithm, an average pixel value of a neighborhood near each pixel point is obtained, and the average pixel value is set as the gradient threshold value according to the average pixel value, wherein the neighborhood can take 3-8 pixel points adjacent to the pixel point.

Step S5302: performing neighborhood expansion on the edge profile map to obtain a mask;

specifically, in order to monitor the subsequent changes of the edge contour and its neighboring regions to train the depth estimation model, it is necessary to perform neighborhood expansion on the edge contour map to obtain the mask P.

Wherein neighborhood dilation is a morphological image processing technique, also known as morphological dilation, that can be used to fill holes in an image, smooth edges, or enhance the contrast of an image.

In this embodiment, the edge contour map is neighborhood-expanded to enhance the edges of the image, so that the edges are more prominent and more visible. Specifically, an edge profile is first read, wherein the edge profile is represented by white and the other portions are represented by black. A structural element is then defined. The structural element is a small matrix representing the shape of the neighborhood, typically a rectangular structure, an elliptical structure or a crisscross structure. The edge profile is then expanded using the structural elements. Specifically, the center point of the structural element is moved to each pixel location on the image, and the neighborhood is expanded according to the shape of the structural element. If the pixel value in the neighborhood is greater than the center point pixel value of the structural element, the pixel value is updated to the center point pixel value of the structural element, and in this embodiment, the neighborhood expansion generally selects 3 or 5 pixels. Finally, a new expansion image, i.e. mask P, is obtained.

Step S5303: calculating the absolute value of parallax derivative loss of the corresponding pixel point of the mask according to the edge loss function to obtain edge loss;

specifically, the edge loss is calculated by comparing absolute values of parallax derivative differences of corresponding pixel points on an edge contour region between an estimated parallax map and a real parallax map, the edge contour region is monitored subsequently, and semantic loss of the edge contour in the downsampling process is reduced through back propagation, so that parallax estimation of the edge region is improved.

For example, in the present invention, the edge loss function is expressed with reference to the following formula:

wherein P refers to the edge contour region, D' _gt Representing the downsampled true disparity map, D _pred Then an estimated parallax image is obtained by estimating the binocular training image through a stereo matching algorithm model; g because of the computing resources _x,y In the present invention, the first derivative is taken, i.e. in relation to G _x,y The value of n in the formula of (1); w is the width of the disparity map (estimated disparity map and true disparity map) and h is the height of the disparity map (estimated disparity map and true disparity map).

Step S5304: according to the parallax loss function, calculating an absolute value of a relative error of the parallax loss of the pixel points corresponding to the estimated parallax image and the real parallax image, and carrying out logarithmic processing on the absolute value to obtain parallax loss;

Specifically, the parallax loss is calculated by taking absolute value and logarithm of the relative error of the parallax difference of the corresponding pixel points between the estimated parallax map and the real parallax map, calculating the normalized parallax loss in the logarithmic space, and improving the result of remote parallax estimation.

For example, in the present invention, the parallax loss function is expressed with reference to the following formula:

wherein D is _gt Representing a true disparity map, D' _gt Representing the downsampled true disparity map, F _pred Then the estimated parallax map obtained by estimating the binocular training image through a stereo matching algorithm model is that epsilon is a minimum value artificially set by dividing 0 in a formula for avoiding parallax loss function; w is the width of the disparity map (estimated disparity map and true disparity map) and h is the height of the disparity map (estimated disparity map and true disparity map).

Step S5305: and obtaining the loss between the estimated disparity map and the real disparity map according to the edge loss and the disparity loss.

Specifically, the edge loss obtained in step S5303 and the parallax loss obtained in step S5304 are added to obtain a parallax estimation loss of the stereo matching algorithm model.

For example, in the present invention, the preset loss function is divided into 2 parts, as shown by the following formula _disp And an edge loss function L _edge Wherein lambda is ₁ And lambda (lambda) ₂ The value of (2) is set by the relevant personnel according to the experience value. The parallax estimation loss L is calculated as follows:

L＝λ ₁ L _disp +λ ₂ L _edge ；

wherein lambda is ₁ +λ ₂ ＝1。

According to the technical scheme, the binocular training image for training and the corresponding real parallax image are obtained; performing parallax estimation on the binocular training image by using a three-dimensional matching algorithm model based on deep learning to obtain an estimated parallax image; based on an edge detection algorithm, carrying out edge extraction on the binocular training image, and forming an edge contour map by pixel points meeting preset conditions of the edge detection algorithm; performing neighborhood expansion on the edge profile map to obtain a mask; calculating the absolute value of parallax derivative loss of the corresponding pixel point of the mask according to the edge loss function to obtain edge loss; according to the parallax loss function, calculating an absolute value of a relative error of the parallax loss of the pixel points corresponding to the estimated parallax image and the real parallax image, and carrying out logarithmic processing on the absolute value to obtain parallax loss; obtaining the loss between the estimated disparity map and the real disparity map according to the edge loss and the disparity loss; based on the loss, adjusting parameters of the stereo matching algorithm model to obtain a trained depth estimation model; acquiring binocular image data of a binocular camera and camera calibration parameters; correcting the binocular image data and obtaining corrected binocular image data; estimating by a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region; based on the parallax map, calculating to obtain a depth map according to the camera calibration parameters and a preset parallax depth conversion rule.

And obtaining an expanded edge profile by utilizing the existing edge extraction algorithm and performing neighborhood expansion treatment. The depth estimation device monitors the expanded edge contour region, and sets the loss function to be composed of a parallax loss function of the parallax map and a parallax derivative loss function based on the edge contour region, so that the depth estimation model continuously learns image features and carries out parallax estimation, the subsequent parallax estimation on the binocular image by using the trained depth estimation model is facilitated, and the accuracy of image depth estimation is improved.

Referring to fig. 8, fig. 8 is a schematic diagram of a training data flow for obtaining a depth estimation model in an embodiment of a depth estimation method of the present application. Based on the embodiment shown in fig. 6, in step S5103: processing the depth truth value according to the camera calibration parameters and a preset parallax depth conversion rule, and after obtaining a first parallax truth value diagram, further comprising:

step S810: resampling the first parallax truth value graph with the parallax smaller than a preset value to obtain a second parallax truth value graph;

specifically, in the practical image depth estimation application, as the conversion relationship between parallax and depth in the binocular vision system mentioned in the foregoing embodiment can be known, the deeper the depth, that is, the farther a certain object in the image is from the binocular camera, the smaller the parallax, wherein the depth change caused by the fine change of the remote parallax is larger and the depth estimation model is more difficult to learn, so in order to improve the accuracy of the depth estimation model for the remote depth estimation, the training data of small parallax needs to be added in the model training data.

In this embodiment, resampling is performed on the first parallax truth value map with the parallax smaller than the preset value to obtain a second parallax truth value map, so as to increase training data with small parallax.

Step S820: and downsampling the second parallax truth diagram to obtain a corresponding real parallax diagram.

Specifically, the downsampling operation is performed on the added training data with small parallax, namely the second parallax truth diagram, and the size of the first parallax truth diagram is reduced to the preset size of the model input. Wherein the preset size is determined by a person according to the requirements of the depth estimation model.

According to the scheme, the binocular image data of the known binocular camera, the camera calibration parameters and the corresponding depth truth values are obtained; correcting binocular image data of a known binocular camera to obtain a binocular training image for training; processing the depth truth value according to the camera calibration parameters and a preset parallax depth conversion rule to obtain a first parallax truth value diagram; downsampling the first parallax truth diagram to obtain a corresponding real parallax diagram; resampling the first parallax truth value graph with the parallax smaller than a preset value to obtain a second parallax truth value graph; downsampling the second parallax truth diagram to obtain a corresponding real parallax diagram; performing parallax estimation on the binocular training image by using a three-dimensional matching algorithm model based on deep learning to obtain an estimated parallax image; calculating the loss between the estimated disparity map and the real disparity map according to a preset loss function, wherein the preset loss function is composed of a disparity loss function of the disparity map and a disparity derivative loss function based on an edge contour area; based on the loss, adjusting parameters of the stereo matching algorithm model to obtain a trained depth estimation model; acquiring binocular image data of a binocular camera and camera calibration parameters; correcting the binocular image data and obtaining corrected binocular image data; estimating by a depth estimation model obtained through pre-training based on the corrected binocular image data to obtain a parallax image, wherein the depth estimation model is obtained through supervision training based on parallax derivative loss of an edge contour region; based on the parallax map, calculating to obtain a depth map according to the camera calibration parameters and a preset parallax depth conversion rule.

And resampling is carried out on the parallax truth diagram, so that the long-distance training data of the small parallax is further expanded, the subsequent training of the depth estimation model is facilitated, and the long-distance depth estimation precision of the depth estimation model is improved.

The principle and implementation process of depth estimation in this embodiment are referred to the above embodiments, and are not described herein.

The embodiment of the application also provides a terminal device, which comprises a memory, a processor and a depth estimation program stored on the memory and capable of running on the processor, wherein the depth estimation program realizes the steps of the depth estimation method when being executed by the processor.

Because the depth estimation program is executed by the processor and adopts all the technical schemes of all the embodiments, the depth estimation program at least has all the beneficial effects brought by all the technical schemes of all the embodiments and is not described in detail herein.

The embodiments of the present application also propose a computer-readable storage medium, on which a depth estimation program is stored, which when executed by a processor implements the steps of the depth estimation method as described above.

It is noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as above, including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, a controlled terminal, or a network device, etc.) to perform the method of each embodiment of the present application.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A depth estimation method, the depth estimation method comprising:

2. The depth estimation method of claim 1, wherein correcting the binocular image data comprises:

3. The depth estimation method according to claim 1, wherein the estimating by a depth estimation model trained in advance based on the corrected binocular image data further comprises, before obtaining a disparity map:

4. A depth estimation method according to claim 3, wherein the acquiring the training binocular training image and the corresponding true disparity map comprises:

5. A depth estimation method according to claim 3, wherein the predetermined loss function comprises an edge loss function and a disparity loss function, and wherein calculating the loss between the estimated disparity map and the true disparity map based on the predetermined loss function comprises:

performing neighborhood expansion on the edge profile map to obtain a mask;

6. The method of depth estimation according to claim 4, wherein the processing the depth truth value according to the camera calibration parameter and a preset parallax depth conversion rule to obtain a first parallax truth value diagram further comprises:

7. The depth estimation method of claim 5, wherein the edge detection algorithm comprises any one of an edge detection algorithm based on Sobel, prewitt or Laplacian operator, and an edge detection model based on deep learning.

8. A depth estimation device, the device comprising:

9. A depth estimation apparatus, the apparatus comprising: memory, a processor and a depth estimation program stored on the memory and executable on the processor, which depth estimation program, when executed by the processor, implements the steps of the depth estimation method according to any one of claims 1 to 7.

10. A storage medium having a depth estimation program stored thereon, which when executed by a processor, implements the steps of the depth estimation method according to any one of claims 1 to 7.