CN111179331A

CN111179331A - Depth estimation method, depth estimation device, electronic equipment and computer-readable storage medium

Info

Publication number: CN111179331A
Application number: CN201911406449.7A
Authority: CN
Inventors: 黄浴
Original assignee: Zhiche Youxing Technology Shanghai Co ltd
Current assignee: Zhiche Youxing Technology Shanghai Co ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-19
Anticipated expiration: 2039-12-31
Also published as: CN111179331B

Abstract

The embodiment of the disclosure discloses a depth estimation method, a depth estimation device, an electronic device and a computer-readable storage medium. The method comprises the following steps: acquiring an original point cloud acquired by a laser radar and acquiring an original image acquired by a camera; determining a first quality evaluation index corresponding to the original point cloud and determining a second quality evaluation index corresponding to the original image; determining first target data which passes the quality evaluation in the original point cloud and the original image according to the first quality evaluation index and the second quality evaluation index; and obtaining a depth estimation result according to the first target data and a corresponding depth estimation strategy. Compared with the prior art, the embodiment of the disclosure can effectively ensure the reliability of the depth estimation result.

Description

Depth estimation method, depth estimation device, electronic equipment and computer-readable storage medium

Technical Field

The present disclosure relates to the field of depth estimation technologies, and in particular, to a depth estimation method, apparatus, electronic device, and computer-readable storage medium.

Background

For an automatic driving system, depth estimation is a very important link, and currently, in depth estimation, depth estimation is generally performed only by using an image acquired by a camera, and once the quality of the image acquired by the camera is poor, the reliability of a depth estimation result is very poor.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. Embodiments of the present disclosure provide a depth estimation method, apparatus, electronic device, and computer-readable storage medium.

According to an aspect of the embodiments of the present disclosure, there is provided a depth estimation method, including:

acquiring an original point cloud acquired by a laser radar and acquiring an original image acquired by a camera;

determining a first quality evaluation index corresponding to the original point cloud, and determining a second quality evaluation index corresponding to the original image;

determining first target data which passes quality evaluation in the original point cloud and the original image according to the first quality evaluation index and the second quality evaluation index;

and obtaining a depth estimation result according to the first target data and a corresponding depth estimation strategy.

According to another aspect of the embodiments of the present disclosure, there is provided a depth estimation apparatus including:

the first acquisition module is used for acquiring original point cloud acquired by the laser radar and acquiring an original image acquired by the camera;

the first determining module is used for determining a first quality evaluation index corresponding to the original point cloud and determining a second quality evaluation index corresponding to the original image;

the second determination module is used for determining first target data which passes quality evaluation in the original point cloud and the original image according to the first quality evaluation index and the second quality evaluation index;

and the second acquisition module is used for acquiring a depth estimation result according to the first target data and a corresponding depth estimation strategy.

According to still another aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instruction from the memory and executing the instruction to realize the depth estimation method.

According to yet another aspect of an embodiment of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-described depth estimation method.

In the embodiment of the disclosure, after a first quality evaluation index is determined for an original point cloud acquired by a laser radar and a second quality evaluation index is determined for an original image acquired by a camera, first target data which passes quality evaluation in the original point cloud and the original image can be determined according to the first quality evaluation index and the second quality evaluation index; then, a depth estimation result can be obtained according to the first target data and a corresponding depth estimation strategy. It can be seen that a multi-sensor system is used in the embodiment of the present disclosure, the multi-sensor system may include a laser radar and a camera at the same time, and when performing depth estimation, the first target data that passes quality estimation is based on the original point cloud collected by the laser radar and the original image collected by the camera, that is, as long as data collected by at least one of the laser radar and the camera can meet requirements, a reliable depth estimation result can be obtained, and therefore, compared with a case of performing depth estimation directly according to an image collected by the camera in the prior art, the embodiment of the present disclosure can effectively ensure reliability of the depth estimation result.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

fig. 1 is a schematic flow chart of a depth estimation method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a block diagram of a depth estimation system in an exemplary embodiment of the present disclosure;

FIG. 3 is another block diagram of a depth estimation system in an exemplary embodiment of the present disclosure;

FIG. 4 is a schematic diagram of the operation of a deep network in an exemplary embodiment of the present disclosure;

FIG. 5 is a schematic diagram of the operation of a deep network in another exemplary embodiment of the present disclosure;

FIG. 6 is a schematic diagram of the operation of a deep network in yet another exemplary embodiment of the present disclosure;

fig. 7 is a block diagram of a depth estimation apparatus according to an exemplary embodiment of the disclosure;

fig. 8 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

But only a few embodiments of the present disclosure and not all embodiments of the present disclosure, it should be understood that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those within the art that the terms "first", "second", etc. in the embodiments of the present disclosure are used merely to distinguish one step, device or module from another, and do not denote any particular technical meaning or necessary logical order; "plurality" may mean two or more, and "at least one" may mean one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In the present disclosure, the character "/" indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity. It should be understood that the dimensions of the various features shown in the drawings are not drawn to scale for ease of illustration.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, and the like may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Exemplary method

Fig. 1 is a flowchart illustrating a depth estimation method according to an exemplary embodiment of the disclosure. The method shown in fig. 1 may include step 101, step 102, step 103 and step 104, which are described separately below.

Step 101, acquiring an original point cloud acquired by a laser radar and acquiring an original image acquired by a camera.

Here, the raw point cloud collected by the lidar and the raw image collected by the camera may be time synchronized. If the laser radar continuously collects the original point clouds to obtain a point cloud sequence consisting of a plurality of original point clouds and the camera continuously collects the original images to obtain an image sequence consisting of a plurality of original images, the frame rates of the point cloud sequence and the image sequence may be the same.

Here, the number of cameras may be one; alternatively, the number of cameras may be at least two. Specifically, as can be seen from the depth estimation systems shown in fig. 2 and 3, the number of the cameras may be two, the two cameras are the camera 1 and the camera 2 respectively, and the camera 1 and the camera 2 may constitute a binocular camera, in which case, the original image involved in step 101 may include the original image collected by the camera 1 and the original image collected by the camera 2 at the same time.

Step 102, determining a first quality evaluation index corresponding to the original point cloud, and determining a second quality evaluation index corresponding to the original image.

Here, the first quality evaluation index may be used to evaluate the quality of the original point cloud, and the first quality evaluation index is various in types, and for clarity of layout, the following description is given by way of example.

Here, the second quality evaluation index may be used to evaluate the quality of the original image, and the second quality evaluation index may be a measure of conventional image processing and video frame acquisition, such as peak-to-noise ratio (PSNR), Structural Similarity (SSIM), and the like.

Step 103, determining first target data which passes the quality evaluation in the original point cloud and the original image according to the first quality evaluation index and the second quality evaluation index.

The first quality evaluation index is used for evaluating the quality of the original point cloud, the second quality evaluation index is used for evaluating the quality of the original image, and whether the quality evaluation of the original point cloud and the quality evaluation of the original image pass or not can be respectively determined according to the first quality evaluation index and the second quality evaluation index, so that corresponding first target data can be obtained according to the determination result. Several possible scenarios of the data composition of the first target data are exemplified below with reference to fig. 2 and 3.

In the first case, the quality evaluation of the original point cloud collected by the laser radar is passed, the quality evaluation of the original image collected by the camera 1 and the quality evaluation of the original image collected by the camera 2 are both passed, and at this time, the first target data may include the original point cloud, the original image collected by the camera 1, and the original image collected by the camera 2 at the same time.

In the second case, the quality evaluation of the original point cloud collected by the laser radar is passed, and the quality evaluation of the original image collected by the camera 1 and the quality evaluation of the original image collected by the camera 2 are not passed, at this time, the first target data may only include the original point cloud.

In the third case, the quality evaluation of the original point cloud acquired by the lidar does not pass, and the quality evaluation of at least one of the original image acquired by the camera 1 and the original image acquired by the camera 2 passes, at this time, the first target data may only include the original image whose quality evaluation passes in the original image acquired by the camera 1 and the original image acquired by the camera 2.

In the fourth case, the quality evaluation of the original point cloud acquired by the lidar is passed, and the quality evaluation of one of the original image acquired by the camera 1 and the original image acquired by the camera 2 is passed, and at this time, the first target data may only include the original image acquired by the camera 1 and the original image acquired by the camera 2, the original image whose quality evaluation is passed, and the original point cloud.

And 104, obtaining a depth estimation result according to the first target data and a corresponding depth estimation strategy.

Here, in the case where only the original point cloud is included in the first target data, a depth estimation result may be obtained in a depth estimation strategy based on the point cloud according to the first target data; in the case that only at least one of the original image collected by the camera 1 and the original image collected by the camera 2 is included in the first target data, a depth estimation result may be obtained according to the first target data by an image-based depth estimation strategy; in the case where at least one of the original image acquired by the camera 1 and the original image acquired by the camera 2 and the original point cloud are included in the first target data, a depth estimation result may be obtained according to the first target data by a depth estimation policy based on the fusion of the point cloud and the image. Alternatively, the depth estimation result may be a dense depth map (e.g., the first dense depth map hereinafter).

It should be noted that, as shown in fig. 2 and fig. 3, a control switch a may be provided corresponding to the laser radar, a control switch B may be provided corresponding to the camera 1, and a control switch C may be provided corresponding to the camera 2 in the depth estimation system.

In the case where an original point cloud exists in the first target data, the control switch a may be placed in a closed state so that the original point cloud may be used for subsequent depth estimation; otherwise, the control switch a may be placed in an open state.

Similarly, in the case that the original image acquired by the camera 1 exists in the first target data, the control switch B may be placed in a closed state, so that the original image acquired by the camera 1 may be used for subsequent depth estimation; otherwise, the control switch B may be placed in an off state.

Similarly, in the case that the original image acquired by the camera 2 exists in the first target data, the control switch C may be placed in a closed state, so that the original image acquired by the camera 2 may be used for subsequent depth estimation; otherwise, the control switch C may be placed in an open state.

In an optional example, obtaining a depth estimation result with a corresponding depth estimation policy according to the first target data includes:

under the condition that the first target data comprises an original image, extracting features according to the original image to obtain first image features;

generating a first target feature map according to the first image feature; wherein the first target feature map comprises a first attention map, a first normal map and a first edge map;

and obtaining a first consistent density depth map output by the depth network according to the original image, the first target feature map and the depth network, and taking the first consistent density depth map as a depth estimation result.

In general, note that the graph may also be referred to as attentionmap, the normal graph may also be referred to as normalmap, the edge graph may also be referred to as edgemap, and the deep network may also be referred to as depthnet.

In the embodiment of the present disclosure, in the case where the original image is included in the first target data, feature extraction may be performed according to the original image to obtain the first image feature.

In the case where the original image captured by the camera 1 and the original image captured by the camera 2 are included in the first target data, the original image may be selected from the original image captured by the camera 1 and the original image captured by the camera 2, and the selected original image may be input to the encoder shown in fig. 2 and 3 to extract features, so as to obtain the first image features.

In the case where one of the original image captured by the camera 1 and the original image captured by the camera 2 is included in the first target data, for example, in the case where only the original image captured by the camera 1 is included, the original image captured by the camera 1 may be directly input to the encoder shown in fig. 2 and 3 to extract features, so as to obtain the first image features.

In both cases, the encoder can extract features using either a residual network or a dense network. In particular, the residual network may also be referred to as ResNet and the dense network may also be referred to as densnet.

After the first image feature is obtained, a first target feature map may be generated based on the first image feature. Here, the first image feature may be input to the segmentation network, the normal network, and the edge network shown in fig. 2 and 3, respectively, to obtain a first attention map output by the segmentation network, a first normal map output by the normal network, and a first edge map output by the edge network, and the first attention map, the first normal map (which is related to the surface normal information), and the first edge map (which is related to the contour information) may constitute a first target feature map. Specifically, the segmentation network may also be referred to as segmentation network, and the segmentation network may be a network based on U-Net, which is an image segmentation network; the normal network may also be referred to as normalnet; the edge network may also be referred to as posenet.

After the first target feature map is obtained, a first dense depth map output by the depth network may be obtained according to the original image, the first target feature map, and the depth network in the first target data. In particular, the deep network may be a monocular deep network.

In the embodiment of the disclosure, when the original image is included in the first target data, a first target feature map including the first attention map, the first normal map, and the first edge map may be obtained according to a first image feature obtained by feature extraction on the original image, and then, depth estimation may be performed according to the original image, the first target feature map, and the depth network in the first target data, so that a first dense depth map as a depth estimation result may be obtained. Therefore, in the embodiment of the disclosure, information such as motion, contour, surface normal and the like can be used for depth estimation, so that the reliability of the depth estimation result can be better ensured.

In an alternative example, obtaining a first dense depth map output by a depth network according to an original image, a first target feature map and the depth network comprises:

determining second target data according to the original image;

and obtaining a first dense depth map output by the depth network according to the second target data, the first target feature map and the depth network.

Here, the second target data may be determined according to the original image, and a specific embodiment of determining the second target data will be described below.

In a specific embodiment, the number of the cameras is one, and the determining the second target data according to the original image includes:

and taking the original image in the first target data as second target data.

In this embodiment, the original image in the first target data is directly used as the second target data, and therefore, the embodiment can determine the second target data very conveniently.

In another specific embodiment, the number of the cameras is two, and the determining the second target data according to the original image includes:

and under the condition that the first target data comprises the original images respectively collected by the two cameras, carrying out image fusion on the original images respectively collected by the two cameras in the first target data to obtain a fusion result, and determining second target data comprising the fusion result.

Here, in the case where the original images respectively captured by the two cameras are included in the first target data, for example, in the case where the original image captured by the camera 1 and the original image captured by the camera 2 in fig. 2 and 3 are included in the first target data, the original image captured by the camera 1 and the original image captured by the camera 2 may be subjected to image fusion to obtain a new image as a fusion result, and obtain the second target data including the fusion result.

Specifically, when image fusion is performed, feature merging or correlation may be directly performed, for example, as shown in fig. 4, feature correlation may be performed by using a disparity network generalized by an optical flow network, and the correlation in fig. 4 may represent a layer in which binocular image features are subjected to correlation calculation. Specifically, the optical flow network may also be referred to as FlowNet, and the parallax network may also be referred to as DispNet.

Of course, when image fusion is performed, a four-dimensional cost volume may also be calculated by a conventional stereo vision method, and then a Three-dimensional depth convolution depth network (3-DCNN) is fed in, where a typical reference network includes a pyramid stereo matching depth learning network, a stereo matching regression network, and the like. Specifically, the four-dimensional cost volume can also be called 4-D costvolume, the pyramid stereo matching deep learning network can also be called PSM-Net, and the stereo matching regression network can also be called GCNet.

In this embodiment, when the first target data includes the original images respectively acquired by the two cameras, the second target data can be determined very conveniently by image fusion.

No matter what manner is adopted to determine the second target data, after the second target data is determined, a first dense depth map output by a deep network can be obtained according to the second target data, the first target feature map and the deep network, and a specific implementation of obtaining the first dense depth map is described below by way of example.

In a specific embodiment, obtaining a first dense depth map output by a depth network according to the second target data, the first target feature map and the depth network includes:

and under the condition that the original point cloud is not included in the first target data, inputting the second target data and the first target feature map into the depth network to obtain a first dense depth map output by the depth network.

Here, while outputting the first dense depth map, the depth network may further output a confidence map corresponding to the first dense depth map, where the confidence map may be used to characterize the confidence of each depth estimation value in the first dense depth map, and the confidence may be characterized by a coefficient between 0 and 1.

In this embodiment, when the first target data does not include the original point cloud, the second target data and the first target feature map may be directly input to the depth network, and the depth network may directly output the first dense depth map as the depth estimation result.

In another specific embodiment, obtaining a first dense depth map output by a depth network according to the second target data, the first target feature map and the depth network includes:

under the condition that the first target data comprises original point clouds, obtaining a sparse depth map and a sparse mask according to the original point clouds;

and inputting the second target data, the first target feature map, the obtained sparse depth map and the sparse mask into the depth network to obtain a first dense depth map output by the depth network.

Here, in the case that the original point cloud is included in the first target data, the original point cloud may be subjected to perspective projection by using a perspective projection module to obtain a sparse depth map and a sparse mask (e.g., a 2-value sparse mask), where the sparse mask may indicate, by a value of 0 to 1, which positions in the sparse depth map have depth values provided by the lidar and which positions do not have depth values provided by the lidar.

Then, the second target data, the first target feature map, and the obtained sparse depth map and sparse mask may be input to the depth network together, so as to implement fusion of the second target data, the first target feature map, and the obtained sparse depth map and sparse mask through the depth network, thereby obtaining a first dense depth map output by the depth network as a depth estimation result.

Here, the deep network may adopt a pre-fusion method or a post-fusion method. In general, pre-fusion refers to that data is subjected to a lot of preprocessing work before entering fusion, such as feature extraction; post-fusion refers to the merging or joining of information only beginning in feature space or final task space (e.g., coordinate system of positioning).

Specifically, when the depth network adopts the pre-fusion method, as shown in fig. 5, the sparse depth map, the sparse mask, the image in the second target data (which may be a monocular image), and the attention map in the first target feature map (which is specifically the first attention map in the foregoing), the normal map (which is specifically the first normal map in the foregoing), and the edge map (which is specifically the first edge map in the foregoing) may be input into an encoder of the depth network for merging, and then the merging result is provided to a decoder of the depth network, and the decoder may output the first dense depth map accordingly.

When the depth network adopts the post-fusion mode, as shown in fig. 6, the sparse depth map and the sparse mask may enter one encoder in the depth network, the image (which may be a monocular image) in the second target data, and the attention map (which is specifically the first attention map in the foregoing), the normal map (which is specifically the first normal map in the foregoing), and the edge map (which is specifically the first edge map in the foregoing) in the first target feature map may enter another encoder in the depth network, the output results of the two encoders are combined at the input end of the decoder of the depth network, and the decoder may output the first dense depth map accordingly.

It should be noted that the merging in fig. 5 and 6 refers to superimposing the features or images of the same spatial size in the channel dimension.

In this embodiment, when the first target data includes both the original point cloud and the original image, the sparse depth map and the sparse mask obtained based on the original point cloud, and the second target data and the first target feature map may be input to the depth network, and the depth network may output the first dense depth map as a depth estimation result. In addition, the implementation mode can fully utilize the color image information of the camera to interpolate the sparse depth data of the laser radar point cloud so as to realize complete depth, and meanwhile, according to the reliable depth value provided by the point cloud data acquired by the laser radar on the image plane at the discrete point, the target of the image inference depth can be enhanced or regularized and corrected, so that the precision and the robustness of the whole depth estimation system are improved.

Therefore, in the embodiment of the disclosure, the depth estimation result can be conveniently and reliably obtained by using the second target data determined according to the original image, the first target feature map and the depth network.

In an optional example, before obtaining a first dense depth map output by a depth network according to an original image, a first target feature map and the depth network, the method further includes:

generating a deep network;

according to the training image collected by the camera, feature extraction is carried out to obtain a second image feature;

according to the second image characteristics, obtaining self-motion parameters of the camera and a second target characteristic diagram, and distorting the self-motion parameters to obtain a distortion result; wherein the second target feature map comprises a second attention map, a second normal map and a second edge map;

inputting the second target feature map and the training image into a depth network to obtain a second dense depth map and a confidence map output by the depth network;

inputting the second compact depth map and the confidence map into a residual optical flow network to obtain a residual optical flow output by the residual optical flow network;

and adding the residual optical flow and the distortion result to obtain an optical flow domain, and correcting the depth network according to the optical flow domain.

In order to perform depth estimation using a deep network, training of the deep network is required first. Taking the case that the depth estimation system includes only one camera as an example, the camera may be called first to acquire a training image, and an initial depth network may be generated.

Next, the training image may be input to the encoder shown in fig. 2 and 3 for feature extraction to obtain a second image feature, and the second image feature is input to the segmentation network, the normal network, the edge network, and the pose network shown in fig. 2 and 3, respectively, to obtain a second attention map output by the segmentation network, a second normal map output by the normal network, a second edge map output by the edge network, and a self-motion parameter of the camera output by the pose network; the second normal map, the second normal map and the second edge map may constitute a second target feature map. Specifically, the gesture network may also be referred to as posenet; the autoromotive parameters can be obtained by performing regression estimation on two continuous frames of original images through an attitude network; the auto-motion parameters may include a rotation matrix (which may be denoted by R) and a translation parameter (which may be denoted by t).

The self-motion parameters may then be warped by a warping operation, such as warping the self-motion parameters from the left eye image plane to the right eye image plane, to obtain a warped result. In addition, the second target feature and the training image can be input into the current depth network to obtain a second dense depth map and a confidence map output by the depth network. Then, the obtained second dense depth map and the confidence map may be input into a residual optical flow network to obtain a residual optical flow output by the residual optical flow network after the self-motion is removed. In particular, warping may also be referred to as warp and the residual optical flow network may also be referred to as reflownet.

Then, the residual optical flow and the warping result may be added to obtain an entire optical flow domain, and the depth network may be modified according to the optical flow domain, so as to obtain a depth network finally used for depth estimation. It should be noted that the result of optical flow estimation and the result of depth estimation may be mutually constrained during training, and depth estimation may be constrained by using motion information by modifying the depth network according to the optical flow domain, so as to optimize the depth network, thereby ensuring reliability of the estimation result when the depth network finally used for depth estimation performs depth estimation.

It should be noted that, in the deep network training, the loss function may have a plurality of loss terms, such as a deep reconstruction term, a normal term, an edge term, an attention map term, a motion continuity term, a solid geometry term, and the like. Even in the absence of stereoscopic binocular input, using the depth map estimated at training (i.e., the second dense depth map above) and camera parameters (i.e., the auto-motion parameters above), the monocular image may be warped to another image plane for calculating the loss of binocular vision; the motion continuity term is from the errors of the residual optical flow network and the pose network; the normal term comes from the geometric transformation between the depth map and the normal map.

under the condition that the first target data does not comprise the original image and the first target data comprises the original point cloud, obtaining a sparse depth map and a sparse mask according to the original point cloud;

the resulting sparse depth map and sparse mask are input to a convolutional neural network to obtain a first dense depth map output by the convolutional neural network.

Here, the convolutional neural network may be a sparse invariant convolutional neural network shown in fig. 2, or a normalized invariant convolutional neural network shown in fig. 3. In particular, the sparse invariant convolutional neural network may also be referred to as a sparse invariant CNN, and the normalized invariant convolutional neural network may also be referred to as a normalized invariant CNN.

In the embodiment of the disclosure, in the case that only the original point cloud is included in the first target data, the original point cloud may be subjected to perspective projection by using a perspective projection module to obtain a sparse depth map and a sparse mask (for example, a 2-value sparse mask), where the sparse mask may indicate, by a value of 0 to 1, which positions in the sparse depth map have depth values provided by the lidar and which positions do not have depth values provided by the lidar.

After the sparse depth map and the 2-valued sparse mask are obtained, the sparse depth map and the 2-valued sparse mask may be input into the sparse invariant CNN shown in fig. 2, from which the sparse invariant CNN may output a first dense depth map. Alternatively, after obtaining the sparse depth map and the 2-value sparse mask, the 2-value sparse mask may be input as the normalized invariant CNN shown in fig. 3 with the confidence map and the sparse depth map, and the normalized invariant CNN may output the first dense depth map accordingly.

It can be seen that, in the embodiments of the present disclosure, in the absence of an image as a guide, a depth estimation result based on a point cloud can be conveniently obtained by using a convolutional neural network.

In one optional example, determining a first quality assessment indicator corresponding to the original point cloud comprises:

projecting the original point cloud to an image plane of a camera to obtain a projected image;

calculating the gradient information of the projected image and the correlation degree of the image edge information of the original image, and taking the correlation degree as a first quality evaluation index corresponding to the original point cloud; or determining the Raney quadratic entropy of the projection image, and taking the Raney quadratic entropy as a first quality evaluation index corresponding to the original point cloud.

Taking the case that the depth estimation system only includes one camera as an example, after the original point cloud acquired by the laser radar and the original image acquired by the camera are acquired, the original point cloud can be projected to an image plane of the camera to obtain a projected image, and at this time, the coordinate systems of the laser radar and the camera can be considered to be calibrated. Next, a first quality evaluation index corresponding to the original point cloud is determined according to the projection image, and a specific embodiment of determining the first quality evaluation index is described below by way of example.

In one embodiment, gradient information of the projection image and image edge information of the original image may be calculated, and a correlation of the gradient information and the image edge information may be calculated. Specifically, the formula used for calculating the correlation may be:

where Jc is the degree of correlation, w is the video window size, f is the image, (I, j) is the pixel location in the image, p is the 3-D point of the point cloud, X is the point cloud data acquired by the lidar, and D is the image gradient map (which is used to characterize the gradient information).

After the correlation degree is calculated, the correlation degree can be used as a first quality evaluation index corresponding to the original point cloud. Here, a correlation threshold may be preset, and if the determined correlation is greater than the correlation threshold, the quality evaluation of the original point cloud may be considered to pass; otherwise, the quality assessment of the original point cloud may be deemed to fail.

It is easy to see that, the implementation method can determine the first quality evaluation index very conveniently, and can realize the quality evaluation of the original point cloud very conveniently.

In another embodiment, the Raney quadratic entropy of the projected image may be calculated. Specifically, the formula adopted for calculating the raney quadratic entropy may be:

wherein,

it is reny quadratic entropy, G (a, b) is a Gaussian distribution function with mean a and variance b, RQE is a dense measure defining point cloud distribution in the form of a Gaussian Mixture Model (GMM), which can be used as the quality measure here.

After the raney quadratic entropy is calculated, the raney quadratic entropy may be used as a first quality evaluation index corresponding to the original point cloud. Specifically, a threshold value of the raney quadratic entropy may be preset, and if the calculated raney quadratic entropy is greater than the threshold value of the raney quadratic entropy, the quality evaluation of the original point cloud may be considered to pass; otherwise, the quality assessment of the original point cloud may be deemed to fail.

It is easy to see that, the implementation method can also determine the first quality evaluation index very conveniently, and can realize the quality evaluation of the original point cloud very conveniently.

In the embodiment of the disclosure, the correlation or the raney entropy can be calculated very conveniently by projecting the original point cloud to the projection image obtained by the image plane of the camera, so that the first quality evaluation index is determined accordingly, and the quality evaluation of the original point cloud is realized.

It should be noted that the quality evaluation of the original image may be performed in a manner similar to that of the original point cloud. Specifically, a PSNR threshold may be preset, and the PSNR may be calculated for the original image, and when the calculated PSNR is greater than the PSNR threshold, it may be considered that the quality evaluation of the original image passes; otherwise, the quality assessment of the original image may be deemed to fail.

The operation principle of the depth estimation system will be described with reference to fig. 2 and 3.

As shown in fig. 2 and fig. 3, the depth estimation system may include a laser radar, a camera 1 and a camera 2, the depth estimation system may further include a control switch D in addition to a control switch a corresponding to the laser radar, a control switch B corresponding to the camera 1, and a control switch C corresponding to the camera 2, the control switch D may have two working positions, which are a first working position and a second working position, respectively, when the control switch D is in the first working position, the control switch D is connected to an output terminal of the convolutional neural network, and when the control switch D is in the second working position, the control switch D is connected to an output terminal of the depth network.

In order to realize the depth estimation, a depth network and a convolutional neural network need to be trained firstly.

After the laser radar collects the original point cloud, and the camera 1 and the camera 2 respectively collect original images, point cloud data quality evaluation can be performed on the original point cloud, and image quality evaluation can be performed on the original images collected by the camera 1 and the original images collected by the camera 2 respectively. The following four cases are possible:

in the first case, only the quality evaluation of the original point cloud passes, at which time the control switch a can be put into the closed state, both the control switches B and C can be put into the open state, and the control switch D can be switched to the first working position. Then, the sparse depth map and the sparse mask obtained by perspective projection of the point cloud data can be input into the convolutional neural network together to obtain a dense depth map (equivalent to the first dense depth map in the above) output by the convolutional neural network.

In the second case, the quality evaluation of the original point cloud is not passed and the quality evaluation of the original image captured by one of the cameras 1 and 2 is passed, for example, the quality evaluation of the original image captured by only the camera 1 is passed, at which point the control switch B can be put into the closed state, both the control switches a and C can be put into the open state, and the control switch D can be switched to the second operating position. Next, the original image collected by the camera 1 may be input into the encoder to extract features, and the extracted image features (equivalent to the first image features in the foregoing) are input into the segmentation network, the normal network, and the edge network, respectively, so as to obtain a first attention image output by the segmentation network, a first normal map output by the normal network, and a first edge map output by the edge network. Then, the first attention map, the first normal map, the first edge map, and the original image acquired by the camera 1 may be input into the depth network together to obtain a dense depth map (equivalent to the first dense depth map in the above) and a confidence map output by the depth network.

In the third case, the quality of the original point cloud is evaluated, and the quality of the original image captured by one of the camera 1 and the camera 2 is evaluated, for example, the quality of the original image captured by the camera 1 is evaluated, at this time, the control switch a and the control switch B may be placed in the closed state, the control switch C may be placed in the open state, and the control switch D may be switched to the second working position. Next, in a manner similar to the second case, the first attention map, the first normal map, and the first edge map may be acquired, and then, the first attention map, the first normal map, the first edge map, the original image acquired by the camera 1, and the point cloud data may be input to the depth network together with a sparse depth map and a sparse mask obtained by perspective projection to obtain a dense depth map (equivalent to the first dense depth map in the above) and a confidence map output by the depth network. It should be noted that the third case may correspond to the dense depth map acquisition manner shown in fig. 5 or fig. 6.

In the fourth situation, the quality evaluation of the original point cloud passes, the quality evaluation of the original image acquired by the camera 1 and the quality evaluation of the original image acquired by the camera 2 pass, at this time, the control switch a, the control switch B and the control switch C can be all placed in a closed state, and the control switch D can be switched to the second working position. Next, in a manner similar to the second case, the first attention map, the first normal map, and the first edge map may be obtained, the original image acquired by the camera 1 and the original image acquired by the camera 2 may be further subjected to image fusion to obtain second target data, and the first attention map, the first normal map, the first edge map, the second target data, and the point cloud data may be collectively input to the depth network through a sparse depth map and a sparse mask obtained by perspective projection to obtain a dense depth map (equivalent to the first dense depth map in the foregoing) and a confidence map output by the depth network. It should be noted that the fourth case may correspond to the dense depth map acquisition method shown in fig. 4.

It should be noted that the lidar ranging is very accurate, the result is stable, but there are disadvantages of sparseness, limited detection distance (e.g. about 100 meters), low sampling frame rate (e.g. 10FPS), the method for acquiring depth based on camera images has poor stereo matching stability but high resolution, therefore, the depth estimation system comprising the laser radar and the camera can be constructed in the embodiment of the disclosure, and the reliable data is preferentially selected to obtain the depth estimation result in an adaptive manner, so as to ensure the flexibility and robustness of the system, therefore, the low-cost camera and the high-cost laser radar can complement each other, particularly, some low beam lidar still can obtain high-resolution depth perception data, and different sensor data can be cooperated with each other to compensate for the difference of each other, therefore, the advantages of multiple sensors can be fully utilized, and the reliability of the depth estimation result is effectively ensured. Furthermore, the method and the device provided by the embodiment of the invention can also be applied to the fields of reality simulation/augmented reality/robot navigation/security monitoring and the like.

Exemplary devices

Fig. 7 is a block diagram of a depth estimation device according to an exemplary embodiment of the present disclosure. The apparatus shown in fig. 7 includes a first obtaining module 701, a first determining module 702, a second determining module 703 and a second obtaining module 704.

The first acquisition module 701 is used for acquiring an original point cloud acquired by a laser radar and acquiring an original image acquired by a camera;

a first determining module 702, configured to determine a first quality assessment indicator corresponding to the original point cloud, and determine a second quality assessment indicator corresponding to the original image;

a second determining module 703, configured to determine, according to the first quality assessment indicator and the second quality assessment indicator, first target data that passes quality assessment in the original point cloud and the original image;

and a second obtaining module 704, configured to obtain a depth estimation result according to the first target data and a corresponding depth estimation policy.

In an optional example, the second obtaining module 704 includes:

the first obtaining submodule is used for extracting features according to the original image under the condition that the first target data comprises the original image so as to obtain first image features;

the generation submodule is used for generating a first target feature map according to the first image feature; wherein the first target feature map comprises a first attention map, a first normal map and a first edge map;

and the second obtaining submodule is used for obtaining a first consistent density depth map output by the depth network according to the original image, the first target feature map and the depth network, and taking the first consistent density depth map as a depth estimation result.

In an optional example, the second obtaining sub-module includes:

a determining unit configured to determine second target data from the original image;

and the acquisition unit is used for acquiring a first dense depth map output by the depth network according to the second target data, the first target feature map and the depth network.

In an optional example, the obtaining unit is specifically configured to:

under the condition that the first target data does not comprise the original point cloud, inputting the second target data and the first target feature map into a depth network to obtain a first dense depth map output by the depth network;

an acquisition unit comprising:

the first acquiring subunit is used for acquiring a sparse depth map and a sparse mask according to the original point cloud under the condition that the first target data comprises the original point cloud;

and the second acquisition subunit is used for inputting the second target data, the first target feature map, the obtained sparse depth map and the sparse mask into the depth network so as to obtain a first dense depth map output by the depth network.

In one alternative example of this, the user may,

the number of the cameras is one, and the determining unit is specifically configured to:

taking an original image in the first target data as second target data;

or,

the number of cameras is two, and the determining unit is specifically configured to:

In one optional example, the apparatus further comprises:

the generating module is used for generating a depth network before a first dense depth map output by the depth network is obtained according to the original image, the first target feature map and the depth network;

the first processing module is used for extracting features according to the training image acquired by the camera to obtain second image features;

the second processing module is used for obtaining the self-motion parameters of the camera and a second target characteristic diagram according to the second image characteristics, and twisting the self-motion parameters to obtain a twisting result; wherein the second target feature map comprises a second attention map, a second normal map and a second edge map;

the third acquisition module is used for inputting the second target feature map and the training image into the depth network so as to obtain a second dense depth map and a confidence map output by the depth network;

the fourth acquisition module is used for inputting the second compact depth map and the confidence map into the residual optical flow network so as to obtain a residual optical flow output by the residual optical flow network;

and the third processing module is used for adding the residual optical flow and the distortion result to obtain an optical flow field and correcting the depth network according to the optical flow field.

In an optional example, the second obtaining module 704 includes:

the third obtaining submodule is used for obtaining a sparse depth map and a sparse mask according to the original point cloud under the condition that the first target data does not comprise the original image and the first target data comprises the original point cloud;

and the fourth acquisition submodule is used for inputting the obtained sparse depth map and the sparse mask into the convolutional neural network so as to obtain a first dense depth map output by the convolutional neural network.

In an alternative example, the first determining module 702 includes:

the fifth acquisition submodule is used for projecting the original point cloud to an image plane of the camera so as to obtain a projected image; the target camera is any one of N cameras;

the processing submodule is used for calculating the gradient information of the projected image and the correlation degree of the image edge information of the original image, and taking the correlation degree as a first quality evaluation index corresponding to the original point cloud; or determining the Raney quadratic entropy of the projection image, and taking the Raney quadratic entropy as a first quality evaluation index corresponding to the original point cloud.

Exemplary electronic device

Next, an electronic apparatus 80 according to an embodiment of the present disclosure is described with reference to fig. 8. The electronic device 80 may be either or both of the first device and the second device, or a stand-alone device separate from them that may communicate with the first device and the second device to receive the acquired input signals therefrom.

As shown in fig. 8, the electronic device 80 includes one or more processors 81 and memory 82.

Processor 81 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities that controls other components in electronic device 80 to perform desired functions.

The memory 82 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on a computer-readable storage medium and executed by processor 81 to implement the depth estimation methods of the various embodiments disclosed above and/or other desired functionality.

In one example, the electronic device 80 may further include: an input device 83 and an output device 84, which are interconnected by a bus system and/or other form of connection mechanism (not shown). The input device 83 may include a keyboard, a mouse, and the like. Output device 84 may include a display, speakers, a remote output device, and so forth.

Of course, for simplicity, only some of the components of the electronic device 80 relevant to the present disclosure are shown in fig. 8, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 80 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the depth estimation method according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the depth estimation method according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

A computer-readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, and it is noted that the advantages, effects, etc., presented in the present disclosure are merely examples and are not limiting, which should not be considered essential to the various embodiments of the present disclosure. The specific details disclosed above are for the purpose of illustration and understanding only and are not intended to be limiting, since the above-described details do not limit the disclosure to the specific details described above.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. Because the system embodiment basically corresponds to the method embodiment, the description is relatively simple, and the relevant points can be referred to the description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

In the apparatus, devices and methods of the present disclosure, components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of depth estimation, comprising:

2. The method of claim 1, wherein obtaining depth estimation results according to the first target data and a corresponding depth estimation strategy comprises:

under the condition that the first target data comprises the original image, extracting features according to the original image to obtain first image features;

generating a first target feature map according to the first image feature; wherein the first target feature map comprises a first attention map, a first normal map, and a first edge map;

and obtaining a first uniform density depth map output by the depth network according to the original image, the first target feature map and the depth network, and taking the first uniform density depth map as a depth estimation result.

3. The method of claim 2, wherein obtaining a first dense depth map output by a depth network from the original image, the first target feature map, and the depth network comprises:

determining second target data according to the original image;

4. The method of claim 3,

the obtaining a first dense depth map output by the depth network according to the second target data, the first target feature map and the depth network includes:

under the condition that the original point cloud is not included in the first target data, inputting the second target data and the first target feature map into a depth network to obtain a first dense depth map output by the depth network;

or,

under the condition that the original point cloud is included in the first target data, obtaining a sparse depth map and a sparse mask according to the original point cloud;

and inputting the second target data, the first target feature map, the obtained sparse depth map and the sparse mask into a depth network to obtain a first dense depth map output by the depth network.

5. The method of claim 3,

the number of the cameras is one, and the determining of the second target data according to the original image includes:

taking the original image in the first target data as second target data;

or,

the number of the cameras is two, and the determining of the second target data according to the original image includes:

and under the condition that the first target data comprises the original images respectively acquired by the two cameras, carrying out image fusion on the original images respectively acquired by the two cameras in the first target data to obtain a fusion result, and determining second target data comprising the fusion result.

6. The method of claim 2, wherein before obtaining the first dense depth map output by the depth network from the original image, the first target feature map, and the depth network, the method further comprises:

generating a deep network;

extracting features according to the training image acquired by the camera to obtain second image features;

according to the second image characteristic, obtaining a self-motion parameter of the camera and a second target characteristic diagram, and twisting the self-motion parameter to obtain a twisting result; wherein the second target feature map comprises a second attention map, a second normal map, and a second edge map;

inputting the second target feature map and the training image into the depth network to obtain a second dense depth map and a confidence map output by the depth network;

inputting the second dense depth map and the confidence map into a residual optical flow network to obtain a residual optical flow output by the residual optical flow network;

and adding the residual optical flow and the warping result to obtain an optical flow field, and correcting the depth network according to the optical flow field.

7. The method of claim 1, wherein obtaining depth estimation results according to the first target data and a corresponding depth estimation strategy comprises:

under the condition that the original image is not included in the first target data and the original point cloud is included in the first target data, obtaining a sparse depth map and a sparse mask according to the original point cloud;

inputting the obtained sparse depth map and the sparse mask into a convolutional neural network to obtain a first dense depth map output by the convolutional neural network.

8. A depth estimation device, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the depth estimation method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored, the computer program being configured to perform the depth estimation method of any of the preceding claims 1 to 7.