CN115457105A

CN115457105A - Depth information acquisition method and device, electronic equipment and storage medium

Info

Publication number: CN115457105A
Application number: CN202210989737.5A
Authority: CN
Inventors: 王啸峰; 叶云; 黄冠
Original assignee: Beijing Jianzhi Technology Co ltd
Current assignee: Beijing Jianzhi Technology Co ltd
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2022-12-09

Abstract

The application provides a depth information acquisition method and device, electronic equipment and a storage medium. The method comprises the following steps: inputting a first frame image and a second frame image of a target vehicle into a depth information fusion model; calling a monocular depth information acquisition layer to process the image characteristics of the first frame image to obtain monocular depth information of each pixel in the first frame image; calling a multi-view depth information acquisition layer to process the image characteristics of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle and the camera external parameter of the target camera, so as to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information; and calling a depth information fusion layer to fuse the monocular depth information and the monocular depth information of each pixel in the first frame image according to the certainty factor to obtain the image depth information of the first frame image. The method and the device can improve the accuracy of the prediction image depth.

Description

Depth information acquisition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image depth prediction technologies, and in particular, to a depth information obtaining method and apparatus, an electronic device, and a storage medium.

Background

In the field of automatic driving, how to obtain depth information of objects such as vehicles, pedestrians and the like is a more important technical point in many current researches, such as 3D reconstruction, obstacle detection, SLAM (singular Localization and Mapping), and the like.

The image depth algorithm is divided into a supervised algorithm and an unsupervised algorithm, the depth value prediction of the supervised algorithm is more accurate, but the supervised algorithm needs to provide point cloud, so that the algorithm is limited in practical application. Unsupervised algorithms are therefore more popular for future research and applications. The unsupervised meaning is that the algorithm can be trained and the image depth information can be obtained only by inputting the image data without point cloud signals or manual marking.

Currently, the common unsupervised depth algorithms include two types: 1. a monocular unsupervised depth estimation algorithm based on a video sequence; 2. a multi-frame unsupervised depth estimation algorithm based on video sequences. However, the two algorithms need to ensure that the input image is an image of a static scene, and in a dynamic scene, the accuracy of the predicted image depth information is low.

Disclosure of Invention

The embodiment of the application provides a depth information acquisition method and device, electronic equipment and a storage medium, and aims to solve the problem that the accuracy of predicted image depth information is low due to an unsupervised depth algorithm in the related art.

In order to solve the above technical problem, the embodiments of the present application are implemented as follows:

in a first aspect, an embodiment of the present application provides a depth information obtaining method, including:

inputting a first frame image and a second frame image of a target vehicle into a depth information fusion model; the first frame image and the second frame image are two continuous frames, the generation time of the first frame image is later than that of the second frame image, and the depth information fusion model comprises: the system comprises a monocular depth information acquisition layer, a monocular depth information acquisition layer and a depth information fusion layer;

calling the monocular depth information acquisition layer to process the image characteristics of the first frame image to obtain the monocular depth information of each pixel in the first frame image;

calling the multi-view depth information acquisition layer to process the image characteristics of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle and the camera external parameter of the target camera, so as to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information;

and calling the depth information fusion layer to fuse the monocular depth information and the monocular depth information of each pixel in the first frame image according to the certainty factor to obtain the image depth information of the first frame image.

Optionally, the depth information fusion model further includes: a camera external reference acquisition layer,

after the inputting the first frame image and the second frame image of the target vehicle into the depth information fusion model, the method further comprises:

calling the camera external parameter acquisition layer, and processing the image characteristics of the first frame image and the second frame image to obtain the camera external parameters of the target camera;

and determining the vehicle speed of the target vehicle according to the camera translation vector in the camera external parameters and the camera shooting frame rate of the target camera.

Optionally, the multi-view depth information obtaining layer includes: an encoding module, a cost body construction module and a decoding module,

the calling the multi-view depth information acquisition layer to process the image characteristics of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle and the camera external parameter of the target camera to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information, and the calling the multi-view depth information acquisition layer comprises the following steps:

calling the coding module to code the first frame image and the second frame image to obtain a first image feature of the first frame image and a second image feature of the second frame image;

calling the cost body construction module to perform homography transformation on the first image characteristic and the second image characteristic to obtain a transformation characteristic, and processing the camera external parameter and the transformation characteristic to obtain a cost body characteristic of each pixel in the first frame image;

and calling the decoding module to process the cost body characteristics according to the vehicle speed and the monocular depth information to obtain the monocular depth search range information of each pixel in the first frame image, and determining the monocular depth information of each pixel in the first frame image and the reliability of the monocular depth information according to the monocular depth search range information.

Optionally, the invoking the cost body construction module to perform homography transformation on the first image feature and the second image feature to obtain a transformation feature, and processing the camera external parameter and the transformation feature to obtain the cost body feature of each pixel in the first frame image includes:

calling the cost body construction module to perform homography transformation on the first image characteristic and the second image characteristic to obtain a first transformation characteristic of the first image characteristic and a second transformation characteristic of the second image characteristic;

obtaining an initial cost volume characteristic of each pixel in the second frame image according to the first transformation characteristic, the second transformation characteristic and the camera external parameter;

and projecting the initial cost body characteristics to the first frame image to obtain the cost body characteristics of each pixel in the first frame image.

Optionally, the invoking the depth information fusion layer to fuse monocular depth information and monocular depth information of each pixel in the first frame image according to the certainty factor to obtain image depth information of the first frame image includes:

calling the depth information fusion layer to obtain a first pixel with the certainty factor larger than a first certainty factor threshold value in the first frame image, and taking the multi-view depth information of the first pixel as the depth information of the first pixel;

calling the depth information fusion layer to obtain a second pixel with the certainty factor smaller than a second certainty factor threshold value in the first frame image, and taking monocular depth information of the second pixel as depth information of the second pixel; the first confidence threshold is greater than the second confidence threshold;

calling the depth information fusion layer to obtain a third pixel except the first pixel and the second pixel in the first frame image, fusing monocular depth information and the multi-view depth information of the third pixel according to the certainty factor of the third pixel to obtain fusion depth information, and taking the fusion depth information as the depth information of the third pixel;

and obtaining the image depth information of the first frame image according to the depth information of the first pixel, the depth information of the second pixel and the depth information of the third pixel.

In a second aspect, an embodiment of the present application provides a depth information obtaining apparatus, including:

the image input module is used for inputting the first frame image and the second frame image of the target vehicle into the depth information fusion model; the first frame image and the second frame image are two continuous frames, the generation time of the first frame image is later than that of the second frame image, and the depth information fusion model comprises: the system comprises a monocular depth information acquisition layer, a monocular depth information acquisition layer and a depth information fusion layer;

the monocular depth acquiring module is used for calling the monocular depth information acquiring layer to process the image characteristics of the first frame image to obtain the monocular depth information of each pixel in the first frame image;

the multi-view depth acquisition module is used for calling the multi-view depth information acquisition layer to process the image characteristics of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle and the camera external parameter of the target camera to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information;

and the image depth acquisition module is used for calling the depth information fusion layer to fuse the monocular depth information and the monocular depth information of each pixel in the first frame image according to the certainty factor to obtain the image depth information of the first frame image.

the device further comprises:

the camera parameter acquisition module is used for calling the camera external parameter acquisition layer and processing the image characteristics of the first frame image and the second frame image to obtain the camera external parameters of the target camera;

and the vehicle speed determining module is used for determining the vehicle speed of the target vehicle according to the camera translation vector in the camera external parameter and the camera shooting frame rate of the target camera.

Optionally, the multi-view depth information obtaining layer includes: an encoding module, a cost body constructing module and a decoding module,

the multi-view depth acquisition module comprises:

the image feature acquisition unit is used for calling the coding module to code the first frame image and the second frame image to obtain a first image feature of the first frame image and a second image feature of the second frame image;

the cost body feature obtaining unit is used for calling the cost body construction module to perform homography transformation on the first image feature and the second image feature to obtain transformation features, and processing the camera external parameters and the transformation features to obtain cost body features of each pixel in the first frame image;

and the multi-view depth determining unit is used for calling the decoding module to process the cost body characteristics according to the vehicle speed and the monocular depth information to obtain multi-view depth searching range information of each pixel in the first frame image, and determining the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information according to the multi-view depth searching range information.

Optionally, the cost body feature obtaining unit includes:

a transformation feature obtaining subunit, configured to invoke the cost body construction module to perform homography transformation on the first image feature and the second image feature, so as to obtain a first transformation feature of the first image feature and a second transformation feature of the second image feature;

an initial feature obtaining subunit, configured to obtain an initial cost volume feature of each pixel in the second frame image according to the first transformation feature, the second transformation feature, and the camera external parameter;

and the cost body characteristic obtaining subunit is configured to project the initial cost body characteristic onto the first frame image, so as to obtain a cost body characteristic of each pixel in the first frame image.

Optionally, the image depth acquiring module includes:

a first depth information obtaining unit, configured to invoke the depth information fusion layer to obtain a first pixel in the first frame image, where a certainty factor of the first pixel is greater than a first certainty factor threshold, and use multi-view depth information of the first pixel as depth information of the first pixel;

a second depth information obtaining unit, configured to invoke the depth information fusion layer to obtain a second pixel in the first frame of image, where a certainty factor of the second pixel is smaller than a second certainty factor threshold, and use monocular depth information of the second pixel as depth information of the second pixel; the first confidence threshold is greater than the second confidence threshold;

a third depth information obtaining unit, configured to invoke the depth information fusion layer to obtain a third pixel in the first frame image, except for the first pixel and the second pixel, and fuse monocular depth information and the monocular depth information of the third pixel according to the certainty factor of the third pixel to obtain fusion depth information, where the fusion depth information is used as depth information of the third pixel;

and the image depth information acquisition unit is used for obtaining the image depth information of the first frame image according to the depth information of the first pixel, the depth information of the second pixel and the depth information of the third pixel.

In a third aspect, an embodiment of the present application provides an electronic device, including:

a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the depth information acquisition method of any one of the above.

In a fourth aspect, an embodiment of the present application provides a readable storage medium, where when executed by a processor of an electronic device, an instruction in the storage medium enables the electronic device to perform any one of the depth information obtaining methods described above.

In the embodiment of the application, the first frame image and the second frame image of the target vehicle are input into the depth information fusion model. The first frame image and the second frame image are two continuous frames of images, the generation time of the first frame image is later than that of the second frame image, and the depth information fusion model comprises: the system comprises a monocular depth information acquisition layer, a monocular depth information acquisition layer and a depth information fusion layer. And calling a monocular depth information acquisition layer to process the image characteristics of the first frame image to obtain the monocular depth information of each pixel in the first frame image. And calling a multi-view depth information acquisition layer to process the image characteristics of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle and the camera external parameter of the target camera to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information. And calling a depth information fusion layer to fuse the monocular depth information and the monocular depth information of each pixel in the first frame image according to the certainty factor to obtain the image depth information of the first frame image. According to the image depth prediction method and device, the multi-view depth of the image is adaptively adjusted by combining the vehicle speed and the single-view depth, the multi-view depth and the single-view depth are fused to obtain the image depth, more accurate image depth can be obtained, and the accuracy of image depth prediction is improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating steps of a depth information obtaining method according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating steps of a method for acquiring camera parameters according to an embodiment of the present disclosure;

fig. 3 is a flowchart illustrating steps of a multi-view depth obtaining method according to an embodiment of the present disclosure;

fig. 4 is a flowchart illustrating steps of a cost characteristic obtaining method according to an embodiment of the present application;

fig. 5 is a flowchart illustrating steps of an image depth information obtaining method according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of a depth information fusion model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a depth information acquiring apparatus according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, of the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a flowchart illustrating steps of a depth information obtaining method provided in an embodiment of the present application is shown, and as shown in fig. 1, the depth information obtaining method may include: step 101, step 102, step 103 and step 104.

Step 101: and inputting the first frame image and the second frame image of the target vehicle into the depth information fusion model.

The method and the device for fusing the monocular depth information and the monocular depth information of the image can be applied to the scene for fusing the monocular depth information and the monocular depth information of the image so as to accurately obtain the depth information of the image.

In this embodiment, the target vehicle may be an autonomous vehicle, such as an unmanned delivery vehicle, an unmanned patrol vehicle, or the like.

The first frame image and the second frame image are two continuous frames of images, the generation time of the first frame image is later than that of the second frame image, specifically, the first frame image and the second frame image are two continuous frames of images in a video recorded by a target camera on a target vehicle, for example, the video shot by the target camera comprises 10 frames of images, the 10 frames of images are sequentially an image 1, an image 2, an image 10 according to the time sequence, and when the first frame image is an image 2, the second frame image is an image 1; and when the first frame image is image 6, the second frame image is image 5, and so on.

It should be understood that the above examples are only examples for better understanding of the technical solutions of the embodiments of the present application, and are not to be taken as the only limitation to the embodiments.

The depth information fusion model is a model for fusing multi-view depth information and monocular depth information to obtain an image depth, and in this example, the depth information fusion model may include: the system comprises a monocular depth information acquisition layer, a monocular depth information acquisition layer and a depth information fusion layer. As shown in fig. 6, the dose branch is a camera extrinsic parameter acquisition layer, the Monocular branch is a Monocular depth information acquisition layer, and the MVS branch is a Monocular depth information acquisition layer.

In a specific implementation, when a target camera on a target vehicle is used to record a video and the depth information of a current frame (i.e., a first frame) image is analyzed, the first frame image and a second frame image at a previous time may be input to a depth information fusion model.

After the first frame image and the second frame image of the target vehicle are input to the depth information fusion model, step 102 is performed.

Step 102: and calling the monocular depth information acquisition layer to process the image characteristics of the first frame image to obtain the monocular depth information of each pixel in the first frame image.

After the first frame image and the second frame image of the target vehicle are input to the depth information fusion model, the monocular depth information obtaining layer may be invoked to process the image features of the first frame image to obtain the monocular depth information of each pixel in the first frame image. As shown in fig. 6, in a Monocular branch, a first Frame image (i.e. the Frame T shown in the figure) may be processed by using DepthNet to obtain Monocular Depth information Monocular Depth of the first Frame image.

After the monocular depth information of each pixel in the first frame image is obtained by processing the image feature of the first frame image by calling the monocular depth information acquiring layer, step 103 is executed.

Step 103: and calling the multi-view depth information acquisition layer to process the image characteristics of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle and the predicted camera parameter of the target camera to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information.

The vehicle speed refers to the predicted traveling speed of the target vehicle.

The camera parameters refer to parameters of the target camera that captures the first frame image and the second frame image, and in this example, the camera parameters may include: a rotation matrix of the target camera and a camera translation vector.

The implementation of the predicted vehicle speed and camera parameters may be described in detail below in conjunction with FIG. 2.

Referring to fig. 2, a flowchart illustrating steps of a camera parameter obtaining method provided in an embodiment of the present application is shown, and as shown in fig. 2, the camera parameter obtaining method may include: step 201 and step 202.

Step 201: and calling the camera external parameter acquisition layer, and processing the image characteristics of the first frame image and the second frame image to obtain the camera external parameters of the target camera.

In this embodiment, the depth information fusion model may further include: as shown in fig. 7, the position branch is the external reference acquisition layer of the camera.

After the first frame image and the second frame image of the target vehicle are input to the depth information fusion model, the camera extrinsic parameter acquisition layer may be invoked to process image features of the first frame image and the second frame image to obtain camera extrinsic parameters of the target camera. As shown in fig. 7, after a Frame T (i.e., a first Frame image) and a Frame T-1 (i.e., a second Frame image) are input to the depth information fusion model, image features of the Frame T and the Frame T-1 may be processed by using a Pose Net in a Pose bridge to obtain camera extrinsic parameters, i.e., a rotation matrix and a camera translation vector, of the target camera.

After the camera external parameter obtaining layer is called to process the image features of the first frame image and the second frame image to obtain the camera external parameters of the target camera, step 202 is executed.

Step 202: and determining the vehicle speed of the target vehicle according to the camera translation vector in the camera external parameters and the camera shooting frame rate of the target camera.

After the camera external parameter of the target camera is obtained by calling the camera external parameter acquisition layer to process the image features of the first frame image and the second frame image, the vehicle speed of the target vehicle can be determined according to the camera translation vector in the camera external parameter and the camera shooting frame rate of the target camera. Specifically, the vehicle speed can be predicted at the camera external reference acquisition layer according to the following formula (1):

v＝α||T|| ₂ (1)

in the above equation (1), v is the vehicle speed, α is the camera frame rate, and T is the camera translation vector.

After obtaining the monocular depth information of the first frame image, the vehicle speed of the target vehicle, and the camera extrinsic parameters of the target camera, the monocular depth information of the first frame image, the vehicle speed of the target vehicle, and the camera extrinsic parameters of the target camera may be used as inputs of the monocular depth information acquisition layer. And processing the image characteristics of the first frame image and the second frame image by calling a multi-view depth information acquisition layer according to the monocular depth information of the first frame image, the vehicle speed of the target vehicle and the camera external parameter of the target camera to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information of each pixel. The confidence level may be used to indicate how much the pixel tends to the multi-ocular depth information, with a greater confidence level indicating that the pixel tends more to the predicted multi-ocular depth information for the pixel, and so on. As shown in FIG. 7, the outputs of the Frame T (i.e., the first Frame picture) and the Frame T-1 (i.e., the second Frame picture), as well as the two branches of the Pose branch and the Monotubular branch, may be used as inputs to the MVS branch. And processing the input by an encoder, a Homo-Warp and a decoder to obtain the multi-view Depth information MVS Depth of each pixel and the corresponding certainty factor Uncertainty.

In a specific implementation, the multi-view depth information obtaining layer may be called to obtain a cost body feature of each pixel in the first frame image, and dynamically adjust the multi-view depth search range information of each pixel in the first frame image, and determine the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information according to the multi-view depth search range information. The implementation can be described in detail as follows in conjunction with fig. 3.

Referring to fig. 3, a flowchart illustrating steps of a multi-view depth acquisition method provided in an embodiment of the present application is shown, and as shown in fig. 3, the multi-view depth acquisition method may include: step 301, step 302 and step 303.

Step 301: and calling the coding module to code the first frame image and the second frame image to obtain a first image characteristic of the first frame image and a second image characteristic of the second frame image.

In this embodiment, the multi-view depth information acquisition layer may include: as shown in fig. 7, the MVS branch may include: an encoder (encoding module), a Homo-Warp (cost body construction module), and a decoder (decoding module).

After the first frame image and the second frame image are input to the depth information fusion model, a coding module in the multi-view depth information acquisition layer may be called to perform coding processing on the first frame image and the second frame image to obtain a first image feature of the first frame image and a second image feature of the second frame image.

After the encoding module is called to perform encoding processing on the first frame image and the second frame image to obtain a first image feature of the first frame image and a second image feature of the second frame image, step 302 is performed.

Step 302: and calling the cost body construction module to perform homography transformation on the first image characteristic and the second image characteristic to obtain a transformation characteristic, and processing the camera external parameter and the transformation characteristic to obtain the cost body characteristic of each pixel in the first frame image.

After the coding module is called to code the first frame image and the second frame image to obtain a first image feature of the first frame image and a second image feature of the second frame image, the cost body construction module can be adopted to perform homography transformation on the first image feature and the second image feature to obtain transformation features corresponding to the first image feature and the second image feature respectively. And then, processing the camera extrinsic parameters and the transformation characteristics to obtain cost volume characteristics of each pixel in the first frame image.

For this implementation, the following detailed description may be made in conjunction with fig. 4.

Referring to fig. 4, a flowchart illustrating steps of a cost body feature obtaining method provided in an embodiment of the present application is shown, and as shown in fig. 4, the cost body feature obtaining method may include: step 401, step 402 and step 403.

Step 401: and calling the cost body construction module to perform homography transformation on the first image characteristic and the second image characteristic to obtain a first transformation characteristic of the first image characteristic and a second transformation characteristic of the second image characteristic.

In this embodiment, after the first image feature and the second image feature are obtained, the cost body construction module may be called to perform homography transformation on the first image feature and the second image feature to obtain a first transformation feature of the first image feature and a second transformation feature of the second image feature.

After the first transformation characteristic and the second transformation characteristic are obtained, step 402 is performed.

Step 402: and obtaining the initial cost volume characteristic of each pixel in the second frame image according to the first transformation characteristic, the second transformation characteristic and the camera external parameter.

After the first transformation feature and the second transformation feature are obtained, the initial cost volume feature of each pixel in the second frame image can be obtained according to the first transformation feature, the second transformation feature and the camera external parameter.

In this embodiment, the cost volume feature of each pixel may be constructed by using a planar scanning method, as shown in the following formula (2):

P _t-1,j ＝K*(R*(K ^-1 *P _t *d _j )+T) (2)

in the above formula (2), P _t Pixel characteristics, P, of the corresponding transformation characteristics for the first frame image _t-1,j Is the pixel characteristic of the transformation characteristic corresponding to the second frame image, K is the camera internal parameter of the target camera, R is the rotation matrix of the target camera, T is the camera translation vector of the target camera, d _j Is the depth value of the jth pixel.

In a specific implementation, the above formula (2) may be applied to each pixel once, so as to obtain a cost volume feature of each pixel in the second frame image, that is, an initial cost volume feature.

Step 403: and projecting the initial cost body characteristics to the first frame image to obtain the cost body characteristics of each pixel in the first frame image.

After obtaining the initial cost volume feature of each pixel in the second frame image, the initial cost volume feature may be projected onto the first frame image to obtain the cost volume feature of each pixel in the first frame image.

After the cost volume feature of each pixel in the first frame image is obtained, step 303 is performed.

Step 303: and calling the decoding module to process the cost body characteristics according to the vehicle speed and the monocular depth information to obtain the monocular depth search range information of each pixel in the first frame image, and determining the monocular depth information of each pixel in the first frame image and the reliability of the monocular depth information according to the monocular depth search range information.

After the cost volume characteristics of each pixel in the first frame image are obtained, a decoding module can be called to process the corresponding cost volume characteristics according to the vehicle speed and the monocular depth information of each pixel in the first frame image, so that the multi-view depth search range information of each pixel in the first frame image is obtained. Then, according to the multi-view depth search range information, the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information of each pixel are determined.

In this embodiment, the depth search range of the multi-view depth information may be dynamically adjusted based on the predicted speed and the monocular depth information, and when the speed is higher, the precondition assumption (that the visual fields of the two images change greatly) of the multi-view depth information is satisfied, then the depth search range may be expanded to estimate the depth more accurately. When the vehicle body speed is low, the precondition assumption of the multi-view depth information is not satisfied, so that the depth search range can be reduced, the depth is close to the monocular depth as much as possible, and the obtained multi-view depth is more robust. The multi-view depth search range may be as shown in the following equation (3):

in the above formula (3), d _min Is the minimum value of the multi-view depth search range, d _max For the maximum value of the multi-view depth search range, D _Mono The monocular depth information is shown, beta is a hyper-parameter (the value is between 1 and 2), T is a camera translation vector, v is the vehicle speed, and the value range of beta T (v) is between 0 and 1.

After obtaining the multi-view depth search range information of each pixel, the accurate multi-view depth information corresponding to each pixel can be found from the multi-view depth search range information.

For the certainty factor, a decoder (which may be composed of a 2D convolutional network plus a sigmoid activation function) may be used to learn an entropy function of the cost body feature, and obtain uncertainty of the cost body feature as the certainty factor of the multi-purpose depth information of each pixel.

After obtaining the monocular depth information and the monocular depth information for each pixel in the first frame of image, and the certainty factor of the monocular depth information and the monocular depth information for each pixel, step 104 is performed.

Step 104: and calling the depth information fusion layer to fuse the monocular depth information and the monocular depth information of each pixel in the first frame image according to the certainty factor to obtain the image depth information of the first frame image.

After obtaining the monocular depth information and the monocular depth information of each pixel in the first frame image and the certainty factor of the monocular depth information and the monocular depth information of each pixel in the first frame image, the depth information fusion layer can be called to fuse the monocular depth information and the monocular depth information of each pixel in the first frame image according to the certainty factor, and the image depth information of the first frame image is obtained. This implementation can be described in detail below in conjunction with fig. 5.

Referring to fig. 5, a flowchart illustrating steps of an image depth information obtaining method provided in an embodiment of the present application is shown, and as shown in fig. 5, the image depth information obtaining method may include: step 501, step 502, step 503 and step 504.

Step 501: calling the depth information fusion layer to obtain a first pixel of which the certainty factor is greater than a first certainty factor threshold value in the first frame image, and taking the multi-view depth information of the first pixel as the depth information of the first pixel.

In this embodiment, the first certainty threshold refers to a preset certainty threshold for determining a pixel of the image pixel that is close to the multi-view depth information, and a specific value of the first certainty threshold may be determined according to a service requirement, which is not limited in this embodiment.

After obtaining the multi-view depth information and the corresponding certainty factor of each pixel in the first frame image, the depth information fusion layer may be invoked to obtain the first pixel in the first frame image whose certainty factor is greater than the first certainty factor threshold.

After acquiring a first pixel of the first frame image whose certainty factor is greater than the first certainty factor threshold, the multi-purpose depth information of the first pixel may be used as the depth information of the first pixel.

Step 502: calling the depth information fusion layer to obtain a second pixel with certainty factor smaller than a second certainty factor threshold value in the first frame image, and taking monocular depth information of the second pixel as depth information of the second pixel; the first confidence threshold is greater than the second confidence threshold.

The second certainty threshold refers to a preset certainty threshold for determining pixels approaching monocular depth information in the image pixels, and a specific value of the second certainty threshold may be determined according to business requirements, which is not limited in this embodiment.

In this example, the first confidence threshold is greater than the second confidence threshold.

After obtaining the multi-view depth information and the corresponding certainty factor of each pixel in the first frame image, the depth information fusion layer may be called to obtain a second pixel in the first frame image whose certainty factor is smaller than a second certainty factor threshold.

After acquiring the second pixel with the certainty factor smaller than the second certainty factor threshold in the first frame image, the monocular depth information of the second pixel may be used as the depth information of the second pixel, that is, the monocular depth information of the second pixel is used to replace the monocular depth information of the second pixel.

Step 503: calling the depth information fusion layer to obtain a third pixel except the first pixel and the second pixel in the first frame image, fusing monocular depth information and the multi-view depth information of the third pixel according to the certainty factor of the third pixel to obtain fusion depth information, and taking the fusion depth information as the depth information of the third pixel.

After obtaining the multi-view depth information and the corresponding certainty factor of each pixel in the first frame image, the depth information fusion layer may be invoked to obtain a third pixel in the first frame image except for the first pixel and the second pixel.

After the third pixel in the first frame image is obtained, the monocular depth information and the monocular depth information of the third pixel may be fused according to the certainty factor of the third pixel to obtain the fusion depth information, and the fusion depth information may be further used as the depth information of the third pixel.

Step 504: and obtaining the image depth information of the first frame image according to the depth information of the first pixel, the depth information of the second pixel and the depth information of the third pixel.

After the depth information of the first pixel, the depth information of the second pixel, and the depth information of the third pixel are obtained, the image depth information of the first frame image may be obtained according to the depth information corresponding to the first pixel, the second pixel, and the third pixel, respectively.

In the present embodiment, the implementation described above can be combined with the following equation (3):

D _Fuse ＝U*D _Mono +(1-U)*D _MVS (3)

in the above formula (3), D _Fuse Depth of image for fusionInformation, U is certainty, D _Mono For monocular depth information, D _MVS Is multi-view depth information.

According to the depth information acquisition method provided by the embodiment of the application, the first frame image and the second frame image of the target vehicle are input into the depth information fusion model. The first frame image and the second frame image are two continuous frames of images, the generation time of the first frame image is later than that of the second frame image, and the depth information fusion model comprises: the system comprises a monocular depth information acquisition layer, a monocular depth information acquisition layer and a depth information fusion layer. And calling a monocular depth information acquisition layer to process the image characteristics of the first frame image to obtain the monocular depth information of each pixel in the first frame image. And calling a multi-view depth information acquisition layer to process the image characteristics of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle and the camera external parameter of the target camera to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information. And calling a depth information fusion layer to fuse the monocular depth information and the monocular depth information of each pixel in the first frame image according to the certainty factor to obtain the image depth information of the first frame image. According to the image depth prediction method and device, the multi-view depth of the image is adaptively adjusted by combining the vehicle speed and the single-view depth, the multi-view depth and the single-view depth are fused to obtain the image depth, more accurate image depth can be obtained, and the accuracy of image depth prediction is improved.

Referring to fig. 7, a schematic structural diagram of a depth information acquiring apparatus provided in an embodiment of the present application is shown, and as shown in fig. 7, the depth information acquiring apparatus 700 may include: an image input module 710, a monocular depth obtaining module 720, a monocular depth obtaining module 730, and an image depth obtaining module 740.

An image input module 710 for inputting the first frame image and the second frame image of the target vehicle into the depth information fusion model; the first frame image and the second frame image are two continuous frames, the generation time of the first frame image is later than that of the second frame image, and the depth information fusion model comprises: the system comprises a monocular depth information acquisition layer, a monocular depth information acquisition layer and a depth information fusion layer;

a monocular depth acquiring module 720, configured to invoke the monocular depth information acquiring layer to process the image feature of the first frame of image, so as to obtain monocular depth information of each pixel in the first frame of image;

a multi-view depth obtaining module 730, configured to invoke the multi-view depth information obtaining layer to process image features of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle, and camera external parameters of the target camera, so as to obtain multi-view depth information of each pixel in the first frame image and reliability of the multi-view depth information;

and the image depth obtaining module 740 is configured to invoke the depth information fusion layer to fuse the monocular depth information and the monocular depth information of each pixel in the first frame of image according to the certainty factor, so as to obtain the image depth information of the first frame of image.

Optionally, the depth information fusion model further includes: a camera external reference acquisition layer for acquiring the external reference of the camera,

the device further comprises:

and the vehicle speed determining module is used for determining the vehicle speed of the target vehicle according to the camera translation vector in the camera external parameters and the camera shooting frame rate of the target camera.

the multi-view depth obtaining module 730 includes:

Optionally, the cost characteristic obtaining unit includes:

Optionally, the image depth obtaining module 740 includes:

a first depth information acquiring unit, configured to call the depth information fusion layer to acquire a first pixel in the first frame image, where a certainty factor of the first pixel is greater than a first certainty factor threshold, and use multi-view depth information of the first pixel as depth information of the first pixel;

a second depth information obtaining unit, configured to call the depth information fusion layer to obtain a second pixel in the first frame of image, where a certainty factor of the second pixel is smaller than a second certainty factor threshold, and use monocular depth information of the second pixel as depth information of the second pixel; the first confidence threshold is greater than the second confidence threshold;

According to the depth information acquisition device provided by the embodiment of the application, the first frame image and the second frame image of the target vehicle are input into the depth information fusion model. The first frame image and the second frame image are two continuous frames of images, the generation time of the first frame image is later than that of the second frame image, and the depth information fusion model comprises: the system comprises a monocular depth information acquisition layer, a monocular depth information acquisition layer and a depth information fusion layer. And calling a monocular depth information acquisition layer to process the image characteristics of the first frame image to obtain the monocular depth information of each pixel in the first frame image. And calling a multi-view depth information acquisition layer to process the image characteristics of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle and the camera external parameter of the target camera, so as to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information. And calling a depth information fusion layer to fuse the monocular depth information and the monocular depth information of each pixel in the first frame image according to the certainty factor to obtain the image depth information of the first frame image. According to the image depth prediction method and device, the multi-view depth of the image is adaptively adjusted by combining the vehicle speed and the single-view depth, the multi-view depth and the single-view depth are fused to obtain the image depth, more accurate image depth can be obtained, and the accuracy of image depth prediction is improved.

An embodiment of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program realizing the above depth information acquisition method when executed by the processor.

Fig. 8 shows a schematic structural diagram of an electronic device 800 according to an embodiment of the present invention. As shown in fig. 8, electronic device 800 includes a Central Processing Unit (CPU) 801 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 802 or computer program instructions loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the electronic apparatus 800 can also be stored. The CPU801, ROM802, and RAM803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, a microphone, and the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The various processes and processes described above may be performed by processing unit 801. For example, the methods of any of the embodiments described above may be implemented as a computer software program tangibly embodied on a computer-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM802 and/or the communication unit 809. When loaded into RAM803 and executed by CPU801, may perform one or more of the acts of the methods described above.

The embodiment of the present application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements each process of the above depth information obtaining method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Those of ordinary skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed in the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be implemented in practice, for example, multiple units or groups may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk, and various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A depth information acquisition method, comprising:

calling the multi-view depth information acquisition layer to process the image characteristics of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle and the camera external parameter of the target camera to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information;

2. The method of claim 1, wherein the depth information fusion model further comprises: a camera external reference acquisition layer for acquiring the external reference of the camera,

and determining the vehicle speed of the target vehicle according to the camera translation vector in the camera external parameter and the camera shooting frame rate of the target camera.

3. The method of claim 1, wherein the multi-view depth information acquisition layer comprises: an encoding module, a cost body construction module and a decoding module,

the calling the multi-view depth information acquisition layer processes the image characteristics of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle and the camera external parameter of the target camera to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information, and the calling the multi-view depth information acquisition layer comprises the following steps:

and calling the decoding module to process the cost body characteristics according to the vehicle speed and the monocular depth information to obtain the multi-view depth searching range information of each pixel in the first frame image, and determining the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information according to the multi-view depth searching range information.

4. The method according to claim 3, wherein the invoking the cost body construction module performs homography transformation on the first image feature and the second image feature to obtain a transformation feature, and processes the camera external parameter and the transformation feature to obtain a cost body feature of each pixel in the first frame image, including:

and projecting the initial cost body characteristic to the first frame image to obtain the cost body characteristic of each pixel in the first frame image.

5. The method according to claim 1, wherein said invoking the depth information fusion layer to fuse monocular depth information and monocular depth information of each pixel in the first frame image according to the certainty factor to obtain image depth information of the first frame image comprises:

6. A depth information acquisition apparatus characterized by comprising:

the monocular depth information acquisition layer is used for acquiring the monocular depth information of each pixel in the first frame image;

the multi-view depth information acquisition layer is used for processing the image characteristics of the first frame image and the second frame image according to the monocular depth information, the predicted vehicle speed of the target vehicle and the camera external parameter of the target camera to obtain the multi-view depth information of each pixel in the first frame image and the reliability of the multi-view depth information;

and the image depth acquisition module is used for calling the depth information fusion layer to fuse the monocular depth information and the monocular depth information of each pixel in the first frame image according to the certainty factor so as to obtain the image depth information of the first frame image.

7. The apparatus of claim 6, wherein the depth information fusion model further comprises: a camera external reference acquisition layer,

the device further comprises:

8. The apparatus of claim 6, wherein the multi-view depth information acquisition layer comprises: an encoding module, a cost body construction module and a decoding module,

the multi-view depth acquisition module comprises:

9. The apparatus according to claim 8, wherein the cost body feature obtaining unit comprises:

10. The apparatus of claim 6, wherein the image depth acquisition module comprises:

a third depth information obtaining unit, configured to invoke the depth information fusion layer to obtain a third pixel in the first frame image, except for the first pixel and the second pixel, and fuse monocular depth information and the monocular depth information of the third pixel according to the reliability of the third pixel to obtain fusion depth information, where the fusion depth information is used as depth information of the third pixel;

11. An electronic device, comprising:

memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the depth information acquisition method of any one of claims 1 to 5.

12. A readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the depth information acquisition method of any one of claims 1 to 5.