CN118261955A

CN118261955A - Image depth estimation method, device, electronic equipment and storage medium

Info

Publication number: CN118261955A
Application number: CN202211676655.1A
Authority: CN
Inventors: 黄媛; 吴垚垚; 王烁; 焦少慧; 武芳芳
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2024-06-28

Abstract

The embodiment of the specification provides an image depth estimation method, an image depth estimation device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a plurality of images to be processed of a first object; each image to be processed in the plurality of images to be processed corresponds to each first shooting viewpoint in the first object one by one; performing depth estimation on each to-be-processed image in the plurality of to-be-processed images through a depth estimation model to obtain a depth image corresponding to each to-be-processed image; the training data of the depth estimation model comprises a plurality of scene images of the second object and depth images corresponding to each scene image; each scene image corresponds to each second shooting viewpoint in the second object one by one; the depth image corresponding to each scene image is obtained based on the point cloud data of the second object. According to the embodiment, the depth scale consistency of the same object in the depth images corresponding to the images shot by different shooting viewpoints can be improved, and the accuracy of three-dimensional modeling is improved.

Description

Image depth estimation method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image depth estimation, and in particular, to an image depth estimation method, an image depth estimation device, an electronic device, and a storage medium.

Background

In the three-dimensional modeling scene, images obtained by shooting the scene at different shooting viewpoints can be obtained, the shot images are respectively converted into depth maps through a depth estimation model based on a neural network, point cloud data corresponding to each shooting viewpoint are generated based on the depth maps corresponding to each shooting viewpoint, and a three-dimensional model of the whole scene is constructed according to the point cloud data corresponding to each shooting viewpoint.

Because the depth estimation model based on the neural network can only consider whether the depth of a single image is accurate or not, and cannot consider the problem of the scale consistency of the depths of a plurality of images, in the above-mentioned process, when the same object appears in two or more images obtained by shooting, the depth values of the object in a plurality of depth maps corresponding to the plurality of images may not be identical. For example, for a certain point on the table of the object, the depth is 5 in the depth map corresponding to the image captured based on the capturing viewpoint a, and the depth is 6 in the depth map corresponding to the image captured based on the capturing viewpoint B.

Furthermore, when the three-dimensional model is performed through the process, the data values of the point cloud data of the same object under different shooting viewpoints may be different, so that the point cloud data of the same object under different shooting viewpoints have fusion errors, and the accuracy of three-dimensional modeling is reduced.

Disclosure of Invention

The embodiment of the specification provides an image depth estimation method, an image depth estimation device, electronic equipment and a storage medium, which are used for improving the depth scale consistency of the same object in depth images corresponding to images shot from different shooting viewpoints and improving the accuracy of three-dimensional modeling.

In a first aspect, embodiments of the present disclosure provide an image depth estimation method, including:

Acquiring a plurality of images to be processed of a first object; each image to be processed in the plurality of images to be processed corresponds to each first shooting viewpoint in the first object one by one;

Performing depth estimation on each to-be-processed image in the plurality of to-be-processed images through a depth estimation model to obtain a depth image corresponding to each to-be-processed image;

The training data of the depth estimation model comprises a plurality of scene images of a second object and depth images corresponding to each scene image; each scene image corresponds to each second shooting viewpoint in the second object one by one; and obtaining a depth image corresponding to each scene image based on the point cloud data of the second object.

In a second aspect, embodiments of the present disclosure provide an image depth estimation apparatus, including:

an image acquisition unit configured to acquire a plurality of images to be processed of a first object; each image to be processed in the plurality of images to be processed corresponds to each first shooting viewpoint in the first object one by one;

The depth estimation unit is used for carrying out depth estimation on each to-be-processed image in the plurality of to-be-processed images through a depth estimation model to obtain a depth image corresponding to each to-be-processed image;

In a third aspect, embodiments of the present disclosure provide an electronic device, including:

A processor; and

A memory configured to store computer-executable instructions that, when executed, cause the processor to implement the steps of the method of the first aspect described above.

In a fourth aspect, embodiments of the present description provide a computer-readable storage medium for storing computer-executable instructions which, when executed by a processor, implement the steps of the method of the first aspect described above.

In this embodiment, a plurality of to-be-processed images of a first object are first acquired, each of the plurality of to-be-processed images corresponds to each of first shooting viewpoints of the first object one by one, and then depth estimation is performed on each of the plurality of to-be-processed images through a depth estimation model, so as to obtain a depth image corresponding to each of the to-be-processed images. The training data of the depth estimation model comprises a plurality of scene images of the second object and depth images corresponding to each scene image; each scene image corresponds to each second shooting viewpoint in the second object one by one, and the depth image corresponding to each scene image is obtained based on the point cloud data of the second object. Because the depth image corresponding to each scene image is obtained based on the point cloud data of the second object, and the point cloud data of the second object is used as the point cloud data representing the whole scene of the second object, and has scale consistency in depth, the depth image corresponding to each scene image also has scale consistency in depth, so that training data with scale consistency in depth can be obtained through the embodiment, when a depth estimation model is trained based on the training data and depth estimation of the image is carried out, the scale consistency of depth in the depth image corresponding to the image shot by the same object at different shooting viewpoints can be improved, and the accuracy of three-dimensional modeling is further improved.

Drawings

For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are needed in the description of the embodiments or of the prior art will be briefly described below, it being obvious that the drawings in the description that follow are only some of the embodiments described in the present description, from which other drawings can be obtained, without inventive faculty, for a person skilled in the art;

fig. 1 is a flowchart of an image depth estimation method according to an embodiment of the present disclosure;

Fig. 2 is a schematic structural diagram of an image depth estimation device according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive effort, are intended to be within the scope of the present disclosure.

The embodiment of the specification provides an image depth estimation method, which is used for improving the depth scale consistency of the same object in depth images corresponding to images shot by different shooting viewpoints and improving the accuracy of three-dimensional modeling. The image depth estimation method can be applied to a server side and executed by the server. To facilitate an understanding of the various embodiments in this specification, a brief description of a depth image based three-dimensional modeling process is presented herein. Taking indoor modeling as an example, in the traditional method, a plurality of panoramic images can be obtained by shooting at different indoor shooting viewpoints, a depth image corresponding to each panoramic image is generated through a depth estimation model, then point cloud data corresponding to each depth image is obtained through conversion, the point cloud data corresponding to each depth image is spliced, and a three-dimensional model of an indoor scene is built.

Fig. 1 is a flow chart of an image depth estimation method according to an embodiment of the present disclosure, as shown in fig. 1, the flow chart includes:

Step S102, a plurality of images to be processed of a first object are acquired; each image to be processed in the plurality of images to be processed corresponds to each first shooting viewpoint in the first object one by one;

step S104, carrying out depth estimation on each to-be-processed image in a plurality of to-be-processed images through a depth estimation model to obtain a depth image corresponding to each to-be-processed image;

The training data of the depth estimation model comprises a plurality of scene images of the second object and depth images corresponding to each scene image; each scene image corresponds to each second shooting viewpoint in the second object one by one; the depth image corresponding to each scene image is obtained based on the point cloud data of the second object.

In the above step S102, the first object is an object for three-dimensional modeling, such as a room or a street. A plurality of first shooting viewpoints are arranged in a first object, the first shooting viewpoints are preset shooting positions, a camera can be placed at each first shooting viewpoint, and each first shooting viewpoint shoots an RGB image to be used as an image to be processed, the depth of which is to be estimated. The images to be processed are in one-to-one correspondence with the first shooting viewpoints.

In the step S104, each image to be processed is input into a pre-trained depth estimation model, and the depth estimation model is used for processing, so as to obtain a depth image corresponding to each image to be processed.

In one embodiment, after performing depth estimation on each of the plurality of to-be-processed images to obtain a depth image corresponding to each to-be-processed image, point cloud data corresponding to each first shooting viewpoint may be generated according to the depth image corresponding to each to-be-processed image, and a three-dimensional model of the first object may be constructed according to the point cloud data corresponding to each first shooting viewpoint. Specifically, according to the depth image corresponding to each image to be processed and the pose of each first shooting viewpoint, generating point cloud data corresponding to each first shooting viewpoint in a mode of converting the point cloud data by using the depth image, and splicing the point cloud data corresponding to each first shooting viewpoint to obtain a three-dimensional model of the first object.

In this embodiment, since the depth image corresponding to each image to be processed has higher scale consistency, the scale consistency of the point cloud data corresponding to each first shooting viewpoint is also higher, the error in point cloud fusion is smaller, and the three-dimensional model of the first object constructed is more accurate.

As mentioned above, the training data of the pre-trained depth estimation model comprises a plurality of scene images of the second object and a depth image corresponding to each scene image. Each scene image corresponds one-to-one with each second shooting viewpoint in the second object. The depth image corresponding to each scene image is obtained based on the point cloud data of the second object.

Specifically, the second object may be any one of indoor scenes or any one of outdoor scenes. Since the depth estimation model is trained using the scene image of the second object, which in turn is used to estimate the depth of the first object, the first object and the second object are of the same type, both being indoor scenes or both being outdoor scenes.

A plurality of second shooting viewpoints are preset in the second object, the second shooting viewpoints are preset shooting positions, a camera can be placed at each second shooting viewpoint, a panoramic RGB image is shot on the second object at each second shooting viewpoint, the panoramic RGB image is used as a scene image of the second object, and therefore a plurality of scene images of the second object are obtained, and each scene image corresponds to each second shooting viewpoint one by one.

In one embodiment, the depth image corresponding to the scene image of the second object is determined by:

Generating a pose of each second shooting viewpoint according to a plurality of scene images of the second object;

Generating point cloud data of the second object according to the plurality of scene images of the second object;

and generating a depth image corresponding to each scene image according to the pose of each second shooting viewpoint and the point cloud data of the second object.

Specifically, from a plurality of scene images of the second object, a pose of each second shooting viewpoint in the second object and point cloud data of the second object are generated. In one embodiment, the plurality of scene images of the second object may be processed by an SFM algorithm to obtain a pose of each second shooting viewpoint in the second object and point cloud data of the second object. The point cloud data of the second object is point cloud data representing the whole scene of the second object, so that the point cloud data of the second object has scale consistency in depth, and therefore, a depth image corresponding to each scene image is obtained based on the point cloud data of the second object, and the depth images corresponding to each scene image also have scale consistency in depth, namely, in different depth images, the depth values of the same object are the same.

In one embodiment, the pose of each second shooting viewpoint is generated according to a plurality of scene images of the second object, specifically:

selecting a plurality of first images meeting the first image matching requirement from the plurality of scene images, wherein the plurality of first images form an image set;

Determining the pose of a second shooting viewpoint corresponding to each first image;

determining the pose of a second shooting viewpoint corresponding to the rest scene images based on the image feature matching relation between each first image and the rest scene images; the rest scene images are scene images except for each first image in the plurality of scene images;

and determining the pose of each second shooting viewpoint according to the pose of the second shooting viewpoint corresponding to each first image and the poses of the second shooting viewpoints corresponding to the rest scene images.

First, a plurality of first images meeting first image matching requirements are selected from a plurality of scene images and an image set is formed. First, image feature points including pixel coordinates, directions, feature descriptors, and the like are extracted on each scene image. And then matching the image feature points of any two scene images, calculating the similarity between every two image feature points between any two scene images, wherein the two image feature points with the similarity exceeding a certain threshold value are regarded as successful matching, and finally, selecting the two scene images with the largest number of successfully matched image feature points from each scene image as two first images meeting the first image matching requirement, wherein the two first images form an image set.

Then, the pose of the second shooting viewpoint corresponding to each first image is determined. In one embodiment, determining the pose of the second shooting viewpoint corresponding to each first image specifically includes:

Determining the relative pose between the second shooting viewpoints corresponding to the first images according to the image coordinates of the first image feature points matched with each other of the first images;

Setting the pose of a second shooting viewpoint corresponding to any one of the first images, and determining the pose of the second shooting viewpoint corresponding to the rest of the first images according to the set pose and the relative pose.

As described above, two scene images with the largest number of image feature points that are successfully matched are selected from the respective scene images as the first image. Based on the image feature points matched with each other between the two first images are used as the first image feature points, and according to the image coordinates of the first image feature points in each first image, the relative pose between the two second shooting viewpoints corresponding to the two first images is calculated, wherein the relative pose comprises a relative rotation matrix and a relative translation matrix between the two second shooting viewpoints corresponding to the two first images.

Then, setting the pose of the second shooting view point corresponding to any one first image as the origin of the world coordinate system, and determining the pose of the second shooting view point corresponding to the other first image based on the origin of the world coordinate system and the relative pose between the two second shooting view points corresponding to the two first images calculated before.

The number of the first images is not limited to the two images, but may be three or 4 images, and when the number of the first images is 3 or more, the first image feature points are the image feature points matched with each other among the plurality of first images, and the process of solving the pose of the second shooting viewpoint corresponding to each first image is the same as that described above and is not repeated here.

Therefore, in this embodiment, the pose of the second shooting viewpoint corresponding to each first image can be accurately and rapidly determined by setting the pose of the second shooting viewpoint corresponding to any one first image and solving the relative pose.

And then, taking the scene images except for the first images in the plurality of scenes as the rest scene images, and determining the pose of the second shooting view point corresponding to the rest scene images based on the image feature matching relation between each first image and the rest scene images.

In one embodiment, based on the image feature matching relationship between each first image and the rest of the scene images, determining the pose of the second shooting viewpoint corresponding to the rest of the scene images specifically includes:

selecting a second image which meets the matching requirement of the second image from the rest scene images according to the image matching relation between the rest scene images and any one of the first images;

Determining the pose of a second shooting viewpoint corresponding to the second image according to the second image feature points matched with each other between the second image and each first image;

Adding the processed second image as a new first image to the image set;

and circularly executing the actions of selecting the second image and determining the pose of the second shooting view point corresponding to the second image based on the increased image set until all the rest scene images are selected.

Through the previous operation, the image feature points are extracted from each scene image, any one first image is selected, the image feature points of the selected first image are matched with the image feature points of each other scene image, the similarity between every two image feature points between the selected first image and each other scene image is calculated, the two image feature points with the similarity exceeding a certain threshold value are regarded as successful matching, and the image with the largest number of successfully matched image feature points between the selected first images is selected from all other scene images as a second image with the image matching relation meeting the second image matching requirement.

And then, taking the image characteristic points matched between the second image and each first image as second image characteristic points, and determining the pose of a second shooting viewpoint corresponding to the second image according to the second image characteristic points. So far, the pose of the second shooting viewpoint corresponding to the selected second image can be determined.

Then, the one second image is added as a new first image to the above-described image set constituted by the respective first images, thereby updating the image set. And finally, based on the first image included in the updated image set, circularly executing the actions of selecting the second image and determining the pose of the second shooting view point corresponding to the second image until all the other scene images are selected and processed and then added into the image set, wherein the image set comprises all the scene images.

In the process of determining the pose of the second shooting viewpoint corresponding to the second image according to the second image characteristic points matched with each other between the second image and each first image, each first image in the image set needs to be considered. It is easy to understand that, since the processed second image is added to the image set as a new first image, when the number of first images selected initially is two, the second image feature points matched with each other between the second image and the two first images need to be considered when the second image is selected for the first time, and when the second image is selected for the second time, the number of first images is three, the second image feature points matched with each other between the second image and the three first images need to be considered.

In one embodiment, determining the pose of the second shooting viewpoint corresponding to the second image according to the second image feature points matched between the second image and each first image includes:

Splitting the second image and the second image characteristic points matched with each other between the first images into first sub-characteristic points and second sub-characteristic points respectively; the point cloud data corresponding to the first sub-feature points are located in the point cloud data corresponding to the first image; the point cloud data corresponding to the second sub-feature points are not located in the point cloud data corresponding to the first image;

And determining the pose of a second shooting viewpoint corresponding to the second image according to the point cloud data and the image coordinates corresponding to the first sub-feature points matched with each other between the second image and each first image.

Here, taking the number of the first images as two as an example, a procedure of determining the pose of the second photographing viewpoint corresponding to the second image is explained. When the number of the first images is three or more, the procedure is the same as the following procedure and is not repeated here.

First, the image feature points that match each other between the second image and each of the first images are determined as the second image feature points. Then, for each first image, the second image feature points of the second image and the first image that are matched with each other are split into first sub feature points and second sub feature points. The first sub-feature points are feature points of which the point cloud data are located in the point cloud data corresponding to the first image, and the second sub-feature points are feature points of which the point cloud data are not located in the point cloud data corresponding to the first image.

The point cloud data corresponding to the first image is the point cloud data corresponding to the first image feature point, and when the first image includes the two first images selected initially and the second image added subsequently, the point cloud data corresponding to the first image may be the point cloud data corresponding to the first image feature point or the point cloud data corresponding to the image feature point in the determined second image, so that the second image feature point matched with each other between the second image and each first image may be split into the first sub-feature point and the second sub-feature point according to the point cloud data corresponding to each first image. The point cloud data of the first sub-feature point has been determined before, and the point cloud data of the second sub-feature point has not been determined before.

And then, according to the point cloud data corresponding to the first sub-feature points matched with each other between the second image and each first image and the image coordinates of the first sub-feature points in the second image and each first image, determining the pose of the second shooting viewpoint corresponding to the second image. And the PnP algorithm can be adopted to input point cloud data corresponding to the first sub-feature points matched with each other between the second image and each first image, and image coordinates of the first sub-feature points matched with each other between the second image and each first image in the second image and each first image as algorithms, so as to obtain the pose of the second shooting viewpoint corresponding to the second image.

For the second image, point cloud data of first sub-feature points matched with each other between each first image and point cloud data of second sub-feature points matched with each other between each first image jointly form point cloud data corresponding to the image feature points of the second image.

In one specific example, each scene image has 100 image feature points numbered 1-100. In the first image 1, the image feature points with the sequence numbers 1-50 have point cloud data, in the first image 2, the image feature points with the sequence numbers 20-70 have point cloud data, and for the second image, the sequence numbers of the second image feature points matched with the first image 1 are 30-60, wherein the second image feature points with the sequence numbers 30-50 are first sub-feature points, and the feature points with the sequence numbers 50-60 are second sub-feature points. For the second image, the serial numbers of the second image characteristic points matched with the first image 2 are 40-80, wherein the second image characteristic points of serial numbers 40-70 are first sub-characteristic points, and the characteristic points of serial numbers 70-80 are second sub-characteristic points.

Based on the above flow, the point cloud data of the second image feature points with the sequence numbers 30-50 in the first image 1, the image coordinates of the second image feature points with the sequence numbers 30-50 in the second image and the first image 1, the point cloud data of the second image feature points with the sequence numbers 40-70 in the first image 2, and the image coordinates of the second image feature points with the sequence numbers 40-70 in the second image and the first image 2 are input as a PnP algorithm, so that the pose of the second shooting viewpoint corresponding to the second image is obtained.

The second image is based on the point cloud data of the first sub-feature point of the serial number 30-50 matched with the first image 1, the point cloud data of the second sub-feature point of the serial number 50-60 matched with the first image 1, the point cloud data of the first sub-feature point of the serial number 40-70 matched with the first image 2, and the point cloud data of the second sub-feature point of the serial number 70-80 matched with the first image 2, and the point cloud data corresponding to the image feature point of the second image is formed together, so that the point cloud data of the image feature point of the serial number 30-80 of the second image is obtained.

After the rest of scene images are selected as the second images and the pose of the second shooting view point corresponding to the second images is obtained through the flow, in the flow, the pose of each second shooting view point is determined according to the pose of the second shooting view point corresponding to each first image and the pose of the second shooting view point corresponding to the rest of scene images. In one embodiment, determining the pose of each second shooting viewpoint according to the pose of the second shooting viewpoint corresponding to each first image and the pose of the second shooting viewpoints corresponding to the rest of scene images specifically includes:

Taking the pose of the second shooting view point corresponding to each first image and the poses of the second shooting view points corresponding to the rest scene images as optimization objects, and optimizing the optimization objects by solving the mode that the re-projection error is minimum;

And taking the optimized pose of the second shooting view point corresponding to each first image and the poses of the second shooting view points corresponding to the rest scene images as the pose of each second shooting view point.

Specifically, the pose of the second shooting view point corresponding to each first image and the pose of the second shooting view point corresponding to the rest scene images are taken as optimization objects, an energy function formed by the re-projection errors is constructed, the energy function is minimized by using a Gauss Newton method or an LM algorithm, so that the re-projection errors are minimized, and the optimized optimization objects are obtained. The re-projection error refers to an error between the point cloud data and the corresponding pixel coordinates after being re-projected back to the image according to the point cloud data and the corresponding pose.

And then, taking the optimized pose of the second shooting view point corresponding to each first image and the poses of the second shooting view points corresponding to the rest scene images as the pose of each second shooting view point.

Therefore, in this embodiment, the pose of each second shooting viewpoint may be optimized, so as to improve the accuracy of pose data. In addition, in this embodiment, through the above procedure, the pose of each second shooting viewpoint can be accurately generated according to the plurality of scene images of the second object, so as to prepare for obtaining depth images with depth consistent scale subsequently.

In one embodiment, the point cloud data of the second object is generated according to the plurality of scene images, specifically:

generating point cloud data corresponding to the first image feature points of each first image, wherein the point cloud data corresponds to the first image feature points of each first image;

Determining point cloud data corresponding to image feature points in the rest scene images based on the image feature matching relation between each first image and the rest scene images; the rest scene images are scene images except for each first image in the plurality of scene images;

And constructing point cloud data of the second object based on the point cloud data corresponding to the first image feature points and the point cloud data corresponding to the image feature points in the rest of scene images.

Next, point cloud data corresponding to the mutually matched first image feature points of the respective first images is generated. In one embodiment, point cloud data corresponding to first image feature points matched with each other between the first images is generated, specifically: based on a point cloud triangulation technology, according to the image coordinates of the first image feature points, the first image feature points are projected into a three-dimensional space, and point cloud data corresponding to the first image feature points are obtained.

Specifically, the image feature points matched with each other between the two first images are first image feature points, and based on a point cloud triangulation technology, the first image feature points are projected to a three-dimensional space according to the image coordinates of the first image feature points in each first image, so as to obtain point cloud data corresponding to the first image feature points. Thus, the point cloud data of the first image characteristic points matched with each other between the two first images can be obtained.

The number of the first images is not limited to the two images described above, but may be three or 4 images, and when the number of the first images is 3 or more, the first image feature points are mutually matched image feature points among a plurality of first images, and the process of solving the point cloud data of the first image feature points is the same as described above and is not repeated here.

Therefore, according to the embodiment, the point cloud data corresponding to the first image feature point can be rapidly and accurately determined through the point cloud triangulation technology.

And then, taking the scene images except for the first images in the plurality of scenes as the rest of scene images, and determining the point cloud data corresponding to the image feature points in the rest of scene images based on the image feature matching relation between each first image and the rest of scene images.

In one embodiment, based on the image feature matching relationship between each first image and the rest of scene images, the point cloud data corresponding to the image feature points in the rest of scene images is determined, specifically:

according to the second image and the second image feature points matched with each other between the first images, determining point cloud data corresponding to the image feature points in the second images;

Adding the processed second image as a new first image to the image set;

And circularly executing the actions of selecting the second image and determining the point cloud data corresponding to the image feature points in the second image based on the increased image set until all the rest scene images are selected.

And then, taking the image characteristic points matched between the second image and each first image as second image characteristic points, and determining point cloud data corresponding to the image characteristic points in the second image according to the second image characteristic points. So far, the point cloud data corresponding to the image characteristic point in the selected second image can be determined.

Then, the one second image is added as a new first image to the above-described image set constituted by the respective first images, thereby updating the image set. And finally, based on the first image included in the updated image set, circularly executing the actions of selecting the second image and determining the point cloud data corresponding to the image feature points in the second image until all the other scene images are selected and processed and then added into the image set, wherein the image set comprises all the scene images.

In the process of determining the point cloud data corresponding to the image feature points in the second image according to the second image feature points matched with each first image, each first image in the image set needs to be considered. It is easy to understand that, since the processed second image is added to the image set as a new first image, when the number of first images selected initially is two, the second image feature points matched with each other between the second image and the two first images need to be considered when the second image is selected for the first time, and when the second image is selected for the second time, the number of first images is three, the second image feature points matched with each other between the second image and the three first images need to be considered.

In one embodiment, determining the point cloud data corresponding to the image feature points in the second image according to the second image feature points matched with each first image, including:

And based on a point cloud triangulation technology, projecting the second sub-feature points into a three-dimensional space according to the image coordinates of the second sub-feature points, and obtaining point cloud data corresponding to the second sub-feature points.

Here, taking the number of the first images as two as an example, a procedure of determining the pose of the shooting viewpoint corresponding to the second image and the point cloud data corresponding to the image feature points in the second image is explained. When the number of the first images is three or more, the procedure is the same as the following procedure and is not repeated here.

The point cloud data corresponding to the first image may be the point cloud data corresponding to the first image feature point or the point cloud data corresponding to the image feature point in the second image determined by the above procedure, for example, when the first image is two first images selected initially, the point cloud data corresponding to the first image is the point cloud data corresponding to the first image feature point, and when the first image includes two first images selected initially and a second image added subsequently, the point cloud data corresponding to the first image may be the point cloud data corresponding to the first image feature point or the point cloud data corresponding to the image feature point in the second image determined by the above procedure, so that the second image feature point matched with each other between the second image and each first image may be split into the first sub-feature point and the second sub-feature point according to the point cloud data corresponding to each first image. The point cloud data of the first sub-feature point has been determined before, and the point cloud data of the second sub-feature point has not been determined before.

And then, based on a point cloud triangulation technology, according to the image coordinates of the second sub-feature points matched with each other between the second image and each first image, projecting the second sub-feature points matched with each other between the second image and each first image into a three-dimensional space, and obtaining point cloud data corresponding to the second sub-feature points matched with each other between the second image and each first image.

Based on the above flow, based on the point cloud triangularization technology, according to the image coordinates of the second image feature points with the sequence numbers 50-60 in the first image 1 and the second image, the image coordinates of the second image feature points with the sequence numbers 70-80 in the first image 2 and the second image, the second image feature points with the sequence numbers 50-60 and the second image feature points with the sequence numbers 70-80 are projected to a three-dimensional space, and corresponding point cloud data are obtained.

And selecting the rest of scene images as the second image through the flow, and after obtaining the pose of the shooting viewpoint and the point cloud data corresponding to the second image, constructing the point cloud data of the second object based on the point cloud data corresponding to the characteristic points of the first image and the point cloud data corresponding to the characteristic points of the images in the rest of scene images in the flow. In one embodiment, the step of constructing the point cloud data of the second object based on the point cloud data corresponding to the first image feature point and the point cloud data corresponding to the image feature points in the rest of the scene images specifically includes:

Taking point cloud data corresponding to the first image feature points and point cloud data corresponding to the image feature points in the rest scene images as optimization objects, and optimizing the optimization objects by solving the mode that the re-projection error is minimum;

and combining the point cloud data corresponding to the optimized first image feature points and the point cloud data corresponding to the image feature points in the rest scene images into the point cloud data of the second object.

Specifically, point cloud data corresponding to the first image feature points and point cloud data corresponding to the image feature points in the rest of scene images are used as optimization objects, an energy function formed by the re-projection errors is constructed, the energy function is minimized by using a Gauss Newton method or an LM algorithm, so that the re-projection errors are minimized, and the optimized optimization objects are obtained. The re-projection error refers to an error between the point cloud data and the corresponding pixel coordinates after being re-projected back to the image according to the point cloud data and the corresponding pose.

And combining the optimized point cloud data corresponding to the first image feature points and the point cloud data corresponding to the image feature points in the rest of scene images into point cloud data of a second object.

In this embodiment, the point cloud data corresponding to the first image feature point and the point cloud data corresponding to the image feature points in the rest of the scene images can be optimized, so that the accuracy of the point cloud data is improved. In addition, in the embodiment, through the above flow, the point cloud data of the second object can be accurately generated according to the plurality of scene images, so as to prepare for obtaining the depth image with the depth consistent with the scale.

In an embodiment, in the above process, according to the pose of each second shooting viewpoint and the point cloud data of the second object, a depth image corresponding to each scene image is generated, specifically: and according to the pose of each second shooting viewpoint, projecting the point cloud data of the second object to each second shooting viewpoint to obtain the point cloud data corresponding to each second shooting viewpoint, and according to the point cloud data corresponding to each second shooting viewpoint and the pose of each second shooting viewpoint, generating a depth image corresponding to each scene image.

Specifically, according to the pose of each second shooting viewpoint, the point cloud data of the second object is projected onto each second shooting viewpoint through an existing algorithm to obtain the point cloud data corresponding to each second shooting viewpoint, and then according to the point cloud data corresponding to each second shooting viewpoint and the pose of each second shooting viewpoint, the depth image corresponding to each scene image is generated in a mode of converting the point cloud data into the depth image. When the point cloud data of each second shooting viewpoint is converted into the depth image corresponding to each scene image, if the point cloud data is sparse, the depth value of the pixel point without the point cloud data can be set to be 0 in the depth image.

The above flow details the process of generating the depth image corresponding to each scene image of the second object, and it can be seen from the above flow that in this embodiment, a depth image with a depth value having scale consistency can be obtained, so that the accuracy of the depth estimation result of the depth estimation model is improved, and the accuracy of the three-dimensional modeling is improved.

In one embodiment, after obtaining each scene image of the second object and the corresponding depth image of each scene image, the depth estimation model is trained by:

Acquiring a depth estimation model after initial training is completed; and

Taking each scene image and the depth image corresponding to each scene image as optimized training data of a depth estimation model with initial training completed;

and optimizing the depth estimation model after the initial training by optimizing training data.

Because the training data of the obtained depth estimation model has scale consistency in depth, when the depth estimation model is optimized and the depth estimation of the image is carried out based on the training data, the scale consistency of the depth of the same object in the depth images corresponding to the images shot by different shooting viewpoints can be improved, and the accuracy of three-dimensional modeling is further improved.

In this embodiment, an initial training-completed depth estimation model is obtained, where the initial training-completed depth estimation model may be trained based on a large number of sample images and depth images corresponding to the sample images, where the depth images corresponding to the sample images may be collected by a structured light device. Because each depth image is acquired independently when the depth image corresponding to the sample image is acquired, the scale consistency of the depth of the same object in each depth image is not considered, and therefore, when the depth estimation model of the same object in different panoramic images is estimated, the estimated depth is possibly inconsistent, that is, the depth estimation model does not have the scale consistency. In this embodiment, the obtained training data is used as optimized training data of the depth estimation model after the initial training is completed, so as to improve the scale consistency of the depth estimation model when estimating the depth of the same object in different panoramic images.

In this embodiment, the depth estimation model is optimized by optimizing training data. The method specifically comprises the following steps: inputting the optimized training data into a depth estimation model, and performing iterative learning for preset times on the optimized data through the depth estimation model; and taking the depth estimation model after iterative learning as the depth estimation model after optimization.

The optimized training data comprises a scene image of each second shooting view point and a depth image corresponding to the scene image, and in an optional mode, when the model is optimized, the scene image of the first second shooting view point and the depth image corresponding to the scene image are input into a depth estimation model for iterative learning for a preset number of times, such as 10 times, then the scene image of the second shooting view point and the depth image corresponding to the scene image are input into the depth estimation model for the preset number of times, such as 10 times, until each second shooting view point is subjected to iterative learning. Or alternatively, during model optimization, the scene images of the second shooting viewpoints and the depth images corresponding to the scene images are input into the depth estimation model together, and iterative learning is performed for a preset number of times, such as 10 times. And taking the depth estimation model after iterative learning as the depth estimation model after optimization. The optimized depth estimation model not only improves the scale consistency capability when estimating the depth of the same object in different panoramic images, but also inherits the original depth estimation capability and generalization.

Based on the above image depth estimation method, an embodiment of the present disclosure further provides an image depth estimation device for implementing the above image depth estimation method. Fig. 2 is a schematic structural diagram of an image depth estimation device according to an embodiment of the present disclosure, as shown in fig. 2, the device includes:

an image acquisition unit 21 for acquiring a plurality of images to be processed of a first object; each image to be processed in the plurality of images to be processed corresponds to each first shooting viewpoint in the first object one by one;

A depth estimation unit 22, configured to perform depth estimation on each of the plurality of images to be processed through a depth estimation model, so as to obtain a depth image corresponding to each of the images to be processed;

Optionally, the method further comprises a modeling unit for:

After carrying out depth estimation on each to-be-processed image in the plurality of to-be-processed images to obtain a depth image corresponding to each to-be-processed image, generating point cloud data corresponding to each first shooting viewpoint according to the depth image corresponding to each to-be-processed image; and constructing a three-dimensional model of the first object according to the point cloud data corresponding to each first shooting viewpoint.

Optionally, the depth map generating unit is further included for:

generating a pose of each second shooting viewpoint according to the plurality of scene images;

Generating point cloud data of the second object according to the plurality of scene images;

Optionally, the depth map generating unit is specifically configured to:

Selecting a plurality of first images meeting first image matching requirements from the plurality of scene images, wherein the plurality of first images form an image set;

Determining the pose of the second shooting view point corresponding to each first image;

Determining the pose of the second shooting view point corresponding to the rest scene images based on the image feature matching relation between each first image and the rest scene images; the rest scene images are scene images except for each first image in the plurality of scene images;

Optionally, the depth map generating unit is further specifically configured to:

And setting the pose of the second shooting view point corresponding to any one of the first images, and determining the pose of the second shooting view point corresponding to the rest of the first images according to the set pose and the relative pose.

Selecting a second image which meets the second image matching requirement in the image matching relation between the rest scene images and any one of the first images;

Determining the pose of the second shooting viewpoint corresponding to the second image according to the second image characteristic points matched with each other between the second image and each first image;

adding the processed second image as a new first image to the image set;

and circularly executing the actions of selecting a second image and determining the pose of the second shooting view point corresponding to the second image based on the increased image set until all the rest scene images are selected.

Splitting the second image and the second image characteristic points matched with each other between the first images into first sub-characteristic points and second sub-characteristic points respectively; the point cloud data corresponding to the first sub-feature point are located in the point cloud data corresponding to the first image; the point cloud data corresponding to the second sub-feature points are not located in the point cloud data corresponding to the first image;

And determining the pose of the second shooting viewpoint corresponding to the second image according to the point cloud data and the image coordinates corresponding to the first sub-feature points matched with each other between the second image and each first image.

Optionally, the depth map generating unit is specifically configured to:

generating point cloud data corresponding to the first image feature points matched with each other of each first image;

Determining point cloud data corresponding to image feature points in the rest scene images based on an image feature matching relation between each first image and the rest scene images; the rest scene images are scene images except for each first image in the plurality of scene images;

And constructing the point cloud data of the second object based on the point cloud data corresponding to the first image feature points and the point cloud data corresponding to the image feature points in the rest scene images.

based on a point cloud triangulation technology, according to the image coordinates of the first image feature points, the first image feature points are projected to a three-dimensional space, and point cloud data corresponding to the first image feature points are obtained.

adding the processed second image as a new first image to the image set;

and circularly executing the actions of selecting a second image and determining point cloud data corresponding to image feature points in the second image based on the increased image set until all the rest scene images are selected.

And based on a point cloud triangulation technology, projecting the second sub-feature points into a three-dimensional space according to the image coordinates of the second sub-feature points to obtain point cloud data corresponding to the second sub-feature points.

and combining the optimized point cloud data corresponding to the first image feature points and the point cloud data corresponding to the image feature points in the rest scene images into the point cloud data of the second object.

Optionally, the depth map generating unit is specifically configured to:

according to the pose of each second shooting viewpoint, the point cloud data of the second object are projected to each second shooting viewpoint, and the point cloud data corresponding to each second shooting viewpoint are obtained;

and generating a depth image corresponding to each scene image according to the point cloud data corresponding to each second shooting viewpoint and the pose of each second shooting viewpoint.

Optionally, the training unit is further included for:

Acquiring a depth estimation model after initial training is completed; and

Taking each scene image and a depth image corresponding to each scene image as optimized training data of a depth estimation model after the initial training is completed;

and optimizing the depth estimation model after the initial training through the optimized training data.

Optionally, the training unit is specifically configured to:

Inputting the optimized training data into the depth estimation model after initial training, and performing iterative learning for preset times on the optimized data through the depth estimation model after initial training;

And taking the depth estimation model after iterative learning as the depth estimation model after optimization.

It should be noted that, the image depth estimation device in this embodiment can implement the processes of the foregoing image depth estimation method embodiment, and achieve the same effects and functions, which are not repeated here.

An embodiment of the present disclosure further provides an electronic device, and fig. 3 is a schematic structural diagram of the electronic device provided in the embodiment of the present disclosure, as shown in fig. 3, the electronic device may have a relatively large difference due to different configurations or performances, and may include one or more processors 301 and a memory 302, where the memory 302 may store one or more application programs or data. Wherein the memory 302 may be transient storage or persistent storage. The application programs stored in memory 302 may include one or more modules (not shown), each of which may include a series of computer-executable instructions in the electronic device. Still further, the processor 801 may be configured to communicate with the memory 302 and execute a series of computer executable instructions in the memory 302 on an electronic device. The electronic device may also include one or more power supplies 303, one or more wired or wireless network interfaces 304, one or more input or output interfaces 305, one or more keyboards 306, and the like.

In a specific embodiment, the electronic device is an image depth estimation device, specifically a background server, and includes a processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to implement the following:

The image depth estimation apparatus in the embodiments of the present specification can realize the respective processes of the foregoing embodiments of the image depth estimation method and achieve the same effects and functions, and are not repeated here.

Another embodiment of the present specification also provides a computer-readable storage medium for storing computer-executable instructions that, when executed by a processor, implement the following flow:

The storage medium in the embodiments of the present specification can implement the respective processes of the foregoing embodiments of the image depth estimation method and achieve the same effects and functions, and are not repeated here.

In various embodiments of the present disclosure, the computer-readable storage medium includes a Read-Only Memory (ROM), a random-access Memory (Random Access Memory RAM), a magnetic disk or an optical disk, and so on.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language), and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each unit may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present specification.

One skilled in the relevant art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is by way of example only and is not intended to limit the present disclosure. Various modifications and changes may occur to those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present document are intended to be included within the scope of the claims of the present document.

Claims

1. An image depth estimation method, comprising:

2. The method according to claim 1, further comprising, after performing depth estimation on each of the plurality of images to be processed to obtain a depth image corresponding to each of the images to be processed:

generating point cloud data corresponding to each first shooting viewpoint according to the depth image corresponding to each image to be processed;

And constructing a three-dimensional model of the first object according to the point cloud data corresponding to each first shooting viewpoint.

3. The method according to claim 1, wherein the depth image corresponding to the scene image is determined by:

4. A method according to claim 3, wherein said generating a pose for each of said second shooting viewpoints from said plurality of scene images comprises:

5. The method of claim 4, wherein determining the pose of the second shooting viewpoint corresponding to each of the first images comprises:

6. The method of claim 4, wherein determining the pose of the second shooting viewpoint corresponding to the remaining scene images based on the image feature matching relationship between each of the first images and the remaining scene images comprises:

adding the processed second image as a new first image to the image set;

7. The method according to claim 6, wherein determining the pose of the second shooting viewpoint corresponding to the second image according to the second image feature points matched with each of the first images, includes:

8. The method of claim 4, wherein determining the pose of each of the second shooting viewpoints according to the pose of the second shooting viewpoint corresponding to each of the first images and the pose of the second shooting viewpoints corresponding to the remaining scene images comprises:

9. A method according to claim 3, wherein said generating point cloud data for said second object from said plurality of scene images comprises:

10. The method of claim 9, wherein generating the point cloud data corresponding to the first image feature points of each of the first images that match each other comprises:

11. The method according to claim 9, wherein the determining the point cloud data corresponding to the image feature points in the remaining scene images based on the image feature matching relationship between each of the first images and the remaining scene images includes:

adding the processed second image as a new first image to the image set;

12. The method according to claim 11, wherein determining the point cloud data corresponding to the image feature points in the second image according to the second image feature points matched with each of the first images, includes:

13. The method of claim 9, wherein the constructing the point cloud data of the second object based on the point cloud data corresponding to the first image feature point and the point cloud data corresponding to the image feature points in the remaining scene images comprises:

14. A method according to claim 3, wherein generating a depth image corresponding to each scene image from the pose of each second shooting viewpoint and the point cloud data of the second object comprises:

15. The method of claim 1, wherein the depth estimation model is trained by:

Acquiring a depth estimation model after initial training is completed; and

16. The method of claim 15, wherein optimizing the initial trained depth estimation model via the optimized training data comprises:

17. An image depth estimation apparatus, comprising:

18. An electronic device, comprising:

A processor; and

A memory configured to store computer-executable instructions that, when executed, cause the processor to perform the steps of the method of any of the preceding claims 1-16.

19. A computer readable storage medium for storing computer executable instructions which when executed by a processor implement the steps of the method of any of the preceding claims 1-16.