CN117876608A

CN117876608A - Three-dimensional image reconstruction method, three-dimensional image reconstruction device, computer equipment and storage medium

Info

Publication number: CN117876608A
Application number: CN202410270409.9A
Authority: CN
Inventors: 谢子锐; 张如高; 虞正华
Original assignee: Magic Vision Intelligent Technology Wuhan Co ltd
Current assignee: Magic Vision Intelligent Technology Wuhan Co ltd
Priority date: 2024-03-11
Filing date: 2024-03-11
Publication date: 2024-04-12

Abstract

The invention relates to the technical field of image processing and discloses a three-dimensional image reconstruction method, a device, computer equipment and a storage medium.

Description

Three-dimensional image reconstruction method, three-dimensional image reconstruction device, computer equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a three-dimensional image reconstruction method, apparatus, computer device, and storage medium.

Background

In recent years, with the vigorous development of autopilot technology and business, three-dimensional reconstruction is more widely applied in autopilot scenes.

In the current state of the art, neural radiation fields (Neural Radiance Fields, neRF) are introduced for purely visual three-dimensional reconstruction. The traditional NeRF is a three-dimensional reconstruction method based on deep learning, relies on a large amount of data to train a model, generally marks three-dimensional information through a three-dimensional grid or point cloud model, and is widely used.

The automatic driving scene is mostly an open scene, the traditional NeRF focuses on central objects in the scene, however, many objects in the scene appear on two sides of a picture and have low appearance frequency, and the objects cannot be reconstructed only by using the traditional NeRF, so that the reconstruction accuracy is low. In addition, the traditional NeRF only uses the pixel colors of the image as supervision, focuses more on the reconstructed rendered image, does not focus on the accuracy of three-dimensional surface reconstruction, and because of lack of constraint, the convergence rate of the model is not high, so that the training cost of the model is greatly increased.

Disclosure of Invention

In view of the above, the present invention provides a three-dimensional image reconstruction method, apparatus, computer device and storage medium, so as to solve the problems of low image reconstruction accuracy and high training cost of the conventional model.

In a first aspect, the present invention provides a three-dimensional image reconstruction method, the method comprising:

Acquiring multiple groups of images of a target scene, wherein each group of images comprises a pair of left view images and right view images which are shot at the same time;

extracting common view areas between left view images and right view images in each group of images, and acquiring pixel information of each common view area; the pixel information comprises azimuth vectors, depths, colors, semantic tags of pixels in the common area and directed distances between the pixels and the surface of the nearest object;

training an initial three-dimensional image reconstruction model based on the common view area and pixel information of the common view area of each group of images to obtain a trained three-dimensional image reconstruction model; wherein the loss function during training is obtained based on the color of the pixel, the semantic label, and the directed distance;

and carrying out three-dimensional image reconstruction on the target scene based on the trained three-dimensional image reconstruction model.

The method comprises the steps of obtaining a plurality of groups of left view images and right view images of a target scene, extracting pixel information of a common view area between each group of left view images and right view images, filtering unnecessary pixels in a picture, training an initial three-dimensional image reconstruction model based on the pixel information, obtaining a loss function in the training process based on colors, semantic labels and directed distances of the pixels, accelerating the convergence of the model, reducing training cost, improving model precision, and finally reconstructing the three-dimensional image of the target scene based on the trained three-dimensional image reconstruction model to obtain a surface model of the target scene.

In an alternative embodiment, the training process of the three-dimensional image reconstruction model includes:

sampling pixels in the common region based on azimuth vectors and depths of the pixels in the common region, and performing position coding on the obtained multiple pixel sampling points to obtain implicit characterization of each pixel sampling point;

based on the implicit characterization of each pixel sampling point, obtaining a color predicted value, a semantic tag predicted value and a directed distance predicted value corresponding to each pixel sampling point;

calculating the value of a loss function according to the loss of the color predicted value of each pixel sampling point, the loss of the semantic label predicted value and the loss of the directional distance predicted value;

and updating parameters of the three-dimensional image reconstruction model based on the value of the loss function to obtain the trained three-dimensional image reconstruction model.

The method comprises the steps of obtaining predicted values corresponding to colors, semantic tags and directed distances of pixels, calculating a loss function based on the colors, the semantic tags and the directed distances of the pixels, training an initial three-dimensional image reconstruction model, accelerating convergence of the model, reducing training cost and improving model accuracy.

In an alternative embodiment, sampling pixels in the common area based on the azimuth vector and the depth of the pixels in the common area, and performing position coding on the obtained plurality of pixel sampling points to obtain implicit characterization of each pixel sampling point, including:

Sampling pixels of the vector on the same ray along the direction of increasing depth;

and carrying out hash coding based on the obtained positions of the plurality of pixel sampling points to obtain implicit characterization of each pixel sampling point.

Thus, the implicit representation of the sampling point is obtained by carrying out position coding on the sampling point so as to predict other characteristic values of the sampling point later.

In an alternative embodiment, based on the trained three-dimensional image reconstruction model, performing three-dimensional image reconstruction on the target scene includes:

uniformly sampling pixels in a target common area corresponding to a target scene to obtain a plurality of target pixel sampling points;

inputting a plurality of target pixel sampling points into a trained three-dimensional image reconstruction model to obtain target directional distance predicted values of the target pixel sampling points;

and reconstructing a three-dimensional image of the target scene based on the target directional distance predicted value of each target pixel sampling point.

The distance between each point and the surface can be obtained through the SDF predicted value of each target pixel sampling point in the target scene, and then the surface model of the target scene is obtained, so that three-dimensional image reconstruction is carried out on the target scene.

In an alternative embodiment, acquiring multiple sets of images of a target scene includes:

a plurality of left view images of the target scene are acquired by the first image acquisition device, and a plurality of right view images of the target scene are acquired by the second image acquisition device.

Therefore, the accuracy of the model is improved by acquiring the multi-view image of the target scene.

In an alternative embodiment, extracting the common view area between the left view image and the right view image in each group of images, and acquiring the pixel information of each common view area includes:

and carrying out coordinate system conversion on the pixels in each common region based on the pose, the internal parameters and the external parameters respectively corresponding to the first image acquisition device and the second image acquisition device, so as to obtain the azimuth vector of the pixels in each common region in the world coordinate system.

In an alternative embodiment, extracting the common view area between the left view image and the right view image in each group of images, and acquiring the pixel information of each common view area, further includes:

inputting left view images and right view images in each group of images into a preset neural network to obtain optical flow data of each common view area; the optical flow data includes correspondence between pixels in the left view image and pixels in the right view image;

Based on external parameters respectively corresponding to the first image acquisition device and the second image acquisition device, obtaining a relative position relationship between the first image acquisition device and the second image acquisition device;

and obtaining the depth of the pixels in each common view area based on the optical flow data and the relative position relation of each common view area.

In a second aspect, the present invention provides a three-dimensional image reconstruction apparatus comprising:

the acquisition module is used for acquiring a plurality of groups of images of the target scene, wherein each group of images comprises a pair of left view images and right view images which are shot at the same time;

the first processing module is used for extracting common view areas between left view images and right view images in each group of images and acquiring pixel information of each common view area; the pixel information comprises azimuth vectors, depths, colors, semantic tags of pixels in the common area and directed distances between the pixels and the surface of the nearest object;

the second processing module is used for training a preset three-dimensional image reconstruction model based on the common view area and the pixel information of the common view area of each group of images to obtain a trained three-dimensional image reconstruction model; wherein the loss function during training is obtained based on the color of the pixel, the semantic label, and the directed distance;

And the third processing module is used for reconstructing the three-dimensional image of the target scene based on the trained three-dimensional image reconstruction model.

In a third aspect, the present invention provides a computer device comprising: the three-dimensional image reconstruction method comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions, so that the three-dimensional image reconstruction method according to the first aspect or any implementation mode corresponding to the first aspect is executed.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to execute the three-dimensional image reconstruction method of the first aspect or any of the embodiments corresponding thereto.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a three-dimensional image reconstruction method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another three-dimensional image reconstruction method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a hash encoding process according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a multi-layer perceptron network, in accordance with an embodiment of the present invention;

FIG. 5 is a flow chart of yet another three-dimensional image reconstruction method according to an embodiment of the present invention;

FIG. 6 is a schematic application diagram of a three-dimensional image reconstruction method according to an embodiment of the present invention;

fig. 7 is a block diagram of a three-dimensional image reconstruction apparatus according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In recent years, with the vigorous development of autopilot technology and business, three-dimensional reconstruction is more widely applied in autopilot scenes. In the current state of the art, neural radiation fields (Neural Radiance Fields, neRF) are introduced for purely visual three-dimensional reconstruction. The traditional NeRF is a three-dimensional reconstruction method based on deep learning, relies on a large amount of data to train a model, generally marks three-dimensional information through a three-dimensional grid or point cloud model, and is widely used.

In an autopilot scenario, three-dimensional reconstruction based on deep learning relies heavily on large amounts of data for training of the network. The true value of the training data often depends on a large number of manual labels, and the process is high in cost, low in efficiency and low in labeling accuracy. Particularly on a pure vision platform, if the traditional manual labeling method cannot label three-dimensional information (such as depth, three-dimensional target detection and the like) in principle, the method can only rely on introducing an additional sensor with the three-dimensional information or repeated re-projection inspection correction. Either of these methods greatly increases the cost of labeling and does not achieve good accuracy.

Therefore, the embodiment of the invention provides a three-dimensional image reconstruction scheme, by acquiring a plurality of groups of left view images and right view images of a target scene, extracting pixel information of a common view area between each group of left view images and right view images, filtering unnecessary pixels in a picture, then training an initial three-dimensional image reconstruction model based on the pixel information, wherein a loss function in the training process is obtained based on colors, semantic tags and directed distances of the pixels, so that convergence of the model is accelerated, training cost is reduced, model precision is improved, and finally three-dimensional image reconstruction is carried out on the target scene based on the trained three-dimensional image reconstruction model to obtain a surface model of the target scene.

In accordance with an embodiment of the present invention, a three-dimensional image reconstruction method embodiment is provided, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.

In this embodiment, a three-dimensional image reconstruction method is provided, which may be used in a computer device or an electronic device for performing three-dimensional image reconstruction, such as a mobile phone, a tablet computer, a control chip, etc., fig. 1 is a flowchart of a three-dimensional image reconstruction method according to an embodiment of the present invention, and as shown in fig. 1, the flowchart includes the following steps:

in step S101, a plurality of sets of images of a target scene are acquired, each set of images including a pair of left view images and right view images captured simultaneously.

Specifically, the left view image and the right view image of the target scene photographed at the same time can be acquired by an image acquisition device such as a binocular camera, thereby obtaining a multi-view image of the target scene and improving the accuracy of the model.

Step S102, extracting common view areas between left view images and right view images in each group of images, and acquiring pixel information of each common view area.

Wherein the pixel information includes an azimuth vector, a depth, a color, a semantic label, and a directed distance between the pixel and a nearest object surface of the pixel in the common region. The directional distance, i.e., the SDF value (Signed Distance Fuction), may be obtained by a directional distance field corresponding to an image, so as to obtain a directional distance corresponding to a pixel in the image, i.e., a distance between the pixel and a surface of a nearest object.

Since the scenes of autopilot are mostly open scenes, many objects appear on both sides of the screen and the frequency of appearance is not high. Therefore, unnecessary pixels in the picture are filtered out by extracting the common view area between the left view angle image and the right view angle image of the target scene, so that the accuracy of three-dimensional image reconstruction is improved.

Step S103, training an initial three-dimensional image reconstruction model based on the common view area and pixel information of the common view area of each group of images to obtain a trained three-dimensional image reconstruction model.

The loss function in the training process is obtained based on the colors, the semantic tags and the directed distances of the pixels, so that parameters of the three-dimensional image reconstruction model are optimized by fusing the semantic tags, the colors and the directed distance information of the pixels, the convergence of the model is accelerated, the training cost of the model is reduced, and the accuracy of the model is improved.

Step S104, based on the trained three-dimensional image reconstruction model, three-dimensional image reconstruction is carried out on the target scene.

Specifically, pixel sampling is performed on the target scene, SDF values of the sampling points are predicted through a three-dimensional image reconstruction model, and the distance between each point and the surface is obtained, so that the surface model of the target scene is obtained.

According to the three-dimensional image reconstruction method, through obtaining multiple groups of left view images and right view images of a target scene, extracting pixel information of a common view area between each group of left view images and right view images, filtering unnecessary pixels in a picture, then training an initial three-dimensional image reconstruction model based on the pixel information, wherein a loss function in the training process is obtained based on colors, semantic labels and directed distances of the pixels, so that convergence of the model is accelerated, training cost is reduced, model precision is improved, finally, a surface model of the target scene is obtained based on the trained three-dimensional image reconstruction model, and three-dimensional image reconstruction is carried out on the target scene.

In this embodiment, a three-dimensional image reconstruction method is provided, which may be used in a computer device or an electronic device for performing three-dimensional image reconstruction, such as a mobile phone, a tablet computer, a control chip, etc., fig. 2 is a flowchart of the three-dimensional image reconstruction method according to an embodiment of the present invention, and as shown in fig. 2, the flowchart includes the following steps:

In step S201, a plurality of sets of images of the target scene are acquired, each set of images including a pair of left view images and right view images photographed at the same time.

Specifically, a plurality of left view images of the target scene may be acquired by the first image acquisition device, while a plurality of right view images of the target scene may be acquired by the second image acquisition device. The first image capturing device may be a left camera of a binocular camera, and the second image capturing device may be a right camera of the binocular camera.

Step S202, a common view area between the left view image and the right view image in each group of images is extracted, and pixel information of each common view area is obtained.

In some optional embodiments, the step S202 includes: and carrying out coordinate system conversion on the pixels in each common region based on the pose, the internal parameters and the external parameters respectively corresponding to the first image acquisition device and the second image acquisition device, so as to obtain the azimuth vector of the pixels in each common region in the world coordinate system.

Specifically, taking three-dimensional image reconstruction in an automatic driving scene as an example, a rough initial trajectory can be obtained through on-board inertial sensor IMU, global satellite navigation system GNSS, wheel speed odometer, binocular camera and other devices, so as to obtain pose corresponding to a left camera (a first image acquisition device) and a right camera (a second image acquisition device) of the binocular camera respectively. It should be noted that, since the common view area is determined by the placement position of the binocular camera, the common view area is fixed after the position of the binocular camera is fixed.

Next, for each pixel of each image (each image is commonPixels), by means of the external parameters of the camera corresponding to the image +.>(obtained by camera calibration) and the pose of the camera corresponding to the imageThe position of the optical center of the camera in the world coordinate system is obtained according to the following formula>：

Wherein,representing an European transformation->Wherein->Indicating rotation(s)>Representing translation.

Next, external parameters through the cameraInternal parameters->(obtained by camera calibration) and camera pose +.>Obtaining the position of the azimuth vector of each pixel in the image in the world coordinate system according to the following formula：

Wherein,representing the coordinates of the pixel in the pixel coordinate system, for example>Representing the internal parameters.

In some optional embodiments, the step S202 further includes:

step a1, inputting left view images and right view images in each group of images into a preset neural network to obtain optical flow data of each common view area; the optical flow data includes correspondence between pixels in the left view image and pixels in the right view image.

Specifically, a left view image and a right view image are input into a deep learning network for predicting optical flow data, for example, a RAFT (return All-Pairs Field Transforms) algorithm and a deep learning algorithm based on SuperPoint and SuperGlue are used for extracting and matching image feature points, so that optical flows of a common view part of the left image and the right image are obtained, namely, each pixel of the left view image corresponds to a pixel position of the left view image.

And a2, obtaining the relative position relation between the first image acquisition device and the second image acquisition device based on the external parameters respectively corresponding to the first image acquisition device and the second image acquisition device.

Specifically, taking a binocular camera as an example, a positional relationship (for example, a transformation relationship from the left camera to the right camera) of the left camera and the right camera relative to each other can be obtained by external parameters (obtained by camera calibration) of the left camera and the right camera, and the positional relationship is a fixed value.

And a step a3 of obtaining the depth of the pixels in each common view area based on the optical flow data and the relative position relation of each common view area.

Specifically, by combining the correspondence between pixels in the left view angle image and the right view angle image and the relative positional relationship of the left and right cameras, the depth of each pixel in the common view part of the left and right images can be obtained through triangulation, and the region without common view can be directly discarded.

Wherein the process of triangularization comprises: the relative positional relationship of the left and right cameras is known, for example, the conversion relationship (rotationTranslation->) Homogeneous coordinates of pixel point on left cameraHomogeneous coordinates of pixel point on right camera +. >The following steps are:

to the upper left multiplyObtaining:

wherein the method comprises the steps of，Is->Collinear vector of>For the depth of the pixel in the left view image, is->Is the depth of the pixel in the right view image. The left side of the above is 0, and the right side can be regarded as a right side of the above formula +.>Can directly find +.>. To obtain->Then substituting to obtain->The depth of the pixel is thus obtained.

In some optional embodiments, the step S202 further includes: inputting the image into a semantic segmentation deep learning network, such as MMSeg (MMSegmentation) framework, to obtain a semantic label corresponding to each pixel, wherein the category of the semantic label is not limited, so long as the whole data set is unified.

In some optional embodiments, the step S202 further includes: the directional distance corresponding to a certain pixel in the image is obtained through the directional distance field corresponding to the image; the color of the pixel is obtained directly from the image.

Step S203, training the initial three-dimensional image reconstruction model based on the common view area and the pixel information of the common view area of each group of images to obtain a trained three-dimensional image reconstruction model.

Specifically, the step S203 includes:

step S2031, sampling pixels in the common area based on the azimuth vector and the depth of the pixels in the common area, and performing position coding on the obtained plurality of pixel sampling points to obtain implicit characterization of each pixel sampling point.

In some optional embodiments, step S2031 includes:

step b1, sampling pixels of the vector on the same ray along the direction of depth increase.

In particular, the position of the optical center of the camera in the world coordinate systemStarting from the vicinity of the object surface along +.>The sampling is performed at the depth of the direction advancing pixel, and the sampling mode can adopt multi-stage sampling, average sampling or weighted sampling, so as to obtain a sampling point ray_pts.

And b2, carrying out hash coding based on the obtained positions of the plurality of pixel sampling points to obtain implicit characterization of each pixel sampling point.

Specifically, as shown in fig. 3, each sampling point ray_pts is used as an input and is input into a hash coding network, and in the hash coding network, each output of the hash coding is spliced with a position code according to the position of the sampling point ray_pts to obtain the coded implicit representation y. The sampling point ray_pts may be position coded according to the following formula:

where p= (x, y, z) is the coordinates of the sampling point.

Step S2032, obtaining a color predicted value, a semantic tag predicted value and a directed distance predicted value corresponding to each pixel sampling point based on the implicit characterization of each pixel sampling point.

Specifically, the implicit representation y of the pixel sampling point is input into a Multi-layer Perception network (Multi-layer Perception), as shown in fig. 4, the Multi-layer Perception machine network is a neural network formed by multiple fully-connected layers, and in actual use, the network layer number can be adjusted according to the scene size, and generally 4-8 layers are used. RGB color and weight are predicted by a multi-layer perceptronAnd semantic tags, weight +.>And obtaining the logarithm to obtain the SDF value of the pixel sampling point.

Step S2033, calculating a value of the loss function according to the loss of the color predicted value, the loss of the semantic label predicted value, and the loss of the directional distance predicted value of each pixel sampling point.

Specifically, a loss function is calculated through a true value and a predicted value which correspond to the color, the semantic label and the directed distance of the pixel sampling point respectively, and the gradient update network is back propagated. Wherein the loss function L comprises a loss of RGB valuesLoss of SDF value->Loss of language tags->The method comprises the following steps:

wherein,、/>all represent weights.

And step S2034, updating parameters of the three-dimensional image reconstruction model based on the value of the loss function to obtain a trained three-dimensional image reconstruction model.

The method comprises the steps of obtaining true values and predicted values corresponding to colors, semantic tags and directed distances of pixels respectively, calculating a loss function based on the colors, the semantic tags and the directed distances of the pixels, training an initial three-dimensional image reconstruction model, accelerating convergence of the model, reducing training cost and improving model accuracy.

Step S204, based on the trained three-dimensional image reconstruction model, three-dimensional image reconstruction is carried out on the target scene. Please refer to step S104 in the embodiment shown in fig. 1 in detail, which is not described herein.

According to the three-dimensional image reconstruction method, through obtaining multiple groups of left view images and right view images of a target scene, extracting pixel information of a common view area between each group of left view images and right view images, filtering unnecessary pixels in a picture, then through obtaining predicted values respectively corresponding to colors, semantic tags and directed distances of the pixels, calculating a loss function based on the losses of the colors, the semantic tags and the directed distances of the pixels, training an initial three-dimensional image reconstruction model, so that convergence of the model is accelerated, training cost is reduced, model accuracy is improved, finally, based on the trained three-dimensional image reconstruction model, a surface model of the target scene is obtained, and three-dimensional image reconstruction is carried out on the target scene.

In this embodiment, a three-dimensional image reconstruction method is provided, which may be used in a computer device or an electronic device for performing three-dimensional image reconstruction, such as a mobile phone, a tablet computer, a control chip, etc., fig. 5 is a flowchart of the three-dimensional image reconstruction method according to an embodiment of the present invention, and as shown in fig. 5, the flowchart includes the following steps:

in step S501, a plurality of sets of images of a target scene are acquired, each set of images including a pair of left view images and right view images captured simultaneously. Please refer to step S201 in the embodiment shown in fig. 2 in detail, which is not described herein.

Step S502, extracting common view areas between left view images and right view images in each group of images, and acquiring pixel information of each common view area. Please refer to step S202 in the embodiment shown in fig. 2, which is not described herein.

Step S503, training an initial three-dimensional image reconstruction model based on the common view area and pixel information of the common view area of each group of images to obtain a trained three-dimensional image reconstruction model. Please refer to step S203 in the embodiment shown in fig. 2 in detail, which is not described herein.

Step S504, based on the trained three-dimensional image reconstruction model, three-dimensional image reconstruction is carried out on the target scene.

Specifically, the step S504 includes:

in step S5041, pixels in the target common area corresponding to the target scene are uniformly sampled to obtain a plurality of target pixel sampling points.

Specifically, pixels in a target common area corresponding to a target scene are uniformly sampled, and a plurality of target pixel sampling points ray_pts are obtained.

Step S5042, inputting the plurality of target pixel sampling points into the trained three-dimensional image reconstruction model to obtain target directional distance predicted values of the target pixel sampling points.

Specifically, a plurality of target pixel sampling points ray_pts are input into a trained three-dimensional image reconstruction model, and target SDF predicted values of the target pixel sampling points ray_pts are obtained.

In step S5043, the three-dimensional image of the target scene is reconstructed based on the target directional distance prediction values of the target pixel sampling points.

Specifically, based on the target SDF predicted value of each target pixel sampling point ray_pts, the distance between each point and the surface can be obtained, and then the surface model of the target scene is obtained, so that three-dimensional image reconstruction is carried out on the target scene.

According to the three-dimensional image reconstruction method, through the acquisition of multiple groups of left view images and right view images of a target scene, pixel information of a common view area between each group of left view images and right view images is extracted, unnecessary pixels in a picture are filtered, then an initial three-dimensional image reconstruction model is trained through the acquisition of predicted values respectively corresponding to colors, semantic labels and directed distances of the pixels, and a loss function is calculated, so that the convergence of the model is accelerated, the training cost is reduced, the model precision is improved, finally, based on the trained three-dimensional image reconstruction model, SDF predicted values corresponding to sampling points of target pixels in the target scene are obtained, the surface model of the target scene is obtained, and three-dimensional image reconstruction is carried out on the target scene.

The three-dimensional image reconstruction method according to the embodiment of the present invention will be described in further detail by way of a specific application example.

As shown in fig. 6, multiple sets of left view images and right view images of the target scene are acquired by two sets of cameras, and the positions of the optical centers of the cameras in the world coordinate system are acquiredAnd the position of the azimuth vector of each pixel of the common view region in each set of left view image and right view image in the world coordinate system +.>Then based on +.>And->Sampling to obtain a plurality of pixel sampling points ray_pts, then carrying out hash coding based on positions on the pixel sampling points ray_pts, inputting the obtained implicit representation into a multi-layer perceptron, predicting RGB values, SDF values and semantic tags of the pixel sampling points ray_pts, simultaneously carrying out model training based on losses of the RGB values, the SDF values and the semantic tags, accelerating model convergence, and finally predicting the SDF values of the sampling points according to the trained model, thereby obtaining a surface model of a target scene, and carrying out three-dimensional image reconstruction on the target scene.

In this embodiment, a three-dimensional image reconstruction device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and will not be described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides a three-dimensional image reconstruction apparatus, as shown in fig. 7, including:

an acquisition module 701, configured to acquire a plurality of groups of images of a target scene, each group of images including a pair of left view images and right view images captured simultaneously;

a first processing module 702, configured to extract a common view region between a left view image and a right view image in each group of images, and obtain pixel information of each common view region; the pixel information comprises azimuth vectors, depths, colors, semantic tags of pixels in the common area and directed distances between the pixels and the surface of the nearest object;

the second processing module 703 is configured to train a preset three-dimensional image reconstruction model based on the co-view area of each group of images and the pixel information of the co-view area, so as to obtain a trained three-dimensional image reconstruction model; wherein the loss function during training is obtained based on the color of the pixel, the semantic label, and the directed distance;

and a third processing module 704, configured to reconstruct a three-dimensional image of the target scene based on the trained three-dimensional image reconstruction model.

In some alternative embodiments, the acquisition module 701 includes:

the first processing unit is used for acquiring a plurality of left view images of the target scene through the first image acquisition device, and acquiring a plurality of right view images of the target scene through the second image acquisition device.

In some alternative embodiments, the first processing module 702 includes:

and the second processing unit is used for carrying out coordinate system conversion on the pixels in each common area based on the pose, the internal parameters and the external parameters respectively corresponding to the first image acquisition device and the second image acquisition device, so as to obtain the azimuth vector of the pixels in each common area in the world coordinate system.

In some alternative embodiments, the first processing module 702 further comprises:

the third processing unit is used for inputting left view images and right view images in each group of images into a preset neural network to obtain optical flow data of each common view area; the optical flow data includes correspondence between pixels in the left view image and pixels in the right view image;

the fourth processing unit is used for obtaining the relative position relationship between the first image acquisition device and the second image acquisition device based on the external parameters respectively corresponding to the first image acquisition device and the second image acquisition device;

and a fifth processing unit for obtaining the depth of the pixels in each common view area based on the optical flow data and the relative position relationship of each common view area.

In some alternative embodiments, the second processing module 703 includes:

The sixth processing unit is used for sampling the pixels in the common area based on the azimuth vectors and the depths of the pixels in the common area, and performing position coding on the obtained multiple pixel sampling points to obtain implicit characterization of each pixel sampling point;

the seventh processing unit is used for obtaining a color predicted value, a semantic tag predicted value and a directed distance predicted value corresponding to each pixel sampling point based on the implicit representation of each pixel sampling point;

an eighth processing unit, configured to calculate a value of a loss function according to a loss of the color predicted value, a loss of the semantic tag predicted value, and a loss of the directional distance predicted value of each pixel sampling point;

and the ninth processing unit is used for updating parameters of the three-dimensional image reconstruction model based on the value of the loss function to obtain a trained three-dimensional image reconstruction model.

In some alternative embodiments, the sixth processing unit comprises:

a first processing subunit configured to sample pixels of the vector on the same ray along a direction in which the depth increases;

and the second processing subunit is used for carrying out hash coding based on the obtained positions of the plurality of pixel sampling points to obtain implicit representation of each pixel sampling point.

In some alternative embodiments, the third processing module 704 includes:

a tenth processing unit, configured to uniformly sample pixels in a target common area corresponding to a target scene, to obtain a plurality of target pixel sampling points;

the eleventh processing unit is used for inputting a plurality of target pixel sampling points into the trained three-dimensional image reconstruction model to obtain target directional distance predicted values of the target pixel sampling points;

and the twelfth processing unit is used for reconstructing a three-dimensional image of the target scene based on the target directional distance predicted value of each target pixel sampling point. Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The three-dimensional image reconstruction apparatus in this embodiment is presented in the form of functional units, where the units refer to ASIC (Application Specific Integrated Circuit ) circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above-described functions.

The embodiment of the invention also provides computer equipment, which is provided with the three-dimensional image reconstruction device shown in the figure 1.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 8, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 8.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A method of three-dimensional image reconstruction, the method comprising:

training an initial three-dimensional image reconstruction model based on the common view area of each group of images and pixel information of the common view area to obtain a trained three-dimensional image reconstruction model; wherein the loss function during training is obtained based on the color of the pixel, the semantic label, and the directed distance;

2. The three-dimensional image reconstruction method according to claim 1, wherein the training process of the three-dimensional image reconstruction model includes:

and updating parameters of the three-dimensional image reconstruction model based on the value of the loss function to obtain a trained three-dimensional image reconstruction model.

3. The three-dimensional image reconstruction method according to claim 2, wherein the sampling the pixels in the common region based on the azimuth vector and the depth of the pixels in the common region, and performing position encoding on the obtained plurality of pixel sampling points to obtain the implicit representation of each pixel sampling point, includes:

4. The three-dimensional image reconstruction method according to claim 1, wherein the reconstructing the three-dimensional image of the target scene based on the trained three-dimensional image reconstruction model comprises:

inputting the plurality of target pixel sampling points into a trained three-dimensional image reconstruction model to obtain target directional distance predicted values of the target pixel sampling points;

and reconstructing a three-dimensional image of the target scene based on target directional distance predicted values of all target pixel sampling points.

5. The three-dimensional image reconstruction method according to claim 1, wherein the acquiring a plurality of sets of images of the target scene includes:

6. The method according to claim 5, wherein extracting the common view region between the left view image and the right view image in each group of images and acquiring the pixel information of each common view region comprises:

and carrying out coordinate system conversion on pixels in each common area based on the pose, the internal parameters and the external parameters respectively corresponding to the first image acquisition device and the second image acquisition device, so as to obtain azimuth vectors of the pixels in each common area in a world coordinate system.

7. The method according to claim 5, wherein the extracting the common view region between the left view image and the right view image in each group of images and acquiring the pixel information of each common view region further comprises:

inputting left view images and right view images in each group of images into a preset neural network to obtain optical flow data of each common view area; the optical flow data includes correspondence between pixels in a left view image and pixels in a right view image;

based on external parameters corresponding to the first image acquisition device and the second image acquisition device respectively, obtaining a relative position relationship between the first image acquisition device and the second image acquisition device;

And obtaining the depth of the pixels in each common view area based on the optical flow data of each common view area and the relative position relation.

8. A three-dimensional image reconstruction apparatus, the apparatus comprising:

the second processing module is used for training a preset three-dimensional image reconstruction model based on the common view area of each group of images and the pixel information of the common view area to obtain a trained three-dimensional image reconstruction model; wherein the loss function during training is obtained based on the color of the pixel, the semantic label, and the directed distance;

9. A computer device, comprising:

a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the three-dimensional image reconstruction method of any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the three-dimensional image reconstruction method of any one of claims 1 to 7.