CN116824026A

CN116824026A - Three-dimensional reconstruction method, device, system and storage medium

Info

Publication number: CN116824026A
Application number: CN202311084904.2A
Authority: CN
Inventors: 肖美华; 李承欢; 谭睿霄; 徐锐涵
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-09-29
Anticipated expiration: 2043-08-28
Also published as: CN116824026B

Abstract

The application provides a three-dimensional reconstruction method, a device, a system and a storage medium, which belong to the field of image processing, wherein the method comprises the following steps: importing an original video, and dividing the original video to obtain an image sequence; analyzing the image sequence to obtain a model view perspective matrix and a mask image; constructing an original matrix set, and constructing an initial reconstruction model through the original matrix set; and rendering the model view perspective matrix and the mask image through the initial reconstruction model to obtain a rendered image. The application can only monitor the model through the 2D image information, and the generated 3D model has considerable point-to-surface structure and point-to-surface quantity, has important significance in the fields of games, virtual reality industry and digital cultural relics, and reduces the manpower, financial resources and time cost brought by manual modeling.

Description

Three-dimensional reconstruction method, device, system and storage medium

Technical Field

The application mainly relates to the technical field of image processing, in particular to a three-dimensional reconstruction method, a device, a system and a storage medium.

Background

Three-dimensional reconstruction is a process of recovering a three-dimensional scene or object from two-dimensional images of multiple perspectives, which can save a lot of cost compared with manually modeling a 3D scene. Based on the characterization of the model, three-dimensional reconstruction techniques can be divided into implicit and explicit reconstructions. The common implicit reconstruction uses voxels, a signal distance function SDF (Signal Distance Function) and an occupied field OP (Occupational filed) to represent the shape, however, the implicit reconstruction method always adopts a Maring_cube algorithm when the model is synthesized finally, so that the number of points of the model exceeds the bearing capacity of the traditional modeling software, the generated model can be used only by using special checking software, an OBJ file is difficult to derive, and the implicit reconstruction often depends on 3D supervision information, which leads to the application non-generalizability. Conventional explicit reconstruction typically requires a large amount of input data, particularly high resolution sensor data, which can be a significant challenge for complex scenes or large-scale objects, require a large amount of time and computational resources, and are expensive sensor equipment, making explicit reconstruction applications difficult to popularize.

Disclosure of Invention

The application aims to solve the technical problem of providing a three-dimensional reconstruction method, a device, a system and a storage medium aiming at the defects of the prior art.

The technical scheme for solving the technical problems is as follows: a three-dimensional reconstruction method comprising the steps of:

importing an original video, and dividing the original video to obtain a plurality of image sequences;

analyzing each image sequence to obtain a model view perspective matrix corresponding to each image sequence and a mask image corresponding to each image sequence;

constructing an original matrix set, and constructing an initial reconstruction model through the original matrix set;

rendering the model view perspective matrixes and the mask images corresponding to the image sequences respectively through the initial reconstruction model to obtain rendered images corresponding to the image sequences;

optimizing the initial reconstruction model according to all the image sequences and all the rendering images to obtain a three-dimensional reconstruction model;

importing an image to be reconstructed, and performing three-dimensional reconstruction on the image to be reconstructed through the three-dimensional reconstruction model to obtain a three-dimensional reconstruction result;

the process of analyzing each image sequence to obtain a model view perspective matrix corresponding to each image sequence and a mask image corresponding to each image sequence comprises the following steps:

extracting affine transformation matrix corresponding to each image sequence, image height corresponding to each image sequence, image width corresponding to each image sequence and camera focal length corresponding to each image sequence from each image sequence by utilizing a motion structure algorithm;

extracting mask images corresponding to the image sequences from the image sequences by using a Python tool;

and respectively carrying out matrix calculation on an affine transformation matrix corresponding to each image sequence, an image height corresponding to each image sequence, an image width corresponding to each image sequence and a camera focal length corresponding to each image sequence to obtain a model view perspective matrix corresponding to each image sequence.

The other technical scheme for solving the technical problems is as follows: a three-dimensional reconstruction apparatus comprising:

the importing module is used for importing the original video;

the segmentation module is used for segmenting the original video to obtain a plurality of image sequences;

the analysis module is used for respectively analyzing each image sequence to obtain a model view perspective matrix corresponding to each image sequence and a mask image corresponding to each image sequence;

the construction module is used for constructing an original matrix set, and constructing an initial reconstruction model through the original matrix set;

the rendering module is used for respectively rendering each model view perspective matrix and mask images corresponding to each image sequence through the initial reconstruction model to obtain rendered images corresponding to each image sequence;

the optimization module is used for optimizing the initial reconstruction model according to all the image sequences and all the rendering images to obtain a three-dimensional reconstruction model;

the importing module is also used for importing an image to be reconstructed;

the three-dimensional reconstruction result obtaining module is used for carrying out three-dimensional reconstruction on the image to be reconstructed through the three-dimensional reconstruction model to obtain a three-dimensional reconstruction result;

the analysis module is used for:

Based on the three-dimensional reconstruction method, the application further provides a three-dimensional reconstruction system.

The other technical scheme for solving the technical problems is as follows: a three-dimensional reconstruction system comprising a memory, a processor and a computer program stored in the memory and executable on the processor, which when executed by the processor implements a three-dimensional reconstruction method as described above.

Based on the three-dimensional reconstruction method, the application further provides a computer readable storage medium.

The other technical scheme for solving the technical problems is as follows: a computer readable storage medium storing a computer program which, when executed by a processor, implements a three-dimensional reconstruction method as described above.

The beneficial effects of the application are as follows: the method comprises the steps of obtaining a plurality of image sequences through segmentation of an original video, obtaining a model view perspective matrix and a mask image through analysis of the image sequences, constructing an initial reconstruction model through an original matrix structure, obtaining a rendered image through rendering of the model view perspective matrix and the mask image through the initial reconstruction model, obtaining a three-dimensional reconstruction model according to the image sequences and optimization of the initial reconstruction model by the rendered image, obtaining a three-dimensional reconstruction result through three-dimensional reconstruction of the image to be reconstructed through the three-dimensional reconstruction model, and generating a 3D model with a considerable point-to-plane structure and point-to-plane number only through a 2D image information supervision model.

Drawings

FIG. 1 is a schematic flow chart of a three-dimensional reconstruction method according to an embodiment of the present application;

FIG. 2 is a diagram of one of tetrahedral structures of a three-dimensional reconstruction method according to an embodiment of the present application;

FIG. 3 is a diagram showing a second tetrahedral structure of a three-dimensional reconstruction method according to an embodiment of the present application;

FIG. 4 is a diagram of a third tetrahedral structure of a three-dimensional reconstruction method according to an embodiment of the present application;

FIG. 5 is a diagram showing a tetrahedral structure of a three-dimensional reconstruction method according to an embodiment of the present application;

fig. 6 is a block diagram of a three-dimensional reconstruction device according to an embodiment of the present application.

Detailed Description

The principles and features of the present application are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the application and are not to be construed as limiting the scope of the application.

Fig. 1 is a schematic flow chart of a three-dimensional reconstruction method according to an embodiment of the present application.

As shown in fig. 1, a three-dimensional reconstruction method includes the following steps:

It should be appreciated that the original video may be a 360 degree video of the reconstructed product, with the obtained video being segmented into a sequence of images.

It will be appreciated that video is segmented into image sequences according to the frame parameters per second.

It will be appreciated that based on the resulting initial model, MVP (Model View Perspective) matrix (i.e., model view perspective matrix), the sequence of mask images (i.e., mask images), an image of the initial model is rendered.

It should be appreciated that the SFM algorithm (i.e., motion structure algorithm) is used on the acquired image sequence to acquire an affine transformation matrix of the camera coordinate system to the world coordinate system and the image height (i.e., image height), width (i.e., image width), camera focal length.

Specifically, the steps of the motion structure algorithm are specifically as follows:

for the resulting image sequenceFinding the correspondence between the overlapping images of the product and verifying the projection of the same points in the overlapping images, the output being a set of geometrically verified image pairs +.>And an image projection map for each point. For the photographed product we have the following assumptions: the product conforms to the rigid body motion characteristics, namely +.>，/>The world coordinate vector for any two points on the surface of the product satisfies the following formula: />It is understood that the product is not a fluid, smoke, or the like, surface-deformable object. For->At any point on the product +.>Its world coordinate is +.>The camera coordinates are +.>, and />Then->At the time, camera coordinates->The rigid body displacement between the camera coordinates and the world coordinates is called the external reference of the camera, for the known camera coordinates +.>Its image coordinates (two-dimensional), wherein />For depth scaling factor, ++>Is an internal reference matrix of a camera>For the camera focal length size, +.>For scaling factors，/>Is the center offset. The above six variables satisfy->Is the size of a horizontal pixel in [ meters per pixel ]]Similarly->Is the size of the vertical pixel +.>Is the horizontal focal length and is similarly->Is vertical focal length size +.>Is the aspect ratio. />For the projection matrix of camera coordinates onto image coordinates, based on the above assumptions and definitions, for each image +.>SFM (struct from motion) detection at position->Is a local feature set of (1)The set is denoted->We assume that the feature set remains unchanged under illumination and geometric transformation, that the SFM can identify the feature in multiple images, and that next the SFM uses the feature set +.>Finding the same scene part as an appearance description of an imageIs output as a set of potentially overlapping image pairsAnd corresponding characteristic relation->Finally, the camera internal and external parameters are estimated through PNP (Perspotive-N-Point) problem, and because the 2D-3D corresponding relation is often polluted by abnormal values, the pose of the calibration camera is estimated by using random sampling agreement (Random Sample Consensus, RANSAC) and a minimum pose solver and is stored in a npy file which can be quickly read and written, and the file content is stored in an array form and comprises N rows and 17 columns of data. Where N is the number of images and 17 columns store the in-camera parameters for the images. The first 12 columns define a +.>The last 3 columns define the image height, width, focal length of the camera-to-world affine transformation matrix. The last 2 columns define 2 depth values, a near boundary value and a far boundary value for scaling the range of products.

It will be appreciated that mask images corresponding to each of the image sequences are extracted from each of the image sequences, respectively, using a segmentation mask library function of a PHOTOSHOP or Python tool.

In the above embodiment, a plurality of image sequences are obtained by segmenting an original video, a model view perspective matrix and a mask image are obtained by analyzing the image sequences, an initial reconstruction model is constructed by an original matrix structure, a rendering image is obtained by rendering the model view perspective matrix and the mask image by the initial reconstruction model, a three-dimensional reconstruction model is obtained by optimizing the initial reconstruction model according to the image sequences and the rendering image, a three-dimensional reconstruction result is obtained by three-dimensional reconstruction of the image to be reconstructed by the three-dimensional reconstruction model, and the generated 3D model has a considerable point-to-plane structure and point-to-plane number and has important significance in the fields of games, virtual reality industry and digital relics, so that manpower, financial resources and time cost brought by manual modeling are reduced.

Alternatively, as one embodiment of the present application, the affine transformation matrix includes a sum of an x-axis origin and a center offset, a sum of a y-axis origin and a center offset, an x-axis center offset, and a y-axis center offset,

the process of respectively performing matrix calculation on the affine transformation matrix corresponding to each image sequence, the image height corresponding to each image sequence, the image width corresponding to each image sequence and the camera focal length corresponding to each image sequence to obtain the model view perspective matrix corresponding to each image sequence comprises the following steps:

performing matrix calculation on a sum of an x-axis origin and a center offset corresponding to each image sequence, a sum of a y-axis origin and a center offset corresponding to each image sequence, an x-axis center offset corresponding to each image sequence, a y-axis center offset corresponding to each image sequence, an image height corresponding to each image sequence, an image width corresponding to each image sequence and a camera focal length corresponding to each image sequence respectively by a first formula to obtain a model view perspective matrix corresponding to each image sequence, wherein the first formula is as follows:

，

wherein ,,

,

wherein ,，/>,

wherein ,for model view perspective matrix, < >>For perspective matrix->For the matrix of model viewing angles>For the vertical viewing angle range of the camera, < > for>For aspect ratio->For a preset far boundary value, < >>For a preset near boundary value,/>For the focal length of the camera,、/> and />Are all scaling factors, ++>Is the sum of the x-axis origin and the center offset, < >>For x-axis center offset, +.>Is the sum of the origin of the y-axis and the center offset, < >>For the y-axis center offset, +.>For image height +.>Is the image width.

It should be appreciated that the camera focal length calculates MVP (Model View Perspective) matrix (i.e., model view perspective matrix) using the affine transformation matrix of the camera coordinate system to the world coordinate system and the image height (i.e., image height), width (i.e., image width).

Specifically, the following formula:

，

for the camera focal length size, +.>For scaling factor +.>Is the sum of the origin and the center offset. The above six variables satisfy->Is the size of a horizontal pixel in [ meters per pixel ]]Similarly->Is the size of the vertical pixel +.>Is the horizontal focal length and is similarly->Is vertical focal length size +.>Is of length-width ratio, h isThe image is high and w is the image width.

Specifically, the MVP (Model View Perspective) matrix is multiplied by three matrices of Model, view, perselect, which are considered as an overall 4x4 matrix.

Wherein the Perspective matrix is equal to the following formula:

，

: the far-end boundary value is used to determine,

: the value of the near-boundary value is,

wherein MV (Model View) matrix is of the formula:

of (2), wherein->，/>Is->，/>The value after the center offset is removed.

In the above embodiment, the affine transformation matrix, the image height, the image width and the camera focal length are respectively calculated to obtain the model view perspective matrix, and the generated 3D model has considerable point-to-surface structure and point-to-surface quantity only through the 2D image information supervision model, so that the method has important significance in the fields of games, virtual reality industry and digital cultural relics, and reduces the manpower, financial resources and time cost brought by manual modeling.

Alternatively, as an embodiment of the present application, as shown in fig. 1 to 5, the primitive matrix group includes a tetrahedral vertex three-dimensional coordinate matrix and a vertex index matrix,

the process of constructing an original matrix set and constructing an initial reconstruction model through the original matrix set comprises the following steps:

s31: counting the number of three-dimensional coordinates in the tetrahedron vertex three-dimensional coordinate matrix to obtain the total number of the tetrahedron vertex three-dimensional coordinates;

s32: carrying out random assignment on the total number of the three-dimensional coordinates of the tetrahedron vertexes to obtain a plurality of SDF values, and constructing an SDF value matrix through all the SDF values;

s33: and constructing a model of the tetrahedron vertex three-dimensional coordinate matrix, the vertex index matrix and the SDF value matrix by using a marching tetrahedron algorithm to obtain an initial reconstruction model.

It should be understood that the geometric training model is built by initialization, specifically: loading and storing a tetrahedron vertex three-dimensional coordinate matrix, creating an SDF value matrix for the tetrahedron vertex three-dimensional coordinate matrix by using a compressed file of a single tetrahedron vertex index matrix, creating a displacement matrix, and registering as a training parameter.

It should be understood that, as shown in figure 2,four vertices>Is a label for the tetrahedron.

It should be understood that, as shown in fig. 3 to 5, the tetrahedrons are not generated, and are generated triangles according to the shape conditions generated by the SDF values of the vertices after scaling, symmetry and rotation, respectively, when the SDF values of the vertices are positive numbers, the vertices are indicated to be outside the generated model, otherwise, are internal, and when the SDF values are 0, the vertices are exactly on the surface of the model.

Specifically, as shown in fig. 3 to 5, the travelling tetrahedral algorithm is: the coded SDF is converted into an explicit triangular mesh using the Marching Tetrahedra (MT for short) algorithm. SDF value for a given tetrahedron vertexMT is according to->The sign of (2) determines the surface type inside the tetrahedron, the total number of configurations being 24=16, which can be divided into 3 special cases in view of the rotational symmetry. Once the surface type inside the tetrahedron is determined, the vertex position of the iso-surface is calculated at zero-crossing point of the linear interpolation along the tetrahedron edge, +.>Four vertices of a tetrahedron, respectively, the diamonds representing newly generated vertex diamond representations, such as: />Is vertex->The new vertex is generated, the cross rectangle represents the vertex with the SDF as a negative value, the circle represents the vertex with the SDF as a positive value, and the generated new vertex is represented by the formula:given.

It should be understood that three unique surface configurations in MT, the vertex color represents the sign of the sign distance value. Note that flipping the sign of all vertices will result in the same surface configuration, with the position of the vertices linearly interpolated along the edges where the sign changes.

It should be understood that the geometry and material are jointly trained. Detailed: for a tetrahedral vertex three-dimensional coordinate matrix, an SDF value matrix, a vertex index matrix uses a marching tetrahedral algorithm to obtain an initial model (i.e., an initial reconstruction model).

In the above embodiment, the original matrix set is constructed, and the initial reconstruction model is constructed through the original matrix set, so that the generated 3D model has a considerable point-to-surface structure and number of points and faces only through the 2D image information supervision model, thereby reducing the labor, financial and time costs caused by manual modeling.

Optionally, as an embodiment of the present application, the process of optimizing the initial reconstruction model according to all the image sequences and all the rendered images to obtain a three-dimensional reconstruction model includes:

performing loss function calculation on all the image sequences and all the rendered images to obtain a target loss function;

and updating parameters of the tetrahedron vertex three-dimensional coordinate matrix and the SDF value matrix according to the target loss function, returning to S33 after updating until the preset iteration times are reached, and taking the initial reconstruction model as a three-dimensional reconstruction model.

Preferably, the preset number of iterations may be 100.

It will be appreciated that the resulting image of the initial model (i.e., the rendered image) is subjected to a mean square error loss function with the original captured image (i.e., the image sequence) and the gradients are counter-propagated to update the SDF value matrix, the displacement matrix.

In the above embodiment, the initial reconstruction model is optimized according to all the image sequences and all the rendering images to obtain the three-dimensional reconstruction model, and the generated 3D model has considerable point-surface structure and point-surface number only through the 2D image information supervision model, so that the manpower, financial resources and time cost brought by manual modeling are reduced.

Optionally, as an embodiment of the present application, the process of performing a loss function calculation on all the image sequences and all the rendered images to obtain a target loss function includes:

performing loss function calculation on all the image sequences and all the rendered images through a second formula to obtain a target loss function, wherein the second formula is as follows:

，

wherein ,for the objective loss function->For image sequences +.>For rendering an image +.>For the total number of image sequences.

It should be understood that the loss function is as follows:

，

wherein For the number of image sequences +.>For the original image (i.e. image sequence), -a program for the method is provided>Is a rendered image (i.e., a rendered image).

In the above embodiment, the objective loss function is obtained by performing the loss function calculation on all the image sequences and all the rendering images by the second formula, and the generated 3D model has a considerable point-to-surface structure and point-to-surface number only by using the 2D image information supervision model, which has important significance in the fields of games, virtual reality industry and digital cultural relics, and reduces the manpower, financial resources and time cost brought by manual modeling.

Optionally, as another embodiment of the present application, an explicit reconstruction method is adopted, but an implicit reconstruction SDF method is combined, a deformable tetrahedron is used to predict an SDF defined on a deformable tetrahedron grid, then the SDF is converted into a surface grid through a travelling tetrahedron, and only a 2D image information supervision model is used, so that the generated 3D model has a considerable point-to-plane structure and point-to-plane number, which is significant in the fields of games, virtual reality industry and digital cultural relics, and reduces the time cost of manpower and financial resources brought by manual modeling.

Optionally, as another embodiment of the present application, the present application further includes initializing a texture training model. The method comprises the following steps: creating a color map, a highlight map and a normal map three-channel value, limiting the color map and the highlight map three-channel value to 0-1, limiting the normal map three-channel value to-1, and storing the MLP position code as a material dictionary.

Alternatively, as another embodiment of the present application, as shown in fig. 6, a three-dimensional reconstruction apparatus includes:

the importing module is used for importing the original video;

the importing module is also used for importing an image to be reconstructed;

the analysis module is used for:

Alternatively, another embodiment of the present application provides a three-dimensional reconstruction system including a memory, a processor, and a computer program stored in the memory and executable on the processor, which when executed by the processor, implements the three-dimensional reconstruction method as described above. The system may be a computer or the like.

Alternatively, another embodiment of the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the three-dimensional reconstruction method as described above.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and units described above may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims

1. The three-dimensional reconstruction method is characterized by comprising the following steps of:

2. The three-dimensional reconstruction method according to claim 1, wherein the affine transformation matrix comprises a sum of an x-axis origin and a center offset, a sum of a y-axis origin and a center offset, an x-axis center offset, and a y-axis center offset,

，

wherein ,,

,

wherein ,，/>,

wherein ,for model view perspective matrix, < >>For perspective matrix->For the matrix of model viewing angles>For the vertical viewing angle range of the camera, < > for>For aspect ratio->For a preset far boundary value, < >>For a preset near boundary value,/>For the focal length of the camera +.>、 and />Are all scaling factors, ++>Is the sum of the x-axis origin and the center offset, < >>For x-axis center offset, +.>Is the sum of the origin of the y-axis and the center offset, < >>For the y-axis center offset, +.>For image height +.>Is the image width.

3. The three-dimensional reconstruction method according to claim 1, wherein the original matrix group comprises a tetrahedral vertex three-dimensional coordinate matrix and a vertex index matrix,

4. The method of claim 3, wherein optimizing the initial reconstruction model based on all of the image sequences and all of the rendered images to obtain a three-dimensional reconstruction model comprises:

5. The three-dimensional reconstruction method according to claim 4, wherein the step of performing a loss function calculation on all the image sequences and all the rendered images to obtain an objective loss function comprises:

，

6. A three-dimensional reconstruction apparatus, comprising:

the importing module is used for importing the original video;

the importing module is also used for importing an image to be reconstructed;

the analysis module is used for:

7. A three-dimensional reconstruction system comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the three-dimensional reconstruction method according to any one of claims 1 to 5 is implemented when the computer program is executed by the processor.

8. A computer readable storage medium storing a computer program, characterized in that the three-dimensional reconstruction method according to any one of claims 1 to 5 is implemented when the computer program is executed by a processor.