CN117078867B

CN117078867B - Three-dimensional reconstruction method, three-dimensional reconstruction device, storage medium and electronic equipment

Info

Publication number: CN117078867B
Application number: CN202311330012.6A
Authority: CN
Inventors: 方顺; 汪成峰; 冯星; 崔铭; 张志恒; 王海龙; 于飞; 米凌峰; 温思远; 刘志伟; 马晓宏; 武鹏涛; 王冉; 李向楠; 李雪瑶; 张智毅; 张羽; 寇智博; 程宇; 徐春阳
Original assignee: Beijing Xuanguang Technology Co ltd
Current assignee: Beijing Xuanguang Technology Co ltd
Priority date: 2023-10-16
Filing date: 2023-10-16
Publication date: 2023-12-12
Anticipated expiration: 2043-10-16
Also published as: CN117078867A

Abstract

Some embodiments of the present application provide a method, an apparatus, a storage medium, and an electronic device for three-dimensional reconstruction, where the method includes: determining reconstruction parameters for three-dimensional reconstruction of a target object, wherein the target object comprises: target pictures and/or target sentences, wherein the reconstruction parameters comprise: three-dimensional format and target simulation scale; inputting the target object into a target three-dimensional reconstruction model matched with the reconstruction parameters, and outputting the target three-dimensional model of the target object, wherein the target three-dimensional reconstruction model comprises: the first generative model and/or the second generative model. Some embodiments of the application may promote efficiency and effectiveness of three-dimensional modeling.

Description

Three-dimensional reconstruction method, three-dimensional reconstruction device, storage medium and electronic equipment

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a three-dimensional reconstruction method, apparatus, storage medium, and electronic device.

Background

With the continuous development of rendering technology, the need for three-dimensional modeling technology of images is also increasing.

Currently, when reconstructing an image in three dimensions, a three-dimensional model is reconstructed by analyzing features of the image. Because of various forms of images, the workload of a characteristic analysis mode is large, and the efficiency of three-dimensional reconstruction is low.

Therefore, how to provide a technical solution of an efficient three-dimensional reconstruction method is a technical problem to be solved.

Disclosure of Invention

The application aims to provide a three-dimensional reconstruction method, a device, a storage medium and electronic equipment.

In a first aspect, some embodiments of the present application provide a method of three-dimensional reconstruction, comprising: determining reconstruction parameters for three-dimensional reconstruction of a target object, wherein the target object comprises: target pictures and/or target sentences, wherein the reconstruction parameters comprise: three-dimensional format and target simulation scale; inputting the target object into a target three-dimensional reconstruction model matched with the reconstruction parameters, and outputting a target three-dimensional model of the target picture, wherein the target three-dimensional reconstruction model comprises: the first generative model and/or the second generative model.

According to the method and the device, after the reconstruction parameters of the target object are determined, the target object is input into the corresponding target three-dimensional reconstruction model, so that the target three-dimensional model corresponding to the target object is obtained, three-dimensional reconstruction of a single picture and/or a target sentence is realized, the efficiency is high, the reconstruction effect is good, and the subsequent working efficiency is further improved.

In some embodiments, before the inputting the target object into the target three-dimensional reconstruction model that matches the reconstruction parameters, the method further comprises: training an initial reconstruction model by using a first training data set based on a pre-trained target generation model to acquire the first generation model; and/or training the initial language model by using a second training data set based on the target generation model to acquire a second generation model; the pre-trained target generation model is obtained by pre-training an initial generation model through the first training data set, and the target generation model comprises: decoder, analog scale and network estimator.

According to the method, the device and the system, the initial reconstruction model and the initial language model can be trained through the training data set and the target generation model to obtain the first generation model and the second generation model, and effective model support is provided for subsequent three-dimensional reconstruction.

In some embodiments, the first training dataset comprises: a plurality of three-dimensional model samples, and a plurality of pictures corresponding to each of the plurality of three-dimensional model samples; the second training dataset comprises: the three-dimensional model samples and the label information corresponding to each three-dimensional model sample in the three-dimensional model samples.

Some embodiments of the present application may provide efficient data support for model training through the first training set and the second training set.

In some embodiments, the training the initial reconstructed model with the first training data set based on the pre-trained target generation model to obtain the first generation model includes: inputting each three-dimensional model sample corresponding to each picture in the plurality of pictures to the target generation model to obtain a first prediction result; respectively inputting a plurality of pictures corresponding to each three-dimensional model sample into the initial reconstruction model, and generating a model based on the target to generate a second prediction result; and optimizing the initial reconstruction model by using the first prediction result and the second prediction result to obtain the first generation model.

In some embodiments, the training the initial language model with the second training data set based on the target generation model, to obtain the second generation model includes: inputting label information corresponding to each three-dimensional model sample into the initial language model, and generating a model based on the target to obtain a third prediction result; and optimizing the initial language model by using the first prediction result and the third prediction result to obtain the second generation model.

In some embodiments, the plurality of three-dimensional model samples are characterized by a point cloud, voxel, grid, or symbolic distance function SDF.

Some embodiments of the application can support three-dimensional model samples in various forms, and have wider adaptability.

In some embodiments, the inputting each three-dimensional model sample corresponding to each of the plurality of pictures to the target generation model to obtain a first prediction result includes: scaling each three-dimensional model sample so that the size of a bounding box of each three-dimensional model sample meets a set threshold; dividing the bounding box of each three-dimensional model sample according to a preset size to obtain a model sample block corresponding to each three-dimensional model sample; generating the first prediction result through the model sample block.

Some embodiments of the present application may provide support for model training by inputting into a target generation model, processing each three-dimensional model sample to generate a first prediction result.

In a second aspect, some embodiments of the present application provide an apparatus for three-dimensional reconstruction, comprising: a parameter determining module, configured to determine a reconstruction parameter for three-dimensionally reconstructing a target object, where the target object includes: target pictures and/or target sentences, wherein the reconstruction parameters comprise: three-dimensional format and target simulation scale; the model reconstruction module is used for inputting the target object into a target three-dimensional reconstruction model matched with the reconstruction parameters and outputting a target three-dimensional model of the target object, wherein the target three-dimensional reconstruction model comprises: the first generative model and/or the second generative model.

In a third aspect, some embodiments of the application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method according to any of the embodiments of the first aspect.

In a fourth aspect, some embodiments of the application provide an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor is operable to implement a method according to any of the embodiments of the first aspect when executing the program.

In a fifth aspect, some embodiments of the application provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor, is adapted to carry out the method according to any of the embodiments of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of some embodiments of the present application, the drawings that are required to be used in some embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be construed as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is a system diagram of a three-dimensional reconstruction provided by some embodiments of the present application;

FIG. 2 is a schematic diagram of a network model according to some embodiments of the present application;

FIG. 3 is a diagram illustrating a second network model structure according to some embodiments of the present application;

FIG. 4 is a third diagram illustrating a network model according to some embodiments of the present application;

FIG. 5 is a flow chart of a method for three-dimensional reconstruction provided by some embodiments of the present application;

FIG. 6 is a block diagram of a three-dimensional reconstruction apparatus according to some embodiments of the present application;

fig. 7 is a schematic diagram of an electronic device according to some embodiments of the present application.

Detailed Description

The technical solutions of some embodiments of the present application will be described below with reference to the drawings in some embodiments of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

In the related art, in the industries of movies, games, and the like, the three-dimensional model creation process is very time-consuming, and occupies most of the cost of the project. Therefore, how to improve the efficiency of three-dimensional model reconstruction and reduce the cost are problems in the prior art.

In view of this, some embodiments of the present application provide a three-dimensional reconstruction method, where after a reconstruction parameter of a three-dimensional reconstruction of a target object is selected and determined, the target object may be input into a matched target three-dimensional reconstruction model, so as to obtain a target three-dimensional model corresponding to the target object. According to the method, the corresponding target three-dimensional reconstruction model can be selected through the reconstruction parameters, and the adaptability is wide; meanwhile, the three-dimensional reconstruction of the target object through the target three-dimensional reconstruction model is high in efficiency and good in effect, and the subsequent working efficiency is guaranteed.

The overall composition of a three-dimensional reconstruction system provided by some embodiments of the present application is described below by way of example with reference to fig. 1.

As shown in fig. 1, some embodiments of the present application provide a system for three-dimensional reconstruction, the system for three-dimensional reconstruction comprising: a terminal 100 and a processing server 200. The terminal 100 may transmit the target object, which needs to be three-dimensionally reconstructed, to the processing server 200, and the user may select reconstruction parameters on the terminal 100 and transmit them to the processing server 200. The processing server 200 may input the target object into the target three-dimensional reconstruction model by matching the reconstruction parameters with the corresponding target three-dimensional reconstruction model, obtain a target three-dimensional model of the target object, and send the target three-dimensional model to the terminal 100 for display to the user.

In some embodiments of the present application, the terminal 100 may be a mobile terminal or a non-portable computer terminal, and embodiments of the present application are not limited herein.

In some embodiments of the present application, the target object may be a target picture, a target sentence, or a combination of a target picture and a target sentence. In one embodiment, when the target object is a target picture, the target three-dimensional reconstruction model is a first generation model obtained by training in advance. In one embodiment, when the target object is a target sentence, the target three-dimensional reconstruction model is a second generation model obtained by training in advance. In another embodiment, when the target object is a combination of the target picture and the target sentence, the target three-dimensional reconstruction model is a first generation model and a second generation model, and the two generation models are corrected and supplemented to obtain a final target three-dimensional model. Specifically, the reconstruction method may be selected according to the actual situation, and the embodiment of the present application is not limited herein.

In order to implement the fast three-dimensional reconstruction of the target object, the relevant model needs to be trained first to obtain the target three-dimensional reconstruction model, so the implementation procedure for obtaining the target three-dimensional reconstruction model, which is performed by the processing server 200 according to some embodiments of the present application, is exemplarily described below with reference to fig. 2. From the foregoing, it is appreciated that the target three-dimensional reconstruction model includes a first generative model and/or a second generative model, and that the training processes of the first generative model and the second generative model are both related to the target generative model, and thus, the process of acquiring the target generative model is first described in the following by way of example.

Referring to fig. 2, fig. 2 is a block diagram of an acquisition target generation model according to some embodiments of the present application.

In some embodiments of the application, the method of three-dimensional reconstruction further comprises: and training a target generation model. The target generation model is obtained by pre-training an initial generation model through a first training data set. The object generation model includes: decoder, analog scale and network estimator. Wherein the first training dataset comprises: the system comprises a plurality of three-dimensional model samples and a plurality of pictures corresponding to each of the plurality of three-dimensional model samples.

For example, in some embodiments of the application, the initial generation model may employ a VQ-VAE (Vector Quantized Variational Autoencoder, otherwise known as 3D-VQ-VAE) network structure model. As shown in fig. 2, the VQ-VAE includes an encoder, a decoder, and a transducer. The VQ-VAE is trained through a training dataset, and during the training process, a decoder, codebook (as a specific example of a simulation scale), and a transducer (as a specific example of a network estimator) of the trained VQ-VAE can be obtained by optimizing a loss function (e.g., reconstruction loss, target loss, etc.).

For example, as one specific example of the present application, a specific process of training includes: as shown in the process 1→2→3→4→5→6 of fig. 2, the input 3D (i.e., three-dimensional) model sample (e.g., the automobile model of fig. 2) is first subjected to the blocking process by VQ-VAE, and then input to the encoder, the transducer and the decoder to generate the prediction model sample (i.e., the generated 3D model of fig. 2). And comparing the complete model corresponding to the prediction model sample and the 3D model sample, and training and optimizing through pixel comparison L2 loss function and transform loss function (namely cross entropy loss function and L2 regularization loss and training in a mask mode), so as to obtain a trained decoder, a trained Codebook and a trained Transformer. The trained Codebook part is a coding part, the output of the coding part is a hidden vector, and the coding part can simulate and generate three-dimensional models with coverage rates of 45%, 65% or 85% and the like in different proportions, and the lower the proportion is, the greater the diversity of the generated three-dimensional models is. While the proportions of the remaining three-dimensional model are generated by trained transducer predictions. For example, the trained Codebook simulation generated 85%, and the trained transducer predicted the remaining 15%. In practical applications, the proportion of simulation generation (or referred to as simulation proportion) may be set according to practical situations, and embodiments of the present application are not limited herein specifically.

Additionally, in other embodiments of the present application, the convolutional neural network CNN (Convolutional Neural Networks, convolutional neural network) in the decoder of the 3D-VQ-VAE may be replaced with FCN (Fully Convolutional Networks, full convolutional neural network), at which time the 3D model samples may not be adjusted by the preset threshold.

The acquisition process of the first generation model is exemplarily described below.

In some embodiments of the present application, a method of obtaining a first generation model includes:

s1, acquiring a first training data set.

In some embodiments of the application, the first training data set comprises: a plurality of three-dimensional model samples, and a plurality of pictures corresponding to each of the plurality of three-dimensional model samples. The characterization mode of the plurality of three-dimensional model samples is point cloud, voxel, grid or symbol distance function SDF.

For example, in some embodiments of the present application, it is first necessary to prepare a first training dataset, in which three-dimensional (3D) model samples may be characterized in the manner of: point clouds, voxels, mesh (i.e., grid) or signed distance function (SDF, signed Distance Field), etc. In practice, the input data when training the model is paired, the picture+3D model sample, the two are in one-to-one correspondence, i.e. one 3D model sample, corresponding to different angles or pictures with different details.

Specifically, a plurality of model samples (i.e., 3D model samples) under different characterization modes may be collected, and for each model sample under each characterization mode, 6 side single views (for example, one for every 45 degrees, as a specific example of a plurality of pictures) are generated respectively. Model sample data under each characterization mode is prepared into 2 ten thousand 3D models and corresponding 12 ten thousand pictures (48 ten thousand pictures in total). Specifically, the number of three-dimensional model samples may be set according to the actual situation, and the embodiment of the present application is not limited thereto.

And S2, training an initial reconstruction model by using a first training data set based on a pre-trained target generation model, and obtaining the first generation model.

For example, in some embodiments of the present application, the first generation model is trained by a network model structure as shown in FIG. 3. Wherein, the network model structure in fig. 3 includes: an initial reconstruction model of residual network and up-convolution, and codebooks (i.e., the combination in fig. 3), transformers and decoders in the trained target generation model.

In some embodiments of the application, S2 may comprise: s21, inputting each three-dimensional model sample corresponding to each picture in the plurality of pictures into the target generation model to obtain a first prediction result; s22, respectively inputting a plurality of pictures corresponding to each three-dimensional model sample into the initial reconstruction model, and generating a model based on the target to generate a second prediction result; s23, optimizing the initial reconstruction model by using the first prediction result and the second prediction result to obtain the first generation model.

For example, in some embodiments of the present application, a three-dimensional model sample corresponding to each picture in the first training data set is input into the target generation model to generate a first prediction result Z, each picture is input into the network model shown in fig. 3 to obtain a second prediction result Z ', and an initial reconstruction model composed of a residual network and an upconvolution network is optimized by calculating a loss between Z and Z', so as to obtain the first generation model. Where the "input 3D model" is in correspondence with the "input picture", such as where the 3D model is an automobile, then the input picture should also be an input picture of this automobile in order to train the network. For example, the training process as shown in fig. 3 includes: 7→8→9→5→6, namely: and inputting the 3D model sample corresponding to the picture into a target generation model to generate Z. The pictures are input into the network model in fig. 3, and Z' is generated by the trained Codebook described above. And calculating the loss between Z' and Z through a loss function, optimizing an initial reconstruction model, and outputting a trained model when the training times reach an upper limit or are converged to a preset value to obtain a first generation model. It should be noted that the feature vector output by the process 5 in fig. 3 is complete, such as 6464 feature vectors; while feature vectors generated by Codebook are limited, such as 32 +.>32. The effect of the transducer is to use this 32 +.>32 to estimate 64 +.>64, thereby completing the estimation of the feature. That is, the embodiment of the application assists in training by the residual network and the trained object generation modelAnd (3) uploading an initial reconstruction model formed by the convolution network to obtain a final first generation model.

In some embodiments of the present application, S21 may further include: scaling each three-dimensional model sample so that the size of a bounding box of each three-dimensional model sample meets a set threshold; dividing the bounding box of each three-dimensional model sample according to a preset size to obtain a model sample block corresponding to each three-dimensional model sample; generating the first prediction result through the model sample block.

For example, in some embodiments of the present application, the segmentation process in the process of partitioning in fig. 2 and obtaining Z in S21 above may specifically include: the object generation model may scale the bounding box of the input 3D model as close as possible to a value (e.g., 128128/>128, as a specific example of setting a threshold value) in order to keep the size of the input model uniform. Specifically, the setting threshold may be set according to actual situations, and embodiments of the present application are not limited herein. Then, according to a certain length, width and height (such as 1 +.>1/>1, as a specific example of a preset size), the bounding box is cut into a plurality of small blocks, and a 3D model sample is obtained. For example, split into 512 Patches. The specific splitting method can comprise the following steps: for example, according to 8->8/>8, splitting the bounding box; for example, 64>64/>64 splitting the bounding box, deleting invalid blocks, merging 8 adjacent blocks into a final Patches, automatically complementing the outer layers of which are not enough 8, and finally forming 512 blocks and the like; for example, firstly, wrapping a model by using 1 large bounding box, then splitting the model into 8 blocks, and deleting useless blocks; then splitting each block into 8 blocks, and deleting useless blocks; then splitting each block into 8 blocks, and deleting useless blocks; then splitting each block into 8 blocks, and deleting useless blocks; then combining up, combining the adjacent 8 blocks into one block, filling in the block which is insufficient to finally form 8 +.>8/>8, a total of 512 blocks. The splitting mode can be specifically selected according to actual situations, and the embodiment of the application is not limited to this.

The acquisition process of the second generative model is exemplarily described below.

In some embodiments of the application, a method of obtaining a second generative model comprises:

s3, acquiring a second training data set.

In some embodiments of the application, the second training data set comprises: the three-dimensional model samples and the label information corresponding to each three-dimensional model sample in the three-dimensional model samples.

For example, in some embodiments of the application, each three-dimensional model sample is given a three-level sentence tag (as one specific example of tag information) separately. For example, the first level is a category label (e.g., entities such as chairs, tables, houses, cars, etc.); the second level is a key information label, namely, the description key information of the entity (such as yellow chair, three-leg chair, wooden chair and the like); the third level is a detail tab describing specific details (e.g., a seat cushion, a box, etc. placed on a yellow chair). As with the first training dataset, 2 ten thousand 3D models were prepared, along with the corresponding 6 ten thousand labels (24 ten thousand total labels). In training, the input data is paired, i.e.: three-level sentence label+3D model sample, these two are corresponding (one-to-many in training, i.e. one model, corresponding to the sentences of different labels).

And S4, training the initial language model by using a second training data set based on the target generation model, and obtaining the second generation model.

For example, in some embodiments of the present application, the second generative model is obtained by training through a network model structure as shown in FIG. 4. The network model structure in fig. 4 includes: the initial language model formed by the BERT model and the upper convolution, and the codebooks (i.e., the combinations in fig. 4), the transformers, and the decoders in the trained object generation model.

In some embodiments of the application, S4 may comprise: inputting label information corresponding to each three-dimensional model sample into the initial language model, and generating a model based on the target to obtain a third prediction result; and optimizing the initial language model by using the first prediction result and the third prediction result to obtain the second generation model.

For example, in some embodiments of the present application, tag information in the second training data set is input into the network model shown in fig. 4, a third prediction result z″ is generated, and the initial language model in fig. 4 is optimized by performing loss calculation using z″ and Z obtained in S21, so as to obtain a second generation model that meets the requirements. For example, the training process as shown in fig. 4 includes: 10→11→12→5→6, namely: one of the tag information in fig. 4 is "a black merseide-gallop class G off-road vehicle", and the 3D model z″ shown in fig. 4 is generated through the training process 11→12→5→6. The up-convolution in fig. 4 was trained using the already trained decoder of Transformer, VQ-VAE and Codebook (BERT is already trained). Wherein the loss function in the training process is a pixel-to-L2 loss function that generates a 3D model z″.

It should be noted that the training process provided above may be suitable for training a three-dimensional reconstruction model of a target corresponding to a model using different characterization forms. That is, the 3D model data may be represented by a point cloud, a voxel, a grid, an SDF, etc., and one representation corresponds to one target three-dimensional reconstruction model. Wherein each representation corresponds to a different Codebook, transformer and VQ-VAE decoder. For example, the representation form of the 3D model sample is a point cloud, and the target three-dimensional reconstruction model obtained through training of the training data set of the representation form is used for generating the target three-dimensional model in the form of the point cloud. Therefore, the processing server 200 may train to obtain the target three-dimensional model in the form of a point cloud, the target three-dimensional model in the form of a voxel, the target three-dimensional model in the form of a mesh, the target three-dimensional model in the form of an SDF, and the like by the above training process alone and store the model in a local model library. That is, the local model library stores a plurality of trained models with the same network structure, but parameters of the models are different, so that the generation of target three-dimensional models with various data formats (such as point cloud, voxel, grid, SDF and the like) can be supported.

The following is an exemplary description of a specific process for three-dimensional reconstruction provided by some embodiments of the present application in connection with fig. 5.

Referring to fig. 5, fig. 5 is a flowchart of a method for three-dimensional reconstruction according to some embodiments of the present application, where the method for three-dimensional reconstruction includes:

s510, determining reconstruction parameters for three-dimensional reconstruction of a target object, wherein the target object comprises: target pictures and/or target sentences, wherein the reconstruction parameters comprise: three-dimensional format and target simulation scale.

For example, in some embodiments of the present application, a user may input or select a 3D model data format (as a specific example of a three-dimensional format) to be output after three-dimensional reconstruction of a target object at the terminal 100, select a target simulation scale (e.g., 85%), and transmit to the processing server 200. The target object may be a single picture (as a specific example of the target picture), or may be a target sentence, that is, a section of speech including a three-level sentence label, or a combination of the single picture and the three-level sentence label. The embodiments of the present application are not particularly limited herein.

S520, inputting the target object into a target three-dimensional reconstruction model matched with the reconstruction parameters, and outputting a target three-dimensional model of the target object, wherein the target three-dimensional reconstruction model comprises: the first generative model and/or the second generative model.

For example, in some embodiments of the present application, processing server 200 may match a corresponding target three-dimensional reconstruction model from a local model library based on the type of target object, 3D model data format, and target simulation scale. The target three-dimensional reconstruction model is obtained through pre-training by the training method provided above and is stored in a model library. For example, the target object is a single picture, the 3D model data format is a point cloud, and the target simulation proportion is 85%, and the first generation model in the form of the point cloud can be generated after matching from the local model library.

For example, as another specific example of the present application, the target object is a combination form of a single picture and three-level sentence label, the 3D model data format is a point cloud, the target simulation proportion is 85%, and at this time, the first generation model and the second generation model which can generate the point cloud form are matched from the local model library. And inputting the single picture into the first generation model to obtain a first generation 3D model, and inputting the three-level sentence label into the second generation model to obtain a second generation 3D model. Finally, carrying out detail correction on the second generated 3D model by utilizing the first generated 3D model to obtain a target three-dimensional model; or, carrying out detail correction on the first generated 3D model by using the second generated 3D model to obtain the target three-dimensional model. For example, a single picture is a square table, and a table 3D model (i.e., a first generated 3D model) can be obtained through the first generated model. The three-level sentence label is a square table with a cylindrical vase, and a 3D model (namely a second generated 3D model) for placing the cylindrical vase square table can be obtained through the second generated model. Correcting the table 3D model by placing the cylindrical vase square table 3D model to obtain a vase-placed table 3D model (namely a target three-dimensional model).

In some embodiments of the present application, S520 may include: the step of inputting the target object into a target three-dimensional reconstruction model matched with the reconstruction parameters and outputting the target three-dimensional model of the target object comprises the following steps: after the target object is input into the target three-dimensional reconstruction model, generating a corresponding hidden feature vector based on the target simulation proportion, and simultaneously predicting the residual feature vector through a network estimator; and the decoder obtains the target three-dimensional model based on the hidden characteristic vector and the residual characteristic vector.

For example, in some embodiments of the present application, the target three-dimensional reconstruction model contains a trained Codebook (as a specific example of a target simulation scale), a trained transducer (as a specific example of a network estimator). After the target object is input into the target three-dimensional reconstruction model, 85% of hidden feature vectors of the finally output target three-dimensional model are firstly generated through a Codebook, the rest 15% of hidden feature vectors are predicted by a trained transducer to generate predicted rest feature vectors, and finally the two vectors are combined to generate the target 3D model (namely the target three-dimensional model). It should be appreciated that the smaller the target simulation scale, the better the diversity of the generated target 3D model; the larger the target simulation scale, the more like the target object the generated target 3D model.

Referring to fig. 6, fig. 6 illustrates a block diagram of an apparatus for three-dimensional reconstruction according to some embodiments of the present application. It should be understood that the three-dimensional reconstruction apparatus corresponds to the above method embodiments, and is capable of performing the steps involved in the above method embodiments, and specific functions of the three-dimensional reconstruction apparatus may be referred to the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy.

The apparatus of three-dimensional reconstruction of fig. 6 includes at least one software functional module that can be stored in a memory in the form of software or firmware or cured in the apparatus of three-dimensional reconstruction comprising: a parameter determining module 610, configured to determine reconstruction parameters for three-dimensional reconstruction of a target object, where the target object includes: target pictures and/or target sentences, wherein the reconstruction parameters comprise: three-dimensional format and target simulation scale; a model reconstruction module 620, configured to input the target object into a target three-dimensional reconstruction model that matches the reconstruction parameters, and output a target three-dimensional model of the target object, where the target three-dimensional reconstruction model includes: the first generative model and/or the second generative model.

In some embodiments of the present application, before the model reconstruction module 620, the apparatus for three-dimensional reconstruction further includes a training module (not shown in the figure) for generating a model based on a pre-trained target, training the initial reconstructed model using a first training data set, and obtaining the first generated model; and/or training the initial language model by using a second training data set based on the target generation model to acquire a second generation model; the target generation model is obtained by pre-training an initial generation model through the first training data set, and the target generation model comprises: decoder, analog scale and network estimator.

In some embodiments of the application, the first training data set comprises: a plurality of three-dimensional model samples, and a plurality of pictures corresponding to each of the plurality of three-dimensional model samples; the second training dataset comprises: the three-dimensional model samples and the label information corresponding to each three-dimensional model sample in the three-dimensional model samples.

In some embodiments of the present application, a training module is configured to input each three-dimensional model sample corresponding to each of the plurality of pictures to the target generation model to obtain a first prediction result; respectively inputting a plurality of pictures corresponding to each three-dimensional model sample into the initial reconstruction model, and generating a model based on the target to generate a second prediction result; and optimizing the initial reconstruction model by using the first prediction result and the second prediction result to obtain the first generation model.

In some embodiments of the present application, tag information corresponding to each three-dimensional model sample is input to the initial language model, and a model is generated based on the target, so as to obtain a third prediction result; and optimizing the initial language model by using the first prediction result and the third prediction result to obtain the second generation model.

In some embodiments of the present application, the plurality of three-dimensional model samples are characterized by a point cloud, a voxel, a grid, or a symbolic distance function SDF.

In some embodiments of the present application, the training module is configured to perform scaling processing on each three-dimensional model sample, so that a size of a bounding box of each three-dimensional model sample meets a set threshold; dividing the bounding box of each three-dimensional model sample according to a preset size to obtain a model sample block corresponding to each three-dimensional model sample; generating the first prediction result through the model sample block.

It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.

Some embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the operations of the method according to any of the above-described methods provided by the above-described embodiments.

Some embodiments of the present application also provide a computer program product, where the computer program product includes a computer program, where the computer program when executed by a processor may implement operations of a method corresponding to any of the above embodiments of the above method provided by the above embodiments.

As shown in fig. 7, some embodiments of the present application provide an electronic device 700, the electronic device 700 comprising: memory 710, processor 720, and a computer program stored on memory 710 and executable on processor 720, wherein processor 720 may implement a method as in any of the embodiments described above when reading the program from memory 710 and executing the program via bus 730.

Processor 720 may process the digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 720 may be a microprocessor.

Memory 710 may be used for storing instructions to be executed by processor 720 or data related to execution of the instructions. Such instructions and/or data may include code to implement some or all of the functions of one or more of the modules described in embodiments of the present application. The processor 720 of the disclosed embodiments may be configured to execute instructions in the memory 710 to implement the methods shown above. Memory 710 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of three-dimensional reconstruction, comprising:

determining reconstruction parameters for three-dimensional reconstruction of a target object, wherein the target object comprises: target pictures and target sentences, wherein the reconstruction parameters comprise: three-dimensional format and target simulation scale;

inputting the target object into a target three-dimensional reconstruction model matched with the reconstruction parameters, and outputting the target three-dimensional model of the target object, wherein the target three-dimensional reconstruction model comprises: a first generation model and a second generation model;

the method further comprises the steps of:

acquiring a first training data set and a second training data set, wherein the first training data set comprises: a plurality of three-dimensional model samples, and a plurality of pictures corresponding to each of the plurality of three-dimensional model samples; the second training dataset comprises: the label information corresponding to each three-dimensional model sample in the plurality of three-dimensional model samples;

the first generation model is obtained by the following method: inputting each three-dimensional model sample corresponding to each picture in the plurality of pictures to a pre-trained target generation model to obtain a first prediction result; respectively inputting a plurality of pictures corresponding to each three-dimensional model sample into an initial reconstruction model, and generating a model based on the pre-trained target to generate a second prediction result; optimizing the initial reconstruction model by using the first prediction result and the second prediction result to obtain the first generation model;

the second generative model is obtained by the following method: inputting label information corresponding to each three-dimensional model sample into an initial language model, and generating a model based on the pre-trained target to obtain a third prediction result; optimizing the initial language model by using the first prediction result and the third prediction result to obtain the second generation model;

carrying out detail correction on the second generation model by utilizing the first generation model to obtain the target three-dimensional model; or, carrying out detail correction on the first generation model by using the second generation model to obtain the target three-dimensional model.

2. The method of claim 1, wherein the pre-trained target generation model is pre-trained on an initial generation model via the first training data set, the pre-trained target generation model comprising: decoder, analog scale and network estimator.

3. The method of claim 1, wherein the plurality of three-dimensional model samples are characterized by a point cloud, voxel, grid, or symbolic distance function SDF.

4. The method of claim 1, wherein inputting each three-dimensional model sample corresponding to each of the plurality of pictures to the target generation model to obtain a first prediction result comprises:

scaling each three-dimensional model sample so that the size of a bounding box of each three-dimensional model sample meets a set threshold;

dividing the bounding box of each three-dimensional model sample according to a preset size to obtain a model sample block corresponding to each three-dimensional model sample;

generating the first prediction result through the model sample block.

5. An apparatus for three-dimensional reconstruction, characterized in that the apparatus is adapted to perform the method according to any of claims 1-4, the apparatus comprising:

a parameter determining module, configured to determine a reconstruction parameter for three-dimensionally reconstructing a target object, where the target object includes: target pictures and target sentences, wherein the reconstruction parameters comprise: three-dimensional format and target simulation scale;

the model reconstruction module is used for inputting the target object into a target three-dimensional reconstruction model matched with the reconstruction parameters and outputting a target three-dimensional model of the target object, wherein the target three-dimensional reconstruction model comprises: a first generative model and a second generative model.

6. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program, wherein the computer program when run by a processor performs the method according to any of claims 1-4.

7. An electronic device comprising a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the computer program when run by the processor performs the method of any one of claims 1-4.