CN115393526A - Three-dimensional object reconstruction method, device, storage medium and electronic equipment - Google Patents

Three-dimensional object reconstruction method, device, storage medium and electronic equipment Download PDF

Info

Publication number
CN115393526A
CN115393526A CN202211104970.7A CN202211104970A CN115393526A CN 115393526 A CN115393526 A CN 115393526A CN 202211104970 A CN202211104970 A CN 202211104970A CN 115393526 A CN115393526 A CN 115393526A
Authority
CN
China
Prior art keywords
shape
rgb image
model
dimensional
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211104970.7A
Other languages
Chinese (zh)
Inventor
郑秋宏
张云庚
丁鹏
沈云
刘丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202211104970.7A priority Critical patent/CN115393526A/en
Publication of CN115393526A publication Critical patent/CN115393526A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application belongs to the technical field of artificial intelligence, and relates to a three-dimensional object reconstruction method, a three-dimensional object reconstruction device, a storage medium and electronic equipment. The method comprises the following steps: acquiring a single-viewpoint RGB image to be processed, inputting the single-viewpoint RGB image to be processed into a reconstruction model, encoding the single-viewpoint RGB image to be processed through the reconstruction model to acquire a shape code corresponding to the single-viewpoint RGB image to be processed, and decoding the shape code to acquire a triangular patch corresponding to the single-viewpoint RGB image to be processed, wherein the reconstruction model is obtained based on three-dimensional model training corresponding to different object categories; and constructing a three-dimensional object corresponding to the to-be-processed single-viewpoint RGB image according to the triangular patch. The method and the device can make full use of the priori knowledge of the shape of the three-dimensional object, improve the robustness of shape estimation, and improve the authenticity and integrity of the reconstructed three-dimensional object.

Description

Three-dimensional object reconstruction method, device, storage medium and electronic equipment
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a three-dimensional object reconstruction method, a three-dimensional object reconstruction system, a computer storage medium, and an electronic device.
Background
With the development of computer vision and graphics technologies, technologies for reconstructing three-dimensional shapes from natural images are applied to the fields of robotics, augmented reality, electronic commerce, visualization, animation, indoor design, indoor scene reconstruction, and the like.
The traditional three-dimensional reconstruction method usually needs to shoot multi-viewpoint images of an object and needs known camera parameters to reconstruct three-dimensional coordinates of pixels, and under many application scenes, the condition is difficult to achieve.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present application.
Disclosure of Invention
The present application is directed to a three-dimensional object reconstruction method, a three-dimensional object reconstruction system, a computer storage medium, and an electronic device, which improve the authenticity and integrity of a three-dimensional shape reconstructed from a single-viewpoint RGB image at least to a certain extent.
Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.
According to a first aspect of the present application, there is provided a three-dimensional object reconstruction method comprising:
acquiring a single-viewpoint RGB image to be processed, inputting the single-viewpoint RGB image to be processed into a reconstruction model, coding the single-viewpoint RGB image to be processed through the reconstruction model to acquire a shape code corresponding to the single-viewpoint RGB image to be processed, and decoding the shape code to acquire a triangular patch corresponding to the single-viewpoint RGB image to be processed, wherein the reconstruction model is obtained based on three-dimensional model training corresponding to different object types;
and constructing a three-dimensional object corresponding to the single-viewpoint RGB image to be processed according to the triangular patch.
According to a second aspect of the present application, there is provided a three-dimensional object reconstruction apparatus comprising:
the image processing module is used for acquiring a single-viewpoint RGB image to be processed, inputting the single-viewpoint RGB image to be processed into a reconstruction model, encoding the single-viewpoint RGB image to be processed through the reconstruction model to acquire a shape code corresponding to the single-viewpoint RGB image to be processed, and decoding the shape code to acquire a triangular patch corresponding to the single-viewpoint RGB image to be processed, wherein the reconstruction model is obtained based on three-dimensional model training corresponding to different object types;
and the reconstruction module is used for constructing a three-dimensional object corresponding to the to-be-processed single-viewpoint RGB image according to the triangular patch.
According to a third aspect of the present application, a computer storage medium is provided, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the above-mentioned three-dimensional object reconstruction method.
According to a fourth aspect of the present application, there is provided an electronic apparatus, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the above-described three-dimensional object reconstruction method via execution of the executable instructions.
As can be seen from the foregoing technical solutions, the three-dimensional object reconstruction method, the three-dimensional object reconstruction apparatus, the computer storage medium, and the electronic device in the exemplary embodiments of the present application have at least the following advantages and positive effects:
the three-dimensional object reconstruction method comprises the steps of coding a to-be-processed single-viewpoint RGB image through a reconstruction model to obtain a shape code corresponding to the to-be-processed single-viewpoint RGB image, decoding the shape code to obtain a triangular patch corresponding to the to-be-processed single-viewpoint RGB image, and constructing a three-dimensional object corresponding to the to-be-processed single-viewpoint RGB image according to the triangular patch, wherein the reconstruction model is obtained based on three-dimensional model training corresponding to different object categories. In the application, the reconstruction model is obtained based on the three-dimensional model training corresponding to different object types, so that the reconstruction model fully excavates and learns the prior knowledge of the shape of the three-dimensional object, the prior knowledge of the shape of the three-dimensional object can be fully utilized when the three-dimensional object reconstruction is carried out on the single-viewpoint RGB image, and the authenticity and the integrity of the reconstructed three-dimensional object are improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 schematically shows a structural diagram of a system architecture to which the three-dimensional object reconstruction method in the embodiment of the present application is applied.
Fig. 2 schematically shows a flow chart of a three-dimensional object reconstruction method in an embodiment of the present application.
Fig. 3 schematically shows a structural diagram of a reconstruction model in an embodiment of the present application.
Fig. 4 schematically shows a structural diagram of the shape-coded predictor model 301 in the embodiment of the present application.
Fig. 5 schematically shows a structural diagram of a shape representation sub-model 302 in an embodiment of the present application.
Fig. 6 schematically shows a flowchart of training a reconstruction model in an embodiment of the present application.
Fig. 7 schematically shows a flowchart of training a shape-coded predictor model to be trained in the embodiment of the present application.
Fig. 8 schematically shows a block diagram of the three-dimensional object reconstruction apparatus according to the present application.
FIG. 9 schematically illustrates a block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.
The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/parts/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first" and "second", etc. are used merely as labels, and are not limiting on the number of their objects.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flowcharts shown in the figures are illustrative only and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
In the related art of the present application, a representation method of a three-dimensional shape is a key technical module in object reconstruction, three-dimensional voxel representation and triangular patch mesh representation are the most common representation methods, a three-dimensional voxel can represent a shape with a large topological structure difference, but is limited by the resolution of voxel representation, the capability of representing shape details is insufficient, and a reconstructed shape has obvious step-like artifacts; three-dimensional reconstruction methods based on triangular patch representations have limitations in dealing with shapes that have topological differences.
With the development of artificial intelligence, various object reconstruction methods based on deep learning have appeared, for example, methods such as 3D recursive reconstruction neural network (3D-R2N 2), pix2Vox, etc. extract shape features of an image by using a two-dimensional convolutional encoder, and receive feature embedding and reconstruct voxel representation of a three-dimensional shape by using a three-dimensional convolutional decoder; the method of Pixel2Mesh, pixel2Mesh + + and the like extracts image features by using a two-dimensional convolution network, receives the features by using the graph convolution network, deforms the given triangular patch template shape, and gradually fits the three-dimensional shape of the target. However, the above method directly predicts the voxel representation or triangular patch mesh representation of the shape, lacks prior mining and utilization of the object structure, and may generate an incomplete, unreal three-dimensional shape.
As another example, mem3D utilizes a memory network to retrieve the three-dimensional shape most relevant to RGB images and uses it as a structural prior to estimate the voxel representation of the three-dimensional shape. Mesh R-CNN converts the predicted voxel representation to a triangular Mesh representation and refines it in two stages, but may still generate irregular, unreal shapes.
Aiming at the problems in the related art, the application provides a three-dimensional object reconstruction method.
Before explaining the technical solutions in the embodiments of the present application in detail, terms that may be involved in the embodiments of the present application are explained and explained first.
(1) Single-viewpoint image: generally referred to as 2D images.
(2) And (3) SDF: signaled Distance field, a vector rendering approach.
(3) Multi-head pooling attention: multi Head discharging attachment, abbreviated MHPA, enables Multi-scale converters (Multi-scale encoders) to operate with gradually varying spatio-temporal resolution.
(4) Marching Cube algorithm: the MC algorithm is also called "iso surface Extraction" (iso surface Extraction) algorithm because a series of two-dimensional slice data is treated as a three-dimensional data field from which a material with a certain threshold is extracted and connected into triangular patches in a certain topological form.
The three-dimensional object reconstruction method in the present application will be described in detail after introducing terms that may be referred to in the embodiments of the present application.
Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the solution of the present application applies.
As shown in fig. 1, system architecture 100 may include terminal device 101, server 102, and network 103. The terminal device 101 may include various electronic devices with a display screen, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, a smart car terminal, and a head-mounted device. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing a cloud computing service. The network 103 may be a communication medium of various connection types capable of providing a communication link between the terminal device 101 and the server 102, and may be, for example, a wired communication link or a wireless communication link.
The system architecture in the embodiments of the present application may have any number of terminal devices, networks, and servers, according to implementation needs. For example, the server may be a server group consisting of a plurality of server devices.
The technical solution provided in the embodiment of the present application may be applied to the terminal device 101 or the server 102, and when the server 102 executes the three-dimensional object reconstruction method in the present application, the server may be a cloud server providing cloud computing services.
Cloud computing (cloud computing) is a computing model that distributes computing tasks over a large pool of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as if they are infinitely expandable and can be acquired at any time, used on demand, expanded at any time, and paid for use.
As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.
According to the logic function division, a Platform as a Service (PaaS a Service) layer can be deployed on an Infrastructure as a Service (IaaS a Service) layer, a Software as a Service (SaaS a Service) layer is deployed on the PaaS layer, and the SaaS layer can be directly deployed on the IaaS layer. PaaS is a platform on which software runs, such as a database, a web container, etc. SaaS is a variety of business software, such as web portal, sms, and mass texting. Generally speaking, saaS and PaaS are upper layers relative to IaaS.
In the embodiment of the application, the three-dimensional object reconstruction is carried out on the single-viewpoint RGB image through the reconstruction model, and the reconstruction model belongs to a machine learning model and relates to artificial intelligence.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine look, and in particular, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or to transmit to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image information annotation, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronized positioning and mapping, among other techniques.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
The following detailed description will be made of technical solutions of a three-dimensional object reconstruction method, a three-dimensional object reconstruction device, a computer-readable medium, and an electronic device provided by the present application, with reference to specific embodiments.
Fig. 2 shows a flow chart of a three-dimensional object reconstruction method, which, as shown in fig. 2, comprises:
step S210: acquiring a single-viewpoint RGB image to be processed, inputting the single-viewpoint RGB image to be processed into a reconstruction model, coding the single-viewpoint RGB image to be processed through the reconstruction model to acquire a shape code corresponding to the single-viewpoint RGB image to be processed, and decoding the shape code to acquire a triangular patch corresponding to the single-viewpoint RGB image to be processed, wherein the reconstruction model is obtained based on three-dimensional model training corresponding to different object types;
step S220: and constructing a three-dimensional object corresponding to the to-be-processed single-viewpoint RGB image according to the triangular patch.
The three-dimensional object reconstruction method comprises the steps of coding a to-be-processed single-viewpoint RGB image through a reconstruction model to obtain a shape code corresponding to the to-be-processed single-viewpoint RGB image, decoding the shape code to obtain a triangular patch corresponding to the to-be-processed single-viewpoint RGB image, and further constructing a three-dimensional object corresponding to the to-be-processed single-viewpoint RGB image according to the triangular patch, wherein the reconstruction model is obtained based on three-dimensional model training corresponding to different object types. In the application, the reconstruction model is obtained based on the three-dimensional model training corresponding to different object categories, so that the reconstruction model fully excavates and learns the prior knowledge of the shape of the three-dimensional object, the prior knowledge of the shape of the three-dimensional object can be fully utilized when the three-dimensional object reconstruction is carried out on the single-viewpoint RGB image, the robustness of shape estimation is improved, and the authenticity and the integrity of the reconstructed three-dimensional object are improved.
The respective steps of the three-dimensional object reconstruction method shown in fig. 2 will be described in detail below.
In step S210, a to-be-processed single-viewpoint RGB image is obtained, the to-be-processed single-viewpoint RGB image is input to a reconstruction model, the to-be-processed single-viewpoint RGB image is encoded by the reconstruction model to obtain a shape code corresponding to the to-be-processed single-viewpoint RGB image, and the shape code is decoded to obtain a triangular patch corresponding to the to-be-processed single-viewpoint RGB image, where the reconstruction model is obtained based on three-dimensional model training corresponding to different object classes.
In an exemplary embodiment of the present application, an object to be subjected to three-dimensional object reconstruction is a single-viewpoint RGB image to be processed, which may be a two-dimensional RGB image, and in particular, may be a two-dimensional RGB image corresponding to different object categories, such as a two-dimensional RGB image with respect to an airplane, a train, a sofa, and the like. When the three-dimensional object reconstruction is performed on the to-be-processed single-viewpoint RGB image, a reconstruction model is used in the embodiment of the present application, and the reconstruction model is a model having encoding and decoding functions, and can encode the to-be-processed single-viewpoint RGB image to obtain a shape code corresponding to the to-be-processed single-viewpoint RGB image, and further, can decode the shape code to obtain a triangular patch corresponding to the to-be-processed single-viewpoint RGB image.
Fig. 3 schematically shows a structural diagram of a reconstruction model, and as shown in fig. 3, a reconstruction model 300 includes a shape-coded prediction sub-model 301, a shape representation sub-model 302 and a three-dimensional iso-surface determination sub-model 303. The shape coding prediction submodel 301 is configured to code an input single-viewpoint RGB image to be processed to obtain a shape code corresponding to the single-viewpoint RGB image to be processed; the shape representation submodel 302 is configured to decode the shape code to obtain an SDF set corresponding to the RGB image from the single viewpoint to be processed, where the SDF set is an SDF representation corresponding to the RGB image from the single viewpoint to be processed and used for predicting the shape of the object; the three-dimensional iso-surface determining sub-model 303 can extract an iso-surface from the SDF set according to the contained SDF value, and acquire a triangular patch for reconstructing a three-dimensional object corresponding to the to-be-processed single viewpoint RGB image according to the iso-surface.
Next, the detailed configuration and image processing flow of the shape coding prediction submodel 301, the shape representation submodel 302, and the three-dimensional iso-surface determination submodel 303 will be described in detail.
Fig. 4 schematically shows a structural diagram of a shape-coded predictor model 301, as shown in fig. 4, the shape-coded predictor model 301 includes an image block feature embedding layer 401, multiple transform multi-scale coding layers 402 and a shape-embedded prediction layer 403, which are connected in sequence, where the multiple transform multi-scale coding layers 402 may specifically include three transform multi-scale coding layers, namely a first transform multi-scale coding layer 402-1, a second transform multi-scale coding layer 402-2 and a third transform multi-scale coding layer 402-3.
After the RGB image to be processed from a single viewpoint is input into the shape coding predictor model 301, the shape coding predictor model 301 first divides the RGB image to be processed from a single viewpoint to divide the RGB image into a plurality of image blocks, and then the image blocks are input into the image block feature embedding layer 401, and feature extraction can be performed on each image block by the image block feature embedding layer 401 to obtain an image block feature sequence G. Specifically, when performing image blocking, a single-viewpoint RGB image having an input size of h × w may be divided into a size of h 0 ×w 0 Further forming a block of length l = h/h 0 ×w/w 0 The image sequence of (2). When the single-viewpoint RGB image to be processed can not be divided into integer number h 0 ×w 0 When the image to be processed is blocked, the single-view RGB image to be processed can be resampled, so that the size of the single-view RGB image is converted from h multiplied by w to h 0 ×w 0 Dividing by h ' × w ', and partitioning the resampled RGB image to obtain the length h '/h 0 ×w’/w 0 The image sequence of (3). The resampling may be upsampling or downsampling, which is not specifically limited in this embodiment of the application, and h is further an integer 0 、w 0 Can be used forIs any value greater than zero and less than h, w. In the feature extraction, feature extraction may be performed on each image block, and an image block feature sequence may be formed according to the extracted features, and the image block feature sequence may be labeled as
Figure BDA0003841335920000091
Where m is the dimension of each image blocking feature embedding.
In an exemplary embodiment of the present application, the image partition feature embedding layer 401 is composed of a fully connected layer capable of performing feature extraction on each image partition and generating an image partition feature sequence from all the extracted features.
After the image block feature sequence is generated, the image block feature embedding layer 401 outputs the image block feature sequence G to a plurality of transform multi-scale coding layers, and sequentially performs feature extraction of different dimensions on the image block feature sequence through each transform multi-scale coding layer to obtain a high-dimensional feature embedding vector corresponding to the single viewpoint RGB image. In the embodiment of the application, because the Transformer structure is bidirectional feature extraction, long-range information association of shape features can be constructed, and compared with the case that only local interaction can be realized through convolution operation, the accuracy of three-dimensional object reconstruction can be improved.
In an exemplary embodiment of the present application, the transform multi-scale coding layer includes a plurality of coding units composed of a multi-headed pooling attention layer and a full-concatenation layer, and the number of coding units included in different transform multi-scale coding layers is different. The transform multi-scale coding layer can be regarded as a multi-scale visual Transformer, wherein a multi-head pooling attention layer is included to combine potential tensor sequences to reduce the length of sequences participating in input, and the sequence length corresponds to the space-time resolution, so that under the action of the multi-head pooling attention layer, the transform multi-scale coding layer can operate at gradually changing space-time resolution, that is, after receiving an image block feature sequence, a plurality of transform multi-scale coders can gradually increase the dimension m of feature embedding, simultaneously reduce the length l of the image block feature sequence, represent local low-dimensional shape feature information with features with higher spatial resolution, and represent high-dimensional shape feature information with features with lower spatial resolution.
In an exemplary embodiment of the present application, the multiple fransformer multi-scale coding layers 402 may specifically include a first fransformer multi-scale coding layer 402-1, a second fransformer multi-scale coding layer 402-2, and a third fransformer multi-scale coding layer 402-3, where each fransformer multi-scale coding layer corresponds to one resolution, and then the fransformer multi-scale coding layer in this embodiment of the present application can extract information of three resolution levels, where the first fransformer multi-scale coding layer 402-1 can extract features G1 of a higher spatial resolution, the third fransformer multi-scale coding layer 402-3 can extract features G3 of a lower spatial resolution, and the features G2 of a spatial resolution extracted by the second fransformer multi-scale coding layer 402-2 are between the spatial resolutions extracted by the first fransformer multi-scale layer 402-1 and the third fransformer multi-scale coding layer 402-3.
For each coding unit consisting of a multi-headed pooled attention layer and a fully-connected layer, the features that can be extracted can be determined according to equations (1) - (2):
G'=MHPA(LN(G))+G (1)
E(G)=MLP(LN(G'))+G' (2)
wherein G is the image characteristics input to the coding unit, MPHA is a multi-head pooling attention layer, MLP is a full-connected layer, LN represents layer regularization, and E is the coding unit.
The method comprises the steps that characteristics of input image characteristics are extracted sequentially through a coding unit in each transform multi-scale coding layer, characteristics of different spatial resolutions can be obtained, accordingly, the image block characteristic sequence is extracted sequentially through a first transform multi-scale coding layer 402-1, a second transform multi-scale coding layer 402-2 and a third transform multi-scale coding layer 402-3, and then high-dimensional characteristic embedding vectors G3 corresponding to a single-viewpoint RGB image can be obtained.
Next, the third transform multi-scale encoding layer 402-3 outputs the high-dimensional feature embedding vector to the shape embedding prediction layer 403, and the shape embedding prediction layer 403 may perform feature extraction on the high-dimensional feature embedding vector to obtain a shape code S corresponding to the to-be-processed single-view RGB image, where the shape code S uniquely determines an SDF corresponding to the three-dimensional shape of the object, and thus may obtain a corresponding SDF representation by decoding the shape code S. In the embodiment of the present application, the shape-embedded prediction layer 403 is composed of a fully-connected layer. At this point, the reconstruction model completes the encoding process of the RGB image to be processed, and then needs to decode the shape code S through the shape representation sub-model.
Fig. 5 schematically shows a structural diagram of the shape representation sub-model 302, and as shown in fig. 5, the shape representation sub-model 302 includes a coordinate transformation layer 501 and an implicit template processing layer 502. The coordinate transformation layer 501 is used for establishing a corresponding relationship between a three-dimensional space coordinate of a certain shape example and a standard coordinate space where an implicit template shape is located, and when a shape code S and a group of three-dimensional space coordinate sets c are simultaneously input into the coordinate transformation layer 501, the coordinate transformation layer 501 can transform the three-dimensional space coordinate of the corresponding shape into the standard coordinate space according to the shape code S to obtain a three-dimensional space coordinate set c' on the corresponding implicit template shape; after receiving the three-dimensional space coordinate set c ' output by the coordinate transformation layer 501, the implicit template processing layer 502 can output the SDF value of the implicit template shape at each coordinate position in the three-dimensional space coordinate set c ', where the SDF value is the SDF value of the shape corresponding to the shape code S at each coordinate position in the three-dimensional space coordinate set c '. The three-dimensional space coordinate set c is a coordinate set formed by coordinates of a plurality of sampling points and obtained by sampling the three-dimensional checkerboard, and when the three-dimensional checkerboard is sampled, the three-dimensional checkerboard can be sampled at intervals by a preset number of grids, and certainly, the three-dimensional space coordinate set c can be sampled by other rules, which is not specifically limited in the embodiment of the present application.
In the exemplary embodiment of the present application, the coordinate conversion layer 501 includes a coordinate conversion unit composed of an LSTM layer 501-1 and an affine conversion layer 501-2, as shown in fig. 5, the shape code S and the three-dimensional space coordinate set c are simultaneously input to the coordinate conversion layer 501, the first coordinate conversion unit performs feature extraction and coordinate conversion on the shape code S and the three-dimensional space coordinate set c, outputs first conversion information, and then the second coordinate conversion unit performs feature extraction and coordinate conversion on the shape code S and the first conversion information, outputs second conversion information, and repeats in sequence until the last coordinate conversion unit outputs the converted three-dimensional space coordinate set c'. The implicit template processing layer is composed of fully connected layers, a deep SDF representation network of the implicit template is built, the implicit template processing layer represents the SDF of the shape of the three-dimensional object in a compact form, a voxel representation method different from the shape is limited by resolution, and the deep SDF representation network has better representation capacity for shape details.
The implicit template processing layer 502 can output an SDF set D corresponding to the three-dimensional space coordinate set c ' according to the three-dimensional space coordinate set c ', and each SDF value in the SDF set corresponds to each coordinate point in the three-dimensional space coordinate set c '. After receiving the SDF set, the three-dimensional iso-surface sub-model 303 may determine an iso-surface according to the same SDF value in the SDF set by using a Marching Cube method, and further may determine to obtain a triangular patch according to the iso-surface, and may construct and obtain a three-dimensional object corresponding to the to-be-processed single-viewpoint RGB image based on the obtained triangular patch, as shown by a dotted arrow portion in fig. 5.
In step S220, a three-dimensional object corresponding to the to-be-processed single viewpoint RGB image is constructed according to the triangle patch.
In an exemplary embodiment of the present application, after obtaining the triangle patch representation, surface rendering may be performed according to the triangle patch representation by means of computer graphics technology and an illumination model to fit to obtain a three-dimensional object.
In the exemplary embodiment of the present application, before performing three-dimensional reconstruction on a to-be-processed single-viewpoint RGB image by using a reconstruction model, the to-be-trained reconstruction model needs to be trained to obtain a reconstruction model with stable performance.
The method comprises the steps of determining an equivalent surface by a Marching Cube method, determining a triangular patch according to the equivalent surface, and ensuring the accuracy of the three-dimensional model obtained after three-dimensional reconstruction as long as the accuracy of the obtained SDF value set is ensured.
Next, a detailed description will be given of a training method of the shape-to-be-trained-code predictor model and the shape-to-be-trained-representation submodel.
In an exemplary embodiment of the present application, a first data set may be acquired, and a reconstructed model to be trained may be trained according to the first data set to acquire a reconstructed model for reconstructing a three-dimensional object. The first data set is constructed from a data set of a three-dimensional CAD model including a plurality of object categories, for example, the first data set may be constructed by using a sharenet data set, and of course, other data sets may also be constructed, which is not specifically limited in this embodiment of the present invention. After three-dimensional CAD models of a plurality of object types are obtained, firstly, point coordinates contained in each three-dimensional CAD model are normalized to enable the three-dimensional CAD models to be located in a unit sphere, then, the three-dimensional CAD models after normalization are sampled to obtain SDF values corresponding to each sampling point, an SDF set is formed according to each sampling point and the corresponding SDF value, and finally, a first data set is formed according to the three-dimensional CAD models and the SDF set. Wherein, when sampling the three-dimensional CAD model and obtaining the SDF value that each sampling point corresponds, can be in the inside of three-dimensional CAD model, outside and surface are sampled, then record the vertical distance of sampling point distance model surface as the SDF value, because the direction of inside sampling point and outside sampling point directional surface is different, consequently can distinguish through different symbols, concretely, when the sampling point is inside sampling point, record the SDF value as the negative value, when the sampling point is outside sampling point, record the SDF value as the positive value, and for the authenticity and the integrality that improve three-dimensional reconstruction, increase the quantity of sampling point near the model surface during sampling, so that shape representation model can learn the characteristic of more template shapes.
During training, the shape coding prediction submodel to be trained and the shape representation submodel to be trained are trained separately, the shape representation submodel to be trained is trained firstly, after training is completed, parameters in the shape representation submodel to be trained are fixed, and then the shape coding prediction submodel to be trained is trained.
Fig. 6 schematically illustrates a flow chart of training a reconstruction model, as shown in fig. 6, in step S601, the shape representation submodel to be trained is trained according to the first data set to obtain the shape representation submodel; in step S602, a second data set is obtained, and a third data set composed of a single-viewpoint RGB image and a shape code corresponding to the single-viewpoint RGB image is constructed according to the shape representation sub-model and the second data set, where the second data set includes the single-viewpoint RGB image and a three-dimensional model corresponding to the single-viewpoint RGB image; in step S603, the shape coding predictor model to be trained is trained according to the third data set to obtain the shape coding predictor model.
In step S601, the to-be-trained shape representation submodel includes a to-be-trained coordinate transformation layer and a to-be-trained implicit template processing layer, the to-be-trained coordinate transformation layer may transform the three-dimensional spatial coordinates corresponding to the same three-dimensional CAD model in the first data set into a standard coordinate system corresponding to the implicit template, the to-be-trained implicit template processing layer may then determine a predicted SDF set according to each standard spatial coordinate, finally, a first loss function is constructed according to the predicted SDF set and the SDF set in the first data set, and a model parameter is adjusted according to the first loss function to obtain the shape representation submodel. The input of the coordinate transformation layer to be trained is two, one is an initialized shape code corresponding to the three-dimensional CAD model, and the other is a three-dimensional space coordinate set which is generated by sampling a three-dimensional chessboard.
The calculation formula of the first loss function is shown in formulas (3) to (6), and specifically includes:
Figure BDA0003841335920000131
Figure BDA0003841335920000132
Figure BDA0003841335920000133
clamp(D,ε)=min(ε,max(-ε,D)) (6)
wherein, theta H For a learnable parameter, θ, in the implicit template processing layer to be trained T For learnable parameters in the coordinate transformation layer to be trained, S is shape coding, S is initialized to Gaussian noise before training, lrec is SDF reconstruction loss, lreg is regularization term,
Figure BDA0003841335920000141
for predicting the SDF set, D is the SDF set in the first data set, and ε is [0,1 ]]The value of (1) is (b),
Figure BDA0003841335920000142
and P is the total number of the coordinate transformation units and h (-) is a Huber operator, wherein the ith coordinate transformation unit is in the coordinate transformation layer to be trained.
In training the shape representation model to be trained, θ H 、θ T And S are parameters to be optimized, but the shape code is one input of a coordinate transformation layer to be trained, so the shape code S is initialized to be Gaussian noise before training, three-dimensional space coordinates obtained by sampling based on a three-dimensional checkerboard are used as the other input of a shape representation model to be trained, after a space coordinate set formed by the shape code S and the three-dimensional space coordinates corresponding to sampling points is received by the transformation coordinate transformation layer to be trained, the three-dimensional space coordinates can be transformed into a standard coordinate system, and the corresponding transformed coordinates are outputThe implicit template processing layer to be trained can process the transformed spatial coordinate set to obtain a corresponding prediction SDF set, and model training aims to enable the deviation between the prediction SDF set and the SDF set corresponding to the three-dimensional CAD model to be close to zero, so that parameters of the shape representation model to be trained can be optimized by minimizing a first loss function to obtain optimal model parameters, and further the shape representation model with stable performance is obtained.
Because the three-dimensional CAD models of different object types are adopted to train the shape representation model to be trained, the shape representation model obtained by training can learn the shape prior of various objects, and further, when the three-dimensional reconstruction is carried out on different single-viewpoint RGB images, the accurate three-dimensional shape prediction is carried out.
After the training of the shape representation model to be trained is completed, a data set consisting of single-viewpoint RGB images and corresponding shape codes can be constructed so as to train the shape code predictor model to be trained. The structure of the shape coding prediction submodel to be trained is the same as that of the shape coding prediction submodel, and the shape coding prediction submodel to be trained comprises an image block feature embedding layer to be trained, a plurality of transform multi-scale coding layers to be trained and a shape embedding prediction layer to be trained.
In step S602, a second data set is obtained, where the second data set includes a pair of single-viewpoint RGB images and corresponding three-dimensional models, and a shape code corresponding to the single-viewpoint RGB image can be determined and obtained according to the three-dimensional model corresponding to the single-viewpoint RGB image in the second data set and the trained shape representation model, and then a third data set can be constructed according to the single-viewpoint RGB image and the corresponding shape code.
When the shape code corresponding to the single viewpoint RGB image can be determined and obtained according to the three-dimensional model corresponding to the single viewpoint RGB image in the second data set and the trained shape representation model, firstly, the SDF truth value set is determined according to the three-dimensional model, and then the learnable parameter theta in the formula (3) is fixed H And theta T Then, the first loss function is optimized to obtain the shape code S corresponding to the single viewpoint RGB image.
Further, the shape coding predictor model to be trained can be trained according to the third data set. Fig. 7 schematically illustrates a flow chart of training a shape-coded predictor model to be trained, and as shown in fig. 7, in step S701, the single-viewpoint RGB images in the third data set are input to the shape-coded predictor model to be trained, and feature extraction is performed on the single-viewpoint RGB images by the shape-coded predictor model to be trained, so as to obtain predicted shape codes corresponding to the single-viewpoint RGB images; in step S702, a second loss function is constructed according to the predicted shape code and the shape code corresponding to the single-view RGB image, and a model parameter is adjusted according to the second loss function to obtain the shape-coded predictor model.
In step S701, the process of extracting the features of the single-viewpoint RGB image by the to-be-trained shape-coded predictor model is the same as the process of extracting the features of the single-viewpoint RGB image by the shape-coded predictor model in the above embodiment, and details are not repeated here. The calculation formula of the second loss function in step S702 is shown in formula (7):
L Enc =||Enc(I)-S * || 2 (7)
where I is an input single-viewpoint RGB image, enc (I) is prediction shape coding, and S is shape coding corresponding to the single-viewpoint RGB image.
Through the training process recorded in the embodiment, the training of the reconstruction model to be trained can be completed, the reconstruction model with stable performance is obtained, and the newly obtained single-viewpoint RGB image is subjected to three-dimensional object reconstruction. In addition, in the embodiment of the application, the SDF value corresponding to the three-dimensional object corresponding to the single-viewpoint SDF image is predicted, and then the triangular patch for constructing the three-dimensional object is determined according to the SDF value.
The three-dimensional object reconstruction method in the present application can be applied to industries or fields related to computer vision, such as the construction industry, the medical industry, augmented reality, animation, indoor design, and the like. In order to make the three-dimensional object reconstruction method in the present application clearer, the three-dimensional object reconstruction method in the present application is specifically described below by taking augmented reality as an example.
Augmented reality is a technology for skillfully fusing virtual information and a real world, and a plurality of technical means such as multimedia, three-dimensional modeling, real-time tracking and registration, intelligent interaction, sensing and the like are widely applied, and virtual information such as characters, images, three-dimensional models, music, videos and the like generated by a computer is applied to the real world after being simulated, and the two kinds of information complement each other, so that the 'enhancement' of the real world is realized.
As an example, the three-dimensional object reconstruction device may be disposed in an AR headset, the AR headset may capture a real scene, and send a generated single-viewpoint RGB image to the three-dimensional object reconstruction device, the three-dimensional object reconstruction device may call a reconstruction model to encode the single-viewpoint RGB image to generate a shape code corresponding to the single-viewpoint RGB image, then decode the shape code, convert a three-dimensional space coordinate of a shape corresponding to the shape code into a three-dimensional space coordinate in a standard coordinate space, and obtain an SDF set corresponding to the three-dimensional space coordinate in the standard coordinate space, and further determine an isosurface according to the SDF set, and determine a triangular patch according to the isosurface, and finally perform fitting of a three-dimensional model according to the determined triangular patch to generate a three-dimensional object corresponding to the single-viewpoint RGB image in a virtual space, thereby increasing reality augmentation and immersion, and further improving user experience.
The three-dimensional object reconstruction method in the embodiment of the application obtains a shape code corresponding to a to-be-processed single viewpoint RGB image by coding the to-be-processed single viewpoint RGB image through a reconstruction model, decodes the shape code to obtain triangular patch information corresponding to the to-be-processed single viewpoint RGB image, and constructs a three-dimensional object corresponding to the to-be-processed single viewpoint RGB image according to the triangular patch information, wherein the reconstruction model is obtained based on three-dimensional model training corresponding to different object categories. In the application, the reconstruction model is obtained based on the training of the three-dimensional models corresponding to different object types, so that the reconstruction model fully excavates and learns the prior knowledge of the shape of the three-dimensional object, the prior knowledge of the shape of the three-dimensional object can be fully utilized when the three-dimensional object reconstruction is carried out on the single-viewpoint RGB image, the robustness of shape estimation is improved, and the authenticity and the integrity of the reconstructed three-dimensional object are improved.
The present application further provides a three-dimensional object reconstruction apparatus, fig. 8 shows a schematic structural diagram of the three-dimensional object reconstruction apparatus, and as shown in fig. 8, the three-dimensional object reconstruction apparatus 800 may include an image processing module 801 and a reconstruction module 802. Wherein:
the image processing module 801 is configured to acquire a single-viewpoint RGB image to be processed, input the single-viewpoint RGB image to be processed to a reconstruction model, encode the single-viewpoint RGB image to be processed through the reconstruction model to acquire a shape code corresponding to the single-viewpoint RGB image to be processed, and decode the shape code to acquire a triangular patch corresponding to the single-viewpoint RGB image to be processed, where the reconstruction model is obtained based on three-dimensional model training corresponding to different object types;
a reconstructing module 802, configured to construct a three-dimensional object corresponding to the to-be-processed single viewpoint RGB image according to the triangle patch.
In one embodiment of the application, the reconstruction model comprises a shape-coded predictor model, a shape representation sub-model and a three-dimensional iso-surface determination sub-model; the image processing module 801 includes:
the encoding unit is used for encoding the to-be-processed single-view RGB image through the shape encoding predictor model so as to obtain the shape encoding;
a decoding unit to decode the shape code by the shape representation submodel to obtain a set of directed distance fields (SDFs) corresponding to the shape code;
and the patch obtaining unit is used for determining an isosurface according to the SDF set through the three-dimensional isosurface determining sub-model and determining the triangular patch based on the isosurface.
In one embodiment of the present application, the shape-coded prediction sub-model includes an image block feature embedding layer, multiple transform multi-scale coding layers, and a shape-embedded prediction layer, which are connected in sequence; the encoding unit is configured to:
partitioning the to-be-processed single-viewpoint RGB image through the image partitioning feature embedding layer, and performing feature extraction on image partitions obtained by partitioning to obtain an image partitioning feature sequence;
extracting the characteristics of different dimensions of the image block characteristic sequence through each transform multi-scale coding layer to obtain high-dimensional characteristic embedded vectors corresponding to the RGB image of the single viewpoint to be processed;
and performing feature extraction on the high-dimensional feature embedding vector through the shape embedding prediction layer to obtain the shape code.
In an embodiment of the present application, the image blocking feature embedding layer and the shape embedding prediction layer are each composed of a fully-connected layer, the plurality of transform multi-scale coding layers include a first transform multi-scale coding layer, a second transform multi-scale coding layer, and a third transform multi-scale coding layer, and each of the transform multi-scale coding layers includes a plurality of coding units composed of a multi-headed pooling attention layer and a fully-connected layer.
In one embodiment of the present application, the shape representation submodel includes a coordinate transformation layer and an implicit template processing layer; the decoding unit is configured to:
inputting the shape code and a three-dimensional space coordinate set to the coordinate transformation layer, the three-dimensional space coordinate set being generated by sampling a three-dimensional checkerboard;
converting three-dimensional space coordinates in the three-dimensional space coordinate set into target three-dimensional space coordinates in a standard coordinate space through the coordinate conversion layer, wherein the standard coordinate space is a coordinate space corresponding to the implicit template;
and determining the SDF set corresponding to the implicit template shape corresponding to the shape code according to the target three-dimensional space coordinate through the implicit template processing layer.
In an exemplary embodiment of the present application, the coordinate transformation layer includes a plurality of coordinate transformation units configured by a long-and-short-term memory network layer and an affine transformation layer.
In an exemplary embodiment of the present application, the patch acquisition unit is configured to:
and extracting the same SDF value from the SDF value set, and constructing the isosurface according to the three-dimensional space coordinate corresponding to the same SDF value.
In an exemplary embodiment of the present application, the three-dimensional object reconstruction apparatus 800 further includes:
and the model training module is used for acquiring a first data set before the to-be-processed single-viewpoint RGB image is input into the reconstruction model, and training the to-be-trained reconstruction model according to the first data set to acquire the reconstruction model.
In an exemplary embodiment of the application, the obtaining the first data set is configured to:
acquiring a data set of a three-dimensional CAD model containing a plurality of object categories;
normalizing point coordinates contained in the three-dimensional CAD models corresponding to the object types, and sampling the three-dimensional CAD models after normalization to obtain an SDF set formed by a plurality of sampling points and SDF values corresponding to the sampling points;
and constructing the first data set according to the three-dimensional CAD model and the SDF set.
In an exemplary embodiment of the present application, the to-be-trained reconstruction model includes a to-be-trained shape coding prediction submodel and a to-be-trained shape representation submodel; the model training module comprises:
the first training unit is used for training the shape representation submodel to be trained according to the first data set so as to obtain the shape representation submodel;
a data set constructing unit, configured to acquire a second data set, and construct a third data set composed of a single-viewpoint RGB image and a shape code corresponding to the single-viewpoint RGB image according to the shape representation sub-model and the second data set, where the second data set includes the single-viewpoint RGB image and a three-dimensional model corresponding to the single-viewpoint RGB image;
and the second training unit is used for training the shape coding predictor model to be trained according to the third data set so as to obtain the shape coding predictor model.
In an exemplary embodiment of the application, the shape representation submodel to be trained comprises a coordinate transformation layer to be trained and an implicit template processing layer to be trained; the first training unit includes:
the coordinate conversion unit is used for converting the three-dimensional space coordinates corresponding to the same three-dimensional CAD model in the first data set into a standard coordinate system through the coordinate conversion layer to be trained so as to obtain standard space coordinates, wherein the standard coordinate system is a coordinate system corresponding to the implicit template;
the SDF generating unit is used for determining a prediction SDF set according to the standard space coordinate through the implicit template processing layer to be trained;
and the first training unit is used for constructing a first loss function according to the predicted SDF set and the SDF set in the first data set, and adjusting model parameters according to the first loss function to obtain the shape representation submodel.
In an exemplary embodiment of the present application, the first training unit is configured to:
the calculation formula of the first loss function is shown in formulas (1) to (4):
Figure BDA0003841335920000191
Figure BDA0003841335920000192
Figure BDA0003841335920000193
clamp(D,ε)=min(ε,max(-ε,D)) (4)
wherein, theta H For learnable parameters, θ, in the implicit template processing layer to be trained T For learnable parameters in the coordinate transformation layer to be trained, S is shape coding, S is initialized to Gaussian noise before training, lrec is SDF reconstruction loss, lreg is regularization term,
Figure BDA0003841335920000194
for predicting the SDF set, D is the SDF set in the first data set, and ε is [0,1 ]]The value of (1) is (b),
Figure BDA0003841335920000195
the ith coordinate transformation unit in the coordinate transformation layer to be trained is P, the total number of the coordinate transformation units is P, and h (-) is a Huber operator.
In an exemplary embodiment of the present application, the data set construction unit is configured to:
determining an SDF truth set according to a three-dimensional model which is contained in the second data set and corresponds to the single-viewpoint RGB image;
fixing learnable parameters in the first loss function, and optimizing the loss function according to the SDF truth value set to obtain a shape code corresponding to the single-viewpoint RGB image;
and constructing the third data set according to the single-view RGB image and the shape code corresponding to the single-view RGB image.
In an exemplary embodiment of the present application, the second training unit includes:
the shape coding prediction unit is used for inputting the single-viewpoint RGB images in the third data set into the shape coding prediction sub-model to be trained, and performing feature extraction on the single-viewpoint RGB images through the shape coding prediction sub-model to be trained so as to obtain prediction shape codes corresponding to the single-viewpoint RGB images;
and the second training unit is used for constructing a second loss function according to the predicted shape code and the shape code corresponding to the single-view RGB image, and adjusting model parameters according to the second loss function to obtain the shape coding predictor model.
In an exemplary embodiment of the present application, the second training unit is configured to: the calculation formula of the second loss function is shown in formula (5):
L Enc =||Enc(I)-S * || 2 (5)
wherein I is the input single view RGB image, enc (I) is the prediction shape coding, and S is a shape coding corresponding to the single view RGB image.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Moreover, although the steps of the methods herein are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken into multiple step executions, etc.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.
Fig. 9 schematically shows a block diagram of a computer system for implementing an electronic device according to an embodiment of the present application, where the electronic device may be provided in a terminal device or a server.
It should be noted that the computer system 900 of the electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the application scope of the embodiments of the present application.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit 901 (CPU) that can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory 902 (ROM) or a program loaded from a storage section 908 into a Random Access Memory 903 (RAM). In the random access memory 903, various programs and data necessary for system operation are also stored. The cpu 901, the rom 902 and the ram 903 are connected to each other via a bus 904. An Input/Output interface 905 (Input/Output interface, i.e., I/O interface) is also connected to the bus 904.
In some embodiments, the following components are connected to the input/output interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output portion 907 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a local area network card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The driver 910 is also connected to the input/output interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In particular, the processes described in the various method flowcharts may be implemented as computer software programs, according to embodiments of the present application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program, when executed by the central processor 901, performs various functions defined in the system of the present application.
It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable medium or any combination of the two. A computer readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable an electronic device to execute the method according to the embodiments of the present application.
It will be understood that the present application is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (18)

1. A method of reconstructing a three-dimensional object, comprising:
acquiring a single-viewpoint RGB image to be processed, inputting the single-viewpoint RGB image to be processed into a reconstruction model, encoding the single-viewpoint RGB image to be processed through the reconstruction model to acquire a shape code corresponding to the single-viewpoint RGB image to be processed, and decoding the shape code to acquire a triangular patch corresponding to the single-viewpoint RGB image to be processed, wherein the reconstruction model is obtained based on three-dimensional model training corresponding to different object categories;
and constructing a three-dimensional object corresponding to the single-viewpoint RGB image to be processed according to the triangular patch.
2. The method of claim 1, wherein the reconstruction model comprises a shape-coded predictor sub-model, a shape representation sub-model and a three-dimensional iso-surface determination sub-model;
the encoding the to-be-processed single-viewpoint RGB image through the reconstruction model to obtain a shape code corresponding to the to-be-processed single-viewpoint RGB image, and decoding the shape code to obtain a triangular patch corresponding to the to-be-processed single-viewpoint RGB image, includes:
coding the single-viewpoint RGB image to be processed through the shape coding predictor model to obtain the shape coding;
decoding the shape code by the shape representation submodel to obtain a set of directed distance fields (SDFs) corresponding to the shape code;
and determining an isosurface according to the SDF set through the three-dimensional isosurface determining sub-model, and determining the triangular patch based on the isosurface.
3. The method of claim 2, wherein the shape-coded predictor model comprises an image block feature embedding layer, a plurality of transform multi-scale coding layers and a shape embedding prediction layer connected in sequence;
the encoding the to-be-processed single view RGB image by the shape coding predictor model to obtain the shape coding includes:
partitioning the to-be-processed single-viewpoint RGB image through the image partitioning feature embedding layer, and performing feature extraction on the partitioned image partitions to obtain an image partitioning feature sequence;
extracting the characteristics of different dimensions of the image block characteristic sequence through each transform multi-scale coding layer to obtain high-dimensional characteristic embedded vectors corresponding to the RGB image of the single viewpoint to be processed;
and performing feature extraction on the high-dimensional feature embedding vector through the shape embedding prediction layer to obtain the shape code.
4. The method of claim 3, wherein the image blocking feature embedding layer and the shape embedding prediction layer are each composed of fully-connected layers, wherein the plurality of transform multi-scale coding layers comprises a first transform multi-scale coding layer, a second transform multi-scale coding layer, and a third transform multi-scale coding layer, and wherein each of the transform multi-scale coding layers comprises a plurality of coding units composed of a multi-headed pooling attention layer and a fully-connected layer.
5. The method of claim 2, wherein the shape representation submodel comprises a coordinate transformation layer and an implicit template processing layer;
the decoding the shape encoding by the shape representation submodel to obtain a set of directed distance fields, SDFs, corresponding to the shape encoding, comprising:
inputting the shape code and a three-dimensional space coordinate set to the coordinate transformation layer, the three-dimensional space coordinate set being generated by sampling a three-dimensional checkerboard;
converting three-dimensional space coordinates in the three-dimensional space coordinate set into target three-dimensional space coordinates in a standard coordinate space through the coordinate transformation layer, wherein the standard coordinate space is a coordinate space corresponding to the implicit template;
and determining the SDF set corresponding to the implicit template shape corresponding to the shape code according to the target three-dimensional space coordinate through the implicit template processing layer.
6. The method as claimed in claim 5, wherein the coordinate transformation layer comprises a plurality of coordinate transformation units consisting of a long-short term memory network layer and an affine transformation layer.
7. The method of claim 2, wherein determining the iso-surface from the set of SDFs by the three-dimensional iso-surface determination submodel comprises:
and extracting the same SDF value from the SDF value set, and constructing the isosurface according to the three-dimensional space coordinate corresponding to the same SDF value.
8. The method of claim 1, wherein prior to inputting the single viewpoint RGB image to be processed to a reconstruction model, the method further comprises:
and acquiring a first data set, and training a to-be-trained reconstruction model according to the first data set to acquire the reconstruction model.
9. The method of claim 8, wherein the obtaining a first data set comprises:
acquiring a data set of a three-dimensional CAD model containing a plurality of object categories;
normalizing point coordinates contained in the three-dimensional CAD models corresponding to the object types, and sampling the three-dimensional CAD models after normalization to obtain an SDF set formed by a plurality of sampling points and SDF values corresponding to the sampling points;
and constructing the first data set according to the three-dimensional CAD model and the SDF set.
10. The method according to claim 8 or 9, wherein the to-be-trained reconstruction model comprises a to-be-trained shape-coded predictor sub-model and a to-be-trained shape representation sub-model;
the training a reconstruction model to be trained according to the first data set to obtain the reconstruction model includes:
training the shape representation submodel to be trained according to the first data set to obtain the shape representation submodel;
acquiring a second data set, and constructing a third data set consisting of a single-viewpoint RGB image and a shape code corresponding to the single-viewpoint RGB image according to the shape representation sub-model and the second data set, wherein the second data set comprises the single-viewpoint RGB image and a three-dimensional model corresponding to the single-viewpoint RGB image;
and training the shape coding predictor model to be trained according to the third data set to obtain the shape coding predictor model.
11. The method of claim 10, wherein the shape representation submodel to be trained comprises a coordinate transformation layer to be trained and an implicit template processing layer to be trained;
the training the shape representation submodel to be trained according to the first data set to obtain the shape representation submodel comprises:
converting the three-dimensional space coordinates corresponding to the same three-dimensional CAD model in the first data set into a standard coordinate system through the coordinate conversion layer to be trained to obtain standard space coordinates, wherein the standard coordinate system is a coordinate system corresponding to the implicit template;
determining a prediction SDF set according to the standard space coordinate through the implicit template processing layer to be trained;
and constructing a first loss function according to the predicted SDF set and the SDF set in the first data set, and adjusting model parameters according to the first loss function to obtain the shape representation submodel.
12. The method of claim 11, wherein constructing a first loss function from the set of predicted SDF values and the set of SDF values in the first data set comprises:
the calculation formula of the first loss function is shown in formulas (1) to (4):
Figure FDA0003841335910000041
Figure FDA0003841335910000042
Figure FDA0003841335910000043
clamp(D,ε)=min(ε,max(-ε,D)) (4)
wherein, theta H For a learnable parameter, θ, in the implicit template processing layer to be trained T For learnable parameters in the coordinate transformation layer to be trained, S is shape coding, S is initialized to Gaussian noise before training, lrec is SDF reconstruction loss, lreg is regularization term,
Figure FDA0003841335910000044
for predicting the SDF set, D is the SDF set in the first data set, and ε is [0,1 ]]The value of (1) is (b),
Figure FDA0003841335910000045
the ith coordinate transformation unit in the coordinate transformation layer to be trained is P, the total number of the coordinate transformation units is P, and h (-) is a Huber operator.
13. The method of claim 12, wherein the obtaining the second data set, and constructing a third data set consisting of a single-view RGB image and a shape code corresponding to the single-view RGB image from the shape representation submodel and the second data set comprises:
determining an SDF truth set according to a three-dimensional model corresponding to the single-viewpoint RGB image contained in the second data set;
fixing learnable parameters in the first loss function, and optimizing the loss function according to the SDF truth value set to obtain a shape code corresponding to the single-viewpoint RGB image;
and constructing the third data set according to the single-view RGB image and the shape code corresponding to the single-view RGB image.
14. The method according to claim 10 or 13, wherein the training of the shape-coded predictor model to be trained according to the third data set to obtain the shape-coded predictor model comprises:
inputting the single-viewpoint RGB images in the third data set into the to-be-trained shape coding predictor model, and performing feature extraction on the single-viewpoint RGB images through the to-be-trained shape coding predictor model to obtain a prediction shape code corresponding to the single-viewpoint RGB images;
and constructing a second loss function according to the predicted shape code and the shape code corresponding to the single-view RGB image, and adjusting model parameters according to the second loss function to obtain the shape coding predictor model.
15. The method of claim 14, wherein said constructing a second loss function based on said predictive shape coding and a shape coding corresponding to said single view RGB image comprises:
the calculation formula of the second loss function is shown in formula (5):
L Enc =||Enc(I)-S * || 2 (5)
wherein I is the input single view RGB image, enc (I) is the prediction shape coding, and S is a shape coding corresponding to the single view RGB image.
16. A three-dimensional object reconstruction apparatus, comprising:
the image processing module is used for acquiring a single-viewpoint RGB image to be processed, inputting the single-viewpoint RGB image to be processed into a reconstruction model, encoding the single-viewpoint RGB image to be processed through the reconstruction model to acquire a shape code corresponding to the single-viewpoint RGB image to be processed, and decoding the shape code to acquire a triangular patch corresponding to the single-viewpoint RGB image to be processed, wherein the reconstruction model is obtained based on three-dimensional model training corresponding to different object classes;
and the reconstruction module is used for constructing a three-dimensional object corresponding to the to-be-processed single-viewpoint RGB image according to the triangular patch.
17. A computer storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing a method for reconstructing a three-dimensional object according to any one of claims 1 to 15.
18. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the three-dimensional object reconstruction method of any one of claims 1 to 15 via execution of the executable instructions.
CN202211104970.7A 2022-09-09 2022-09-09 Three-dimensional object reconstruction method, device, storage medium and electronic equipment Pending CN115393526A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211104970.7A CN115393526A (en) 2022-09-09 2022-09-09 Three-dimensional object reconstruction method, device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211104970.7A CN115393526A (en) 2022-09-09 2022-09-09 Three-dimensional object reconstruction method, device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN115393526A true CN115393526A (en) 2022-11-25

Family

ID=84126977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211104970.7A Pending CN115393526A (en) 2022-09-09 2022-09-09 Three-dimensional object reconstruction method, device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115393526A (en)

Similar Documents

Publication Publication Date Title
CN110992252B (en) Image multi-grid conversion method based on latent variable feature generation
CN110390638B (en) High-resolution three-dimensional voxel model reconstruction method
CN109903292A (en) A kind of three-dimensional image segmentation method and system based on full convolutional neural networks
CN111898696A (en) Method, device, medium and equipment for generating pseudo label and label prediction model
CN110689599A (en) 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement
CN113706686A (en) Three-dimensional point cloud reconstruction result completion method and related components
CN110781894A (en) Point cloud semantic segmentation method and device and electronic equipment
Kang et al. Competitive learning of facial fitting and synthesis using uv energy
CN116630514A (en) Image processing method, device, computer readable storage medium and electronic equipment
CN115346000A (en) Three-dimensional human body reconstruction method and device, computer readable medium and electronic equipment
CN115170388A (en) Character line draft generation method, device, equipment and medium
CN114529785A (en) Model training method, video generation method and device, equipment and medium
CN111932458B (en) Image information extraction and generation method based on inter-region attention mechanism
CN117115786B (en) Depth estimation model training method for joint segmentation tracking and application method
CN114283152A (en) Image processing method, image processing model training method, image processing device, image processing equipment and image processing medium
CN116385667B (en) Reconstruction method of three-dimensional model, training method and device of texture reconstruction model
CN117499711A (en) Training method, device, equipment and storage medium of video generation model
CN114333069B (en) Object posture processing method, device, equipment and storage medium
CN115393526A (en) Three-dimensional object reconstruction method, device, storage medium and electronic equipment
CN115731344A (en) Image processing model training method and three-dimensional object model construction method
US20230119830A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
CN117252787B (en) Image re-illumination method, model training method, device, equipment and medium
CN117218300B (en) Three-dimensional model construction method, three-dimensional model construction training method and device
US11610326B2 (en) Synthesizing 3D hand pose based on multi-modal guided generative networks
CN114885170A (en) Video transmission device and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination