CN116843832A

CN116843832A - Single-view three-dimensional object reconstruction method, device, equipment and storage medium

Info

Publication number: CN116843832A
Application number: CN202310800060.0A
Authority: CN
Inventors: 贺磊; 谢宇; 耿进步; 曹植纲; 牛玉坤; 周鼎; 韩晓鹏
Original assignee: Network Communication and Security Zijinshan Laboratory
Current assignee: Network Communication and Security Zijinshan Laboratory
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-10-03

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for reconstructing a single-view three-dimensional object, which comprise the following steps: acquiring a single-view image to be reconstructed and at least one three-dimensional shape voxel to be reconstructed corresponding to the image type of the single-view image to be reconstructed; inputting the single-view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into a three-dimensional reconstruction neural network model, and determining an output result of the three-dimensional reconstruction neural network model as a three-dimensional reconstruction result; the three-dimensional reconstruction neural network model is obtained based on small sample training; the three-dimensional reconstruction neural network model at least comprises a rough feature fusion module used for fusing image features and voxel features. Based on the three-dimensional reconstruction neural network model with better generalization and better correlation and fusion between the image characteristics of the single-view image to be reconstructed and the voxel characteristics of the voxels with the three-dimensional shape to be reconstructed, the accuracy of the three-dimensional reconstruction result is improved.

Description

Single-view three-dimensional object reconstruction method, device, equipment and storage medium

Technical Field

The present invention relates to the field of computer vision, and in particular, to a method, an apparatus, a device, and a storage medium for reconstructing a single-view three-dimensional object.

Background

The ability to infer a complete three-dimensional shape from a partial perspective is one of the fundamental functions in human vision. Attempts have been made for decades to replicate this capability for computer vision, with dramatic success in the deep learning years.

However, while neural network models predict increasingly accurate shapes and more detailed, there is still no commonality as to why the neural network can accomplish this task, one of the key guesses is that the neural network model may not learn to "predict" geometry, but only perform well in "memorizing" training data. That is, it is believed that the neural network model is implemented based solely on nearest neighbor searches in performing three-dimensional shape inference, rather than understanding the underlying geometry. To better reconstruct the target three-dimensional shape from the single view image, the model will typically over-fit the shape of the training set class, which allows the trained model to reconstruct well the known class of three-dimensional shape, while the performance of reconstructing the unknown class of three-dimensional shape is significantly reduced.

Meanwhile, the neural network model for reconstructing the three-dimensional shape obtained by the existing training is often used for extracting visual features from a two-dimensional input image through a pre-trained transducer, and further obtaining voxel features based on a self-encoder or a regression network and the like, so that the neural network model has certain limitations and is difficult to process high-precision and large-scale three-dimensional reconstruction tasks.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for reconstructing a single-view three-dimensional object, wherein a three-dimensional reconstruction neural network model with stronger generalization capability is used for reconstructing a single-view image to be reconstructed, and the influence of image characteristics and voxel characteristics on a reconstruction result is considered in the reconstruction process, so that the three-dimensional reconstruction neural network model with higher reconstruction performance is provided, and the accuracy of reconstructing the three-dimensional object based on the single-view image is improved.

In a first aspect, an embodiment of the present invention provides a method for reconstructing a single-view three-dimensional object, including:

acquiring a single-view image to be reconstructed and at least one three-dimensional shape voxel to be reconstructed corresponding to the image type of the single-view image to be reconstructed;

inputting the single-view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into a three-dimensional reconstruction neural network model, and determining an output result of the three-dimensional reconstruction neural network model as a three-dimensional reconstruction result;

the three-dimensional reconstruction neural network model is obtained based on small sample training; the three-dimensional reconstruction neural network model at least comprises a rough feature fusion module used for fusing image features and voxel features.

In a second aspect, an embodiment of the present invention further provides a single-view three-dimensional object reconstruction apparatus, including:

the system comprises a to-be-reconstructed data acquisition module, a reconstruction module and a reconstruction module, wherein the to-be-reconstructed data acquisition module is used for acquiring a to-be-reconstructed single-view image and at least one to-be-reconstructed three-dimensional shape voxel corresponding to the image type of the to-be-reconstructed single-view image;

the three-dimensional reconstruction module is used for inputting the single-view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into the three-dimensional reconstruction neural network model, and determining the output result of the three-dimensional reconstruction neural network model as a three-dimensional reconstruction result;

In a third aspect, an embodiment of the present invention further provides a single view three-dimensional object reconstruction apparatus, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the single view three-dimensional object reconstruction method provided by the embodiments of the present invention.

In a fourth aspect, embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the method for reconstructing a single-view three-dimensional object provided by embodiments of the present invention.

The embodiment of the invention provides a method, a device, equipment and a storage medium for reconstructing a single-view three-dimensional object, which are characterized in that a single-view image to be reconstructed and at least one three-dimensional shape voxel to be reconstructed corresponding to the image type of the single-view image to be reconstructed are obtained; inputting the single-view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into a three-dimensional reconstruction neural network model, and determining an output result of the three-dimensional reconstruction neural network model as a three-dimensional reconstruction result; the three-dimensional reconstruction neural network model is obtained based on small sample training; the three-dimensional reconstruction neural network model at least comprises a rough feature fusion module used for fusing image features and voxel features. By adopting the technical scheme, as the three-dimensional reconstruction neural network model for reconstructing the three-dimensional object is the neural network model obtained based on the training of the small sample, the three-dimensional reconstruction neural network model can be considered as the neural network model with better generalization capability, and the three-dimensional reconstruction of the single-view image to be reconstructed of different types of pre-training samples can be better supported. The three-dimensional reconstruction neural network model comprises a rough feature fusion module which can fuse input image features and voxel features, so that the three-dimensional reconstruction neural network model can better correlate and fuse the image features of a single-view image to be reconstructed and the voxel features of voxels with three-dimensional shapes to be reconstructed, and can better pay attention to local features in the three-dimensional object reconstruction process, and further, the three-dimensional reconstruction result obtained by reconstructing according to the correlated features is more accurate.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for reconstructing a single view three-dimensional object according to an embodiment of the present invention;

fig. 2 is a flowchart of a single view three-dimensional object reconstruction method according to a second embodiment of the present invention;

fig. 3 is a flowchart illustrating a process of inputting a single view image to be reconstructed and voxels of three-dimensional shape to be reconstructed into a coarse feature fusion module according to a second embodiment of the present invention;

fig. 4 is a flowchart illustrating a process of inputting a sub-pixel sequence and image features into a cross-modal feature fusion unit according to a second embodiment of the present invention;

Fig. 5 is a structural example diagram of a cross-modal feature fusion unit according to a second embodiment of the present invention;

fig. 6 is a diagram illustrating a structure of a three-dimensional reconstructed neural network model according to a second embodiment of the present invention;

fig. 7 is a schematic structural diagram of a single-view three-dimensional object reconstruction device according to a third embodiment of the present invention;

fig. 8 is a schematic structural diagram of a single-view three-dimensional object reconstruction apparatus according to a fourth embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a single-view three-dimensional object reconstruction method according to an embodiment of the present invention, where the method may be performed by a single-view three-dimensional object reconstruction device, and the single-view three-dimensional object reconstruction device may be implemented by software and/or hardware, and the single-view three-dimensional object reconstruction device may be configured in a single-view three-dimensional object reconstruction apparatus. Alternatively, the single-view three-dimensional object reconstruction device may be a notebook, a desktop computer, an intelligent tablet, or the like, which is not limited in the embodiment of the present invention.

As shown in fig. 1, the method for reconstructing a single-view three-dimensional object provided by the embodiment of the invention specifically includes the following steps:

s101, acquiring a single-view image to be reconstructed and at least one three-dimensional shape voxel to be reconstructed corresponding to the image type of the single-view image to be reconstructed.

In this embodiment, the single view image to be reconstructed is specifically understood as a two-dimensional image acquired from a single view, which needs to be reconstructed in three dimensions. The image type is to be understood in particular as the type of object that is desired to be reconstructed in the single view image to be reconstructed. The three-dimensional shape voxels to be reconstructed are to be understood in particular as volume elements which correspond to the image type of the single view image to be reconstructed and which correspond to the object from which reconstruction is desired. For example, assuming that the single view image to be reconstructed is an image of a cat acquired from a single view, the image type corresponding to the single view image to be reconstructed may be considered as a cat type, and correspondingly, the three-dimensional shape voxel to be reconstructed corresponding to the image type may be a three-dimensional individual of other cats not consistent with the cat contained in the single view image to be reconstructed, for example, a blue cat in the single view image to be reconstructed, and the three-dimensional shape voxel to be reconstructed is a cat-in-cat, a puppet cat, and the like.

Specifically, when the three-dimensional object is required to be reconstructed, firstly acquiring a single-view image to be reconstructed, further determining the type of the object to be reconstructed in the single-view image to be reconstructed, taking the single-view image as the image type of the single-view image to be reconstructed, acquiring at least one three-dimensional individual with the same image type from the existing database or from a network on line, and taking the three-dimensional individual as the three-dimensional shape voxel to be reconstructed.

S102, inputting the single-view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into a three-dimensional reconstruction neural network model, and determining an output result of the three-dimensional reconstruction neural network model as a three-dimensional reconstruction result.

In the present embodiment, the three-dimensional reconstruction neural network model can be understood as specifically a neural network model that generates a three-dimensional object corresponding to a two-dimensional image based on the two-dimensional image and the three-dimensional shape voxel corresponding to the two-dimensional image type input thereto.

In this embodiment, the training of the small sample may be specifically understood as a training method for testing the three-dimensional reconstructed neural network model in a novel class set that is mutually exclusive with all the classes in the basic class set after completing the training of the three-dimensional reconstructed neural network model based on the basic class set. Because the trained three-dimensional reconstruction neural network model has very limited data acquired on the novel class, the three-dimensional reconstruction capability of the model learned from the basic class set can be considered to be transferred to the novel class when the test passes, and the generalization of the three-dimensional reconstruction neural network model after training is improved. The three-dimensional reconstruction neural network model can respectively train each category in the basic category set when training is carried out, and can test through the novel category set after training is completed. The rough feature fusion module can be specifically understood as a set of neural network layers used for carrying out cross fusion on image features obtained by feature extraction of a single-view image to be reconstructed and voxel features obtained by processing voxels of a three-dimensional shape to be reconstructed in the input three-dimensional reconstruction neural network model. By way of example, it may be assumed that the basic class set includes RGB images of a vehicle type, a cat type, a headset type, and a book type and corresponding three-dimensional shape voxels, and the novel class set may include RGB images of a dog type and corresponding three-dimensional shape voxels of a dog type, the basic class set and the classes of the novel class set being mutually exclusive.

Optionally, when performing small sample training for the three-dimensional reconstruction neural network model, the reconstruction result of the three-dimensional reconstruction neural network model may be monitored by a Dice loss function, so as to perform end-to-end optimization on the entire neural network model, assuming that the target image input during one type of training is I _tgt There are also K three-dimensional voxel examples of the same type as the target imageThe target three-dimensional object obtained by expected reconstruction is V _tgt The loss function for model training can be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing a predicted value of the three-dimensional object reconstructed for the input target image, p representing a true value of the target three-dimensional object corresponding to the input target image, the true value being in accordance with V _tgt A determined value.

Specifically, inputting a single view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into a three-dimensional reconstruction neural network model, carrying out feature extraction on the single view image to be reconstructed through a rough feature fusion module in the three-dimensional reconstruction neural network model, carrying out refinement and splitting and feature extraction on each three-dimensional shape voxel to obtain corresponding image features and voxel features, respectively fusing the image features with each voxel feature to obtain fused features containing three-dimensional information, and carrying out optimization reconstruction on the obtained three-dimensional features to obtain the output of the three-dimensional reconstruction neural network model, so that the output of the three-dimensional reconstruction neural network model can be determined as a three-dimensional reconstruction result.

Optionally, the image features in the coarse feature fusion module are local features of the image input to the three-dimensional reconstructed neural network model.

Specifically, when the rough feature fusion module extracts the image features of the to-be-reconstructed single-view image input into the rough feature fusion module, the rough feature fusion module can be realized by a feature extractor focusing more on the extraction of the local features of the image, so that the local features and the structural information can be reserved as much as possible when the three-dimensional reconstruction neural network model performs three-dimensional reconstruction, and the detail accuracy of the three-dimensional reconstruction is improved.

According to the technical scheme, a single-view image to be reconstructed and at least one three-dimensional shape voxel to be reconstructed corresponding to the image type of the single-view image to be reconstructed are obtained; inputting the single-view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into a three-dimensional reconstruction neural network model, and determining an output result of the three-dimensional reconstruction neural network model as a three-dimensional reconstruction result; the three-dimensional reconstruction neural network model is obtained based on small sample training; the three-dimensional reconstruction neural network model at least comprises a rough feature fusion module used for fusing image features and voxel features. By adopting the technical scheme, as the three-dimensional reconstruction neural network model for reconstructing the three-dimensional object is the neural network model obtained based on the training of the small sample, the three-dimensional reconstruction neural network model can be considered as the neural network model with better generalization capability, and the three-dimensional reconstruction of the single-view image to be reconstructed of different types of pre-training samples can be better supported. The three-dimensional reconstruction neural network model comprises a rough feature fusion module which can fuse input image features and voxel features, so that the three-dimensional reconstruction neural network model can better correlate and fuse the image features of a single-view image to be reconstructed and the voxel features of voxels with three-dimensional shapes to be reconstructed, and can better pay attention to local features in the three-dimensional object reconstruction process, and further, the three-dimensional reconstruction result obtained by reconstructing according to the correlated features is more accurate.

Example two

Fig. 2 is a flowchart of a single view three-dimensional object reconstruction method provided by a second embodiment of the present invention, where the technical solution of the present invention is further optimized based on the above-mentioned alternative technical solutions, cross-modal feature fusion units in a coarse feature fusion module are used to implement cross-attention mechanism-based image feature and voxel feature fusion, and a three-dimensional feature determination module in a three-dimensional reconstruction neural network model is used to perform weight generation on coarse fusion features corresponding to voxels of each three-dimensional shape to be reconstructed, so that accuracy of three-dimensional shape feature vectors generated according to the weights is improved, influence of association degree of category shapes and three-dimensional shapes on determination of three-dimensional shape feature vectors is reduced, and accuracy of three-dimensional reconstruction results obtained after refinement mapping according to the three-dimensional shape feature vectors is improved. Before the three-dimensional reconstruction neural network model is put into use, the three-dimensional reconstruction neural network model is tested through a test sample set which is different from a training sample set used for training the three-dimensional reconstruction neural network model, and in the test process, the three-dimensional reconstruction neural network model is finely adjusted by utilizing test images and test three-dimensional shape voxels in the test sample set, so that available samples of the three-dimensional reconstruction neural network model are increased, and the reconstruction accuracy of the three-dimensional reconstruction neural network model is improved.

In this embodiment, the three-dimensional reconstruction neural network model further includes a three-dimensional feature determination module and a three-dimensional object reconstruction module. The three-dimensional feature determining module is specifically understood as a set of neural network layers for generating weights for coarse fusion features input into the three-dimensional feature determining module and re-fusing the coarse fusion features containing the weights to generate three-dimensional features. The three-dimensional object reconstruction module is specifically understood as a set of neural network layers for mapping to a three-dimensional voxel space to generate a three-dimensional object after performing optimization operations such as refinement on the three-dimensional shape feature vectors input therein.

As shown in fig. 2, the method for reconstructing a single-view three-dimensional object provided in the second embodiment of the present invention specifically includes the following steps:

s201, acquiring a single-view image to be reconstructed and at least one three-dimensional shape voxel to be reconstructed corresponding to the image type of the single-view image to be reconstructed.

S202, inputting the single view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into a rough feature fusion module, so as to fuse the image features of the single view image to be reconstructed with the voxel features of each three-dimensional shape voxel to be reconstructed respectively, and determining rough fusion features corresponding to each three-dimensional shape voxel to be reconstructed.

Specifically, inputting a single view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into a coarse feature fusion module, carrying out feature extraction on the single view image to be reconstructed to obtain image features corresponding to the single view image to be reconstructed, splitting each three-dimensional shape voxel to be reconstructed into non-coincident regular voxels, determining voxel features of the three-dimensional shape voxels to be reconstructed based on the relation among the split regular voxels belonging to the same three-dimensional shape voxel to be reconstructed, carrying out feature cross fusion on the voxel features and the image features of each three-dimensional shape voxel to be reconstructed as a group, and determining the obtained fusion features corresponding to the three-dimensional shape voxels to be reconstructed as coarse fusion features, wherein the number of the coarse fusion features depends on the number of the three-dimensional shape voxels to be reconstructed.

Optionally, the coarse feature fusion module includes an image feature extraction unit, a voxel sequence determination unit and a cross-modal feature fusion unit, and fig. 3 is a flowchart illustrating inputting a single view image to be reconstructed and voxels of three-dimensional shapes to be reconstructed into the coarse feature fusion module, as shown in fig. 3, and specifically includes the following steps:

S2021, inputting the single-view image to be reconstructed into an image feature extraction unit, and extracting to obtain the image features of the single-view image to be reconstructed.

In the present embodiment, the image feature extraction unit can be understood as a feature extractor for extracting features of an image input thereto.

Specifically, the single-view image to be reconstructed is input to an image feature extraction unit for image feature extraction, and the extracted features are determined as the image features of the single-view image to be reconstructed.

Alternatively, the feature extractor applied by the image feature extraction unit may be any feature extractor that can perform the function of image feature extraction, such as a ResNet convolutional network as the feature extractor or a visual transformer (Vision Transformer, viT) as the feature extractor, etc., the encoding operations of the different feature extractorsSlightly different. If ResNet convolution network is used as the feature extractor, the two-dimensional feature mapping f of the single-view image to be reconstructed can be finally achieved _img ∈R ^D×H×W Feature map f flattened to one dimension _its ∈R ^D×L1 Where l1=h×w, D is the dimension, H is the image height, and W is the image width. When ViT is used as the feature extractor, the single view image to be reconstructed can be divided into patches and converted into sequence data, features can be extracted by a multi-layer transform encoder, and classification prediction can be performed by a multi-layer perceptual mechanism.

It can be understood that in the embodiment of the present invention, the image features that are expected to be processed in the three-dimensional reconstruction neural network model are local features, and ViT as a feature extractor can be more focused on the extraction of the local features and the structural information of the single view image to be reconstructed, so ViT can be used to replace the general convolutional neural network as the feature extractor in the image feature extraction unit in the embodiment of the present invention, so as to improve the occupation ratio of the local features in the extracted image features.

S2022, inputting each three-dimensional shape voxel to be reconstructed into a voxel sequence determining unit, splitting each three-dimensional shape voxel to be reconstructed, flattening each split sub-voxel, linearly projecting and embedding the sub-voxels into positions to form a sequence, and determining the sequence as a sub-voxel sequence corresponding to the three-dimensional shape voxel to be reconstructed.

In this embodiment, the voxel sequence determining unit may be specifically understood as a data processing unit configured to split the input three-dimensional shape voxels to be reconstructed into a plurality of preset shape voxels, and perform information configuration and combination on each of the preset shape voxels obtained after the splitting to generate a data format meeting the input requirement sequence of the cross-modal feature fusion unit. The sub-voxels may be specifically understood as preset shape voxels formed after splitting the three-dimensional shape voxels to be reconstructed, alternatively, in the embodiment of the present invention, the preset shape voxels may be set as regular square voxels, and in different situations, the preset shape voxels may also be set as complete or incomplete voxels of different types according to actual situations, which is not limited in the embodiment of the present invention.

Specifically, each voxel with a three-dimensional shape to be reconstructed is respectively input into a voxel sequence determining unit, each voxel with the three-dimensional shape to be reconstructed is split by the voxel sequence determining unit to obtain at least one sub-voxel, the shape and the size of the sub-voxel and the number of the sub-voxels after the splitting are correspondingly obtained can be set according to practical situations, the embodiment of the invention does not limit the situation, and because the sub-voxels are three-dimensional information, the sub-voxels are more suitable for fusion with image features, flattening and linear projection processing are required to be carried out on the sub-voxels after the splitting to obtain the sub-voxels, and in order to indicate the positions of different sub-voxels in the voxel with the three-dimensional shape to be reconstructed, the position embedding can be carried out on each sub-voxel, and the sequence formed by the sub-voxels after the processing is determined as the sub-voxel sequence corresponding to the voxel with the three-dimensional shape to be reconstructed.

For example, assuming that the number of three-dimensional shape voxels to be reconstructed input into the three-dimensional reconstruction neural network model is K, the three-dimensional shape voxels to be reconstructed are three-dimensional example objects located in a 32×32 stereo space, and each three-dimensional shape voxel to be reconstructed is represented asWhen a voxel with a three-dimensional shape to be reconstructed is input into a voxel sequence determining unit, the voxel with the three-dimensional shape to be reconstructed is split into 64 cubes which are not overlapped with each other, the size of each cube is 8 x 8, one cube is one sub-voxel in the voxel with the three-dimensional shape to be reconstructed, each sub-voxel is flattened and linearly projected, and after position embedding is completed in each sub-voxel, a sub-voxel sequence f corresponding to the voxel with the three-dimensional shape to be reconstructed can be obtained _vts ∈R ^L2×D Where l2=64.

It is understood that there is no obvious sequence between S2021 and S2022, and they may be executed simultaneously or in any sequence, and in the embodiment of the present invention, the execution sequence of S2021 to S2022 is only taken as an example.

S2023, inputting the sub-voxel sequences and the image features into a cross-modal feature fusion unit for each sub-voxel sequence to determine voxel features corresponding to the sub-voxel sequences, fusing the voxel features with the image features, and determining rough fusion features of voxels corresponding to the sub-voxel sequences and having three-dimensional shapes to be reconstructed.

In this embodiment, the cross-modal feature fusion unit may be specifically understood as a unit for performing association fusion on two different types of features based on a cross-attention mechanism, and in this embodiment, is a unit for fusing voxel features and image features. The rough fusion feature is specifically understood as a feature obtained by fusing an image feature of a single view image to be reconstructed and a voxel feature of a voxel of a three-dimensional shape to be reconstructed.

Specifically, the cross-modal feature fusion unit can process a plurality of sub-voxel sequences and image features at the same time, in the embodiment of the invention, taking a processing mode of one sub-voxel sequence as an example, after the sub-voxel sequences and the image features are input into the cross-modal feature fusion unit, the cross-modal feature fusion unit determines the relation between the sub-voxel sequences according to the sub-voxel sequences, uses the relation as the voxel features corresponding to the sub-voxel sequences, further uses the image features to deform the voxel features based on a cross-attention mechanism, and further stacks a preset number of cross-modal feature fusion units for one sub-voxel sequence and the image feature fusion result, thereby obtaining the rough fusion features corresponding to the sub-voxel sequences and the voxels with the three-dimensional shape to be reconstructed.

Optionally, fig. 4 is a flowchart illustrating a process of inputting a sub-pixel sequence and image features into a cross-modal feature fusion unit according to a second embodiment of the present invention, as shown in fig. 4, and specifically includes the following steps:

s20231, determining the relation among all sub-voxels in the sub-voxel sequence based on a self-attention mechanism, and determining the voxel characteristics corresponding to the sub-voxel sequence.

Following the above example, a sub-voxel sequence f _vts When the relation between each sub-voxel in the sub-voxel sequence is input to the cross-modal feature fusion unit, the relation between each sub-voxel in the sub-voxel sequence can be determined based on a self-attention mechanism, a layer normalization function LayerNorm and a soft maximum function softmax, and the residual sum of the relations between the sub-voxels is determined to be the voxel feature corresponding to the sub-voxel sequence, and the specific relation can be expressed by the following formula:

wherein f' _vts For voxel characteristics corresponding to voxels of the three-dimensional shape to be reconstructed, and->Is a linear projection weight, d _k Is a setting parameter related to the attention head dimension.

S20232, carrying out linear mapping on the voxel characteristics and the image characteristics, and fusing the voxel characteristics and the image characteristics after linear mapping based on a cross attention mechanism to obtain primary fusion characteristics.

In the above example, the image feature f input to the cross-modal feature fusion unit _its Linear mapping is carried out to obtainAnd->For voxel feature f 'obtained based on the above steps' _vts Linear mapping is carried out to obtainWherein->And->For linear projection weightsHeavy, based on the cross attention mechanism and layer normalization function LayerNorm and soft maximum function softmax, the deformation of the image feature guiding voxel feature is realized, and the primary fusion feature f' after primary fusion can be obtained " _vts Specifically, the expression can be represented by the following formula:

s20233, inputting the primary fusion characteristics into a full-connection layer for characteristic synthesis, and obtaining rough fusion characteristics of the voxels corresponding to the sub-voxel sequences and the three-dimensional shape to be reconstructed.

Specifically, the cross-modal feature fusion unit comprises a fully-connected feedforward network outside the attention layer formed by the self-attention mechanism and the cross-attention mechanism before final output, namely a fully-connected layer, the primary fusion feature is taken as the input of the fully-connected layer, and the fusion feature of the cross-modal feature fusion unit for a sub-voxel sequence can be obtained.

Following the above example, the function of feature synthesis for the preliminary fusion feature can be expressed as:

FNN(f” _vts )＝max(0,f” _vts W ₁ +b ₁ )W ₂ +b ₂

wherein W is ₁ 、W ₂ 、b ₁ And b ₂ Is a linear function parameter.

Fig. 5 is a schematic diagram of a cross-modal feature fusion unit according to a second embodiment of the present invention, where, as shown in fig. 5, the cross-modal feature fusion unit may at least include a self-attention layer, a first residual normalization layer, a cross-attention layer, a second residual normalization layer, a full-connection layer, and a third residual normalization layer, and if a group of sub-pixel sequences and image features may be implemented by stacking N cross-modal feature fusion units, the sub-pixel sequences may be input to the self-attention layer and the first residual normalization layer, respectively, and output from the self-attention layer may be input to the first residual normalization layer, where the obtained result is the voxel feature corresponding to the sub-pixel sequence; the voxel features are respectively input into a cross attention layer and a second residual error normalization layer, the image features are linearly mapped and then input into the cross attention layer, the output of the cross attention layer is input into the second residual error normalization layer, the obtained result is the primary fusion feature, the primary fusion feature is input into the full-connection layer for processing and then is input into the third residual error normalization layer, the primary fusion feature is also input into the third residual error normalization layer, the fusion result of a cross-modal feature fusion unit aiming at a group of sub-pixel sequences and the image features is obtained, and the rough fusion feature corresponding to the three-dimensional shape voxels to be reconstructed with the sub-pixel sequences can be obtained after N fusion results are stacked.

S203, inputting each rough fusion feature to a three-dimensional feature determining module to generate weight of each rough fusion feature, and connecting products of each rough fusion feature and corresponding weight to obtain a three-dimensional shape feature vector.

Optionally, the three-dimensional feature determining module may include an adaptive pooling layer, and correspondingly, inputting each coarse fusion feature into the three-dimensional feature determining module, including: and inputting each rough fusion feature to a self-adaptive pooling layer of the three-dimensional feature determining module, and determining the weight of each rough fusion feature.

Specifically, after being processed by the rough feature fusion module, rough fusion features with the same number as the voxels of the three-dimensional shape to be reconstructed are obtained, and the average value of the obtained rough fusion features is determined as the prototype feature for the subsequent three-dimensional object reconstruction in the traditional small sample learning algorithm, but because the three-dimensional reconstruction task is different from the classification task with the traditional meaning, the classification information of the three-dimensional object to be reconstructed may not be directly related to the three-dimensional shape of the object, for example, lamps classified by functions, and the three-dimensional shapes of different lamps may be very different. Therefore, the embodiment of the invention introduces the self-adaptive pooling layer into the three-dimensional feature determining module, generates corresponding weights for each rough fusion feature input into the three-dimensional feature determining module through the self-adaptive pooling layer, and then connects products of each rough fusion feature and the corresponding weights to obtain the three-dimensional shape feature vector with lower association with the category information.

Following the above example, assume that a coarse fusion feature can be represented asWhere i=1, …, K is the number of voxels of the three-dimensional shape to be reconstructed input into the three-dimensional reconstruction neural network model, and the weight assigned to each coarse fusion feature by the adaptive pooling layer may be represented as W _coarse ∈R ^K×L2×D Three-dimensional shape feature vector f _coarse Can be represented by the following formula:

alternatively, the pooling function in the adaptive pooling layer may select a softmax function or a gum softmax function, and generate soft weights and hard weights, respectively, which the embodiments of the present invention do not limit.

S204, inputting the three-dimensional shape feature vector into a three-dimensional object reconstruction module to refine the three-dimensional shape feature vector, and mapping the refined three-dimensional shape feature vector into a voxel space to serve as output of the three-dimensional object reconstruction module.

Specifically, the three-dimensional object reconstruction module may be composed of a plurality of transform blocks and a voxel reconstruction layer, after the three-dimensional shape feature vector is input into the three-dimensional object reconstruction module, the three-dimensional shape feature vector is subjected to refinement processing by the plurality of transform blocks to obtain an optimized three-dimensional shape feature vector, and then the optimized three-dimensional shape feature vector is input into the voxel reconstruction layer, so that the operation of linearly mapping the three-dimensional shape feature vector back to the voxel space is realized, and then the three-dimensional object obtained after being mapped to the voxel space is used as the output of the three-dimensional object reconstruction module, namely as the output of the three-dimensional reconstruction neural network model.

S205, determining an output result of the three-dimensional reconstruction neural network model as a three-dimensional reconstruction result.

Optionally, before acquiring the single view image to be reconstructed and at least one voxel of the three-dimensional shape to be reconstructed corresponding to the image type of the single view image to be reconstructed, the method further comprises:

and fine tuning the three-dimensional reconstruction neural network model through the test image and the test three-dimensional shape voxels in the test sample set.

The test sample set is mutually disjoint with image types contained in a training sample set used for training the three-dimensional reconstruction neural network model.

In the embodiment of the invention, the model can be finely tuned by a small amount of target data based on the migration learning concept on the basis of the trained model. When the method is applied to the three-dimensional reconstruction neural network model in the embodiment of the invention, the three-dimensional reconstruction neural network model is a neural network model which is trained based on a small sample training principle, so that a test sample set which is mutually disjoint with the image category contained in a training sample set is required to be selected when the method is used for testing, namely the test sample set meets the standard of a small amount of target data in a migration learning concept, when the three-dimensional reconstruction model is tested, fine adjustment can be carried out in a model reasoning stage through a test three-dimensional shape voxel and a side view image which are used for testing in the test sample set, the fine adjustment process is consistent with the model training process, and the fine adjustment of the whole network is completed by adopting the same loss function as that in the training process, so that the increase of available samples in the three-dimensional reconstruction neural network model is realized.

Fig. 6 is a schematic diagram of a three-dimensional neural network model according to a second embodiment of the present invention, in which the output of the image feature extraction unit is written as I _tgt Tokens, each sub-pixel sequence output by the voxel sequence determination unit is writteni=1, …, K, each coarse fusion feature output by the coarse feature fusion module is written as f ₁ ,...,f _K What is shown for one category in FIG. 6The input target image is I _tgt There are also K three-dimensional voxel examples of the same type as the target image +.>When training is carried out, the training data flow trend of the neural network model is reconstructed in a three-dimensional mode.

According to the technical scheme, cross-modal feature fusion units in the rough feature fusion module are used for realizing cross-attention mechanism-based image feature and voxel feature fusion, and the three-dimensional feature determination module in the three-dimensional reconstruction neural network model is used for generating weights for rough fusion features corresponding to three-dimensional shape voxels to be reconstructed, so that the accuracy of three-dimensional shape feature vectors generated according to the weights is improved, the influence of category shapes and three-dimensional shape association on three-dimensional shape feature vector determination is reduced, and the accuracy of three-dimensional reconstruction results obtained after refinement and mapping according to the three-dimensional shape feature vectors is improved. Before the three-dimensional reconstruction neural network model is put into use, the three-dimensional reconstruction neural network model is tested through a test sample set which is different from a training sample set used for training the three-dimensional reconstruction neural network model, and in the test process, the three-dimensional reconstruction neural network model is finely adjusted by utilizing test images and test three-dimensional shape voxels in the test sample set, so that available samples of the three-dimensional reconstruction neural network model are increased, and the reconstruction accuracy of the three-dimensional reconstruction neural network model is improved.

Example III

Fig. 7 is a schematic structural diagram of a single-view three-dimensional object reconstruction device according to a third embodiment of the present invention, where, as shown in fig. 7, the single-view three-dimensional object reconstruction device includes a data acquisition module 31 to be reconstructed and a three-dimensional reconstruction module 32.

The to-be-reconstructed data obtaining module 31 is configured to obtain a to-be-reconstructed single-view image, and at least one to-be-reconstructed three-dimensional shape voxel corresponding to an image type of the to-be-reconstructed single-view image; the three-dimensional reconstruction module 32 is configured to input the single view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into a three-dimensional reconstruction neural network model, and determine an output result of the three-dimensional reconstruction neural network model as a three-dimensional reconstruction result; the three-dimensional reconstruction neural network model is obtained based on small sample training; the three-dimensional reconstruction neural network model at least comprises a rough feature fusion module used for fusing image features and voxel features.

According to the technical scheme, the three-dimensional reconstruction neural network model for reconstructing the three-dimensional object is a neural network model obtained based on small sample training, so that the three-dimensional reconstruction neural network model can be considered to be a neural network model with better generalization capability, and three-dimensional reconstruction of different types of single-view images to be reconstructed of the pre-training sample can be better supported. The three-dimensional reconstruction neural network model comprises a rough feature fusion module which can fuse input image features and voxel features, so that the three-dimensional reconstruction neural network model can better correlate and fuse the image features of a single-view image to be reconstructed and the voxel features of voxels with three-dimensional shapes to be reconstructed, and can better pay attention to local features in the three-dimensional object reconstruction process, and further, the three-dimensional reconstruction result obtained by reconstructing according to the correlated features is more accurate.

Optionally, the image features are local features of an image input to the three-dimensional reconstructed neural network model.

Optionally, the three-dimensional reconstruction neural network model further comprises a three-dimensional feature determination module and a three-dimensional object reconstruction module. Accordingly, the three-dimensional reconstruction module 32 includes:

the rough feature determining unit is used for inputting the single view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into the rough feature fusion module so as to fuse the image features of the single view image to be reconstructed with the voxel features of each three-dimensional shape voxel to be reconstructed respectively and determine rough fusion features corresponding to each three-dimensional shape voxel to be reconstructed.

The three-dimensional vector determining unit is used for inputting each rough fusion feature to the three-dimensional feature determining module so as to generate the weight of each rough fusion feature, and connecting the product of each rough fusion feature and the corresponding weight to obtain the three-dimensional shape feature vector.

The three-dimensional object determining unit is used for inputting the three-dimensional shape feature vector into the three-dimensional object reconstructing module to refine the three-dimensional shape feature vector, and mapping the refined three-dimensional shape feature vector to the voxel space to serve as output of the three-dimensional object reconstructing module.

Optionally, the coarse feature fusion module includes an image feature extraction unit, a voxel sequence determination unit, and a cross-modal feature fusion unit. Correspondingly, the rough characteristic determining unit is specifically configured to:

inputting the single-view image to be reconstructed into an image feature extraction unit, and extracting to obtain the image features of the single-view image to be reconstructed;

inputting each three-dimensional shape voxel to be reconstructed into a voxel sequence determining unit, splitting each three-dimensional shape voxel to be reconstructed, flattening each split sub-voxel, linearly projecting and embedding the position to form a sequence, and determining the sequence as a sub-voxel sequence corresponding to the three-dimensional shape voxel to be reconstructed;

and inputting the sub-voxel sequences and the image features into a cross-modal feature fusion unit aiming at each sub-voxel sequence to determine voxel features corresponding to the sub-voxel sequences, fusing the voxel features with the image features, and determining rough fusion features corresponding to the sub-voxel sequences and of voxels with three-dimensional shapes to be reconstructed.

Optionally, inputting the sub-pixel sequence and the image features into a cross-modal feature fusion unit, including:

determining the relation among all sub-voxels in the sub-voxel sequence based on a self-attention mechanism, and determining the voxel characteristics corresponding to the sub-voxel sequence;

Linearly mapping the voxel characteristics and the image characteristics, and fusing the voxel characteristics and the image characteristics after the linear mapping based on a cross attention mechanism to obtain primary fusion characteristics;

and inputting the primary fusion characteristics into a full-connection layer for characteristic synthesis to obtain rough fusion characteristics of the voxels corresponding to the sub-voxel sequences in the three-dimensional shape to be reconstructed.

Optionally, the three-dimensional vector determining unit is specifically configured to:

and inputting each rough fusion feature to a self-adaptive pooling layer of the three-dimensional feature determining module, and determining the weight of each rough fusion feature.

performing fine tuning on the three-dimensional reconstruction neural network model through the test image and the test three-dimensional shape voxels in the test sample set;

The single-view three-dimensional object reconstruction device provided by the embodiment of the invention can execute the single-view three-dimensional object reconstruction method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 8 is a schematic structural diagram of a single-view three-dimensional object reconstruction apparatus according to a fourth embodiment of the present invention. The single view three-dimensional object reconstruction device 40 may be an electronic device intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 8, the single view three-dimensional object reconstruction apparatus 40 includes at least one processor 41, and a memory such as a Read Only Memory (ROM) 42, a Random Access Memory (RAM) 43, etc. communicatively connected to the at least one processor 41, wherein the memory stores a computer program executable by the at least one processor, and the processor 41 can perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 42 or the computer program loaded from the storage unit 48 into the Random Access Memory (RAM) 43. In the RAM 43, various programs and data required for the operation of the single-view three-dimensional object reconstruction apparatus 40 may also be stored. The processor 41, the ROM 42 and the RAM 43 are connected to each other via a bus 44. An input/output (I/O) interface 45 is also connected to bus 44.

The various components in the single view three-dimensional object reconstruction apparatus 40 are connected to an I/O interface 45, including: an input unit 46 such as a keyboard, a mouse, etc.; an output unit 47 such as various types of displays, speakers, and the like; a storage unit 48 such as a magnetic disk, an optical disk, or the like; and a communication unit 49 such as a network card, modem, wireless communication transceiver, etc. The communication unit 49 allows the single view three-dimensional object reconstruction device 40 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 41 may be various general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 41 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 41 performs the various methods and processes described above, such as a single view three-dimensional object reconstruction method.

In some embodiments, the single view three-dimensional object reconstruction method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 48. In some embodiments, part or all of the computer program may be loaded and/or installed onto the single view three-dimensional object reconstruction device 40 via the ROM 42 and/or the communication unit 49. When the computer program is loaded into RAM 43 and executed by processor 41, one or more steps of the single view three-dimensional object reconstruction method described above may be performed. Alternatively, in other embodiments, processor 41 may be configured to perform the single view three-dimensional object reconstruction method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for reconstructing a single view three-dimensional object, comprising:

inputting the single-view image to be reconstructed and each voxel of the three-dimensional shape to be reconstructed into a three-dimensional reconstruction neural network model, and determining an output result of the three-dimensional reconstruction neural network model as a three-dimensional reconstruction result;

The three-dimensional reconstruction neural network model is a neural network model obtained based on small sample training; the three-dimensional reconstruction neural network model at least comprises a rough feature fusion module used for fusing image features and voxel features.

2. The method of claim 1, wherein the coarse feature fusion module comprises an image feature extraction unit, a voxel sequence determination unit, and a cross-modal feature fusion unit;

inputting the single-view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into a three-dimensional reconstruction neural network model, wherein the method comprises the following steps of:

inputting the single-view image to be reconstructed into the image feature extraction unit, and extracting to obtain image features of the single-view image to be reconstructed;

inputting each voxel of the three-dimensional shape to be reconstructed into the voxel sequence determining unit, splitting each voxel of the three-dimensional shape to be reconstructed, flattening each split voxel, linearly projecting and embedding the position to form a sequence, and determining the sequence as a sub-voxel sequence corresponding to the voxel of the three-dimensional shape to be reconstructed;

and inputting the sub-pixel sequences and the image features into the cross-modal feature fusion unit for each sub-pixel sequence to determine voxel features corresponding to the sub-pixel sequences, fusing the voxel features with the image features, and determining rough fusion features corresponding to the sub-pixel sequences and the voxels with the three-dimensional shape to be reconstructed.

3. The method of claim 2, wherein the inputting the sequence of sub-volumes and the image features to the cross-modality feature fusion unit comprises:

performing linear mapping on the voxel characteristics and the image characteristics, and fusing the voxel characteristics and the image characteristics after linear mapping based on a cross attention mechanism to obtain primary fusion characteristics;

and inputting the primary fusion characteristics into a full-connection layer for characteristic synthesis to obtain rough fusion characteristics of the voxels corresponding to the sub-voxel sequences and the three-dimensional shape to be reconstructed.

4. The method of any of claims 1-3, wherein the three-dimensional reconstruction neural network model further comprises a three-dimensional feature determination module and a three-dimensional object reconstruction module;

inputting the single view image to be reconstructed and each voxel of the three-dimensional shape to be reconstructed into a three-dimensional reconstruction neural network model, wherein the method comprises the following steps:

inputting the single view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into the rough feature fusion module so as to fuse the image features of the single view image to be reconstructed with the voxel features of each three-dimensional shape voxel to be reconstructed respectively and determine rough fusion features corresponding to each three-dimensional shape voxel to be reconstructed;

Inputting each rough fusion feature to the three-dimensional feature determining module to generate weight of each rough fusion feature, and connecting products of each rough fusion feature and corresponding weight to obtain a three-dimensional shape feature vector;

and inputting the three-dimensional shape feature vector into the three-dimensional object reconstruction module to refine the three-dimensional shape feature vector, and mapping the refined three-dimensional shape feature vector into a voxel space to serve as output of the three-dimensional object reconstruction module.

5. The method of claim 4, wherein said inputting each of said coarse fusion features to said three-dimensional feature determination module to generate a weight for each of said coarse fusion features comprises:

and inputting each rough fusion feature to an adaptive pooling layer of the three-dimensional feature determining module, and determining the weight of each rough fusion feature.

6. The method of any of claims 1, wherein the image features are local features of an image input to the three-dimensional reconstructed neural network model.

7. The method of claim 1, further comprising, prior to the acquiring the single view image to be reconstructed and the at least one three-dimensional shape voxel to be reconstructed corresponding to the image type of the single view image to be reconstructed:

Fine tuning the three-dimensional reconstruction neural network model through a test image and a test three-dimensional shape voxel in the test sample set;

8. A single view three-dimensional object reconstruction apparatus, comprising:

the device comprises a to-be-reconstructed data acquisition module, a reconstruction module and a reconstruction module, wherein the to-be-reconstructed data acquisition module is used for acquiring a to-be-reconstructed single-view image and at least one to-be-reconstructed three-dimensional shape voxel corresponding to the image type of the to-be-reconstructed single-view image;

the three-dimensional reconstruction module is used for inputting the single-view image to be reconstructed and each three-dimensional shape voxel to be reconstructed into a three-dimensional reconstruction neural network model, and determining an output result of the three-dimensional reconstruction neural network model as a three-dimensional reconstruction result;

9. A single view three-dimensional object reconstruction apparatus, comprising:

At least one processor; and

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the single view three dimensional object reconstruction method of any one of claims 1-7.

10. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing the single view three dimensional object reconstruction method as claimed in any one of claims 1 to 7.