CN114255328A

CN114255328A - Three-dimensional reconstruction method for ancient cultural relics based on single view and deep learning

Info

Publication number: CN114255328A
Application number: CN202111510170.0A
Authority: CN
Inventors: 李红波; 叶成庆; 杨杰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-03-29

Abstract

The invention relates to the technical field of three-dimensional reconstruction of ancient cultural relic pictures, and discloses a three-dimensional reconstruction method of ancient cultural relics based on single view and deep learning, which comprises the following steps: step 1: inputting a data set of ancient cultural relics and artworks similar to the ancient cultural relics; step 2: according to a network structure of an encoder-decoder, an encoder with a 3D-ResVGG network and a multipath channel attention module is used for carrying out deep information mining and feature extraction on the data set, and an ancient cultural relic three-dimensional network model is generated; and step 3: preprocessing an ancient cultural relic picture needing three-dimensional reconstruction through an intelligent AI tool to generate a preset type picture; and 4, step 4: and inputting the preset type picture as a single view into the ancient cultural relic three-dimensional network model to generate a complete ancient cultural relic three-dimensional model of the ancient cultural relic picture needing three-dimensional reconstruction. The invention provides a novel 3D-ResVGG network, which improves the over-fitting problem of the original VGG framework through multi-branch model training.

Description

Three-dimensional reconstruction method for ancient cultural relics based on single view and deep learning

Technical Field

The invention relates to the technical field of three-dimensional reconstruction of ancient cultural relic pictures, in particular to a three-dimensional reconstruction method of ancient cultural relics based on single view and deep learning.

Background

The ancient cultural relics are relics and traces which are left in social activities by human beings and have historical, artistic and scientific values, and are valuable historical cultural heritages of the human beings. Many times, people only observe and understand ancient cultural relics through pictures, and thus the ancient cultural relics are not deeply understood. Because we live in a three-dimensional world, but what the human eye sees is a two-dimensional projection of an object, and the form of the object captured by a camera or the like is usually a two-dimensional image, image-based three-dimensional reconstruction is a core problem in many fields such as computer vision, computer animation, industrial manufacturing, and the like. In the real world, all objects exist in three dimensions and have characteristics such as shape, appearance, and texture. For the engineering surveying and mapping technology, the three-dimensional reconstruction can help a remote sensing system to acquire required earth surface space information, and the real-time performance is realized; in the field of human-computer interaction, a three-dimensional reconstruction technology is a foundation for laying a high-level human-computer interface, and an immersive and interactive virtual reality technology is really realized; in the field of building informatization, the method is an important tool for cultural relic building protection, engineering quality detection and management, building removal management and building transformation or decoration.

Meanwhile, the cultural relic data is used as a carrier for spreading, storing and displaying historical culture and social landscape to the public, and the protection work of the cultural relic data is gradually valued by the public. The three-dimensional reconstruction technology is developed to the present day and gradually becomes the most leading application technology in the aspect of historical relic data preservation. The method can store the geometric information of the real cultural relics for a long time, and provides great convenience for promoting the application research and the deep data analysis of the cultural relics.

In addition, in the technical field of visualization, the digitization of entity cultural relic information and the visualization technology in the field of computers are both applied in a larger scale, and the related industries of cultural education have more opportunities to exert the social education functions of the cultural relics so as to further expand and meet the appreciation requirements of the masses.

The traditional method in three-dimensional reconstruction is mature in the theory of stereoscopic vision, and the stereoscopic vision is a scientific method for calculating parallax and further deriving depth by simulating the geometric rule of human eyes to watch objects through a camera based on the principle of pinhole imaging. In recent years, with the popularization of smart phones and the pursuit of mobile phone manufacturers for lens blurring, binocular vision-based three-dimensional reconstruction is more and more mature, a disparity map is calculated through double photographing of a camera, a depth map is calculated through the disparity map, and then a three-dimensional scene is restored. While the development of three-dimensional reconstruction technology, a large number of stereoscopic vision devices such as stereo cameras, Kinect and the like are produced. However, despite the emergence of a large number of stereoscopic devices, most ancient cultural relic picture materials which are easily acquired in a network environment are still mainly monocular single-view images, are not binocular images containing depth RGBD or given camera parameters, and single-view pictures lacking associated information are difficult to reconstruct through a stereoscopic vision theory, so that a new computer vision problem is caused: three-dimensional reconstruction of single views.

The task of single-view three-dimensional reconstruction is a very difficult computer vision task. The traditional method for extracting features mainly comprises the steps of constructing manual features, extracting textures and contours of images by using operators, calculating optical flows, extracting motion information, and judging a light-dark relation through gray levels, wherein the manual features are commonly used for extracting the manual features, the reconstruction of a three-dimensional model is completed by using the features and auxiliary assumed conditions, the characteristics are limited by information content contained in the features, and the effect of the traditional method is not ideal.

Disclosure of Invention

Aiming at the problems of the lack of ancient cultural relic view pictures and the lack of three-dimensional model data sets at present, on the basis of the existing research and the realization of the ancient cultural relic three-dimensional model data sets, a three-dimensional ancient cultural relic reconstruction method based on single view and deep learning is provided, a model can be generated by inputting a single view picture after the training of an ancient cultural relic three-dimensional network model is completed, so that the mapping from a two-dimensional image to a three-dimensional model is realized, the three-dimensional reconstruction is carried out under the condition of a single view picture, in addition, the algorithm does not need image annotation or classification labels for training, and the problems of lack of textures, wide baseline characteristic matching and the like which cannot be solved in the past are overcome.

The invention is realized by the following technical scheme:

a three-dimensional ancient cultural relic reconstruction method based on single view and deep learning comprises the following steps:

step 1: inputting a data set of an ancient cultural relic and a handicraft similar to the ancient cultural relic, wherein the data set comprises a multi-angle multi-view picture of the ancient cultural relic and the handicraft similar to the ancient cultural relic and a voxel model for training;

step 2: according to a network structure of an encoder-decoder, an encoder with a 3D-ResVGG network and a multipath channel attention module is used for carrying out deep information mining and feature extraction on the data set, and an ancient cultural relic three-dimensional network model is generated;

and step 3: preprocessing an ancient cultural relic picture needing three-dimensional reconstruction through an intelligent AI tool to generate a preset type picture;

and 4, step 4: and inputting the preset type picture as a single view into the ancient cultural relic three-dimensional network model to generate a complete ancient cultural relic three-dimensional model of the ancient cultural relic picture needing three-dimensional reconstruction.

As an optimization, in step 2, the ancient cultural relic three-dimensional network model comprises an encoder, an iterative convolution module and a decoder, wherein the encoder is composed of a residual network structure, a multi-path channel attention module and a 3D-ResVGG network, the iterative convolution module is composed of a group of iterative convolution units, the iterative convolution units are distributed in a three-dimensional grid structure in space, and each iterative convolution unit is responsible for reconstructing a final output voxel probability result.

As an optimization, the 3D-ResVGG network adds two groups of 1 × 1 convolutions to the forward propagation part of the ResNet network, and is obtained by combining a ResNet residual module and a VGG network architecture.

As an optimization, the multipath channel attention module functions as: and performing grouping convolution in the process of performing feature extraction on the 3D-ResVGG network by using the multipath channel attention module, and performing deep mining on data information in the form of multipath multi-group convolution kernels to obtain the depth information of the data.

As optimization, the encoder in step 2 specifically performs depth information mining and feature extraction on the data set, and the specific steps are as follows:

2.1, the 3D-ResVGG network receives a data set and carries out feature extraction on the data set to generate a feature map;

step 2.2, the multipath channel attention module performs grouping extraction on the feature map generated by the 3D-ResVGG network: dividing the characteristic graph into a plurality of groups according to the characteristics of the characteristic graph and convolution kernels, and selecting a plurality of groups of adaptive convolution kernels to perform grouping convolution on the characteristic graph;

2.3, performing a group of 1 × 1 convolution on each group of the feature map simulation residual error network structure after convolution to obtain a plurality of corresponding groups of feature maps;

and 2.4, splicing and combining the plurality of groups of feature maps obtained in the step 2.3 to obtain a feature map with high scale, high dimensionality and richer information content.

As an optimization, in step 2.2, 3 groups of information convolution kernels are selected to perform grouping convolution on the feature map.

And as optimization, the iterative convolution unit is composed of a long-short convolution network, the long-short convolution network selects specific convolution unit quantity according to specific format of the input feature diagram, the feature diagram extracted by the encoder is subjected to iterative convolution from a three-dimensional angle, the convolution voxel probability result of each convolution in the feature diagram is obtained in a classification network mode, and a group of three-dimensional probability information is collected and reserved.

As an optimization, in step 2, the specific process of performing convolution operation on the input multiple groups of feature maps by the iterative convolution unit is as follows: and the iterative convolution unit performs iterative convolution operation on a plurality of groups of multi-angle characteristic images to circularly extract the information of the characteristic images, dynamically updates the voxel probability related to the iterative convolution unit, and finally combines the voxel probability data of all the iterative convolution units to synthesize a group of three-dimensional voxel probability models.

As an optimization, in the iterative convolution module, the specific structure of the iterative convolution unit is determined by the resolution of the feature map, and the iterative convolution unit of X is used for processing data of the feature map with X resolution.

And as optimization, verifying whether the ancient cultural relic three-dimensional network model is reliable or not through a loss function, wherein the loss function is a characteristic regularization softmax function after a cross entropy function is improved.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method can generate a corresponding three-dimensional voxel model by inputting a single ancient cultural relic picture, finish the training operation of the ancient cultural relic three-dimensional network model by a plurality of two-dimensional images, and then perform reconstruction operation on the two-dimensional image to be detected by utilizing the ancient cultural relic three-dimensional network model.

2. The invention provides a new 3D-ResVGG network aiming at the problems of low video memory occupation and low flexibility in the traditional ResNet network and the over-fitting problem in the traditional VGG network under the condition of a deeper network layer number, wherein the network improves a 3 x 3 residual operator in the ResNet into a form of combining a 3 x 3 convolution with a 1 x 1 convolution branch through multi-branch model training on the basis of the traditional ResNet and VGG network architecture, and adds a Relu form after each group of large branches, thereby perfecting the over-fitting problem of the traditional VGG network.

3. The invention improves the problems that the traditional encoder network cannot carry out deep extraction on information, has lower training speed and the like, and improves the precision and the depth of the integral three-dimensional reconstruction.

Drawings

In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and that for those skilled in the art, other related drawings can be obtained from these drawings without inventive effort. In the drawings:

FIG. 1 is a schematic structural diagram of an ancient cultural relic three-dimensional model of an ancient cultural relic three-dimensional reconstruction method based on a single view and deep learning according to the invention;

FIG. 2 is a flow chart of a three-dimensional ancient cultural relic reconstruction method based on single view and deep learning according to the invention;

FIG. 3 is a flow chart of an encoder in a three-dimensional ancient cultural relic reconstruction method based on single view and deep learning according to the present invention;

FIG. 4 is a block diagram of a multi-path channel attention module in accordance with the present invention;

fig. 5 is a structure diagram of a 3D-ResVGG network in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Example 1

step 1: inputting ancient cultural relics and handicraft similar to the ancient cultural relics (how to judge whether the ancient cultural relics are similar; the data set herein may use existing ShapeNetVox data sets, including voxel models similar to ancient cultural relics, such as china, lamps, bowls, tables and chairs.

and step 3: preprocessing an ancient cultural relic picture needing three-dimensional reconstruction through an intelligent AI tool to generate a preset type picture; the intelligent AI tool herein may include but is not limited to PhotoScissor, which uses a photo processing tool such as PhotoScissor to perform preprocessing such as dividing, clipping, etc. on a picture.

In this embodiment, in step 2, the ancient cultural relic three-dimensional network model includes an encoder, an iterative convolution module and a decoder. The encoder is composed of a residual error network structure, a multi-path channel attention module and a 3D-ResVGG network, wherein the multi-path channel attention module is an efficient multi-scale channel attention mechanism network, the input of the encoder is a two-dimensional image of a single view, the output of the encoder is a two-dimensional feature vector, and the two-dimensional feature vector needs to be converted into three-dimensional information;

the iterative convolution module uses a set of iterative convolution units, each iterative convolution unit is composed of a set of long and short convolution networks, namely continuous long and short time and LRU networks, the iterative convolution units are distributed in a three-dimensional grid structure in space, each iterative convolution unit is responsible for reconstructing a voxel probability module which is finally output, specifically, the iterative convolution units select specific convolution unit number according to the specific format of a network input picture, for example, the specific convolution unit number is determined according to the spatial resolution of the picture, if the spatial resolution is 32, 32 × 32 convolution units are provided, the feature map extracted in the above mode is subjected to iterative convolution from the three-dimensional angle, the voxel probability result of each convolution in the feature map is obtained in the form of a classification network, and a set of three-dimensional probability information is collected and reserved, wherein the specific format of the feature map is determined according to the size of the picture, the method is equivalent to cutting a picture into a plurality of small blocks, wherein each small block represents the voxel probability of the position of the small block; the input of the decoder is three-dimensional information obtained by processing of the iterative convolution module, and the output of the decoder is three-dimensional prediction voxel occupation of a single image.

As shown in fig. 5, in this embodiment, the 3D-ResVGG network is obtained by adding two groups of 1 × 1 convolutions to the forward propagation portion of the ResNet network and combining the ResNet residual module and the VGG network architecture. The 3D-ResVGG network realizes high-precision feature extraction on the basis of only using 3 x 3 convolution and 1 x 1 convolution by combining the advantages of a ResNet residual module and a VGG network architecture.

In this embodiment, the multi-path channel attention module functions as: and performing grouping convolution in the process of performing feature extraction on the 3D-ResVGG network by using the multi-path channel attention module, activating by an activation function Relu after the convolution, performing deep mining on data information in the form of multi-path multi-group convolution kernels, and acquiring the depth information of the data.

Specifically, as shown in fig. 3 to 4, in this embodiment, the specific steps of the encoder in step 2 for performing depth information mining and feature extraction on the data set are as follows:

step 2.2, the multipath channel attention module performs grouping extraction on the feature map generated by the 3D-ResVGG network: dividing the characteristic graph into a plurality of groups according to the characteristics of the characteristic graph and convolution kernels, and selecting a plurality of groups of adaptive convolution kernels to perform grouping convolution on the characteristic graph, so that the parameter quantity can be reduced and the extraction precision of the characteristic graph can be improved; the feature map is specifically determined to be divided into three groups or four groups according to the number of the feature maps, the convolution kernels are selected according to actual training requirements, for example, convolution kernels with high accuracy requirements and low number of selectable kernels are selected, for example, convolution kernels with high kernel number are selected according to time requirements, for example, the feature maps are divided into multiple groups according to the characteristics of the feature maps according to the convolution kernels 1 × 1, 3 × 3, 5 × 5, 7 × 7 and 9, and three adaptive groups are selected for carrying out grouping convolution; the parameter reduction here means that 12 feature maps are generated if the three-dimensional neural network is not grouped, and the 12 feature maps may be too large to be processed by the server at one time, but the grouping may be four groups of 3, 3 and 3, and the server may process 3 at one time.

2.3, in order to prevent the result from being over-fitted, performing a group of 1-by-1 convolution on each group of the feature map simulation residual error network structures after convolution to obtain a plurality of corresponding groups of feature maps; as a specific example according to step 2.2, three sets of feature maps may be obtained,

and 2.4, splicing and combining the plurality of groups of feature maps subjected to the weighting and residual error processing in the steps 2.1-2.3 to obtain a feature map with high scale, high dimensionality and richer information content.

In fig. 4, "input" is a plurality of sets of feature maps, the node Split is to group the plurality of sets of feature maps, conv is a convolution kernel, path is a path, SE is a residual module, and "output" is a feature map after combination and splicing.

In this embodiment, in step 2, a specific process of performing convolution operation on the input multiple sets of feature maps by the iterative convolution unit is as follows: the iterative convolution unit carries out iterative convolution operation on a plurality of groups of multi-angle feature maps to circularly extract the information of the feature maps, the voxel probability related to the iterative convolution unit is dynamically updated, under the condition of multiple views, the processing of each view updates the currently generated voxel probability model to generate a more accurate probability model, and finally the voxel probability data of all the iterative convolution units are combined to synthesize a group of three-dimensional voxel probability models. It should be noted that, the information of the feature map is extracted circularly here, which means that a hidden layer between the input layer and the output layer of the iterative convolution unit is used to perform a second processing on the generated information of the feature map, wherein the hidden layer of the iterative convolution unit (long-short time network) is arranged in the three-dimensional grid

In this embodiment, in the iterative convolution module, a specific structure of the iterative convolution unit is determined by the resolution of the feature map, the iterative convolution unit of X × X is configured to process data of the feature map with X resolution, and X is a positive integer.

In this embodiment, whether the ancient cultural relic three-dimensional network model is reliable is verified through a loss function, where the loss function is a feature regularization softmax function after a cross entropy function is improved. The normalization processing of data is added on the basis of the existing A-softmax and L-softmax, when the quality of a plurality of groups of data is different, feature extraction is carried out according to the specific picture data condition, and the problems of concentrated sample attention failure and the like caused by picture quality are solved.

The specific formula is as follows:

wherein, theta is defined as an interval angle, m is an introduced parameter factor, and is defined as an angle characteristic distance parameter, the distance between the characteristics is adjusted through m, and is fixed to be 4, s is a scaling factor, and is fixed to be 30.

The encoder is a 3D-ResVGG network improved on the existing ResNet network, combines the advantages of a ResNet residual module and a VGG network architecture, and realizes high-precision feature extraction on the basis of only using 3 x 3 convolution and 1 x 1 convolution; the iterative convolution module is composed of a group of iterative convolution units, each iterative convolution unit generates a voxel probability, and a complete three-dimensional voxel probability model is formed by multi-view image input; the decoder is composed of a traditional resnet network, the resnet network model is realized by using the prior art, the traditional resnet network model can prevent the overfitting problem caused by too deep network learning, meanwhile, the problem that the voxel probability model accuracy possibly caused by an encoder and an iterative convolution module is unstable is solved, and finally, a convolution neural network is formed. The input of the encoder is a single-view two-dimensional image, the output of the encoder is a two-dimensional feature vector, and the two-dimensional feature vector needs to be converted into three-dimensional information; the iterative convolution module completes a voxel probability model distributed in the three-dimensional grid. The input of the decoder is three-dimensional information obtained by processing of the iterative convolution module, and the output of the decoder is three-dimensional prediction voxel occupation of a single image.

The feasibility of the protocol of example 1 was verified in conjunction with specific experiments, as described in detail below:

1) experimental data set

ShapeNet data set

This data set consists of a three-dimensional CAD model of an object, the largest three-dimensional model data set that contains rich annotations to date. The three-dimensional model is organized under WordNet classification, rich semantic annotations are provided for each three-dimensional model, the semantic annotations comprise physical dimensions, keywords and the like, and the annotations can be provided through a Web-based interface so as to realize data visualization of object attributes. ShapeNet contains over 300 ten thousand models in total, of which 22 ten thousand models are classified into 3135 categories.

2) Evaluation criteria

IoU:Intersection-over-Union

IoU is used as an evaluation index, and the larger the evaluation index is, the better the evaluation index is. The formula is as follows, wherein I (x) is an indicator function and t represents a threshold value. If the probability is greater than the threshold, the voxel exists in the position, and the specific formula is as follows:

3) training process

The invention trains the ancient cultural relic three-dimensional network model by using a ShapeNet data set. The picture of 127 × 3 is input, and after passing through the ancient cultural relic three-dimensional network model, the voxel space of 32 × 32 is output.

In summary, the embodiment of the present invention verifies the feasibility of the scheme in embodiment 1 through the experimental process, the experimental data, and the experimental results, and the three-dimensional object reconstruction algorithm provided by the embodiment of the present invention has a good three-dimensional reconstruction capability for the two-dimensional image.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A three-dimensional ancient cultural relic reconstruction method based on single view and deep learning is characterized by comprising the following steps:

2. The method for reconstructing the ancient cultural relics based on the single view and the deep learning as claimed in claim 1, wherein in the step 2, the ancient cultural relics three-dimensional network model comprises an encoder, an iterative convolution module and a decoder, the encoder is composed of a residual network structure combined with a multipath channel attention module and a 3D-ResVGG network, the iterative convolution module is composed of a group of iterative convolution units, the iterative convolution units are spatially distributed in a three-dimensional grid structure, and each iterative convolution unit is responsible for reconstructing a final output voxel probability result.

3. The method of claim 1, wherein the 3D-ResVGG network is obtained by adding two groups of 1 x 1 convolutions to a forward propagation part of a ResNet network and combining a ResNet residual module and a VGG network architecture.

4. The method as claimed in claim 3, wherein the multipath channel attention module is used for: and performing grouping convolution in the process of performing feature extraction on the 3D-ResVGG network by using the multipath channel attention module, and performing deep mining on data information in the form of multipath multi-group convolution kernels to obtain the depth information of the data.

5. The method as claimed in claim 4, wherein the encoder in step 2 performs depth information mining and feature extraction on the data set by the following specific steps:

6. The method as claimed in claim 5, wherein in step 2.2, 3 groups of information convolution kernels are selected to perform the grouping convolution on the feature map.

7. The ancient cultural relic three-dimensional reconstruction method based on the single view and the deep learning according to claim 6, wherein the iterative convolution unit is composed of a long and short convolution network, the long and short convolution network selects specific convolution unit quantity according to specific format of the input feature map, the feature map extracted by an encoder is subjected to iterative convolution from a three-dimensional angle, a voxel probability result of each convolution in the feature map is obtained in a classification network mode, and a group of three-dimensional probability information is collected and retained.

8. The method according to claim 7, wherein in step 2, the iterative convolution unit performs convolution operation on the input multiple groups of feature maps by using a specific process that is: and the iterative convolution unit performs iterative convolution operation on a plurality of groups of multi-angle characteristic images to circularly extract the information of the characteristic images, dynamically updates the voxel probability related to the iterative convolution unit, and finally combines the voxel probability data of all the iterative convolution units to synthesize a group of three-dimensional voxel probability models.

9. The method according to claim 8, wherein in the iterative convolution module, the specific structure of the iterative convolution unit is determined by the resolution of the feature map, and the iterative convolution unit of X is used for processing data of the feature map of X resolution.

10. The ancient cultural relic three-dimensional reconstruction method based on the single view and the deep learning as claimed in any one of claims 1 to 9, wherein whether the ancient cultural relic three-dimensional network model is reliable or not is verified through a loss function, wherein the loss function is a feature regularization softmax function after cross entropy function improvement.