CN116051746A

CN116051746A - Improved method for three-dimensional reconstruction and neural rendering network

Info

Publication number: CN116051746A
Application number: CN202310038592.5A
Authority: CN
Inventors: 党凤月; 任振宁; 李会朋
Original assignee: Shandong Huichuan Iot Technology Co ltd
Current assignee: Shandong Huichuan Iot Technology Co ltd
Priority date: 2023-01-12
Filing date: 2023-01-12
Publication date: 2023-05-02

Abstract

The invention discloses an improved method of a three-dimensional reconstruction and neural rendering network, which comprises two-dimensional view feature extraction, feature matching, model decomposition, single-view neural rendering network, multi-view neural rendering network and commodity three-dimensional modeling scheme flow. The beneficial effects of the invention are as follows: by utilizing the advantages of FastNeRF and PixelNeRF, depth optimization is performed on the basis of NeRF, and a complete network structure is provided, so that the model has the characteristics of high reasoning speed, small view input quantity and strong generalization; for the lower computational efficiency of NeRF, fastNeRF can render a 200Hz high-fidelity realistic image on a terminal consumer grade GPU, and the speed of the method is 3000 times that of the original NeRF algorithm; in addition, the PixelNeRF aims to well synthesize a new view angle when a few view angles are known, and greatly improves the generalization of the model.

Description

Improved method for three-dimensional reconstruction and neural rendering network

Technical Field

The invention relates to a neural rendering method, in particular to an improved method of three-dimensional reconstruction and neural rendering network, and belongs to the technical field of neural rendering based on software-three-dimensional modeling.

Background

In the intelligent age of combining the digital and the media and enabling the advanced technology to be changed day by day, the demands of people are also increasing. In recent years, three-dimensional models are widely applied to various industries, such as mapping and measurement, geographic information systems, teaching display, city planning, building construction, game making, smart cities, smart scenic spots, digital archiving protection of ancient cultural relics, and the like. The three-dimensional modeling technology is convenient for people to live, and has better experience, the three-dimensional reconstruction application is very wide, in the aspect of image entertainment, an object can be reconstructed to obtain a three-dimensional model, and the three-dimensional model can be used for 3D printing or driving by human bodies to make some interesting applications; for virtual fitting, clothes with different sizes can be automatically adapted according to the fat, thin and height of different human bodies after the human bodies are rebuilt; in the aspect of intelligent home, a plurality of shopping APP can be used for placing virtual furniture to see whether the matching of the furniture and the home of the user is achieved, and whether the furniture is put down or not is determined through the actual size; in terms of relic reconstruction and AR tourism, a plurality of museums or tourist attractions have similar products at present, such as AR, xihu lake and the like; in the aspect of automatic driving, a high-precision map can be constructed; for the three-dimensional reconstruction of a large scene, virtual roaming can be achieved, and in most shopping apps, a consumer can only see one two-dimensional commodity graph and cannot see the real commodity effect.

The neural radiation field (NeRF) proposed in 2020 becomes the most fire algorithm in the three-dimensional modeling field, and neural rendering of the neural radiation field (NeRF) is an emerging three-dimensional modeling and rendering technology, and the neural network is used for implicitly representing the shape, texture and material of an object, and is used for training end to end directly, so that rendering results of various angles with high reduction degree can be obtained. However, the NeRF technology is also relatively primary, the training speed, the reasoning speed, the modeling robustness and the lack of explicit three-dimensional representation have all seriously affected the application of the technology, and many scientists around the world have conducted intensive research and exploration on the basis of the technology.

Disclosure of Invention

The invention aims to solve at least one technical problem, and provides an improved method for three-dimensional reconstruction and neural rendering network, which improves the model reasoning speed and generalization, and the model can support a small amount of input, namely 3-4 two-dimensional commodity graphs are input, so that three-dimensional commodities can be reconstructed.

The invention realizes the above purpose through the following technical scheme: an improved method of three-dimensional reconstruction and neural rendering network, comprising the steps of:

step one, two-dimensional view feature extraction, the NeRF method requires a large number of pictures of known viewing angles as input and takes a large amount of training time; pixelNeRF allows the network to be trained across multiple scenes to learn previous scenes, i.e., to acquire a priori knowledge of the scenes, enabling it to perform new view synthesis from a sparse view set (one or a small number) in a feed-forward manner; the PixelNeRF adds a full convolution image encoder in front of the NeRF network for pixel-wise encoding of the image into a feature network aligned along the pixels, wherein NeRF is a neural radiation field, which is a method of generating new views of complex scenes, essentially constructing an implicit rendering flow;

step two, after the features corresponding to the points of the input view are obtained, in order to know the projection of the points on the world coordinates on the input view, the coordinates of the points need to be converted into a camera space, and the corresponding features are sampled in a feature map according to the normalized coordinates;

step three, model decomposition, neRF is essentially an equation that converts a three-dimensional position p and a two-dimensional representation of direction d into a three-dimensional color c and scalar density σ, where density depends only on position and color depends on position and direction; the basic idea of FastNeRF is to sacrifice a part of the cache to improve the calculation efficiency and reduce the time required for rendering; neRF was decomposed into two neural networks: a position dependent network that generates a deep radiation pattern and a direction dependent network that generates weights, the inner products of the weights and the deep radiation pattern estimating colors in the scene that are seen at the specified position and from the specified direction; the FastNeRF architecture can be cached efficiently, and the testing time efficiency is remarkably improved while the visual quality of NeRF is maintained;

step four, a single-view neural rendering network, wherein a model is formed by combining two-dimensional input visual angle characteristics with a decomposition network, the model is formed by two parts, the first part is a characteristic network which is formed by encoding input images in the step one and the step two into alignment along pixels according to pixels, the second part is a neural network which is formed by decomposing the step three into two suitable buffer memory and is used for buffering space coordinate information and corresponding encoding characteristics and outputting color and density values, and the neural rendering is to realize display or implicit control of some scene assets (illumination, camera parameters, gestures, geometry, appearance and semantic structures) through a depth image or video generation method;

step five, for the multi-view neural rendering network, for inputting multiple views, independently processing coordinates and corresponding features in each view coordinate frame according to a single-view rendering method, then aggregating in a mode of solving an average value of rendering results obtained by each view, transmitting the aggregated results to an MLP to obtain predicted density and color, and finally minimizing the mean square error between the calculated results and a true value;

step six, a commodity three-dimensional modeling scheme flow, wherein the improved network structure is applied to commodity three-dimensional modeling, data is submitted first, 3-4 two-dimensional commodity models are input, each angle of the commodity is provided with a picture, and the picture background is required to be as concise and clear as possible and is best mainly based on white background; then preprocessing the picture; transmitting the processed picture into a model for three-dimensional modeling and nerve rendering; and finally uploading the generated three-dimensional commodity to an app for viewing by a consumer.

As still further aspects of the invention: in step one, pixelNeRF uses pre-trained ResNet34 for feature picture feature extraction, and is implemented in the first 7x7 large-kernel convolutional layer with 3 convolution kernels of 3x3 small convolution kernels instead, corresponding to the last average poolThe changed core is changed into 4x4, which not only can reduce parameters, but also can deepen the network depth to realize network capacity and complexity; the improved network is mainly composed of 16 basic units, 3x3 convolution layers and 1 full connection layer, and the total is 36 layers, and the final size is that

Wherein ResNet34 refers to a residual network with 34 convolutional layers as a backbone network for extracting feature maps.

As still further aspects of the invention: in the second step, specifically, the method includes:

knowing the camera coordinate p of the input view angle and the rotation matrix R, the point coordinate is s in the camera space _c ＝R ^-1 (s _i -p)；

Since the rotation matrix is an orthogonal matrix, the coordinate is s _c ＝R ^T (s _i -p)＝(x _c ,y _c ,z _c )；

Projected onto a camera plane with coordinates of

Its coordinates after normalization are

Finally according to s _uv Corresponding features are sampled in the feature map.

As still further aspects of the invention: in the third step, specifically include:

dividing the NeRF same task into two neural networks suitable for buffering;

location dependent network F _pos Outputting a depth radiation pattern (u, v, w) comprising D components;

direction dependent network F _dir When the direction of the light is input, the weight (β) of the output component ₁ ,...,β _D )；

The split architecture allows for independent caching of both location dependent and ray direction dependent outputs, which can greatly improve performance when cached.

As still further aspects of the invention: in the fourth step, specifically include:

for a single input image i, firstly fixing a coordinate system as a view space of the input image, and specifying a position and camera rays in the coordinate system;

extracting the characteristic quantity W of the input image through the improved ResNet34 network ⁽ⁱ⁾ ；

For a point x on the camera ray ⁽ⁱ⁾ By using known internal references, x is calculated ⁽ⁱ⁾ Projected onto the image coordinates pi (x ⁽ⁱ⁾ ) On, then extracting corresponding image feature vectors W between pixel features ⁽ⁱ⁾ (π(x ⁽ⁱ⁾ ))；

Finally, the image features are transferred to the decomposition network along with the position (x, y, z) and the view direction (θ, φ), and the view angle (rgb σ) is output.

As still further aspects of the invention: in step six, the submitted data includes 2D pictures and model information, and preprocessing of the pictures includes image segmentation and format conversion.

The beneficial effects of the invention are as follows: acquiring input view characteristics; dividing the same task into two neural networks suitable for buffering, a position dependent network generating a depth radiation pattern and a direction dependent network generating weight; combining the input visual angle characteristics with a decomposition network to obtain single-view output color and density values; the single view result is aggregated in a mean value solving mode and is transmitted to an MLP to obtain predicted density and color; by utilizing the advantages of FastNeRF and PixelNeRF, depth optimization is performed on the basis of NeRF, and a complete network structure is provided, so that the model has the characteristics of high reasoning speed, small view input quantity and strong generalization; for the lower computational efficiency of NeRF, fastNeRF can render a 200Hz high-fidelity realistic image on a terminal consumer grade GPU, and the speed of the method is 3000 times that of the original NeRF algorithm; in addition, the PixelNeRF aims to well synthesize a new view angle when a few view angles are known, and greatly improves the generalization of the model.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a schematic diagram of an improved network architecture according to the present invention;

FIG. 3 is a diagram of a network architecture for model decomposition of the present invention;

FIG. 4 is a block diagram of a single view neural rendering network of the present invention;

FIG. 5 is a block diagram of a multi-view neural rendering network of the present invention;

FIG. 6 is a flow chart of a three-dimensional modeling scheme for a commodity according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1 to 6, an improved method of three-dimensional reconstruction and neural rendering network, comprising the steps of:

Example two

In addition to all the technical features in the first embodiment, the present embodiment further includes:

in the first step, the PixelNeRF adopts a pre-training ResNet34 for extracting characteristic picture characteristics, and is realized by replacing convolution kernels of 3x3 small convolution kernels in the initial 7x7 large-kernel convolution layer, and the corresponding final average pooling kernel is changed into 4x4, so that parameters can be reduced, and the network depth can be deepened to realize network capacity and complexity; the improved network is mainly composed of 16 basic units, 3x3 convolution layers and 1 full connection layer, and the total is 36 layers, and the final size is that

In the second step, specifically, the method includes:

Projected onto a camera plane with coordinates of

Its coordinates after normalization are

Example III

in the third step, specifically include:

dividing the NeRF same task into two neural networks suitable for buffering;

In the fourth step, specifically include:

In step six, the submitted data includes 2D pictures and model information, and preprocessing of the pictures includes image segmentation and format conversion.

Working principle: on the basis of the prior art, the NeRF algorithm is deeply optimized, the model reasoning speed and generalization are improved, the model can support a small amount of input, namely 3-4 two-dimensional commodity graphs are input, and three-dimensional commodities can be reconstructed.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. An improved method for three-dimensional reconstruction and neural rendering network is characterized in that: the method comprises the following steps:

step one, extracting the characteristics of the two-dimensional view,

the NeRF method requires a picture of a known viewing angle as an input and takes a lot of training time;

PixelNeRF allows the network to be trained across multiple scenes to learn previous scenes, i.e., to acquire a priori knowledge of the scenes, enabling it to perform new view synthesis from a sparse view set in a feed-forward manner;

the PixelNeRF adds a full convolution image encoder in front of a NeRF network for encoding the image into a characteristic network aligned along the pixels according to the pixels, wherein NeRF is a nerve radiation field, is a method for generating a new view of a complex scene, and constructs an implicit rendering flow;

step two, feature matching, after obtaining the features corresponding to the points in the input view angle, in order to know the projection of the points on the world coordinates on the input view angle, the coordinates of the points need to be converted into a camera space, and the corresponding features are sampled in a feature map according to the normalized coordinates;

step three, model decomposition, neRF is essentially an equation that converts a three-dimensional position p and a two-dimensional representation of direction d into a three-dimensional color c and scalar density σ, where density depends only on position and color depends on position and direction;

the basic idea of FastNeRF is to sacrifice a part of the cache to improve the calculation efficiency and reduce the time required for rendering;

NeRF was decomposed into two neural networks: a position dependent network that generates a deep radiation pattern and a direction dependent network that generates weights, the inner products of the weights and the deep radiation pattern estimating colors in the scene that are seen at the specified position and from the specified direction;

the FastNeRF architecture can be cached efficiently, and the testing time efficiency is remarkably improved while the visual quality of NeRF is maintained;

step four, a single-view neural rendering network, wherein a model combines two-dimensional input visual angle characteristics with a decomposition network, the model is composed of two parts, the first part is a characteristic network in which input images in the step one and the step two are aligned along pixels according to pixel codes, and the second part is a neural network in which the decomposition in the step three is divided into two suitable caches and is used for caching space coordinate information and corresponding coding characteristics and outputting color and density values;

2. The improvement as claimed in claim 1 wherein: in the first step, the PixelNeRF adopts the pretrained res net34 to extract the characteristic picture features, and uses the convolution kernels of 3x3 small convolution kernels to replace the convolution kernels in the first 7x7 large-kernel convolution layer, and the corresponding final average pooled kernel is changed into 4x4, so that parameters can be reduced, and the network depth can be deepened to realize network capacity and complexity; the improved network is mainly composed of 16 basic units, 3x3 convolution layers and 1 full connection layer, and the total is 36 layers, and the final size is that

3. The improvement as claimed in claim 1 wherein: in the second step, specifically, the method includes:

Since the rotation matrix is an orthogonal matrix, the coordinate is s _c ＝R ^T (s _i -p)＝(x _c ，y _c ，z _c )；

Projected onto a camera plane with coordinates of

Its coordinates after normalization are

4. The improvement as claimed in claim 1 wherein: in the third step, specifically, the method includes:

dividing the NeRF same task into two neural networks suitable for buffering;

direction dependent network F _dir When the direction of the light is input, the weight (β) of the output component ₁ ，...，β _D )；

5. The improvement as claimed in claim 1 wherein: in the fourth step, specifically, the method includes:

6. The improvement as claimed in claim 1 wherein: in the sixth step, the submitted data includes 2D pictures and model information, and preprocessing of the pictures includes image segmentation and format conversion.