CN116681818B

CN116681818B - New view angle reconstruction method, training method and device of new view angle reconstruction network

Info

Publication number: CN116681818B
Application number: CN202211336428.4A
Authority: CN
Inventors: 陈兵; 高崇军
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2024-04-09
Anticipated expiration: 2042-10-28
Also published as: CN116681818A

Abstract

The application provides a new view angle reconstruction method, a training method of a new view angle reconstruction network and a training device of the new view angle reconstruction network, wherein the training method of the new view angle reconstruction network is combined with depth information and appearance information (rendering) to jointly supervise learning of the new view angle reconstruction network, so that the obtained nerve encoding body can more accurately express the information of the geometry and the appearance of a scene, and further the accuracy and the resolution of a finally output new view angle image are improved. The new view angle reconstruction method can construct and obtain corresponding nerve coding bodies and nerve radiation fields based on multiple views of the reconstruction object, and finally, the image of the reconstruction object at the new view angle is rendered based on the nerve radiation fields. The neural encoder can infer and learn geometric information and appearance information of the reconstructed object, so that training optimization is not required for each scene, and a large number of training samples are not required to be input.

Description

New view angle reconstruction method, training method and device of new view angle reconstruction network

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a new view angle reconstruction method, a training method for a new view angle reconstruction network, and a training device for the new view angle reconstruction network.

Background

The new view synthesis is widely applied to the fields of 3D reconstruction, VR (virtual reality), AR (augmented reality) and the like. The new view angle composition may use multiple known images of an object or scene to generate a high definition image at any view angle.

Currently, the neural radiation field (Neural Radiance Fields, neRF) becomes the mainstream method of new view angle synthesis, and the NeRF technology is very successful in optimizing the volume geometry and appearance of the observed image, so that a vivid new view can be rendered. However, this method requires training optimization for each scene to obtain a high-quality nerve radiation field, which is time-consuming and requires high hardware, thus limiting the use of the technology on intelligent terminals (e.g., mobile phones). In addition, the accuracy of the new view synthesized by the current new view angle reconstruction method is low.

Disclosure of Invention

In view of this, the present application provides a new view angle reconstruction method, a new view angle reconstruction network training method and a device, so as to solve at least some of the above problems, and the disclosed technical solution is as follows:

in a first aspect, the present application provides a training method of a new view angle reconstruction network, applied to an electronic device, where a terminal device operates a training network, where the training network includes a new view angle reconstruction network and an image depth estimation network; the method comprises the following steps: receiving image samples of a plurality of reconstruction objects through a new view angle reconstruction network, wherein each reconstruction object comprises image samples of a plurality of different view angles, and the plurality of image samples of each reconstruction object comprises a reference image sample; obtaining a neural coding body of the reconstructed object based on all image samples of the same reconstructed object through a new view angle reconstruction network, and obtaining a rendering loss value based on the neural coding body; obtaining a depth loss value corresponding to a reference image sample based on a nerve encoding body through a depth estimation network; parameters of the training network are optimized based on the depth loss value and the rendering loss value of the same reconstructed object. Therefore, the training method of the new view angle reconstruction network combines the depth information and the appearance information (rendering) to jointly supervise the learning of the new view angle reconstruction network, so that the obtained neural coding body can more accurately express the geometric and appearance information of the scene, and further the accuracy and resolution of the finally output new view angle image are improved.

In a possible implementation manner of the first aspect, obtaining, by the new view angle reconstruction network, a neural coding body of a reconstructed object based on all image samples of the same reconstructed object includes: extracting depth features from each image sample of the same reconstruction object through a new view angle reconstruction network to obtain a feature map; and obtaining the neural code body corresponding to the reconstructed object by using all feature maps of the same reconstructed object through the new view angle reconstruction network.

In another possible implementation manner of the first aspect, obtaining a rendering loss value based on the neural coding volume includes: constructing a nerve radiation field corresponding to the reconstructed object based on the nerve encoding body; rendering a target view of the reconstructed object based on the nerve radiation field of the reconstructed object to obtain a target view; and obtaining a rendering loss value by utilizing the real view of the target view and the reconstructed object in the target view angle.

In a further possible implementation manner of the first aspect, optimizing parameters of the training network according to the depth loss value and the rendering loss value of the same reconstructed object includes: acquiring a first weight corresponding to the depth loss value and a second weight corresponding to the rendering loss value; calculating a first product of the depth loss value and the first weight, a second product of the rendering loss and the second weight, and calculating the first product and the second product to obtain a total loss value; parameters of the training network are adjusted according to the total loss value. Therefore, the weight coefficients of the depth loss value and the rendering loss value can be adjusted according to actual requirements, the influence of the depth loss value and the rendering loss value on the total loss value is further adjusted, and the flexibility is higher.

In a further possible implementation manner of the first aspect, the obtaining, by the new view angle reconstruction network, a neural coding body corresponding to the reconstructed object by using all feature maps of the same reconstructed object includes: homography transformation is carried out on all feature images of the same reconstruction object, and the homography transformation is projected onto a parallel plane of the feature image corresponding to the reference image sample, so that a feature body corresponding to each feature image is obtained; all feature bodies corresponding to the same reconstruction object are aggregated into a three-dimensional cost body; and encoding the three-dimensional cost body to obtain a nerve encoding body corresponding to the reconstruction object. The neural coding body in the scheme can effectively infer and propagate geometric information and appearance information of scenes, so that training optimization is not required to be carried out on each scene, and a large number of training samples are not required to be input.

In another possible implementation manner of the first aspect, aggregating all feature volumes corresponding to the same reconstructed object into one three-dimensional cost volume includes: and calculating variances of all feature bodies corresponding to the same reconstruction object to obtain a three-dimensional cost body corresponding to the reconstruction object. The variance-based 3D cost body obtained by the scheme encodes the image appearance changes on different input views, so that the variance-based 3D cost body can express the appearance changes caused by the scene geometry and view-related coloring effects, and further the neural coding body can express the geometry and appearance of the scene more accurately.

In a further possible implementation manner of the first aspect, constructing a neural radiation field corresponding to the reconstructed object based on the neural encoder includes: encoding a first position vector of any three-dimensional position in the nerve encoding body to obtain a second position vector; encoding a first direction vector of any observation direction of the nerve encoding body to obtain a second direction vector; and obtaining voxel density and color information corresponding to any three-dimensional position in any observation direction by using the neural code body corresponding to the reconstruction object, and the position vector, the direction vector and the color information corresponding to the position of the image sample of the reconstruction object mapped by any three-dimensional position.

In another possible implementation manner of the first aspect, encoding the first position vector of any three-dimensional position in the neural encoding body to obtain the second position vector includes: encoding the first position vector by adopting a position encoding mode to obtain a second position vector, wherein the dimension of the second position vector is higher than that of the first position vector; encoding a first direction vector of any observation direction of the neural encoding body to obtain a second direction vector, including: and encoding the first direction vector by adopting a position encoding mode to obtain a second direction vector, wherein the dimension of the second direction vector is higher than that of the first direction vector. According to the scheme, a position coding mode is adopted, so that a position vector and a direction vector with higher dimension can be obtained, and the high-frequency information, namely detail information, of the rendered image can be improved.

In a second aspect, the present application provides a new view angle reconstruction method, applied to a terminal device, where the terminal device operates a new view angle reconstruction network, where the new view angle reconstruction network is obtained by using the training method of the new view angle reconstruction network in the second aspect; the method comprises the following steps: receiving images of a plurality of different view angles of a target reconstruction object through a new view angle reconstruction network, wherein the images of the plurality of different view angles comprise a reference image; and obtaining the image of the reconstructed object at the target view angle based on all the images of the target reconstructed object through the new view angle reconstruction network. The new view angle reconstruction method of the scheme can construct and obtain corresponding nerve coding bodies and nerve radiation fields based on multiple views of the reconstruction object, and finally, the image of the reconstruction object at the new view angle is rendered based on the nerve radiation fields. The neural encoder can infer and learn geometric information and appearance information of the reconstructed object, so that training optimization is not required for each scene, and a large number of training samples are not required to be input.

In a possible implementation manner of the second aspect, the terminal device includes a movable camera; acquiring images of a plurality of different perspectives of a target reconstructed object, comprising: and shooting images of the target reconstruction object corresponding to a plurality of different visual angles through a movable camera, and selecting any one image from the images of the different visual angles to be determined as a reference image. Therefore, the movable camera can shoot images of the current scene or the object at a plurality of different visual angles, namely a plurality of views, in the moving process, and the user does not need to shoot the images of the different visual angles manually, so that the cooperation requirement on the user is reduced, and the user experience is improved.

In a third aspect, the present application provides an electronic device comprising one or more processors, memory, and a touch screen; the memory is used for storing program codes; the processor is configured to execute the program code to cause the terminal device to implement the training method of the new view angle reconstruction network according to any one of the first aspect or the new view angle reconstruction method according to any one of the second aspect.

In a fourth aspect, the present application also provides a computer readable storage medium having instructions stored thereon, which when run on a terminal device, cause the terminal device to perform the training method of the new view reconstruction network according to any of the first aspects or the new view reconstruction method according to any of the second aspects.

In a fifth aspect, the present application further provides a computer program product having stored thereon an execution, which, when run on a terminal device, causes the terminal device to implement the training method of the new view angle reconstruction network according to any one of the first aspects or the new view angle reconstruction method according to any one of the second aspects.

It should be appreciated that the description of technical features, aspects, benefits or similar language in this application does not imply that all of the features and advantages may be realized with any single embodiment. Conversely, it should be understood that the description of features or advantages is intended to include, in at least one embodiment, the particular features, aspects, or advantages. Therefore, the description of technical features, technical solutions or advantageous effects in this specification does not necessarily refer to the same embodiment. Furthermore, the technical features, technical solutions and advantageous effects described in the present embodiment may also be combined in any appropriate manner. Those of skill in the art will appreciate that an embodiment may be implemented without one or more particular features, aspects, or benefits of a particular embodiment. In other embodiments, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an interface schematic diagram of a terminal device provided in an embodiment of the present application;

fig. 2 is an interface schematic diagram of another terminal device provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;

fig. 4 is a schematic software architecture diagram of a terminal device according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a training network structure of a new view angle reconstruction network according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a new view reconstruction network according to an embodiment of the present application;

fig. 7 is a flowchart of a new view angle reconstruction method according to an embodiment of the present application.

Detailed Description

The terms first, second, third and the like in the description and in the claims and drawings are used for distinguishing between different objects and not for limiting the specified sequence.

In the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

The terminal equipment to which the new view angle synthesizing method and device provided by the embodiment of the application can be applied can be mobile phones, tablet computers, unmanned aerial vehicles, VR equipment, AR equipment and other portable equipment. For example, in one scene, a camera on the terminal device may be used to capture images of multiple views of the scene, and a new view angle synthesis method is further used to synthesize a 3D scene of the scene, so as to directly view the 3D scene on the terminal device, i.e. a scene with any view angle can be displayed on the terminal device.

For example, in one possible implementation, as shown in fig. 1 (1), the user may click on a camera icon 101 on the desktop interface 100, jumping to the photo interface shown in fig. 1 (2), which includes a preview display area 102 and a function item area 103.

For example, in the embodiment of the present application, the function item area 103 may include a 3D scene function item 104. In one possible implementation of the present application, the user clicks on the 3D scene function item 104, entering an interface that captures multiple different views of the scene or object.

The 3D scene function 104 may implement images of multiple different perspectives (i.e., multiple different views) of a scene or object captured by a user, thereby creating a 3D model of the scene or object that the user may view at any angle.

For example, in another possible implementation, as shown in FIG. 2 (1), the user may click on an icon 201 of a gallery application on the desktop interface 200, entering a gallery interface, which may include a photo page, an album page, a time of day page, and a discovery page.

The gallery interface includes a page control area 203 that a user may enter different pages by clicking on different controls within the area 203. The interface 202 shown in fig. 2 (2) is an interface schematic diagram of a discovery page, where the discovery page 202 includes functions such as micro-movie creation, jigsaw creation, 3D scene creation 204, and the like. In an exemplary embodiment of the present application, after a user clicks on the 3D scene creation 204, a 3D model of the scene or object may be constructed according to multiple views of the same scene or object, and the user may view an image of the scene or object at any view angle. For example, multiple views of each room in a home may be acquired by the end device, and a 3D house model may be ultimately synthesized in which images of any view of any room may be viewed.

Referring to fig. 3, a schematic structural diagram of a terminal device provided in an embodiment of the present application is shown.

As shown in fig. 3, the terminal device may include a processor 110, a camera 120, a display 130, and a memory 140.

It will be appreciated that the structure illustrated in this embodiment does not constitute a specific limitation on the terminal device. In other embodiments, the terminal device may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, for example, processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a digital signal processor (digital signal processor, DSP), a baseband processor, etc., where the different processing units may be separate devices or integrated in one or more processors, for example, the modem processor and the baseband processor may be integrated in one processor. Wherein the ISP is used to process the data fed back by the camera 120.

The camera 120 is used to capture still images or video. In some embodiments, the electronic device may include 1 or N cameras 120, N being a positive integer greater than 1.

In one embodiment of the present application, the camera employs a movable camera, wherein "movable" includes, but is not limited to, at least one of the following:

1. in a plane parallel to the plane in which the screen of the terminal device is located, movement in any direction occurs. For example, on a track provided on a rear case parallel to a screen of the mobile phone in the mobile phone. It is understood that the shape of the track includes, but is not limited to, arcuate, linear, etc.

2. The camera can rotate around the fixed shaft within a certain angle range.

It can be understood that the movable camera can shoot images of a current scene or an object at a plurality of different visual angles, namely a plurality of views, in the moving process, and the user does not need to shoot the images of the different visual angles manually, so that the cooperation requirement on the user is reduced, and the user experience is improved.

The display screen 130 is used to display images, videos, and the like. In some embodiments, the electronic device may include 1 or N display screens 130, N being a positive integer greater than 1.

Memory 140 may be used to store computer executable program code that includes instructions. The processor 110 executes various functional applications of the electronic device and data processing by executing instructions stored in the memory 140. For example, in the present embodiment, the processor 110 may synthesize a new view angle image of an object or scene by executing instructions stored in the memory 140.

In addition, an operating system is run on the components.

Taking a mobile phone and a tablet computer as examples, the operating system can beEtc. The operating system may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, taking an Android system with a layered architecture as an example, a software structure of an electronic device is illustrated. In some embodiments, the operating system typically runs in an application processor.

Fig. 4 is a software structural block diagram of a terminal device provided in an embodiment of the present application.

The Android system comprises four layers, namely an application program layer, an application program framework layer, an Zhuoyun row (Android run) and system library and a kernel layer from top to bottom.

The application layer may include a series of application packages, for example, the application packages may include applications for cameras, gallery, calendar, conversation, map, navigation, WLAN, bluetooth, music, video, short messages, etc. For example, in embodiments of the present application, the application package may also include a 3D reconstruction application.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions.

The system library may include a plurality of functional modules. For example: surface manager (surface manager), media library (Media Libraries), etc.

Android run time includes a core library and virtual machines. Android run time is responsible for scheduling and management of the Android system.

The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The kernel layer is a layer between hardware and software. For example, the kernel layer may include various drivers such as a display driver, a camera driver, and the like.

It should be noted that, although the embodiment of the present application is described taking the Android system as an example, the basic principle is equally applicable to electronic devices based on other operating systems.

Referring to fig. 5, a block diagram of a training network architecture of a new view angle reconstruction network provided in an embodiment of the present application is shown. This embodiment shows a new view reconstruction network architecture for the training phase.

As shown in fig. 5, the training network architecture of the new view reconstruction network may include: a feature extraction module 101, a cost volume construction module 102, a neural coding module 103, an image depth estimation module 104, a neural radiation field construction module 105, and a rendering module 106.

The input of the training network architecture is a plurality of views of a plurality of reconstructed objects, each reconstructed object comprising N image samples of different perspectives (N being a positive integer greater than 2), one view being the reference image sample.

In this embodiment, the reconstructed object may be an object or a scene (such as a spatial environment, etc.), which is not limited in this application.

In an application scenario, an electronic device includes a main camera and a movable sub-camera, where the resolution of the main camera is higher than that of the sub-camera, and in the application scenario, an image acquired by the main camera may be used as a reference image sample, and an image captured by the sub-camera may be used as other views.

In another application scenario, the electronic device includes a camera, and the camera is movable, and in this application scenario, an image may be selected from a plurality of images captured by the movable camera as a reference image.

It will be appreciated that the larger the value of N, the higher the accuracy of the new angle reconstruction network.

In the foregoing description, the secondary camera of the terminal device provided in this embodiment adopts the rotatable camera, and can shoot images of multiple different view angles of the current scene or object in the rotation or movement process of the rotatable camera, so that the user does not need to manually move the terminal device to shoot different views, the cooperation requirement on the user is reduced, and the user experience is improved.

The input of the image feature extraction module 101 is N Zhang Yuantu samples, and the module is used for obtaining N depth feature maps for geometric information contained in each source pattern image, i.e. the output of the module is N depth feature maps

The input of the cost body construction module 102 is N feature graphs output by the feature extraction module 101, and the output is a 3D cost body. The module is used for projecting N feature images onto a plurality of parallel planes of the feature images corresponding to the reference image to form N feature bodiesAnd further polymerizing the N feature volumes to obtain a 3D cost volume.

The input of the nerve encoding module 103 is a 3D cost volume C output by the cost volume constructing module 102, and the output is a nerve encoding volume S corresponding to the 3D cost volume C.

The input of the image depth estimation block 104 is the neural code body S output by the neural code block 103, and the output is the depth information of the reference image.

Further, according to the depth estimation value and the depth truth value of the reference image, a depth loss value is calculated.

The input of the neural radiation field construction module 105 is the neural encoding body S output by the neural encoding module 103, and an arbitrary 3D position x, a viewing direction D, and a color c at that position within the neural encoding body S, and the neural radiation field information of the neural encoding body S is output.

The input to the rendering module 106 is the neural radiation field (RGB, σ) output by the neural radiation field building module, which outputs the rendered new view angle image, i.e., the composite view.

Further, according to the new visual angle image data obtained by rendering and the real image data of the visual angle, a rendering loss value is calculated.

The depth loss value and the rendering loss value are combined together to perform supervised learning on the training network architecture of the whole new view reconstruction network, so that the obtained neural coding body can more accurately express the geometric information and the appearance information of the scene, and the accuracy and the resolution of the finally output new view are higher.

Referring to fig. 6, a schematic structural diagram of a new view reconstruction network according to an embodiment of the present application is shown. As shown in fig. 6, the new view angle reconstruction network includes a feature extraction module 101, a cost volume construction module 102, a neural coding module 103, a neural radiation field construction module 105, and a rendering module 106. The functions of each module are the same as those of the same modules in the training network, and will not be described here again.

Referring to fig. 7, a flowchart of a training method of a new view angle reconstruction network according to an embodiment of the present application is shown, where the method is applied to the architecture of the training network of the new view angle reconstruction network shown in fig. 5. As shown in fig. 7, the method may include the steps of:

s10, the feature extraction module acquires N images with different visual angles of the same scene or object.

One of the N views (i.e., the source view) is used as a reference image, and a high resolution image is typically selected as the reference image.

It will be appreciated that the input sample image may comprise images of a plurality of different scenes, each scene image comprising N views.

S11, extracting depth features of each view by a feature extraction module to obtain N feature graphs.

In an exemplary embodiment, the image feature extraction module 101 may extract depth features representing the appearance of a local image from the N images by using a 2D Convolutional neural network (conditional NeuralNetwork, CNN) to obtain N feature maps

In one possible implementation, the 2D CNN downsamples each dimension of the N images by a factor of 4 to obtain a feature map of the N32 channels.

Although the characteristic diagram F _i The method is characterized in that the source map is obtained through downsampling, but the field information of pixels in the image is reserved, and the source map is encoded and stored in feature descriptors of 32 channels, wherein the feature descriptors provide rich semantic information, namely the extracted feature map contains rich semantic information. Therefore, compared with the feature matching on the source image, the feature matching can be carried out by using the feature image, so that the reconstruction quality of the new view angle image can be remarkably improved.

S12, the feature extraction module inputs the N feature graphs to the cost extraction construction module.

S13, the cost body construction module projects the N feature images onto parallel planes of the reference feature images to obtain N feature bodies

The cost body construction module 102 projects the N feature maps to a plurality of parallel planes of the feature map corresponding to the reference image through differentiable homography transformationForming N features

Homography, also known as projective transformation, is used to describe the mapping of points on two planes, and can be understood to describe the mapping of objects between world coordinates and pixel coordinates, and the corresponding transformation matrix is known as Homography. It can be seen that the homography matrix is used to describe the mapping relationship between two planes.

The N feature images extracted through the 2D CNN are projected to the feature image F of the reference image through homography transformation ₁ The lower plurality of parallel planes form N feature bodiesHomography determines the coordinate transformation from the feature map to the cost volume at the depth value d, H _i (d) Represented by F at depth value d _i (i=1, 2, …, N) and F ₁ Homography transformation matrix of (a). n is n ₁ Representing the principal axis direction of the reference camera, the homography matrix can be expressed as:

wherein K is ₁ Representing a characteristic diagram F ₁ Reference matrix, K, of corresponding camera _i Representing a characteristic diagram F _i Reference matrix corresponding to camera, R ₁ Representing a characteristic diagram F ₁ Corresponding to the rotation matrix, R _i Representing a characteristic diagram F _i Corresponding rotation matrix, t ₁ Representing a characteristic diagram F ₁ Corresponding translation vector, t _i Representing a characteristic diagram F _i A corresponding translation vector.Represents F ₁ The normal vector of the common plane of the mapping, I, represents the feature points mapped onto the common plane.

Using feature maps F _i (i=1, 2, …, N) and H _i (d) Obtaining the productTo characteristic volume FV _i ：

FV _i ＝F _i H _i (d) (2)

S14, the cost body construction module aggregates the N feature bodies to obtain a 3D cost body corresponding to the scene or the object, and transmits the 3D cost body to the nerve coding module.

The cost volume construction module 102 constructs N feature volumesAggregate into one 3D cost volume C. The cost volume expresses the change of the geometry (such as depth information) and the appearance (such as color information) of the scene, so that the network can learn the information expression of the geometry and the appearance better.

To accommodate any number of view inputs, embodiments of the present application employ a variance-based cost indicator Var that is used to measure similarity between N feature images, which can be expressed as:

C＝Var(FV ₁ ,FV ₂ ,…,FV _N ) (3)

this variance-based cost volume encodes the image appearance changes on different input views, accounting for the appearance changes caused by the geometric information of the scene and view-dependent coloring effects.

And S15, the nerve encoding module encodes the 3D cost body by utilizing a nerve network to obtain a nerve encoding body corresponding to the 3D cost body, and the nerve encoding body is respectively transmitted to the depth estimation module and the nerve radiation field construction module.

In one embodiment of the present application, the neural encoding module 103 utilizes 3D CNN learning to construct a neural encoding volume with voxel neural features, each of which may represent information of scene geometry and appearance.

In one possible implementation, the 3D CNN adopts a 3DUnet structure with downsampling and upsampling, so that information about the geometry and appearance of the scene can be effectively inferred and propagated, thereby obtaining a meaningful scene encoding body.

Further, the nerve encoding module provides the nerve encoding body to the depth estimation module and the nerve radiation field construction module respectively.

S16, the depth estimation module predicts depth information of the reference image based on the nerve encoding body.

The neural encoder S includes both scene geometry information and appearance information, and thus depth information of the reference picture, i.e., a depth prediction value of the reference picture, can be extracted using a multi-layer perceptron (Multilayer Perceptron, MLP) network (i.e., a second MLP network).

Wherein the depth estimation value output by the second MLP is h×w×1, where H represents the high of the image and W represents the wide of the image.

The layers of the MLP are fully connected, the lowest layer is an input layer, the middle layer is a hidden layer, and the last layer is an output layer.

And S17, calculating a depth loss value based on the depth predicted value and the depth true value of the reference image.

Further, the depth estimation module 104 calculates a depth loss L according to the depth estimation value of the reference map and the depth truth value of the reference map by using a loss function formula shown below _depth ：

Wherein in formula (11)Depth estimation value, dep, representing reference map _gt Depth truth values for the reference map are represented.

And S18, a nerve radiation field construction module constructs a nerve radiation field of the nerve encoding body and transmits the nerve radiation field to the rendering module.

The inputs to the neuro-radiation field construction module 105 are the neuro-encoded volume S, and any 3D position x, viewing direction D, and color c at that position within the neuro-encoded volume S, outputting neuro-radiation field information of the neuro-encoded volume S, namely voxel density σ and radiation field (i.e. RGB color information):

RGB,σ＝MLP_A(x,d,c,f) (5)

where f=tri (x, S) is a feature obtained by Tri-linear interpolation of the neuro-encoded volume S at position x.

In the embodiment of the application, a position coding mode is adopted for the position vector x and the direction vector d, and the low-dimensional vector is converted into the high-dimensional vector, so that the high-frequency signal in the finally output synthesized view, namely the detail information in the image, is enhanced.

In an exemplary embodiment of the present application, as shown in fig. 5, the neural radiation field construction module may construct the neural radiation field of the neural coding body S using the first MLP network.

It will be appreciated that the first MLP network herein has a similar network structure to the second MLP network employed by the depth estimation module, but the two MLP networks may include different layers, and different parameters within the network, and thus different functions are implemented.

And S19, rendering the target visual angle in the nerve radiation field by the rendering module to obtain an image (i.e. a synthesized view) of the target visual angle.

The target viewing angle may be any viewing angle within a scene range.

In one possible implementation, the rendering module 106 may use a physical-based volume rendering approach to obtain the new view angle image. Specifically, the color of the pixel is calculated by passing light through the pixel and calculating the cumulative radiance at the sampled shadow point on the light, and the calculation formula is as follows:

c＝∑ _n T _n (1-exp(-σ _n ))r _n (6)

Wherein,representing the volumetric transmittance, c is the final color output.

In other embodiments, the rendering module 106 may also adopt a neural network rendering manner, which is not limited in the embodiments of the present application.

S110, the rendering module calculates a rendering loss value for the real image based on the synthesized view and the view angle.

Further, a loss function is utilized to calculate a rendering loss value L between a new view angle image (i.e. a composite view) and a real image of the view angle _rec ：

Wherein,representing a rendered new view angle image, I _gt And representing the real image corresponding to the new view angle.

And S111, optimizing network parameters of the whole training network based on the depth loss value and the rendering loss value.

In the embodiment of the application, depth loss L is comprehensively considered _depth And rendering loss L _rec And performing end-to-end training on the network parameters of the whole training network.

Wherein, the calculation formula of the total loss value is as follows:

L＝ω ₁ L _Depth +ω ₂ L _rec (8)

wherein omega ₁ For depth loss L _depth Weight coefficient, omega ₂ Is L _rec Weight coefficient of (c) in the above-mentioned formula (c).

The values of the two weight coefficients, e.g., ω, can be determined according to actual requirements ₁ ＝0.1，ω ₂ ＝1。

And performing direct supervised learning on the whole training network by using the depth loss, and performing indirect supervised learning on the whole training network by using the rendering loss.

It will be appreciated that the process of training the new view angle reconstruction network is a process of optimizing the network based on the latest total loss value repeatedly, for example, after calculating the total loss value for the first time, adjusting parameters of the whole training network based on the total loss value, and further, inputting the sample image into the new training network again to calculate the new total loss value again. This process is repeatedly performed until the calculated total loss value satisfies a preset condition, for example, the loss value is not reduced any more, or the number of cycles satisfies a preset number of times, or the like.

It should be understood that the new training network herein refers to a training network of a new view reconstruction network obtained by adjusting network parameters of the entire training network.

The process shown in S10 to S111 in the embodiment of the present application is a training process of the new view angle reconstruction network, and the training process monitors learning of the new view angle reconstruction network together through geometric information (depth) and appearance information (rendering), so that the obtained neural coding body can more accurately express the information of the geometric and appearance of the scene, and further the accuracy and resolution of the finally output new view angle image are improved.

The process of synthesizing a new view using the trained new view reconstruction network will be described below in connection with S20-S26 in fig. 6:

S20, the application responds to the operation of starting the 3D scene reconstruction function item, acquires N views of the scene and transmits the N views to the feature extraction module.

Where N is a positive integer greater than 2, it is understood that the greater the value of N, the higher the sharpness and resolution of the resulting new view angle image.

For example, the user may perform a clicking operation on the 3D scene button 104 in the interface shown in fig. 1 (2), or a clicking operation on the 3D scene reconstruction function card 204 in the interface shown in fig. 2 (2) may initiate a 3D scene reconstruction function.

Taking a mobile phone as an example, N views can be obtained by shooting through a main camera and a rotatable auxiliary camera arranged on the mobile phone.

In an exemplary embodiment of the application, the movable camera can shoot multiple views of a current scene or an object in the moving process, and a user does not need to manually shoot images with different visual angles, so that the cooperation requirement on the user is reduced, and the user experience is improved.

In one possible implementation, the higher resolution image captured by the primary camera may be used as the reference image.

S21, the feature extraction module extracts depth features of the N views respectively to obtain N feature images, and transmits the N feature images to the cost body construction module.

S22, the cost body construction module projects the N feature images onto parallel planes of the reference feature images to obtain N feature bodies.

S23, the cost body construction module aggregates the N feature bodies to obtain a 3D cost body corresponding to the scene or the object, and transmits the 3D cost body to the nerve coding module.

And S24, the nerve encoding module encodes the 3D cost body by utilizing a nerve network to obtain a nerve encoding body corresponding to the 3D cost body, and transmits the nerve encoding body to a nerve radiation field.

S25, a nerve radiation field construction module constructs a nerve radiation field of the nerve encoding body and transmits the nerve radiation field to the rendering module.

S26, the rendering module renders the target visual angle in the nerve radiation field to obtain an image of the target visual angle.

The process shown in S22 to S26 is the same as the implementation process of the corresponding step in the training network process, and will not be described here again.

It can be seen that, in the new view angle reconstruction method provided in this embodiment, N views of an acquired scene are input into a new view angle reconstruction network obtained by training with the new view angle reconstruction network training method provided in this application, so as to obtain an image of a new view angle. The new view angle reconstruction network is obtained by jointly supervising and learning the depth information and the appearance information of the image, and the obtained nerve encoding body can more accurately express the depth and the appearance information of the scene, so that the accuracy and the resolution of the finally output new view angle image are improved. Moreover, the neural encoder can infer and learn geometric information and appearance information of the reconstructed object, can realize scene reconstruction and rendering under the condition of inputting a very small number of views, does not need to retrain and optimize each scene, and does not need to input a large number of training samples in the training process.

From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical application, the above-described functional allocation may be implemented by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to implement all or part of the functions described above. The specific working processes of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which are not described herein.

In the several embodiments provided in this embodiment, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present embodiment may be integrated in one processing unit, each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present embodiment may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform all or part of the steps of the method described in the respective embodiments. And the aforementioned storage medium includes: flash memory, removable hard disk, read-only memory, random access memory, magnetic or optical disk, and the like.

The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The training method of the new view angle reconstruction network is characterized by being applied to terminal equipment, wherein the terminal equipment runs the training network, and the training network comprises a feature extraction module, a cost body construction module, a nerve coding module, a depth estimation module, a nerve radiation field construction module and a rendering module; the method comprises the following steps:

the feature extraction module acquires a plurality of image samples of different visual angles corresponding to the same reconstruction object, extracts the depth feature of each view to acquire a plurality of feature images, and the plurality of image samples of each reconstruction object comprise a reference image sample;

the cost body construction module projects the feature images onto parallel planes of the feature images corresponding to the reference image samples to obtain a plurality of feature bodies, and aggregates the feature bodies to obtain 3D cost bodies corresponding to the reconstructed objects;

The nerve coding module codes the 3D cost body to obtain a nerve coding body corresponding to the 3D cost body;

the depth estimation module obtains a depth estimate based on depth information of the reference image samples predicted by the neuro-encoderCounting, and according toCalculating to obtain depth lossL _depth ，/>Depth estimation value representing said reference image samples, a->Representing depth truth values of the reference image samples;

the nerve radiation field construction module constructs a nerve radiation field corresponding to a nerve encoding body of the reconstruction object;

the rendering module renders a target view angle in the nerve radiation field to obtain a composite image of the target view angle, and calculates a rendering loss value based on the composite image and the real image of the target view angle;

and optimizing network parameters of the whole training network based on the depth loss value and the rendering loss value corresponding to the same reconstruction object.

2. The method of claim 1, wherein the optimizing network parameters of the entire training network based on the depth loss value and the rendering loss value of the same reconstructed object comprises:

acquiring a first weight corresponding to the depth loss value and a second weight corresponding to the rendering loss value;

Calculating a first product of the depth loss value and the first weight, a second product of the rendering loss and the second weight, and calculating the first product and the second product to obtain a total loss value;

and adjusting parameters of the training network according to the total loss value.

3. The method according to claim 1, wherein the aggregating the plurality of feature volumes to obtain a 3D cost volume corresponding to the reconstructed object comprises:

and calculating variances of all feature bodies corresponding to the same reconstruction object to obtain a three-dimensional cost body corresponding to the reconstruction object.

4. The method of claim 1, wherein the neuro-radiation field construction module constructs a neuro-radiation field corresponding to a neuro-coding body of the reconstruction object based on the neuro-coding body, comprising:

encoding a first position vector of any three-dimensional position in the nerve encoding body to obtain a second position vector;

encoding a first direction vector of any observation direction of the nerve encoding body to obtain a second direction vector;

and obtaining voxel density and color information corresponding to the arbitrary three-dimensional position in the arbitrary observation direction by using the neural code body corresponding to the reconstruction object, and the position vector, the direction vector and the color information corresponding to the position of the image sample of the reconstruction object mapped by the arbitrary three-dimensional position.

5. The method of claim 4, wherein encoding the first position vector for any three-dimensional position within the neural encoder to obtain the second position vector comprises:

the first position vector is encoded in a position encoding mode to obtain the second position vector, and the dimension of the second position vector is higher than that of the first position vector;

the encoding the first direction vector of any observing direction of the nerve encoding body to obtain a second direction vector comprises the following steps:

and encoding the first direction vector by adopting a position encoding mode to obtain the second direction vector, wherein the dimension of the second direction vector is higher than that of the first direction vector.

6. A new view angle reconstruction method, characterized by being applied to a terminal device, wherein the terminal device operates a new view angle reconstruction network, and the new view angle reconstruction network is obtained by using the training method of the new view angle reconstruction network according to any one of claims 1-5; the method comprises the following steps:

receiving images of a plurality of different view angles of a target reconstruction object through the new view angle reconstruction network, wherein the images of the different view angles comprise a reference image;

And obtaining the image of the target reconstruction object at the target view angle based on all the images of the target reconstruction object through the new view angle reconstruction network.

7. The method of claim 6, wherein the terminal device comprises a movable camera;

the acquiring the images of the target reconstruction object at a plurality of different view angles includes:

and shooting images of the target object corresponding to a plurality of different visual angles through the movable camera, and selecting any one image from the images of the different visual angles to be determined as the reference image.

8. A terminal device, characterized in that the terminal device comprises: one or more processors, memory, and a touch screen; the memory is used for storing program codes; the processor is configured to run the program code such that the terminal device implements the training method of the new angle reconstruction network according to any one of claims 1 to 5 or the new angle reconstruction method according to any one of claims 6 to 7.

9. A computer readable storage medium having instructions stored thereon, which when run on a terminal device cause the terminal device to perform the training method of the new angle reconstruction network of any of claims 1 to 5 or the new angle reconstruction method of any of claims 6-7.