CN116091705A - Variable topology dynamic scene reconstruction and editing method and device based on nerve radiation field - Google Patents

Variable topology dynamic scene reconstruction and editing method and device based on nerve radiation field Download PDF

Info

Publication number
CN116091705A
CN116091705A CN202310283264.1A CN202310283264A CN116091705A CN 116091705 A CN116091705 A CN 116091705A CN 202310283264 A CN202310283264 A CN 202310283264A CN 116091705 A CN116091705 A CN 116091705A
Authority
CN
China
Prior art keywords
dynamic scene
image
point
image sequence
key point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310283264.1A
Other languages
Chinese (zh)
Inventor
徐枫
郑成伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202310283264.1A priority Critical patent/CN116091705A/en
Publication of CN116091705A publication Critical patent/CN116091705A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/04Indexing scheme for image data processing or generation, in general involving 3D image data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • G06T2207/10021Stereoscopic video; Stereoscopic image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Geometry (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application provides a variable topology dynamic scene reconstruction and editing method based on a nerve radiation field, which comprises the following steps: acquiring a monocular video of a dynamic scene; identifying 3D key points of the dynamic scene according to an image sequence of the monocular video, and initializing the positions of the 3D key points in the image sequence by using optical flows and depth images; inputting the image sequence and the 3D key point positions into a dynamic scene reconstruction network model, training the dynamic scene reconstruction network model according to the joint loss function, and optimizing the 3D key point positions of each frame of image; modeling a topology-variable dynamic scene by using the optimized 3D key point positions to generate a topology-variable dynamic scene; and editing the variable topology dynamic scene by modifying the position of the 3D key point to generate a new view angle video or image of the variable topology dynamic scene. The method and the device can model the dynamic scene with the variable topology by using the 3D key points, edit the dynamic scene by controlling the 3D key points, and are simple and visual to operate.

Description

Variable topology dynamic scene reconstruction and editing method and device based on nerve radiation field
Technical Field
The application relates to the technical fields of computer vision and computer graphics, in particular to a method and a device for reconstructing and editing a variable topology dynamic scene based on a nerve radiation field.
Background
The reconstruction and editing of dynamic scenes are always the focus problem of one of computer graphics and visual research, and in recent years, the new view reconstruction quality of the scenes is obviously improved due to the proposal of a nerve radiation field (Neural Radiance Fields, neRF for short), but the method does not support scene editing. When topology changes occur in the dynamic scene, the discontinuity of the spatial motion caused by the topology changes can cause the reconstruction and editing of the dynamic scene to be more difficult to solve.
Disclosure of Invention
The present application aims to solve, at least to some extent, one of the technical problems in the related art.
Therefore, a first object of the present application is to provide a method for reconstructing and editing a topology-variable dynamic scene based on a neural radiation field, which solves the technical problems that the existing method is difficult to model the dynamic scene with topology change and does not support dynamic scene editing, can model the topology-variable dynamic scene by using 3D key points, and can edit the dynamic scene by controlling the 3D key points, and is simple and intuitive to operate.
A second object of the present application is to provide a device for reconstructing and editing a dynamic scene of variable topology based on a neural radiation field.
A third object of the present application is to propose a computer device.
A fourth object of the present application is to propose a non-transitory computer readable storage medium.
To achieve the above objective, an embodiment of a first aspect of the present application provides a method for reconstructing and editing a dynamic scene with variable topology based on a neural radiation field, including: acquiring a monocular video of a dynamic scene; identifying 3D key points of the dynamic scene according to an image sequence of the monocular video, and initializing the positions of the 3D key points in the image sequence by using optical flows and depth images; inputting the image sequence and the 3D key point positions into a dynamic scene reconstruction network model, training the dynamic scene reconstruction network model according to the joint loss function, and optimizing the 3D key point positions of each frame of image; modeling a topology-variable dynamic scene by using the optimized 3D key point positions to generate a topology-variable dynamic scene; and editing the variable topology dynamic scene by modifying the position of the 3D key point to generate a new view angle video or image of the variable topology dynamic scene.
Optionally, in one embodiment of the present application, identifying 3D keypoints of a dynamic scene from an image sequence of a monocular video and initializing a position of the 3D keypoints in the image sequence using optical flow and depth images includes:
according to an image sequence of a monocular video, obtaining a camera position of each frame of image through a multi-view reconstruction method, and modeling topological motion states of different space points in a dynamic scene by using additional dimensions to obtain a depth image of the dynamic scene;
and taking a space point with the largest change of the extra dimensional coordinates in the sequence in the modeling process as a 3D key point, and initializing the position of the 3D key point in the image sequence by using the optical flow and the depth image, wherein different topological motion states of the same space point correspond to different extra dimensional coordinates.
Optionally, in one embodiment of the present application, taking a spatial point with the largest change in the sequence of the additional dimensional coordinates in the modeling process as a 3D key point includes:
establishing three-dimensional space voxels in the modeled dynamic scene;
storing the additional dimensional coordinates of each surface point of each frame of image in the image sequence to a corresponding voxel;
and calculating the coordinate variance of the voxels of each surface point, carrying out three-dimensional Gaussian filtering on the voxels of each surface point, and selecting the central point of all the voxels with the local variance extremum as a 3D key point.
Optionally, in one embodiment of the present application, initializing the position of the 3D keypoints in the image sequence using optical flow and depth images includes:
the method comprises the steps of taking a first frame image with a 3D key point on the surface of an object as a reference frame image by comparing a depth value of the 3D key point projected to a pixel position in a depth image and a distance between the 3D key point and a camera;
projecting the 3D key points into the reference frame image to obtain two-dimensional image coordinates of the 3D key points;
calculating the optical flow between frames by using an optical flow estimation method, and sequentially transmitting two-dimensional image coordinates to all frames according to the optical flow;
and projecting the two-dimensional image coordinates back to a three-dimensional space through the depth image to obtain the initial positions of the 3D key points of all frames in the image sequence.
Optionally, in one embodiment of the present application, inputting the image sequence and the 3D keypoint locations into the dynamic scene reconstruction network model includes:
obtaining a space sampling point through uniform sampling and probability sampling according to an image sequence;
inputting the space sampling points into a deformation field based on a multilayer perceptron, and aligning the deformation field to a standard space to obtain standard coordinates of the space sampling points;
calculating weights between the space sampling points and the 3D key points through a weight network, and carrying out weighted summation on the 3D key point coordinates according to the weights to obtain weighted key point vectors;
taking the weighted key point vector as an additional input of a multi-layer perceptron network based on a nerve radiation field to obtain opacity and color information of the space sampling point;
and obtaining a reconstructed image by using a volume rendering method according to the opacity and the color information of the space sampling points.
Optionally, in one embodiment of the present application, training the dynamic scene reconstruction network model according to the joint loss function optimizes the 3D keypoint location of each frame image, including:
constructing motion loss according to the two-dimensional relative positions of the 3D key points and the optical flow values between the images of adjacent frames of the image sequence;
constructing geometric loss according to the distance from the 3D key point in each frame of image of the image sequence to the camera and the depth value in the depth map;
constructing reconstruction loss according to the image sequence and the reconstructed image sequence obtained by rendering;
based on the motion loss, the geometric loss, the reconstruction loss and the regular loss of the deformation field, constructing a joint loss function, training a dynamic scene reconstruction network model according to the joint loss function, and optimizing the position of the 3D key point of each frame of image.
To achieve the above object, an embodiment of a second aspect of the present application provides a device for reconstructing and editing a dynamic scene of variable topology based on a neural radiation field, including:
the acquisition module is used for acquiring monocular videos of the dynamic scene;
the identification module is used for identifying the 3D key points of the dynamic scene according to the image sequence of the monocular video and initializing the positions of the 3D key points in the image sequence by using the optical flow and the depth image;
the training module is used for inputting the image sequence and the 3D key points into the dynamic scene reconstruction network model, training the dynamic scene reconstruction network model according to the joint loss function, and optimizing the positions of the 3D key points of each frame of image;
the generating module is used for modeling the topology-variable dynamic scene by utilizing the optimized 3D key point position to generate a topology-variable dynamic scene;
and the editing module is used for editing the variable topology dynamic scene by modifying the position of the 3D key point to generate a new view angle video or image of the variable topology dynamic scene.
Optionally, in an embodiment of the present application, the identification module is specifically configured to:
according to an image sequence of a monocular video, obtaining a camera position of each frame of image through a multi-view reconstruction method, and modeling topological motion states of different space points in a dynamic scene by using additional dimensions to obtain a depth image of the dynamic scene;
and taking a space point with the largest change of the extra dimensional coordinates in the sequence in the modeling process as a 3D key point, and initializing the position of the 3D key point in the image sequence by using the optical flow and the depth image, wherein different topological motion states of the same space point correspond to different extra dimensional coordinates.
To achieve the above objective, an embodiment of a third aspect of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for reconstructing and editing a variable topology dynamic scene based on a neural radiation field according to the above embodiment when executing the computer program.
To achieve the above object, a fourth aspect of the present application proposes a non-transitory computer-readable storage medium, which when executed by a processor, is capable of performing a variable topology dynamic scene reconstruction and editing method based on a neural radiation field.
The method, the device, the computer equipment and the non-transitory computer readable storage medium for reconstructing and editing the variable topology dynamic scene based on the nerve radiation field solve the technical problems that the existing method is difficult to model the dynamic scene with topology change and does not support dynamic scene editing, can model the dynamic scene with the topology change by using the 3D key points, and can edit the dynamic scene by controlling the 3D key points, and are simple and visual in operation.
Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
fig. 1 is a flowchart of a variable topology dynamic scene reconstruction and editing method based on a neural radiation field according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a variable topology dynamic scene reconstruction and editing device based on a neural radiation field according to a second embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.
Reconstruction and editing of dynamic scenes is always a focus problem in one of computer graphics and vision research. In terms of reconstruction, the neural radiation field uses a multi-layer perceptron (Multilayer Perceptron, MLP for short) to model the opacity and color of each spatial point in the 3D scene and enable very high quality new view angle image rendering. But this class of methods uses implicit expressions to model a scene and therefore cannot directly edit the modeled scene.
In three-dimensional reconstruction and editing, topology changes have been a very troublesome problem because they cause spatial motion discontinuities that are very difficult to model for conventional geometric and motion representations, whereas implicit expressions used in neural radiation sites can model such complex topology changes to some extent. The application provides a variable topology dynamic scene reconstruction and editing method based on a nerve radiation field, which can model and edit a dynamic scene with a variable topology.
The following describes a method and an apparatus for reconstructing and editing a variable topology dynamic scene based on a neural radiation field according to an embodiment of the present application with reference to the accompanying drawings.
Fig. 1 is a flow chart of a variable topology dynamic scene reconstruction and editing method based on a neural radiation field according to an embodiment of the present application.
As shown in fig. 1, the method for reconstructing and editing a variable topology dynamic scene based on a nerve radiation field comprises the following steps:
step 101, obtaining a monocular video of a dynamic scene;
102, identifying a 3D key point of a dynamic scene according to an image sequence of a monocular video, and initializing the position of the 3D key point in the image sequence by using an optical flow and a depth image;
step 103, inputting the image sequence and the 3D key points into a dynamic scene reconstruction network model, training the dynamic scene reconstruction network model according to the joint loss function, and optimizing the positions of the 3D key points of each frame of image;
104, modeling a topology-variable dynamic scene by using the optimized 3D key point position to generate a topology-variable dynamic scene;
and 105, editing the variable topology dynamic scene by modifying the position of the 3D key point to generate a new view angle video or image of the variable topology dynamic scene.
According to the variable topology dynamic scene reconstruction and editing method based on the nerve radiation field, monocular video of a dynamic scene is obtained; identifying 3D key points of the dynamic scene according to an image sequence of the monocular video, and initializing the positions of the 3D key points in the image sequence by using optical flows and depth images; inputting the image sequence and the 3D key point positions into a dynamic scene reconstruction network model, training the dynamic scene reconstruction network model according to the joint loss function, and optimizing the 3D key point positions of each frame of image; modeling a topology-variable dynamic scene by using the optimized 3D key point positions to generate a topology-variable dynamic scene; and editing the variable topology dynamic scene by modifying the position of the 3D key point to generate a new view angle video or image of the variable topology dynamic scene. Therefore, the technical problems that the existing method is difficult to model a dynamic scene with topology change and does not support dynamic scene editing can be solved, the 3D key points can be used for modeling the dynamic scene with topology change, the 3D key points can be controlled to edit the dynamic scene, and the operation is simple and visual.
According to the variable topology dynamic scene editing method based on the nerve radiation field, the 3D key points can be used for modeling the topology variable dynamics, and the 3D key points can be controlled to edit the dynamic scene.
The application provides a variable topology dynamic scene reconstruction and editing method based on a nerve radiation field, which is used for inputting a section of monocular video for shooting a dynamic scene, and then training a network aiming at the input to complete modeling of the scene, so as to obtain an editable model for expressing the scene, wherein the model allows a user to edit and generate a new view image and video. The method uses 3D keypoints to assist in modeling topologically variable dynamics, which are on the surface of some dynamic objects and move together with the object motion, each of the keypoints in the scene being able to control the dynamics of a region and its resulting effects, such as shadows and reflections.
The reconstruction flow of the dynamic scene comprises two parts, namely initialization and training of a network. At initialization, 3D keypoints in the dynamic scene are identified based on intermediate results of existing methods, and the positions of the keypoints in the full sequence are initialized. And then training the network, optimizing the position of the 3D key point of each frame of the sequence in the process, and modeling the topology-variable dynamics by utilizing the position of the 3D key point. The user may then edit the dynamics in the scene by modifying the location of the 3D keypoints.
For any sampling point in space, the network of the application firstly aligns the sampling point to the standard space through a deformation field modeling small movements such as jitter to obtain the standard coordinates of the sampling point. The network of the present application then uses the weighted keypoints as additional inputs to the neural radiation field for modeling topologically variable dynamics. Each point in space has a weight for each keypoint that characterizes the extent to which it is affected, and this weighted keypoint vector is calculated by using this weight to weight sum the coordinates of all keypoints. Then the vector is input into a network of a nerve radiation field to obtain opacity sigma and color c information, so that the opacity and color information of any sampling point in space can be obtained by using the network, and a rendering image can be obtained by using a volume rendering method.
Further, in the embodiment of the present application, identifying 3D keypoints of a dynamic scene according to an image sequence of a monocular video, and initializing a position of the 3D keypoints in the image sequence using optical flow and a depth image includes:
according to an image sequence of a monocular video, obtaining a camera position of each frame of image through a multi-view reconstruction method, and modeling topological motion states of different space points in a dynamic scene by using additional dimensions to obtain a depth image of the dynamic scene;
and taking a space point with the largest change of the extra dimensional coordinates in the sequence in the modeling process as a 3D key point, and initializing the position of the 3D key point in the image sequence by using the optical flow and the depth image, wherein different topological motion states of the same space point correspond to different extra dimensional coordinates.
During initialization, the input of the method is a video or image sequence obtained by shooting a dynamic scene by using a single freely moving camera, the method obtains the position of each frame of camera by a traditional multi-view reconstruction method, and the shooting scene is reconstructed by using the existing topology-variable dynamic scene reconstruction technology. The technology also uses neural radiation fields to model topologically variable dynamic scenes, but the scenes modeled by the method cannot be edited semantically. The technology uses extra dimensions to model topological motion states of different spatial points in a scene, and different topological motion states of the same spatial point correspond to different extra dimension coordinates. The key points of the method aim at modeling the topology-variable dynamics, so that the space point with the largest change of the extra dimension coordinates in the sequence in the prior art is taken as the key point, and the optical flow is used for initializing the key points of the whole sequence.
Further, in the embodiment of the present application, taking a spatial point with the largest change of the coordinates of the additional dimension in the sequence in the modeling process as a 3D key point includes:
establishing three-dimensional space voxels in the modeled dynamic scene;
storing the additional dimensional coordinates of each surface point of each frame of image in the image sequence to a corresponding voxel;
and calculating the coordinate variance of the voxels of each surface point, carrying out three-dimensional Gaussian filtering on the voxels of each surface point, and selecting the central point of all the voxels with the local variance extremum as a 3D key point.
In the embodiment of the application, the specific selection method of the key points is as follows: establishing a space three-dimensional voxel; for each surface point of each frame in the sequence, storing its extra dimensional coordinates into the corresponding voxel; calculating a coordinate variance within each voxel; and carrying out three-dimensional Gaussian filtering on the voxels, and selecting all voxel centers where the local variance extremum is located as reference key points (namely, taking the center point of the voxel as the key point). In addition, the existing topology-variable dynamic scene reconstruction technology can also generate a depth map, so that the first frame with the reference key point just positioned on the surface of the object is taken as a reference frame by comparing the positions of the depth map and the reference key point. In the application, each key point independently selects a reference frame and initializes the whole sequence.
The establishment of spatial three-dimensional voxels refers to the uniform establishment of three-dimensional voxels in the region of the dynamic scene in the three-dimensional space, wherein each voxel is a small cube.
Further, in an embodiment of the present application, initializing the position of the 3D keypoints in the image sequence using the optical flow and the depth image includes:
the method comprises the steps of taking a first frame image with a 3D key point on the surface of an object as a reference frame image by comparing a depth value of the 3D key point projected to a pixel position in a depth image and a distance between the 3D key point and a camera;
projecting the 3D key points into the reference frame image to obtain two-dimensional image coordinates of the 3D key points;
calculating the optical flow between frames by using an optical flow estimation method, and sequentially transmitting two-dimensional image coordinates to all frames according to the optical flow;
and projecting the two-dimensional image coordinates back to a three-dimensional space through the depth image to obtain the initial positions of the 3D key points of all frames in the image sequence.
In the embodiment of the application, the initialization of the key point full sequence is performed based on the reference key point and the reference frame image. Firstly, projecting a reference key point into a reference frame image to obtain two-dimensional image coordinates thereof, then, using the existing optical flow estimation method to obtain optical flows between frames, sequentially transmitting the two-dimensional image coordinates into all frames according to the optical flows, and then, projecting the two-dimensional coordinates back into a three-dimensional space through a depth map of the existing variable topology dynamic scene reconstruction technology to obtain the initial positions of the 3D key point coordinates of all frames in an image sequence. The initial position of the 3D key point may contain significant accumulated errors, so that the position may be continuously optimized in the subsequent training process.
Wherein the first frame with the 3D keypoint just located on the object surface is taken as the reference frame image by comparing the depth value of the 3D keypoint projected at the pixel location in the depth image with the distance of the 3D keypoint to the camera.
Further, in an embodiment of the present application, inputting the image sequence and the 3D keypoint locations into the dynamic scene reconstruction network model includes:
obtaining a space sampling point through uniform sampling and probability sampling according to an image sequence;
inputting the space sampling points into a deformation field based on a multilayer perceptron, and aligning the deformation field to a standard space to obtain standard coordinates of the space sampling points;
calculating weights between the space sampling points and the 3D key points through a weight network, and carrying out weighted summation on the 3D key point coordinates according to the weights to obtain weighted key point vectors;
taking the weighted key point vector as an additional input of a multi-layer perceptron network based on a nerve radiation field to obtain opacity and color information of the space sampling point;
and obtaining a reconstructed image by using a volume rendering method according to the opacity and the color information of the space sampling points.
In the embodiment of the application, each image in the image sequence is sampled independently through uniform sampling and probability sampling, so that a spatial sampling point is obtained. The sampling method comprises the steps of sampling, namely, sampling in two steps, wherein the first step is used for uniformly sampling on the sight line, the second step is used for carrying out probability sampling on the contribution degree of the final result according to each sampling point in the first step, and the more the contribution degree is, the higher the probability of sampling the area, and the more the sampling points are.
The dynamic scene reconstruction network uses implicit expression based on a hidden function to model a scene, inputs the coordinates of a certain point in a space, and outputs the opacity and color information corresponding to the point. For a sampling point x in space, the dynamic scene reconstruction network of the application aligns the sampling point x under a standard space through a deformation field based on a multi-layer sensor to obtain a standard coordinate x', wherein the deformation field is used for modeling micro motions including jitter, and the deformation field is expressed as:
x′=T(x,β t )
wherein beta is t And (3) obtaining deformation hidden variables of the T frame through optimization in a network training process, wherein x is a sampling point in space, T () is a deformation field network, and x' is a standard coordinate of the sampling point x in space.
The dynamic scene reconstruction network of the present application constructs weighted key point vectors to assist in modeling topologically variable dynamics. Since different points in space are affected by different keypoints, each spatial point has a weight ω for all keypoints that characterizes the extent to which it is affected. This application uses a weight network W to calculate this weight, which includes multiple layers of perceptrons and softmax, expressed as:
ω=W(x′)
where ω represents the weight between the sampling point x in space and the key point, W () is the weight network, and x' is the standard coordinate of the sampling point x in space.
The coordinates of all keypoints are then weighted summed using this weight to calculate this weighted keypoint vector, expressed as:
Figure BDA0004138809960000081
wherein p is t (x') is a weighted key point vector of sampling points x in space, superscript i is the number of key points, N is the total number of key points, ω i For the weights between the sampling point x and the i-th keypoint in space,
Figure BDA0004138809960000082
is the key point coordinates.
The weighted keypoint vector will be used as an additional input to the subsequent neural radiation field based multi-layer perceptron network to derive sample point opacity σ and color c information in space, expressed as:
(c,σ)=H(x′,p t (x′),d)
wherein H () is a neural radiation field network, x' is the standard coordinates of sampling point x in space, p t (x') is a weighted key point vector of the sampling point x in space, and d is the viewing line direction.
The opacity and color information of any sampling point in the space can be obtained by using the dynamic scene reconstruction network, and the rendered image can be obtained by using the volume rendering method.
Further, in the embodiment of the present application, training a dynamic scene reconstruction network model according to a joint loss function optimizes a 3D keypoint location of each frame image, including:
constructing motion loss according to the two-dimensional relative positions of the 3D key points and the optical flow values between the images of adjacent frames of the image sequence;
constructing geometric loss according to the distance from the key point in each frame of image of the image sequence to the camera and the depth value in the depth map;
constructing reconstruction loss according to the image sequence and the reconstructed image sequence obtained by rendering;
based on the motion loss, the geometric loss, the reconstruction loss and the regular loss of the deformation field, constructing a joint loss function, training a dynamic scene reconstruction network model according to the joint loss function, and optimizing the position of the 3D key point of each frame of image.
In the embodiment of the application, the optical flow ensures the consistency of the image positions of the key points between the adjacent frames of the sequence, and the depth map ensures that the key points are always on the surface of the object. In addition, a reconstruction loss term is added during network training to ensure that a reconstructed image rendered by the model is identical to an input image, and a regular term of a deformation field is used for enabling the model to model only tiny motions.
Specifically, during the training process, all network parameters (deformation field network, weight network, neural radiation field network), deformation hidden variables and key points are optimized. Because the 3D key points are introduced, in order to ensure the semantic consistency of the key points, the method uses the optical flow and the depth map to monitor the positions of the key points. The optical flow is used for constructing a motion loss term, and the 2D relative positions of key points between adjacent frames of the loss term constraint sequence are consistent with the optical flow values obtained through an optical flow estimation method, so that the key points of different frames correspond to the same object surface points in the image space. The depth map is used for constructing a geometric loss term, and the loss term constrains the distance from each frame of key point to a camera to be consistent with the depth value in the depth map obtained by the existing variable topology dynamic scene reconstruction technology (namely constrains the distance from each frame of 3D key point to the camera and the depth value of the 3D key point projected to the pixel position in the depth image), so that the constraint key point is always on the surface of an object.
In addition, a reconstruction loss term is added during network training to ensure that a reconstructed image rendered by the model is close to an input image, and the loss term constrains the rendered image to be close to the input image in RGB distance. And adding a regular loss term of the deformation field so that the regular loss term only models tiny motions, wherein the regular loss term constrains coordinate values before and after the deformation of the surface point to be as close as possible.
After training is completed, the 3D coordinates of the key points can be modified to correspondingly edit the network modeling scene, and multi-view rendering is performed.
Fig. 2 is a schematic structural diagram of a variable topology dynamic scene reconstruction and editing device based on a neural radiation field according to a second embodiment of the present application.
As shown in fig. 2, the device for reconstructing and editing a variable topology dynamic scene based on a nerve radiation field comprises:
an acquisition module 10, configured to acquire a monocular video of a dynamic scene;
the identifying module 20 is configured to identify a 3D key point of the dynamic scene according to an image sequence of the monocular video, and initialize a position of the 3D key point in the image sequence using the optical flow and the depth image;
the training module 30 is configured to input the image sequence and the 3D keypoints into a dynamic scene reconstruction network model, train the dynamic scene reconstruction network model according to the joint loss function, and optimize the 3D keypoint position of each frame of image;
a generating module 40, configured to model a topology-variable dynamic scene by using the optimized 3D keypoint location, and generate a topology-variable dynamic scene;
the editing module 50 is configured to edit the topology-changing dynamic scene by modifying the 3D key point position, and generate a new view angle video or image of the topology-changing dynamic scene.
The variable topology dynamic scene reconstruction and editing device based on the nerve radiation field comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring monocular videos of dynamic scenes; the identification module is used for identifying the 3D key points of the dynamic scene according to the image sequence of the monocular video and initializing the positions of the 3D key points in the image sequence by using the optical flow and the depth image; the training module is used for inputting the image sequence and the 3D key points into the dynamic scene reconstruction network model, training the dynamic scene reconstruction network model according to the joint loss function, and optimizing the positions of the 3D key points of each frame of image; the generating module is used for modeling the topology-variable dynamic scene by utilizing the optimized 3D key point position to generate a topology-variable dynamic scene; and the editing module is used for editing the variable topology dynamic scene by modifying the position of the 3D key point to generate a new view angle video or image of the variable topology dynamic scene. Therefore, the technical problems that the existing method is difficult to model a dynamic scene with topology change and does not support dynamic scene editing can be solved, the 3D key points can be used for modeling the dynamic scene with topology change, the 3D key points can be controlled to edit the dynamic scene, and the operation is simple and visual.
Further, in the embodiment of the present application, the identification module is specifically configured to:
according to an image sequence of a monocular video, obtaining a camera position of each frame of image through a multi-view reconstruction method, and modeling topological motion states of different space points in a dynamic scene by using additional dimensions to obtain a depth image of the dynamic scene;
and taking a space point with the largest change of the extra dimensional coordinates in the sequence in the modeling process as a 3D key point, and initializing the position of the 3D key point in the image sequence by using the optical flow and the depth image, wherein different topological motion states of the same space point correspond to different extra dimensional coordinates.
In order to implement the above embodiments, the application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for reconstructing and editing a variable topology dynamic scene based on a neural radiation field according to the above embodiments when executing the computer program.
In order to implement the above embodiment, the application further proposes a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for reconstructing and editing a variable topology dynamic scene based on a neural radiation field of the above embodiment.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (10)

1. The variable topology dynamic scene reconstruction and editing method based on the nerve radiation field is characterized by comprising the following steps of:
acquiring a monocular video of a dynamic scene;
identifying 3D key points of the dynamic scene according to the image sequence of the monocular video, and initializing the positions of the 3D key points in the image sequence by using optical flows and depth images;
inputting the image sequence and the 3D key points into a dynamic scene reconstruction network model, training the dynamic scene reconstruction network model according to a joint loss function, and optimizing the positions of the 3D key points of each frame of image;
modeling a topology-variable dynamic scene by using the optimized 3D key point positions to generate a topology-variable dynamic scene;
and editing the variable topology dynamic scene by modifying the position of the 3D key point to generate a new view angle video or image of the variable topology dynamic scene.
2. The method of claim 1, wherein the identifying the 3D keypoints of the dynamic scene from the image sequence of the monocular video and initializing the positions of the 3D keypoints in the image sequence using optical flow and depth images comprises:
according to an image sequence of a monocular video, obtaining a camera position of each frame of image through a multi-view reconstruction method, and modeling topological motion states of different space points in a dynamic scene by using additional dimensions to obtain a depth image of the dynamic scene;
and taking a space point with the largest change of the extra dimensional coordinates in the sequence in the modeling process as a 3D key point, and initializing the position of the 3D key point in the image sequence by using an optical flow and the depth image, wherein different topological motion states of the same space point correspond to different extra dimensional coordinates.
3. The method of claim 2, wherein the taking the spatial point in the modeling process where the additional dimensional coordinates vary the most in the sequence as the 3D keypoint comprises:
establishing three-dimensional space voxels in the modeled dynamic scene;
storing the additional dimensional coordinates of each surface point of each frame of image in the image sequence to a corresponding voxel;
and calculating the coordinate variance of the voxels of each surface point, carrying out three-dimensional Gaussian filtering on the voxels of each surface point, and selecting the central point of all the voxels with the local variance extremum as a 3D key point.
4. The method of claim 3, wherein initializing the position of the 3D keypoint in the image sequence using optical flow and depth images comprises:
the method comprises the steps of taking a first frame image of a 3D key point located on the surface of an object as a reference frame image by comparing a depth value of the 3D key point projected to a pixel position in a depth image and a distance between the 3D key point and a camera;
projecting the 3D key points into the reference frame image to obtain two-dimensional image coordinates of the 3D key points;
calculating the optical flow between frames by using an optical flow estimation method, and sequentially transmitting the two-dimensional image coordinates to all frames according to the optical flow;
and projecting the two-dimensional image coordinates back to a three-dimensional space through the depth image to obtain initial positions of 3D key points of all frames in the image sequence.
5. The method of claim 1, wherein the inputting the image sequence and 3D keypoint locations into the dynamic scene reconstruction network model comprises:
obtaining a space sampling point through uniform sampling and probability sampling according to an image sequence;
inputting the space sampling points into a deformation field based on a multilayer sensor, and aligning the deformation field to a standard space to obtain standard coordinates of the space sampling points;
calculating the weight between the space sampling point and the 3D key point through a weight network, and carrying out weighted summation on the 3D key point coordinates according to the weight to obtain a weighted key point vector;
taking the weighted key point vector as an additional input of a multi-layer perceptron network based on a nerve radiation field to obtain the opacity and color information of the space sampling point;
and obtaining a reconstructed image by using a volume rendering method according to the opacity and the color information of the space sampling points.
6. The method of claim 5, wherein training the dynamic scene reconstruction network model based on the joint loss function optimizes the 3D keypoint locations for each frame of images, comprising:
constructing motion loss according to the two-dimensional relative positions of the 3D key points and the optical flow values between the images of adjacent frames of the image sequence;
constructing geometric losses according to the distance from the 3D key point in each frame of image of the image sequence to the camera and the depth value in the depth map;
constructing reconstruction loss according to the image sequence and the reconstructed image sequence obtained by rendering;
and constructing a joint loss function based on the motion loss, the geometric loss, the reconstruction loss and the regular loss of the deformation field, training a dynamic scene reconstruction network model according to the joint loss function, and optimizing the position of the 3D key point of each frame of image.
7. The utility model provides a change topology dynamic scene rebuilds and edit device based on nerve radiation field which characterized in that includes:
the acquisition module is used for acquiring monocular videos of the dynamic scene;
the identification module is used for identifying the 3D key points of the dynamic scene according to the image sequence of the monocular video and initializing the positions of the 3D key points in the image sequence by using optical flows and depth images;
the training module is used for inputting the image sequence and the 3D key points into a dynamic scene reconstruction network model, training the dynamic scene reconstruction network model according to the joint loss function and optimizing the positions of the 3D key points of each frame of image;
the generating module is used for modeling the topology-variable dynamic scene by utilizing the optimized 3D key point position to generate a topology-variable dynamic scene;
and the editing module is used for editing the variable topology dynamic scene by modifying the position of the 3D key point to generate a new view angle video or image of the variable topology dynamic scene.
8. The apparatus of claim 7, wherein the identification module is specifically configured to:
according to an image sequence of a monocular video, obtaining a camera position of each frame of image through a multi-view reconstruction method, and modeling topological motion states of different space points in a dynamic scene by using additional dimensions to obtain a depth image of the dynamic scene;
and taking a space point with the largest change of the extra dimensional coordinates in the sequence in the modeling process as a 3D key point, and initializing the position of the 3D key point in the image sequence by using an optical flow and the depth image, wherein different topological motion states of the same space point correspond to different extra dimensional coordinates.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of claims 1-6 when executing the computer program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1-6.
CN202310283264.1A 2023-03-21 2023-03-21 Variable topology dynamic scene reconstruction and editing method and device based on nerve radiation field Pending CN116091705A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310283264.1A CN116091705A (en) 2023-03-21 2023-03-21 Variable topology dynamic scene reconstruction and editing method and device based on nerve radiation field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310283264.1A CN116091705A (en) 2023-03-21 2023-03-21 Variable topology dynamic scene reconstruction and editing method and device based on nerve radiation field

Publications (1)

Publication Number Publication Date
CN116091705A true CN116091705A (en) 2023-05-09

Family

ID=86202783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310283264.1A Pending CN116091705A (en) 2023-03-21 2023-03-21 Variable topology dynamic scene reconstruction and editing method and device based on nerve radiation field

Country Status (1)

Country Link
CN (1) CN116091705A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116958449A (en) * 2023-09-12 2023-10-27 北京邮电大学 Urban scene three-dimensional modeling method and device and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116958449A (en) * 2023-09-12 2023-10-27 北京邮电大学 Urban scene three-dimensional modeling method and device and electronic equipment
CN116958449B (en) * 2023-09-12 2024-04-30 北京邮电大学 Urban scene three-dimensional modeling method and device and electronic equipment

Similar Documents

Publication Publication Date Title
Park et al. Nerfies: Deformable neural radiance fields
Huang et al. Arch: Animatable reconstruction of clothed humans
Gadelha et al. 3d shape induction from 2d views of multiple objects
CN110637323B (en) Method, device and system for part-based tracking
US20170213320A1 (en) Reconstruction of articulated objects from a moving camera
Zhang et al. Critical regularizations for neural surface reconstruction in the wild
JP7129529B2 (en) UV mapping to 3D objects using artificial intelligence
Levi et al. Artisketch: A system for articulated sketch modeling
CN114821404B (en) Information processing method, device, computer equipment and storage medium
US11158104B1 (en) Systems and methods for building a pseudo-muscle topology of a live actor in computer animation
CN114450719A (en) Human body model reconstruction method, reconstruction system and storage medium
CN113033442B (en) StyleGAN-based high-freedom face driving method and device
JP2019091436A (en) Classification of 2d image according to type of 3d arrangement
Liu et al. High-quality textured 3D shape reconstruction with cascaded fully convolutional networks
Paier et al. Interactive facial animation with deep neural networks
CN116091705A (en) Variable topology dynamic scene reconstruction and editing method and device based on nerve radiation field
CN116934936A (en) Three-dimensional scene style migration method, device, equipment and storage medium
CN113538682B (en) Model training method, head reconstruction method, electronic device, and storage medium
Khan et al. Towards monocular neural facial depth estimation: Past, present, and future
CN115841546A (en) Scene structure associated subway station multi-view vector simulation rendering method and system
Malleson et al. Hybrid modeling of non-rigid scenes from RGBD cameras
Sibbing et al. Building a large database of facial movements for deformation model‐based 3d face tracking
Zhu et al. Object-based rendering and 3-D reconstruction using a moveable image-based system
Lin et al. Interactive disparity map post-processing
Xu et al. Animated 3d line drawings with temporal coherence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination