CN117934677A - Image processing method and device and training method and device of image processing model - Google Patents

Image processing method and device and training method and device of image processing model Download PDF

Info

Publication number
CN117934677A
CN117934677A CN202311544261.5A CN202311544261A CN117934677A CN 117934677 A CN117934677 A CN 117934677A CN 202311544261 A CN202311544261 A CN 202311544261A CN 117934677 A CN117934677 A CN 117934677A
Authority
CN
China
Prior art keywords
information
image processing
attribute information
dimensional gaussian
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311544261.5A
Other languages
Chinese (zh)
Inventor
戴玉超
惠乐
卢致澄
陈天睿
郭相
杨敏
唐晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Samsung China Semiconductor Co Ltd
Samsung Electronics Co Ltd
Original Assignee
Northwestern Polytechnical University
Samsung China Semiconductor Co Ltd
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University, Samsung China Semiconductor Co Ltd, Samsung Electronics Co Ltd filed Critical Northwestern Polytechnical University
Priority to CN202311544261.5A priority Critical patent/CN117934677A/en
Publication of CN117934677A publication Critical patent/CN117934677A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure provides an image processing method and device and a training method and device of an image processing model. The image processing method comprises the following steps: acquiring a plurality of input images; determining canonical attribute information and structural feature information of point clouds included in the plurality of input images under a canonical space; transforming the canonical attribute information into time domain attribute information of the point cloud in a time domain space by using a transformation network based on the structural feature information; rendered images of the plurality of input images are obtained based on the temporal attribute information.

Description

Image processing method and device and training method and device of image processing model
Technical Field
The present disclosure relates to the field of image processing, and more particularly, to an image processing method and apparatus, and a training method and apparatus of an image processing model.
Background
How to synthesize realistic images and videos is the core of image processing (especially computer graphics) technology and is a hot spot of recent research. The related new view angle synthesizing technology is a technology for rendering corresponding new view angle images in a scene according to new camera pose on the basis of a set of input images and corresponding camera pose. Conventional new view image synthesis techniques typically use rendering algorithms to generate images, e.g., using techniques such as rasterization and ray tracing. In recent years, micro-renderable technology or neural rendering technology has raised a new trend in new view angle image synthesis. Neural rendering techniques, represented by neural radiation fields (Neural RADIANCE FIELD, NERF), create a way to observe synthetic images from the real world by combining the ideas of classical computer graphics and machine learning. However, the existing image synthesis technology still has certain limitations in many scenarios, for example, dynamic scenarios cannot be modeled, and storage overhead is too large. Therefore, there is a need for an image processing technique that enables dynamic scene modeling over a continuous time.
The foregoing information is presented merely as background information to aid in the understanding of the disclosure. No decision or assertion has been made as to whether any of the above is applicable as relevant technology with respect to the present disclosure.
Disclosure of Invention
Embodiments of the present disclosure provide an image processing method and apparatus and a training method and apparatus of an image processing model to solve at least the above-mentioned problems and/or disadvantages.
According to a first aspect of embodiments of the present disclosure, there is provided an image processing method including: acquiring a plurality of input images; determining canonical attribute information and structural feature information of point clouds included in the plurality of input images under a canonical space; transforming the canonical attribute information into time domain attribute information of the point cloud in a time domain space by using a transformation network based on the structural feature information; rendered images of the plurality of input images are obtained based on the temporal attribute information.
Optionally, the step of determining canonical attribute information of the point cloud included in the plurality of input images under the canonical space includes: determining canonical attribute information of each three-dimensional gaussian point in a set of three-dimensional gaussian points corresponding to the point cloud under a canonical space, wherein the canonical attribute information and/or the time domain attribute information comprises at least one of position information, rotation information, and size information.
Optionally, the step of determining structural feature information of the point cloud included in the plurality of input images under the canonical space includes: and carrying out feature extraction and feature fusion on the basis of the position information of each three-dimensional Gaussian point to obtain structural feature information of each three-dimensional Gaussian point.
Optionally, the step of transforming the canonical attribute information into the time domain attribute information of the point cloud in time domain space by using a transformation network based on the structural feature information includes: feature decoding is performed based on the canonical attribute information and the structural feature information of each three-dimensional gaussian point to determine the time-domain attribute information of each three-dimensional gaussian point by using the transformation network.
Optionally, the step of obtaining structural feature information of each three-dimensional gaussian point by performing feature extraction and feature fusion based on the position information of each three-dimensional gaussian point comprises: obtaining structural information of voxels by performing feature extraction on one or more three-dimensional gaussian points using UNet structure based on the position information of each three-dimensional gaussian point; based on the position information of each three-dimensional Gaussian point, obtaining point characteristic information of the three-dimensional Gaussian points by carrying out characteristic extraction on each three-dimensional Gaussian point by using a first neural network model; based on the structural information of the voxels and the point characteristic information of the three-dimensional Gaussian points, the structural characteristic information of each three-dimensional Gaussian point is obtained by carrying out characteristic fusion by using a second neural network model.
Optionally, the step of feature decoding based on the canonical attribute information and the structural feature information of each three-dimensional gaussian point to determine the time domain attribute information of each three-dimensional gaussian point by using the transformation network includes: determining time-based variation attribute information of each three-dimensional gaussian point by performing gaussian transformation on the specification attribute information using a third neural network model based on the structural feature information; determining the time-based time-domain attribute information of each three-dimensional gaussian point based on the canonical attribute information of each three-dimensional gaussian point and the time-based change attribute information, wherein the time-based change attribute information includes at least one of position change information, rotation change information, and size change information.
Optionally, the plurality of input images comprises a plurality of monocular images captured at different times and/or different locations.
According to a second aspect of embodiments of the present disclosure, there is provided a training method of an image processing model, the training method including: acquiring a plurality of training images; performing an image processing method as described above using the image processing model for the plurality of training images; calculating a loss based on a rendered image obtained by the image processing method and a training image corresponding to the rendered image among the plurality of training images; the image processing model is trained by adjusting parameters of the image processing model based on the loss.
Optionally, the step of training the image processing model by adjusting parameters of the image processing model based on the loss comprises: when training the image processing model, determining gradient information of three-dimensional Gaussian points in a three-dimensional Gaussian point set corresponding to the point cloud for each training; determining whether to perform a density control operation based on the time domain attribute information and the gradient information, wherein the density control operation includes changing a number of three-dimensional gaussian points.
According to a third aspect of embodiments of the present disclosure, there is provided an image processing apparatus including: an image acquisition unit configured to acquire a plurality of input images; an attribute determining unit configured to determine specification attribute information and structural feature information of a point cloud included in the plurality of input images under a specification space; a transformation processing unit configured to transform the specification attribute information into time domain attribute information of the point cloud in a time domain space by using a transformation network based on the structural feature information; an image rendering unit configured to obtain rendered images of the plurality of input images based on the time domain attribute information.
Optionally, the attribute determining unit is configured to determine canonical attribute information of the point clouds included in the plurality of input images under the canonical space by: determining canonical attribute information of each three-dimensional gaussian point in a set of three-dimensional gaussian points corresponding to the point cloud under a canonical space, wherein the canonical attribute information and/or the time domain attribute information comprises at least one of position information, rotation information, and size information.
Optionally, the attribute determining unit is configured to determine structural feature information of the point clouds included in the plurality of input images under the canonical space by: and carrying out feature extraction and feature fusion on the basis of the position information of each three-dimensional Gaussian point to obtain structural feature information of each three-dimensional Gaussian point.
Optionally, the transformation processing unit is configured to transform the canonical attribute information into time domain attribute information of the point cloud in time domain space by using a transformation network based on the structural feature information by: feature decoding is performed based on the canonical attribute information and the structural feature information of each three-dimensional gaussian point to determine the time-domain attribute information of each three-dimensional gaussian point by using the transformation network.
Optionally, the attribute determining unit is configured to obtain structural feature information of each three-dimensional gaussian point by performing feature extraction and feature fusion based on the position information of each three-dimensional gaussian point by: obtaining structural information of voxels by performing feature extraction on one or more three-dimensional gaussian points using UNet structure based on the position information of each three-dimensional gaussian point; based on the position information of each three-dimensional Gaussian point, obtaining point characteristic information of the three-dimensional Gaussian points by carrying out characteristic extraction on each three-dimensional Gaussian point by using a first neural network model; based on the structural information of the voxels and the point characteristic information of the three-dimensional Gaussian points, the structural characteristic information of each three-dimensional Gaussian point is obtained by carrying out characteristic fusion by using a second neural network model.
Optionally, the transformation processing unit is configured to perform feature decoding based on the canonical attribute information and the structural feature information of each three-dimensional gaussian point by using the transformation network to determine the time domain attribute information of each three-dimensional gaussian point by: determining time-based variation attribute information of each three-dimensional gaussian point by performing gaussian transformation on the specification attribute information using a third neural network model based on the structural feature information; determining the time-based time-domain attribute information of each three-dimensional gaussian point based on the canonical attribute information of each three-dimensional gaussian point and the time-based change attribute information, wherein the time-based change attribute information includes at least one of position change information, rotation change information, and size change information.
Optionally, the plurality of input images comprises a plurality of monocular images captured at different times and/or different locations.
According to a fourth aspect of embodiments of the present disclosure, there is provided a training apparatus of an image processing model, the training apparatus including: an image acquisition unit configured to acquire a plurality of training images; a model prediction unit configured to perform the image processing method described above using the image processing model for the plurality of training images; a loss calculation unit configured to calculate a loss based on the rendered image and a training image corresponding to the rendered image among the plurality of training images; and a parameter adjustment unit configured to train the image processing model by adjusting parameters of the image processing model based on the loss.
Optionally, the parameter adjustment unit is configured to train the image processing model by adjusting parameters of the image processing model based on the loss by: when training the image processing model, determining gradient information of three-dimensional Gaussian points in a three-dimensional Gaussian point set corresponding to the point cloud for each training; determining whether to perform a density control operation based on the time domain attribute information and the gradient information, wherein the density control operation includes changing a number of three-dimensional gaussian points.
According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device comprising: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the method as described above.
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, wherein instructions in the computer readable storage medium, when executed by at least one processor, cause the at least one processor to perform the method as described above.
According to the image processing method and device and the training method and device of the image processing model, modeling of a dynamic scene on continuous time can be achieved, and the mapping transformation relation between a standard space and a time domain space is determined, so that accurate change and motion prediction are achieved, and storage cost is saved. The image processing method and apparatus according to the embodiments of the present disclosure and the training method and apparatus of an image processing model can also model under dynamic scenes in which time and position (or viewing angle) change simultaneously, for example, a rendered image can be observed from different angles, or at different times, under a scene of an input image. The image processing method and apparatus and the training method and apparatus of the image processing model according to the embodiments of the present disclosure are also capable of realizing new view image composition based on only monocular images or videos.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other aspects, features and elements of certain embodiments of the present disclosure will become more apparent from the following description when taken in conjunction with the accompanying drawings in which:
fig. 1 is a flowchart illustrating an image processing method according to an embodiment of the present disclosure;
Fig. 2 is a block diagram illustrating an image processing method according to an embodiment of the present disclosure;
fig. 3 is a diagram illustrating an example of a UNet structure according to an embodiment of the present disclosure;
FIG. 4 is a block diagram illustrating an example of a transformation network structure according to an embodiment of the present disclosure;
FIG. 5 is a flow chart illustrating a training method of an image processing model according to an embodiment of the present disclosure;
Fig. 6 is a block diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure;
FIG. 7 is a block diagram illustrating the structure of a training apparatus of an image processing model according to an embodiment of the present disclosure; and
Fig. 8 is a block diagram illustrating a structure of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "comprising at least one of a and B" includes three cases side by side: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "execute at least one of the first and second steps", that is, represent three cases in parallel: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.
The rapid development Of computer technology has made image processing more possible, for example, the recently emerging three-dimensional (3D) gaussian snowball (Gaussian splatting) based algorithms enable real-time rendering Of radiation fields, which enable The most advanced (SOTA) level Of visual effect in less training time, and which allow high quality real-time (30 fps) new view image synthesis at 1080P resolution for scenes taken with multiple photographs and/or multiple videos.
Regarding a research method of an image synthesis technology based on 3D gauss, a real-time rendering method for realizing a high-resolution image by using a snowball throwing algorithm is proposed at present, and specifically: starting from sparse points generated in the camera calibration process, representing a scene by using 3D Gaussian points, wherein the 3D Gaussian points keep ideal properties of a continuum radiation field to perform scene optimization, and unnecessary calculation in a blank space is avoided; performing staggered optimization/density control on the 3D Gaussian points, and optimizing anisotropic covariance to realize accurate representation of a scene; a fast visual sense rendering algorithm is developed, which supports anisotropic snowslinging, which can accelerate training and also realize real-time rendering. However, the real world is often a dynamic scene, and the real-time rendering method can only model a static scene, and cannot model the dynamic scene.
In addition, methods involving modeling of dynamic scenes to solve dynamic scene new view synthesis and six degrees of freedom tracking have also been studied, which have used synchronized multiview videos (27 training cameras, 4 test cameras) from a dataset to experiment and trained for 150 time steps. However, the dynamic scene modeling method of the method just connects static scenes in series over a plurality of time steps, that is, the representation method of the dynamic scene is still discrete in time, and the dynamic scene modeling in a real continuous time cannot be realized. In addition, this discrete dynamic scene modeling method requires storing data at each time step, consuming a large amount of storage resources.
In order to solve at least the problems in the related art described above, the present disclosure proposes an image processing technique that realizes dynamic scene modeling over a continuous time by introducing a time variable. Specifically, in order to realize modeling of a dynamic scene in continuous time, the present disclosure provides an image processing method and apparatus capable of realizing modeling of a dynamic scene in continuous time and a training method and apparatus of an image processing model by transforming image data in a canonical space into a temporal space and describing a process of movement and/or change of the image data in the temporal space, thereby obtaining a more accurate image synthesis effect. An image processing method and apparatus and a training method and apparatus of an image processing model according to embodiments of the present disclosure will be described in detail below with reference to fig. 1 to 8.
First, some terms involved in the present disclosure will be briefly described.
Herein, a "point cloud" may include, but is not limited to, a massive set of points expressing the target spatial distribution and target surface characteristics under the same spatial reference frame, e.g., a set of points obtained by acquiring the spatial coordinates of sampling points of the object surface in the image.
In this context, a "canonical space" may include, but is not limited to, a standard volumetric space or range, and data and/or information under the canonical space may be processed by a neural network, e.g., position coordinate information under the position-coded canonical space may be processed as input to the neural network.
Herein, the "time domain space" is a time-associated spatial domain corresponding to the "canonical space", and the data and/or information under the time domain space according to the embodiment of the present disclosure is time-variable-related data and/or information obtained by transforming the data and/or information under the canonical space.
Herein, a "three-dimensional (3 d) gaussian point" is each point in a point cloud described by a three-dimensional gaussian form, which may have certain information of size, rotation direction, position, color, and the like, and thus may also be referred to as a three-dimensional gaussian kernel or simply a three-dimensional gaussian. In other words, a point cloud according to embodiments of the present disclosure may correspond to a set of three-dimensional gaussian points, and points in the point cloud may correspond to three-dimensional gaussian points.
An image processing method according to an embodiment of the present disclosure is described with reference to fig. 1 to 4. Fig. 1 is a flowchart illustrating an image processing method according to an embodiment of the present disclosure. Fig. 2 is a block diagram of an image processing method according to an embodiment of the present disclosure. According to an embodiment, the image processing method of the present disclosure may include an image rendering method, an image synthesizing method, and/or an image estimating method. According to an embodiment, the method illustrated in fig. 1 may be implemented by an image processing model (which may also be referred to as an image rendering model) of the present disclosure.
Referring to fig. 1, in step S110, a plurality of input images are acquired. According to an embodiment, the input image may be used as a training image for the image processing model of the present disclosure. Referring to fig. 2, the plurality of input images may include a sequence of images that varies with time, for example, the plurality of input images may include a plurality of images including a moving person.
According to an embodiment, a manner of acquiring an input image may include receiving an input image, such as receiving image data; capturing an input image, capturing the input image using a video or image capturing device (such as a camera, video camera); reading an input image (such as reading an input image from a memory), and the like, but is not limited thereto.
According to the embodiment, in the case where a video is acquired, a plurality of input images may be acquired from the video at predetermined intervals, for example, a plurality of video frames may be determined from a plurality of video frames of the video at intervals of a predetermined number of frames, and each of the plurality of video frames may be determined as an input image, respectively.
According to an embodiment, the plurality of input images may comprise a plurality of monocular images captured at different times and/or different locations. In particular, monocular images may include, but are not limited to, images captured using a single video or image capture device. The plurality of monocular images according to embodiments of the present disclosure may include, but are not limited to, images captured at different times using a single video or image capture device, e.g., a continuous plurality of images in a dynamic scene. According to an embodiment, the types of input images may include, but are not limited to, red Green Blue (RGB) format images, and the like.
Step S120, determining canonical attribute information and structural feature information of the point cloud included in the plurality of input images under the canonical space.
For example, the point cloud included in the image may be determined by reconstruction using a motion estimation structure (Structure From Motion, SFM) algorithm, or the initial point cloud may be determined by random initialization.
According to an embodiment, the step of determining canonical attribute information of the point cloud included in the plurality of input images under the canonical space may include: and determining the canonical attribute information of each three-dimensional Gaussian point in the three-dimensional Gaussian point set corresponding to the point cloud under the canonical space.
According to an embodiment, the attribute information of the point cloud and/or the three-dimensional gaussian point may include, but is not limited to, at least one of position information, rotation information, and size information, and the attribute information may be used to render an image.
According to an embodiment, the location information may be represented as, but is not limited to, a location coordinate p (x 0,y0,z0) under a canonical space, where x 0、y0、z0 may represent coordinates in an x-axis direction, a y-axis direction, and a z-axis direction, respectively.
According to an embodiment, the rotation information may be represented as, but is not limited to, three-dimensional rotation coordinates R (a 1,a2) parameterized by six-dimensional (6 d,6 dimensions) parameters, where each of a 1 and a 2 is a three-row vector, and accordingly, three-dimensional rotation coordinates R (a 1,a2) may represent a total of 6 variables and these 6 variables may be transformed to a rotation group SO (3) rotation matrix, as will be described below.
According to an embodiment, the size information may be represented as, but is not limited to, the size s (sx, sy, sz) of the three-dimensional gaussian point, wherein sx, sy may represent the sizes in the x-axis direction, the y-axis direction, and the z-axis direction, respectively.
According to an embodiment, the attribute information may further include density information, for example, the density information may be represented as density d (o).
According to an embodiment, the attribute information may also include color information, which may include, by way of example, but is not limited to, the color values RGB (r, g, b) in GRB format.
Accordingly, the specification attribute information according to an embodiment may include at least one of position information, rotation information, and size information. In addition, the specification attribute information may further include density information and color information.
According to an embodiment, the canonical attribute information of each three-dimensional gaussian point in the three-dimensional gaussian point set corresponding to the point cloud under the canonical space may be determined by assigning an attribute information of each three-dimensional gaussian point in the point cloud or the three-dimensional gaussian point set corresponding to the point cloud. For example, random assignment may be initialized by attribute information of the point cloud, or assignment may be performed by inheriting previously stored information or information fed back by subsequent image processing. That is, the attribute information of the point cloud (specification attribute information including three-dimensional gaussian points) according to the embodiment of the present disclosure has a learning property, which can be randomly determined during initialization or continuously adjusted through learning.
Since the point cloud data of the point cloud has structural characteristics (such as characteristics related to the position information), in order to make full use of the structural characteristics of the point cloud, in the present disclosure, it is also proposed to extract structural feature information related to the structural characteristics from the point cloud.
According to an embodiment, the step of determining structural feature information of a point cloud included in the plurality of input images under a canonical space includes: based on the position information of each three-dimensional Gaussian point, feature extraction and feature fusion are performed to obtain structural feature information of each three-dimensional Gaussian point. For example, referring to fig. 2, feature extraction and feature fusion are performed based on a point cloud or three-dimensional gaussian points determined from the point cloud to determine the input for subsequent feature decoding.
Specifically, a process of feature extraction using a UNet structure is described in detail with reference to fig. 3, and fig. 3 is a diagram of an example of the UNet structure according to an embodiment of the present disclosure.
According to an embodiment, based on the location information of each three-dimensional gaussian point, structural information of voxels (voxels) is obtained by feature extraction of one or more three-dimensional gaussian points using a UNet structure, such as a UNet structure based on sparse convolution, where the one or more three-dimensional gaussian points may be included within one voxel (or grid). According to an embodiment, the input of the UNet structure is a voxelized voxel coordinate determined based on the position information of the point cloud and/or the three-dimensional gaussian point, and the structure information of the voxels (e.g., the characteristics of the voxels) is output by a sparse convolution calculation using the UNet structure.
In particular, since a point cloud may have a large number of points, computing resources may be consumed if a convolution computation is performed for each point, and thus the present disclosure uses a sparse convolution method for computation. The blank area where no point exists in most of the space can be skipped by the sparse convolution method, the space is divided into a plurality of grids, the points falling within the same grid are regarded as one point, and the grid coordinates (i.e., voxel coordinates) are used as features of each grid for input and output, so that the output dimension is reduced. Since the UNet structure is based on sparse convolution and the parameters of the input of the UNet structure are voxel coordinates, the output of the sparse convolution in this case necessarily includes neighborhood information of points, and accordingly, the output structural information (e.g., characteristics of the voxels) may be regarded as common structural information of one or more points within the voxels.
Referring to fig. 3 (a), the UNet structure according to an embodiment may include 5 layers (or blocks) of a first downsampling block (DownVoxelBlock), a second downsampling block, a residual block (ResidualBlock), a first upsampling block (UpVoxelBlock), and a second upsampling block, wherein an input of the UNet structure is an initial voxel coordinate determined based on position information of each three-dimensional gaussian point, and is output as structure information of the voxels (e.g., characteristics of the voxels), wherein a connection manner of each block of the UNet structure is illustrated in fig. 3 (b), although the UNet structure according to an embodiment of the present disclosure is illustrated in fig. 3 (a) to include 5 layers, embodiments of the present disclosure are not limited thereto. As an example, the convolution kernel size of each block may be 3 x 3, the step sizes of the first downsampling block, the second downsampling block, the first upsampling block and the second upsampling block may be 2, the step size of the residual block may be 1, the output channels of the first downsampling block and the first upsampling block may be 64, the output channels of the second downsampling block and the residual block may be 128, and the output channels of the second upsampling block may be 32. In this case, when the output channels having a core size of 5x 5, a step size of 1, and a number of 32 are input, the output may have a core size of 1 x 1, a step size of 1, a number of output channels of 32. It should be understood that embodiments of the present disclosure, including but not limited to this example, a UNet structure according to the present disclosure may be implemented in any suitable form. Furthermore, referring to fig. 3 (b), a cross-layer (skip) connection may exist between layers, such as between a second downsampling block and a first upsampling block, and the increase in computing power may be achieved by a UNet structure with the cross-layer connection. It should be noted that the up-sampling block and the down-sampling block in the UNet structure according to the embodiment of the present disclosure are shown as two, and the residual block is shown as one, but the embodiment according to the present disclosure is not limited thereto, and the number of the up-sampling block, the down-sampling block, and the residual block may be one or more. Furthermore, although the UNet structure in the present disclosure has a cross-layer connection, embodiments of the present disclosure are not limited thereto, UNet structures without a cross-layer connection or other connection means may also be used, and furthermore, the manner of cross-layer connection of UNet structures of the present disclosure is not limited to the connection means shown in fig. 3.
Local information (such as structural information) can be extracted from the point cloud by using sparse convolution of UNet structures according to embodiments of the present disclosure, which enables subsequent processing to obtain more accurate inputs containing aggregate features, thereby enabling accurate prediction of motion and deformation.
In addition, point feature information of the three-dimensional gaussian points may be obtained by feature extraction of each three-dimensional gaussian point using a first neural network model, such as a multi-layer perceptron (Multilayer Perceptron, MLP), based on the position information of each three-dimensional gaussian point. According to an embodiment, the input of the first neural network model may be the position information of each three-dimensional gaussian point, the output may be the point feature information of the point itself of each three-dimensional gaussian point (i.e., the attribute of the point itself (e.g., the vector of the point)), for example, the input may be a three-dimensional vector, the output may be a 64-dimensional vector, in which case the output 64-dimensional vector may be regarded as the point feature information.
For example, the neural network model may be an MLP.
According to an embodiment, the neural network model may comprise a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weights and biases as network parameters, and an intermediate neural network layer may perform a neural network calculation by calculation among calculation results of a previous layer, the plurality of weights and biases, and provide the calculation results to a next layer, in which the network parameters may be adjusted or determined by learning. Since many detailed descriptions about the neural network model already exist in the related art, a detailed description thereof will be omitted herein. Examples of neural networks include, but are not limited to, convolutional Neural Networks (CNNs), deep Neural Networks (DNNs), recurrent Neural Networks (RNNs), boltzmann machines limited (RBMs), deep Belief Networks (DBNs), bi-directional recurrent deep neural networks (BRDNN), generative Antagonism Networks (GANs), and deep Q networks. In addition, neural network models may also be implemented by other neural network models. The above examples are merely exemplary, and the present disclosure is not limited thereto.
According to the embodiment, by extracting unique point feature information (such as the attribute of the point itself) of each three-dimensional gaussian point from each three-dimensional gaussian point in the point cloud or the three-dimensional gaussian point set corresponding to the point cloud using the first neural network model, therefore, the features of each three-dimensional gaussian point extracted by the first neural network model may be identical to or different from each other, while the structural features extracted by the UNet structure are common structural features of points located within the same voxel, and in order to fully utilize the structural characteristics of the point cloud, two features (such as structural information and point feature information) need to be used in combination.
According to an embodiment, when the structure information and the point feature information are respectively determined by the UNet structure and the first neural network model, referring to fig. 2, the structure feature information of each three-dimensional gaussian point is obtained by performing feature fusion using a second neural network model (such as MLP) based on the structure information of the voxels and the point feature information of the three-dimensional gaussian point. For example, feature fused structural feature information, which can be used for a subsequent feature decoding process (such as transformation), can be output by stitching structural information of voxels with point feature information of three-dimensional gaussian points, and inputting the stitching result into a second neural network model (such as MLP). According to an embodiment, the input of the second neural network model (such as MLP) may be a concatenation of structural information of the voxels and point feature information of the corresponding three-dimensional gaussian point, and the output may be structural feature information of the corresponding three-dimensional gaussian point. The feature fusion process according to the embodiments of the present disclosure may also be implemented by other feature fusion manners, and the present disclosure is not limited thereto.
By feature fusion of the output of the UNet structure and the output of the first neural network model, structural feature information of each three-dimensional gaussian point, which includes both its own unique feature characteristics and structural features (e.g., geometric features) of the structures within the belonging voxels, can be obtained, thereby providing accurate input data for subsequent processing.
Through the above processing, the canonical attribute information and the structural feature information of the point cloud or the three-dimensional gaussian point in the canonical space can be determined, and then the information in the time domain space can be obtained by introducing the time variable t and performing the transformation based on the canonical attribute information and the structural feature information.
Returning to fig. 1, at step S130, the canonical attribute information is transformed into time domain attribute information of the point cloud in time domain space by using a transformation network based on the structural feature information. For example, the transformation of attribute information from canonical space to time domain space may be achieved by introducing a time variable t using a transformation network (or referred to as a transformation field) constrained by structural feature information.
The time domain attribute information according to an embodiment may include at least one of position information, rotation information, and size information. Further, according to an embodiment, the time domain attribute information may further include density information and color information. According to an embodiment, the time domain attribute information is expressed under a time domain space, has similar characteristics to the specification attribute information, and details of the specification attribute information are applicable to the time domain attribute information, and are not repeated here.
According to an embodiment, the step of transforming the canonical attribute information into time domain attribute information of the point cloud in time domain space by using a transformation network based on the structural feature information comprises: feature decoding is performed based on the canonical attribute information and the structural feature information of each three-dimensional gaussian point by using a transformation network to determine time domain attribute information of each three-dimensional gaussian point. Referring to fig. 2, feature decoding is performed on feature-fused structural feature information and determined specification attribute information by using time embedding, so as to obtain time domain attribute information in a time domain space.
Specifically, based on the structural feature information, time-based variation attribute information (such as position variation information (Δx, Δy, Δz), rotation variation information (a 1'、a2') and size variation information (Δsx, Δsy, Δsz)) of each three-dimensional gaussian point is determined by performing a feature decoding operation (such as gaussian transformation) on the specification attribute information using a third neural network model (such as MLP).
The feature decoding is described in detail below with reference to fig. 2 and 4. Fig. 4 is a block diagram illustrating an example of a transformation network structure according to an embodiment of the present disclosure.
According to an embodiment, the network F may be transformed by use of a transformation. (such as the gaussian transformation network structure in fig. 4) (i.e., the third neural network model)) calculates the change attribute information based on the position information (such as the position-encoded position information r (p)) and the time variable (such as the position-encoded time variable r (t)), where Θ may represent the network parameters (such as weights, biases, etc.) of the transformation network (or transformation field) (i.e., the third neural network model), and the specific calculation may be represented as the following equation:
(Δx,Δy,Δz,a′1,a′2,Δsx,Δsy,Δsz)=FΘ(r(x0,y0,z0),r(t),feature) (1)
wherein feature may represent structural feature information; r () may represent a position-coding function, and may be expressed in particular as the following equation:
Where x may represent a position or a temporal input parameter in the transformed field, i.e. p (x 0,y0,z0) or t; l is a dimensionless number associated with the frequency domain component that may represent the logarithm of the sin and cos pair, with L being 10 for the position coordinate (i.e., p (x 0,y0,z0)) and 6 for the time variable t, where the larger L, the higher the frequency, the more details of the higher frequency may be described by the equation.
As can be seen by using equations (1) and (2), the inputs of the transformation network are position-encoded position information (such as r (p)) and position-encoded time variables (such as r (t)), and the outputs are time t-based position change information (Δx, Δy, Δz), rotation change information (a 1'、a2') and size change information (Δsx, Δsy, Δsz) determined by the transformation network. In other words, the output is an expression of the corresponding information containing the time variable t.
Referring to fig. 4, a transformation network F Θ structure (or referred to as a feature decoding structure, a decoder, etc.) may be constructed of 5 neural network layers, in which position information (such as position-encoded position information r (p)) and feature structure information (such as position-encoded time r (t)) are input into the transformation network, and there is a cross-layer connection in the transformation network that connects the position information and the feature structure information to an intermediate neural network layer (simply referred to as a "layer"), the output of the transformation network being position change information (Δx, Δy, Δz), rotation change information (a 1'、a2') and size change information (Δsx, Δsy, Δsz) based on time t. Although a cross-layer connection to an intermediate one is shown, the present disclosure may not apply a cross-layer connection, or may be implemented between other layers, according to an embodiment of the present disclosure, which is not limited thereto. Since each of a 1'、a2' of the rotation change information includes 3 parameters, the output has 12 parameters in total, and accordingly, the last layer has 12 neurons and the other layers have 256 neurons. The number of layers and the number of neurons of the neural network structure according to embodiments of the present disclosure may also be any other number of uses, to which the present disclosure is not limited. According to embodiments of the present disclosure, any one neural network model may also be implemented as a plurality of neural network models, as long as the corresponding computing function can be implemented.
Network parameters (such as weights, biases, etc.) of the transformation network F Θ structure according to embodiments of the present disclosure have a learning property, e.g., network parameters may be adjusted when training an image processing model according to embodiments of the present disclosure.
Through the transformation network in fig. 4, the specification attribute information may be subjected to gaussian transformation to obtain time-based variation attribute information.
When the variation attribute information is determined, referring to fig. 2, time-based time-domain attribute information of each three-dimensional gaussian point is determined based on the specification attribute information of each three-dimensional gaussian point and the time-based variation attribute information in the time-domain space. For example, the position information in the time-based time domain attribute information can be obtained by the following equation (3):
Where x t may represent the coordinates of the three-dimensional gaussian point in the x-axis direction of the t-time in time domain space, y t may represent the coordinates of the three-dimensional gaussian point in the y-axis direction of the t-time in time domain space, and z t may represent the coordinates of the three-dimensional gaussian point in the z-axis direction of the t-time in time domain space.
For example, the size information in the time-based time domain attribute information may be obtained by the following equation (4):
Where sx t may represent the size of the three-dimensional gaussian point in the x-axis direction at time t in time domain space, sy t may represent the size of the three-dimensional gaussian point in the y-axis direction at time t in time domain space, and sz t may represent the size of the three-dimensional gaussian point in the z-axis direction at time t in time domain space.
For example, the rotation information in the time-based time domain attribute information may be obtained by the following equation (5):
R=6DR2Matrix(a'1,a'2)×6DR2Matrix(a1,a2) (5)
Wherein 6DR2Matrix () may represent a rotation Matrix that converts a six-dimensional continuous rotation representation into a rotation group SO (3), wherein the 6DR2Matrix () function may be represented by the following equation (6) and equation (7):
Wherein the vertical line in equation (6) indicates that the vector is a column vector, such as a 1 is a three-row column vector; n () represents a normalization (normalization) function.
By the above equation, the time domain attribute information of the three-dimensional gaussian point, for example, the three-dimensional gaussian point after the change at a predicted specific time can be expressed in the time domain space.
Referring back to fig. 1, in step S140, rendered images of a plurality of input images are obtained based on the time domain attribute information.
Referring to fig. 2 or 3, a rendered image may be obtained by micro-rasterizeable based on time domain attribute information. Specifically, when specific time domain attribute information of a three-dimensional gaussian point is determined, attribute information of the three-dimensional gaussian point at a specific time (such as time t 0) may be determined, and the three-dimensional gaussian point may be rendered to a two-dimensional image based on the attribute information (such as at least one of position information, size information, rotation information, color information, and density information) through micro-rasterization. Specific details regarding the rasterization operation already exist in the related art and are not described in detail herein.
By the image processing method, the rendering image of any position (such as any view angle) at any moment can be generated based on a plurality of input images, so that a good new view angle image synthesis effect under a dynamic scene is realized. Further, since the embodiments of the present disclosure generate a rendered image through time-related information, instead of storing image data at each time step, the embodiments of the present disclosure can save storage resources, avoiding unnecessary storage overhead.
Fig. 5 is a flowchart illustrating a training method of an image processing model according to an embodiment of the present disclosure. The image processing model of the present disclosure may be obtained through training on an electronic device or server.
Referring to fig. 5, in step S510, a plurality of training images are acquired. According to an embodiment, the step of acquiring the training image is similar to step S110 described above with reference to fig. 1, and the specific details of step S110 may be applied to step S510. In addition, multiple training images may also be obtained from various data sets (such as synthetic data sets including the scenes "Hell Warrior", "instant", "Hook", "Bouncing Balls", "T-Rex", "Stand Up", "Jumping Jacks", and "Lego").
In step S520, the image processing method described with reference to fig. 1 to 4, that is, the prediction processing of the rendered image is performed using the image processing model for a plurality of training images. The image processing method described with reference to fig. 1 to 4 may be applied to step S520, wherein the training image of step S510 may be directly used as an input image or may be processed to be used as an input image, and detailed description thereof will not be given here.
In step S530, a penalty is calculated based on the rendered image obtained by the image processing method and a training image corresponding to the rendered image, such as a truth value (Ground Truth, GT) image, among the plurality of training images. For example, in each iterative training, a certain view angle is randomly extracted and a rendered image corresponding to the view angle is determined using an image processing model, and a penalty is calculated based on the rendered image and the truth image. For example, the penalty may be calculated by comparing the image-processed rendered image at a particular time with a known corresponding truth image at the particular time. For example, the loss can be calculated by the loss function of the following equation (8)
Wherein,Loss may be represented; /(I)The rendered image subjected to the image processing of step S520 may be represented, C may represent a corresponding truth image, and SSIM () may represent a function of calculating the structural similarity.
For example, the truth image may be one of the known images in the training image.
In step S540, the image processing model is trained by adjusting parameters of the image processing model based on the loss.
According to an embodiment, the iterative process of steps S520 to S540 is repeated until the loss satisfies a predetermined condition or a predetermined number of iterations is reached. For example, when the loss is greater than or equal to the threshold, the iterative process of adjusting the parameters and performing the image processing is repeated until the loss is less than the threshold.
According to an embodiment, the parameters of the image processing model include, but are not limited to, parameters related to UNet structures used in the above-described image processing method, parameters of the neural network model, parameters of the transformation network, and the like.
According to an embodiment, in the image processing of each iteration process, the time domain attribute information obtained by the transformation may be mapped back into the canonical space and converted back into canonical attribute information for use in subsequent iteration processes.
According to an embodiment, in the image processing of each iteration process, the specification attribute information under the specification space may be stored or recorded for use in the subsequent iteration process.
According to an embodiment, in the process described with reference to fig. 1 or 5 with respect to determining the canonical attribute information, the initialized point cloud and the corresponding initialized canonical attribute information may be determined, and canonical attribute information stored or recorded or fed back from the time domain attribute information may also be determined.
The specific adjustment parameter processing based on the loss function is similar to the model training method in the related art, so long as the input and output and the specific embedded parameter conform to the protection scope of the specific requirements of the present disclosure, and will not be described herein. Based on the generated rendered image and the truth image, learning training is performed by using semi-supervised or fully supervised or other existing learning modes. In this context, learning means that by applying a learning algorithm to a plurality of data having a learned nature (including but not limited to attribute information, parameters of the respective structure or model), a predefined operating rule or artificial intelligence model having a desired characteristic is formed. Learning may be performed in the apparatus itself according to an embodiment and/or may be implemented by a separate server/device/system.
In related art techniques, such as paper three-dimensional gaussian scattering for real-time radiation field rendering (3D Gaussian Splatting for Real-TIME RADIANCE FIELD RENDERING), three-dimensional gaussian points may be adaptively density controlled prior to rendering an image, e.g., each time a model trains a round while reconstructing a static scene, gradient values will be passed back to each three-dimensional gaussian point, and if the gradient values are greater than a predetermined gradient threshold, density control of the three-dimensional gaussian points is required. In particular, the density control operations may include a cloning operation and a splitting operation, for example, if the size of the three-dimensional gaussian point is smaller or larger than a predetermined size threshold on the premise that the gradient value is larger than the predetermined gradient threshold, the cloning operation or the splitting operation is performed. In this disclosure, it is considered that a time variable is introduced, and thus, the density control operation should be time-dependent.
According to an embodiment, the step of training the image processing model by adjusting parameters of the image processing model based on the loss comprises: when training an image processing model, determining gradient information of three-dimensional Gaussian points in a three-dimensional Gaussian point set corresponding to the point cloud for each training; determining whether to perform a density control operation based on the time domain attribute information and the gradient information, wherein the density control operation includes changing a number of three-dimensional gaussian points.
According to an embodiment, the density control operation may include, but is not limited to, changing the number of three-dimensional gaussian points (e.g., the density of a point cloud or a set of three-dimensional gaussian points), for example, by a cloning operation or a splitting operation.
According to an embodiment, in each iterative training, a three-dimensional gaussian point image-processed using an image processing model is projected onto an image plane, that is, a three-dimensional gaussian point having three coordinate values such as (x, y, z) is a two-dimensional gaussian point having two-dimensional coordinates such as (x, y, 0) and a two-dimensional coordinate gradient value is calculated based on the coordinates of the two-dimensional gaussian point on the image plane and is determined as gradient information of the three-dimensional gaussian point. In each iterative training, gradient information of three-dimensional Gaussian points can be determined. Although the embodiments of the present disclosure only show that the gradient information includes a two-dimensional coordinate gradient, the embodiments of the present disclosure are not limited thereto.
For example, according to an embodiment, in training of an image processing model, gradient information is passed back to a three-dimensional gaussian point in each training iteration, and it is determined whether to perform a density control operation based on time domain attribute information and gradient information at a specific time (e.g., a current time). As an example, the density control judgment is performed once every predetermined number of times (for example, 100 times) of training. As an example, the sum of gradient values of a predetermined number of exercises may be averaged, and it may be determined whether density control needs to be performed according to the magnitude of the average gradient value and time domain attribute information (such as magnitude (or scale) information) of a three-dimensional gaussian point at a specific time (such as the current time). That is, in the present disclosure, it is determined whether to perform the density control operation at a specific timing in the time domain space, instead of performing the determination of whether to perform the density control operation in the specification space.
When a condition for performing a density control operation is satisfied (e.g., gradient information satisfies a gradient threshold condition (e.g., greater than or equal to a predetermined gradient threshold) and the size (or scale) of a three-dimensional gaussian point at a particular moment is greater than or equal to a predetermined size threshold), a density control operation including a cloning operation and/or a splitting operation may be performed. In the present disclosure, in order to perform a density control operation on a three-dimensional gaussian point on a canonical space, it is assumed that the canonical space should perform the same operation as a corresponding three-dimensional gaussian point at a specific moment.
For example, when a cloning operation is performed for a three-dimensional gaussian point, since the cloning operation in the canonical space is completely equivalent to the cloning operation in the time domain space, the cloning operation may be performed in the time domain space or directly in the canonical space. For example, when a splitting operation is performed for a three-dimensional gaussian point, since the splitting operation in the canonical space is not equivalent to the splitting operation in the time domain space, the splitting operation may be performed in the time domain space, and the result of the splitting operation may be mapped back to the canonical space, so that the splitting operation may be performed indirectly in the canonical space. Specific details regarding the density control operation have been found in the related art and are not described in detail herein. By performing adaptive density control to change the number of three-dimensional gaussian points, optimization of image processing can be better achieved.
By using the training method described above, an image processing model including a three-dimensional gaussian model can be extended from modeling of a static scene to modeling of a dynamic scene over a continuous time.
An image processing apparatus that performs the above-described image processing method and a training apparatus for an image processing model that performs the training method for an image processing model will be described in detail below.
Fig. 6 is a block diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure.
Referring to fig. 6, an image processing apparatus 600 according to an embodiment may include an image acquisition unit 610, an attribute determination unit 620, a transformation processing unit 630, and an image rendering unit 640.
According to an embodiment, the image obtaining unit 610 may be configured to obtain a plurality of input images, the attribute determining unit 620 may be configured to determine canonical attribute information and structural feature information of the point cloud in the canonical space included in the plurality of input images, the transformation processing unit 630 may be configured to transform the canonical attribute information into time domain attribute information of the point cloud in the time domain space by using the transformation network based on the structural feature information, and the image rendering unit 640 may be configured to obtain a rendered image of the plurality of input images based on the time domain attribute information.
That is, the image acquisition unit 610 may perform operations corresponding to step S110 of the image processing method described above with reference to fig. 1, the attribute determination unit 620 may perform operations corresponding to step S120 of the image processing method described above with reference to fig. 1, the transformation processing unit 630 may perform operations corresponding to step S130 of the image processing method described above with reference to fig. 1, and the transformation processing unit 630 may perform operations corresponding to step S140 of the image processing method described above with reference to fig. 1. Details of the image processing method described above with reference to fig. 1 to 4 are applicable to the operation of the corresponding units of fig. 6.
According to an embodiment, the attribute determination unit 620 may be configured to determine canonical attribute information of a point cloud included in a plurality of input images under a canonical space by: and determining specification attribute information of each three-dimensional Gaussian point in the three-dimensional Gaussian point set corresponding to the point cloud under a specification space, wherein the specification attribute information and/or the time domain attribute information comprise at least one of position information, rotation information and size information.
According to an embodiment, the attribute determining unit 620 may be configured to determine structural feature information of the point clouds included in the plurality of input images under the canonical space by: based on the position information of each three-dimensional Gaussian point, feature extraction and feature fusion are performed to obtain structural feature information of each three-dimensional Gaussian point.
According to an embodiment, the transformation processing unit 630 may be configured to transform the canonical attribute information into time domain attribute information of the point cloud in time domain space by using a transformation network based on the structural feature information by: feature decoding is performed based on the canonical attribute information and the structural feature information of each three-dimensional gaussian point by using a transformation network to determine time domain attribute information of each three-dimensional gaussian point.
According to an embodiment, the attribute determining unit 620 may be configured to obtain structural feature information of each three-dimensional gaussian point by performing feature extraction and feature fusion based on the position information of each three-dimensional gaussian point by: based on the position information of each three-dimensional Gaussian point, obtaining the structural information of the voxels by carrying out feature extraction on one or more three-dimensional Gaussian points by using a UNet structure; based on the position information of each three-dimensional Gaussian point, obtaining point characteristic information of the three-dimensional Gaussian points by carrying out characteristic extraction on each three-dimensional Gaussian point by using a first neural network model; based on the structural information of the voxels and the point characteristic information of the three-dimensional Gaussian points, the structural characteristic information of each three-dimensional Gaussian point is obtained by carrying out characteristic fusion by using a second neural network model.
According to an embodiment, the transformation processing unit 630 may be configured to perform feature decoding based on the canonical attribute information and the structural feature information of each three-dimensional gaussian point to determine the time-domain attribute information of each three-dimensional gaussian point by using a transformation network by: determining time-based variation attribute information of each three-dimensional gaussian point by performing gaussian transformation on the specification attribute information using a third neural network model based on the structural feature information; time-based time-domain attribute information of each three-dimensional gaussian point is determined based on canonical attribute information of each three-dimensional gaussian point and time-based variation attribute information, wherein the time-based variation attribute information includes at least one of position variation information, rotation variation information, and size variation information.
According to an embodiment, the plurality of input images comprises a plurality of monocular images captured at different times and/or different locations.
The specific manner in which the respective units of the image processing apparatus 600 in the above-described embodiments perform operations has been described in detail in the embodiments of the related image processing method, and will not be described in detail here.
Fig. 7 is a block diagram showing the structure of a training apparatus of an image processing model according to an embodiment of the present disclosure.
Referring to fig. 7, a training apparatus 700 of an image processing model according to an embodiment may include an image acquisition unit 710, a model prediction unit 720, a loss calculation unit 730, and a parameter adjustment unit 740.
According to an embodiment, the image acquisition unit 710 may be configured to acquire a plurality of training images, the model prediction unit 720 may be configured to perform the image processing method as described above using the image processing model for the plurality of training images, the loss calculation unit 730 may be configured to calculate a loss based on the rendered image and a training image corresponding to the rendered image among the plurality of training images, and the parameter adjustment unit 740 may be configured to train the image processing model by adjusting parameters of the image processing model based on the loss.
That is, the image acquisition unit 710 may perform operations corresponding to step S510 of the training method of the image processing model as described above with reference to fig. 5, the model prediction unit 720 may perform operations corresponding to step S520 of the training method of the image processing model as described above with reference to fig. 5, the operation loss calculation unit 730 may perform operations corresponding to step S530 of the training method of the image processing model as described above with reference to fig. 5, and the parameter adjustment unit 740 may perform operations corresponding to step S540 of the training method of the image processing model as described above with reference to fig. 5. Details of the training method of the image processing model described above with reference to fig. 5 are applicable to the operation of the corresponding unit of fig. 7.
According to an embodiment, the parameter adjustment unit 740 is configured to train the image processing model by adjusting parameters of the image processing model based on the loss by: when training the image processing model, determining gradient information of three-dimensional Gaussian points in a three-dimensional Gaussian point set corresponding to the point cloud for each training; determining whether to perform a density control operation based on the time domain attribute information and the gradient information, wherein the density control operation includes changing a number of three-dimensional gaussian points.
According to an embodiment, the model prediction unit 720 may further perform operations corresponding to the image processing methods described above with reference to fig. 1 to 4. Alternatively, the model prediction unit 720 may also be implemented as the image processing apparatus 600 as described above.
The specific manner in which the respective units of the training apparatus 700 for an image processing model perform operations in the above-described embodiments have been described in detail in the embodiments of the training method for an image processing model concerned, and will not be described in detail here.
Further, it should be understood that various units in the image processing apparatus 600 and the training apparatus 700 of the image processing model according to embodiments of the present disclosure may be implemented as hardware components and/or software components. The individual units may be implemented, for example, using a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC), depending on the processing or operations performed by the individual units as defined.
According to an embodiment of the present disclosure, there is also provided an electronic apparatus. Fig. 8 is a block diagram illustrating a structure of an electronic device according to an embodiment of the present disclosure.
According to an embodiment of the present disclosure, the electronic device 800 may include at least one processor 810 and at least one memory 820 storing computer-executable instructions. The computer executable instructions, when executed by the at least one processor 810, cause the at least one processor 810 to perform the image processing method or training method of the image processing model as described above.
According to an embodiment of the present disclosure, electronic device 800 may be a PC computer, tablet, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 800 is not necessarily a single electronic device, but may be any device or aggregate of circuits capable of executing the above-described instructions (or instruction set) alone or in combination. The electronic device 800 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).
In electronic device 800, processor 810 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 810 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and so forth.
Processor 810 may execute instructions or code stored in memory 820, wherein memory 820 may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.
The memory 820 may be integrated with the processor 810, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, the memory 820 may include a separate device, such as an external disk drive, a storage array, or any other storage device usable by a database system. The memory 820 and the processor 810 may be operatively coupled or may communicate with each other, such as through an I/O port, network connection, etc., such that the processor 810 is capable of reading files stored in the memory 820.
In addition, the electronic device 800 may also include a display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 800 may be connected to each other via a bus and/or a network.
According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the image processing method or the training method of the image processing model as described above.
Examples of computer readable storage media according to exemplary embodiments of the present disclosure include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card-type memories (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tapes, floppy disks, magneto-optical data storage devices, hard disks, solid state disks, and any other devices configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in an electronic device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
According to the image processing method and device and the training method and device of the image processing model, modeling of a dynamic scene on continuous time can be achieved, and the mapping transformation relation between a standard space and a time domain space is determined, so that accurate change and motion prediction are achieved, and storage cost is saved. The image processing method and apparatus according to the embodiments of the present disclosure and the training method and apparatus of an image processing model can also model under dynamic scenes in which time and position (or viewing angle) change simultaneously, for example, a rendered image can be observed from different angles, or at different times, under a scene of an input image. The image processing method and apparatus and the training method and apparatus of the image processing model according to the embodiments of the present disclosure are also capable of realizing new view image composition based on only monocular images or videos.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the claims.

Claims (12)

1. An image processing method, comprising:
Acquiring a plurality of input images;
Determining canonical attribute information and structural feature information of point clouds included in the plurality of input images under a canonical space;
Transforming the canonical attribute information into time domain attribute information of the point cloud in a time domain space by using a transformation network based on the structural feature information;
rendered images of the plurality of input images are obtained based on the temporal attribute information.
2. The image processing method according to claim 1, wherein the step of determining canonical attribute information of the point cloud included in the plurality of input images under a canonical space includes:
determining canonical attribute information of each three-dimensional gaussian point in a set of three-dimensional gaussian points corresponding to the point cloud in a canonical space,
Wherein the specification attribute information and/or the time domain attribute information includes at least one of position information, rotation information, and size information.
3. The image processing method according to claim 2, wherein the step of determining structural feature information of the point clouds included in the plurality of input images under a canonical space includes: performing feature extraction and feature fusion based on the position information of each three-dimensional Gaussian point to obtain structural feature information of each three-dimensional Gaussian point, and
Wherein the step of transforming the canonical attribute information into time domain attribute information of the point cloud in time domain space by using a transformation network based on the structural feature information includes: feature decoding is performed based on the canonical attribute information and the structural feature information of each three-dimensional gaussian point to determine the time-domain attribute information of each three-dimensional gaussian point by using the transformation network.
4. The image processing method according to claim 3, wherein the step of performing feature extraction and feature fusion to obtain structural feature information of each three-dimensional gaussian point based on the position information of each three-dimensional gaussian point comprises:
obtaining structural information of voxels by performing feature extraction on one or more three-dimensional gaussian points using UNet structure based on the position information of each three-dimensional gaussian point;
Based on the position information of each three-dimensional Gaussian point, obtaining point characteristic information of the three-dimensional Gaussian points by carrying out characteristic extraction on each three-dimensional Gaussian point by using a first neural network model;
based on the structural information of the voxels and the point characteristic information of the three-dimensional Gaussian points, the structural characteristic information of each three-dimensional Gaussian point is obtained by carrying out characteristic fusion by using a second neural network model.
5. The image processing method according to claim 3, wherein the step of feature-decoding based on the specification attribute information and the structural feature information of each three-dimensional gaussian point to determine the time-domain attribute information of each three-dimensional gaussian point by using the transformation network comprises:
Determining time-based variation attribute information of each three-dimensional gaussian point by performing gaussian transformation on the specification attribute information using a third neural network model based on the structural feature information;
Determining time-based time-domain attribute information of each three-dimensional gaussian point based on the canonical attribute information of each three-dimensional gaussian point and the time-based variation attribute information,
Wherein the time-based change attribute information includes at least one of position change information, rotation change information, and size change information.
6. The image processing method according to any one of claims 1 to 5, wherein the plurality of input images includes a plurality of monocular images captured at different times and/or different positions.
7. A method of training an image processing model, comprising:
Acquiring a plurality of training images;
Performing the image processing method according to any one of claims 1 to 6 using the image processing model for the plurality of training images;
calculating a loss based on a rendered image obtained by the image processing method and a training image corresponding to the rendered image among the plurality of training images;
The image processing model is trained by adjusting parameters of the image processing model based on the loss.
8. The training method of claim 7, wherein training the image processing model by adjusting parameters of the image processing model based on the loss comprises:
When training the image processing model, determining gradient information of three-dimensional Gaussian points in a three-dimensional Gaussian point set corresponding to the point cloud for each training;
Determining whether to perform a density control operation based on the time domain attribute information and the gradient information, wherein the density control operation includes changing a number of three-dimensional gaussian points.
9. An image processing apparatus comprising:
An image acquisition unit configured to acquire a plurality of input images;
an attribute determining unit configured to determine specification attribute information and structural feature information of a point cloud included in the plurality of input images under a specification space;
a transformation processing unit configured to transform the specification attribute information into time domain attribute information of the point cloud in a time domain space by using a transformation network based on the structural feature information;
An image rendering unit configured to obtain rendered images of the plurality of input images based on the time domain attribute information.
10. A training apparatus for an image processing model, comprising:
An image acquisition unit configured to acquire a plurality of training images;
a model prediction unit configured to perform the image processing method according to any one of claims 1 to 6 using the image processing model for the plurality of training images;
a loss calculation unit configured to calculate a loss based on the rendered image and a training image corresponding to the rendered image among the plurality of training images;
and a parameter adjustment unit configured to train the image processing model by adjusting parameters of the image processing model based on the loss.
11. An electronic device, comprising:
At least one processor;
At least one memory storing computer-executable instructions,
Wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the method of any one of claims 1 to 8.
12. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the method of any one of claims 1 to 8.
CN202311544261.5A 2023-11-17 2023-11-17 Image processing method and device and training method and device of image processing model Pending CN117934677A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311544261.5A CN117934677A (en) 2023-11-17 2023-11-17 Image processing method and device and training method and device of image processing model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311544261.5A CN117934677A (en) 2023-11-17 2023-11-17 Image processing method and device and training method and device of image processing model

Publications (1)

Publication Number Publication Date
CN117934677A true CN117934677A (en) 2024-04-26

Family

ID=90754462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311544261.5A Pending CN117934677A (en) 2023-11-17 2023-11-17 Image processing method and device and training method and device of image processing model

Country Status (1)

Country Link
CN (1) CN117934677A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118196306A (en) * 2024-05-15 2024-06-14 广东工业大学 3D modeling reconstruction system, method and device based on point cloud information and Gaussian cloud cluster

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118196306A (en) * 2024-05-15 2024-06-14 广东工业大学 3D modeling reconstruction system, method and device based on point cloud information and Gaussian cloud cluster

Similar Documents

Publication Publication Date Title
Weiss et al. Volumetric isosurface rendering with deep learning-based super-resolution
Kalantari et al. Learning-based view synthesis for light field cameras
US10019826B2 (en) Real-time high-quality facial performance capture
Wang et al. Neuris: Neural reconstruction of indoor scenes using normal priors
JP7373554B2 (en) Cross-domain image transformation
Gecer et al. Synthesizing coupled 3d face modalities by trunk-branch generative adversarial networks
Zhong et al. High-resolution depth maps imaging via attention-based hierarchical multi-modal fusion
US12026892B2 (en) Figure-ground neural radiance fields for three-dimensional object category modelling
Cao et al. Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields
CN116977522A (en) Rendering method and device of three-dimensional model, computer equipment and storage medium
Thomas et al. Deep illumination: Approximating dynamic global illumination with generative adversarial network
CN115601511B (en) Three-dimensional reconstruction method and device, computer equipment and computer readable storage medium
CN117934677A (en) Image processing method and device and training method and device of image processing model
CN115564639A (en) Background blurring method and device, computer equipment and storage medium
JP2024510230A (en) Multi-view neural human prediction using implicitly differentiable renderer for facial expression, body pose shape and clothing performance capture
Lu et al. 3D real-time human reconstruction with a single RGBD camera
Zhou et al. Image2GIF: Generating cinemagraphs using recurrent deep q-networks
Ibrahim et al. Mvpcc-net: Multi-view based point cloud completion network for mls data
US12051151B2 (en) System and method for reconstruction of an animatable three-dimensional human head model from an image using an implicit representation network
Ding et al. Personalizing human avatars based on realistic 3D facial reconstruction
Khan et al. Towards monocular neural facial depth estimation: Past, present, and future
Xu et al. Learning with unreliability: Fast few-shot Voxel radiance fields with relative geometric consistency
Alhaija et al. Intrinsic autoencoders for joint deferred neural rendering and intrinsic image decomposition
US10922872B2 (en) Noise reduction on G-buffers for Monte Carlo filtering
Hou et al. De‐NeRF: Ultra‐high‐definition NeRF with deformable net alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication