CN117939257A

CN117939257A - Automatic video synthesis method based on three-dimensional Gaussian neural radiation field

Info

Publication number: CN117939257A
Application number: CN202410225433.0A
Authority: CN
Inventors: 刘祖渊; 杨白云
Original assignee: Star River Vision Technology Beijing Co ltd
Current assignee: Star River Vision Technology Beijing Co ltd
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-04-26

Abstract

The application relates to an automatic video synthesis method based on a three-dimensional Gaussian neural radiation field, which is applied to the technical field of video implantation and comprises the following steps: acquiring a target synthetic video, wherein the target synthetic video comprises a target object; performing camera pose estimation on the target synthesized video based on COLMAP, and determining camera pose information of the target synthesized video; extracting light field information of the target synthesized video based on a DEPI algorithm, and determining video light field information of the target synthesized video; generating a three-dimensional model of the target object based on the camera pose information and a three-dimensional Gaussian neural radiation field model; automatically generating a composite video containing the target object based on the target composite video, the video light field information, and the three-dimensional model. The application has the effect of improving the video quality of the synthesized video.

Description

Automatic video synthesis method based on three-dimensional Gaussian neural radiation field

Technical Field

The application relates to the technical field of video implantation, in particular to an automatic video synthesis method based on a three-dimensional Gaussian neural radiation field.

Background

When editing video content, a method of decomposing an object video to be implanted into a plurality of continuous single-frame pictures, blurring the pictures according to a certain coefficient and directly fusing the processed pictures with an original video according to the sequence of decomposition is adopted at present.

However, in the current processing scheme, the implanted object video or picture often depends on a huge material library, if the material library does not contain pictures or pictures with certain visual angles, the quality is poor, the problem of distortion of the finally synthesized video can be caused, the finally synthesized video cannot be well fused with the original video content, in order to improve the quality of video fusion, the quantity and quality of the materials are required to be improved, so that the fusion cost is improved, in addition, the presentation effect in the video is different due to different environments of different videos in the fusion, and the synthesized video is still seriously distorted due to the fact that the pictures in the material library are only fused.

Disclosure of Invention

In order to improve the video quality of the synthesized video, the application provides an automatic video synthesis method based on a three-dimensional Gaussian neural radiation field.

In a first aspect, the application provides an automated video synthesis method based on a three-dimensional gaussian nerve radiation field, which adopts the following technical scheme:

An automated video synthesis method based on a three-dimensional gaussian neural radiation field, comprising:

acquiring a target synthetic video, wherein the target synthetic video comprises a target object;

performing camera pose estimation on the target synthesized video based on COLMAP, and determining camera pose information of the target synthesized video;

Extracting light field information of the target synthesized video based on a DEPI algorithm, and determining video light field information of the target synthesized video;

generating a three-dimensional model of the target object based on the camera pose information and a three-dimensional Gaussian neural radiation field model;

automatically generating a composite video containing the target object based on the target composite video, the video light field information, and the three-dimensional model.

By adopting the technical scheme, better video editing effect can be obtained from fewer material libraries, and the cost is lower; by combining the pose information of the video camera and the light field information, the finally synthesized video can be more realistic, so that the video quality of the synthesized video is improved; the technical closed loop from the original video to the synthesized video is formed, so that the manual operation is reduced, the labor cost is reduced, and the project period is shortened.

Optionally, the determining the camera pose information of the target synthesized video based on COLMAP to perform camera pose estimation on the target synthesized video includes:

Dividing the target composite video into a sequence of images based on frames;

Inputting the sequence of images to the COLMAP;

Extracting and matching feature points based on the image sequence by COLMAP, and generating a matching relationship between the feature points and the feature points;

And determining camera pose information of the target synthesized video based on the feature points and the matching relation.

Optionally, the video light field information includes depth information and angle information; the step of extracting the light field information of the target synthesized video based on the DEPI algorithm, and the step of determining the video light field information of the target synthesized video comprises the following steps:

performing key extraction on each frame of the image sequence to determine key visual characteristics;

determining depth information based on the DEPI algorithm and the key visual features;

generating a parallax plane image based on the DEPI algorithm and the depth information;

and determining angle information based on the moveout plane image.

Optionally, the generating the three-dimensional model of the target object based on the camera pose information and the three-dimensional gaussian neural radiation field model includes:

acquiring an object image of the target object;

determining a pose to be used based on the object image and the camera pose information;

extracting key features of the object image to generate key feature points;

Converting the key feature points into Gao Sifei splash points;

and constructing a three-dimensional model of the target object based on the Gaussian splash points.

Optionally, the constructing the three-dimensional model of the target object based on the gaussian splatter points includes:

and fusing the Gaussian splash points, and taking the fused model as a three-dimensional model of the target object.

Optionally, the automatically generating the composite video including the target object based on the target composite video, the video light field information, and the three-dimensional model includes:

generating rendering parameters based on the camera information and the video light field information;

carrying out projection plane processing on the Gaussian splash points to generate a processed image;

Fusing the processing image and the rendering parameters to generate a rendering image;

Locating a surrogate location of the target object based on the target composite video;

generating a composite video containing the target object based on the rendered image, the replacement location, and the three-dimensional model.

Optionally, the positioning the substitute position of the target object based on the target composite video includes:

performing feature extraction and object recognition based on the OA-SLAM and the target synthetic video;

The extracted features are associated with the identified object, and a surrogate location is determined based on the association results.

In a second aspect, the application provides an automated video synthesis device based on a three-dimensional gaussian nerve radiation field, which adopts the following technical scheme:

an automated video compositing device based on three-dimensional gaussian neural radiation fields, comprising:

the system comprises a target video acquisition module, a target video generation module and a target video generation module, wherein the target video acquisition module is used for acquiring a target synthetic video, and the target synthetic video comprises a target object;

the pose information determining module is used for estimating the pose of the camera of the target synthesized video based on COLMAP, and determining the pose information of the camera of the target synthesized video;

the light field information determining module is used for extracting light field information of the target synthetic video based on a DEPI algorithm and determining video light field information of the target synthetic video;

the three-dimensional model generation module is used for generating a three-dimensional model of the target object based on the camera pose information and the three-dimensional Gaussian neural radiation field model;

and the synthesized video generation module is used for automatically generating synthesized video containing the target object based on the target synthesized video, the video light field information and the three-dimensional model.

In a third aspect, the present application provides an electronic device, which adopts the following technical scheme:

An electronic device comprising a processor coupled with a memory;

The processor is configured to execute a computer program stored in the memory to cause the electronic device to execute the computer program of the automated video synthesis method based on three-dimensional gaussian neural radiation field according to any one of the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium, which adopts the following technical scheme:

A computer readable storage medium storing a computer program capable of being loaded by a processor and executing the automated video compositing method based on a three-dimensional gaussian neural radiation field of any of the first aspects.

Drawings

Fig. 1 is a schematic flow chart of an automated video synthesis method based on a three-dimensional gaussian neural radiation field according to an embodiment of the present application.

Fig. 2 is a block diagram of an automated video synthesis device based on a three-dimensional gaussian neural radiation field according to an embodiment of the present application.

Fig. 3 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings.

The embodiment of the application provides an automatic video synthesis method based on a three-dimensional Gaussian neural radiation field, which can be executed by electronic equipment, wherein the electronic equipment can be a server or terminal equipment, the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud computing service. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a desktop computer, etc.

As shown in fig. 1, the main flow of the method is described as follows (steps S101 to S105):

step S101, a target composite video is acquired, wherein the target composite video includes a target object.

In this embodiment, the target composite video is an original video to be synthesized, which is directly captured by a camera, and the target composite video includes at least one target object, and when the target object is synthesized, the synthesis process is performed by using the target object as a main body.

Step S102, camera pose estimation is carried out on the target synthesized video based on COLMAP, and camera pose information of the target synthesized video is determined.

For step S102, dividing the target composite video into image sequences based on frames; inputting the image sequence to COLMAP; COLMAP extracting and matching characteristic points based on the image sequence to generate a matching relationship between the characteristic points; and determining camera pose information of the target synthesized video based on the feature points and the matching relation.

In this embodiment, the target composite video is divided into a plurality of individual frames, the individual frames are subjected to image extraction to obtain a set of image sequences, and the image sequences are input to COLMAP, so that a database containing each frame of image is created. Thereafter COLMAP performs feature point detection on each frame of the image in the image sequence, where the feature points are typically salient points in the image, such as corner points, edges or other distinct texture regions, and the selection of the feature points is based on the ability to be easily identified and tracked in different frames, and each detected feature point extracts a feature descriptor, which is a numerical vector that encodes information about the region around the feature point, such as texture, color, local shape, etc., while COLMAP attempts to match the feature points between different frames, specifically by comparing feature descriptors, which are considered to be matched if the feature points in two different frames have similar descriptors, and feature matching typically uses a metric such as euclidean distance to evaluate similarity between the descriptors, so that COLMAP can confirm the correspondence of the feature points between different frames. Wherein colmap is generally characterized by SIFT algorithm, and is generally characterized by [ c1, c2, c3, … …, cn, s, θ, g1, g2, g3, … …, gm ]; ci represents a certain component representing color information, typically between 0 and 255; s represents scale information, which is a positive real number; θ represents direction information, typically between 0 and 360 degrees; gj represents a certain component of the gradient histogram.

COLMAP the operation of initializing the camera pose with the feature points already matched, wherein the camera pose information includes the position of the camera and the direction of the camera, COLMAP generally includes two main steps of estimating a base matrix and performing triangulation when the initialization of the camera pose is performed, and before the initialization, the camera pose is approximately R, t, wherein R represents a rotation matrix and t represents a translation vector. Typically initialized to null or set an initial value, which may be a zero matrix, an identity matrix, or a preliminary estimate derived from some heuristic method. During initialization, the operations of estimating a base matrix and triangulating are performed, COLMAP, estimating the base matrix for the matched feature points by using methods such as RANSAC, the base matrix describes the geometric relationship between the two cameras, including rotation and translation, then decomposing the base matrix to obtain R and t, and then using known built-in parameters of the cameras and the base matrix for COLMAP, triangulating the matched feature points to obtain coordinates of the feature points in a three-dimensional space, so as to obtain the pose information R 'and t' of the initialized camera.

After determining the camera pose information, COLMAP uses a view triangularization method, for each pair of matched feature points, COLMAP uses their positions in the respective images and the camera pose to build a three-dimensional coordinate system, and by solving a geometric problem, that is, how to make the projections of the points in the three-dimensional space best conform to their positions in the respective images, the position of each point in the three-dimensional space can be determined, so as to calculate the three-dimensional point cloud in the scene, and as new video frames and feature points are added, COLMAP continues to update and optimize the three-dimensional point cloud and the camera pose, this step includes minimizing the re-projection error, that is, reducing the difference between the re-projection position and the actual observation position of the three-dimensional points in the camera view, and then performing optimization processing, specifically global binding adjustment, with the aim of improving the consistency and accuracy of the three-dimensional point cloud and the camera pose in the whole scene, and realizing the consistency of the three-dimensional structure reconstructed from different frames as a whole.

Step S103, extracting light field information of the target synthetic video based on the DEPI algorithm, and determining video light field information of the target synthetic video.

For step S103, the video light field information includes depth information and angle information; performing key extraction on each frame of the image sequence, and determining key visual characteristics; determining depth information based on a DEPI algorithm and key visual features; generating a parallax plane image based on the DEPI algorithm and the depth information; the angle information is determined based on the moveout plane image.

In this embodiment, the DEPI algorithm first processes each video frame to extract key visual features such as edges, textures, colors, etc. These features are critical to understanding the geometry and depth information of the scene. Based on these features, the DEPI algorithm creates a depth map that shows the relative depths of different objects in the scene of the target composite video, the depth map being a two-dimensional representation in which the value of each pixel represents the depth of the corresponding scene point, resulting in depth information for the light field, the depth information describing the depth of each point in the video relative to the camera, typically represented as a two-dimensional matrix corresponding to the resolution of the video frame. For example, for a video frame of 1920x1080 resolution, the depth information would be a 1920x1080 matrix, where each element represents a depth value for a corresponding pixel point. Next, the DEPI algorithm captures the ray paths in the video using depth parallax planar images (DPI or Epipolar PLANE IMAGE), angle information describing the paths of the rays when viewing the scene from different angles, for determining how the rays reflect and refract on different surfaces of the object, the representation of the angle information being complex, requiring simultaneous consideration of spatial position and viewing angle, can be represented using a high-dimensional array including spatial coordinates (x, y) and gaze direction, e.g., azimuth and pitch angles, and in both angular dimensions, the angle information may be an array of 1920x1080xNxN for 1920x1080 video frames, where N is the number of different perspectives.

And step S104, generating a three-dimensional model of the target object based on the camera pose information and the three-dimensional Gaussian neural radiation field model.

Aiming at step S104, acquiring an object image of a target object; determining a pose to be used based on the object image and the camera pose information; extracting key features of the object image to generate key feature points; converting the key feature points into Gao Sifei splash points; and constructing a three-dimensional model of the target object based on the Gaussian splash points.

Further, the Gaussian splash points are fused, and the fused model is used as a three-dimensional model of the target object.

In this embodiment, an object image of a target object and camera pose information corresponding to the object image are input into a three-dimensional gaussian radiation field model, where the camera pose is the pose after initialization and adjustment in step S102, and the three-dimensional gaussian radiation field performs key feature extraction on the input object image, that is, extracts key feature points with significant features, converts each key feature point into Gao Sifei splash points, and then fuses all gaussian splash points together to construct a complete three-dimensional model of the target object. The Gao Sifei splash points can accurately represent the positions and the shapes of original key feature points in a three-dimensional space, and the positions, the distribution ranges and the intensity or the density are included; the position information is a three-dimensional coordinate, is used for representing the position of the Gaussian splash point in a three-dimensional space, the distribution range is the standard deviation or the radius of Gaussian distribution, is used for representing the distribution range of the Gaussian splash point in the space, the distribution range determines the image degree of the Gaussian splash point on the surrounding space, the larger the standard deviation is, the larger the influence area of the Gaussian splash point is, the smaller the standard deviation is, and the more limited the image area of the Gaussian splash point is; the peak of the gaussian distribution is used to represent the intensity or density of the gaussian splash point and is used to represent the physical characteristics of the Gao Sifei splash point in three-mode modeling, such as color intensity, reflectance, transparency, etc. When multiple gaussian splash points overlap in space, the overlapping gaussian splash points are added according to different weights, and the weights can be determined according to factors such as distance, density and the like.

The three-dimensional model successfully constructed comprises geometric information, surface details, color information, physical properties and depth information, wherein the geometric information comprises: the three-dimensional model comprises the shape and the size of an object, and can accurately depict the three-dimensional geometric form of the target object through the spatial distribution and fusion of Gao Sifei splash points; surface details include texture, subtle geometric variations and details of the object surface; the color information comprises color information contained in the input image, and the color information is integrated into a three-dimensional model, wherein the three-dimensional model not only has the geometric shape of an object, but also can present the true color and texture of the object; the physical attribute contains information about the physical characteristics of the surface of the target object, such as reflectivity, transparency, roughness, etc., for realizing a more realistic visual effect in the subsequent rendering process; the depth information is used to create a better stereoscopic effect.

Step S105 automatically generates a composite video containing the target object based on the target composite video, the video light field information, and the three-dimensional model.

Generating rendering parameters based on the camera information and the video light field information for step S105; carrying out projection plane processing on Gaussian splash points to generate a processed image; fusing the processing image and the rendering parameters to generate a rendering image; locating a surrogate location of the target object based on the target composite video; a composite video containing the target object is generated based on the rendered image, the surrogate location, and the three-dimensional model.

Further, feature extraction and object recognition are performed based on the OA-SLAM and the target synthetic video; the extracted features are associated with the identified object, and a surrogate location is determined based on the association results.

In this embodiment, viewing angle parameters including camera position, direction and field of view at the time of rendering are set based on the camera pose information obtained by COLMAP, so that the rendering viewing angle is ensured to be consistent with the viewing angle of the original image, and light source parameters, such as the direction and intensity of the light source, at the time of rendering are set based on the light field information extracted by the DEPI algorithm, so that the illumination effect in the real world can be simulated, and the viewing angle parameters and the light source parameters are used as rendering parameters. The gaussian splatter points in the three-dimensional space are then projected onto a two-dimensional viewing plane, i.e., a projection plane, which refers to a two-dimensional space for displaying the image mapped in the three-dimensional space, specifically calculating the position and size of each gaussian splatter point on the two-dimensional plane, while on the viewing plane, the gaussian kernel of each gaussian splatter point needs to be appropriately transformed and scaled to match its distribution in the three-dimensional space, the transformation and scaling being related to the gaussian kernel, i.e., gaussian distribution, of each gaussian splatter point.

On a two-dimensional plane, there will be overlapping Gao Sifei splash points, which need to be fused, typically by means of weighted averaging or other form of mathematical operation on the gaussian distribution of the overlapping region.

OA-SLAM (Object-Aware Simultaneous Localization AND MAPPING) first extracts key frames, i.e. key moments or views, from an input video stream or image sequence, in which the OA-SLAM algorithm detects feature points in the key frames, which refer to salient, easily tracked positions in the image, such as corner points, edges, etc. The OA-SLAM then identifies specific objects in the images of the image sequence, e.g., identifying chairs, tables, doors, etc. in the images, while creating a database for each identified object, recording its characteristics and appearance information. In the process of identifying a specific object, the three-dimensional structure and the position of the object can be more accurately understood by inputting the point cloud data of the object, the point cloud data represents the surface of the object in the environment, and the object identification is not only limited to a two-dimensional plane but also can be performed in a three-dimensional space by combining the point cloud data, so that the accuracy and the reliability of the object identification are improved. By tracking the movement of the feature points and the exact location information of each gaussian splash point in the point cloud provided by the point cloud data, the OA-SLAM algorithm can simultaneously locate the position of the camera and build a three-dimensional map of the environment, while combining the identified object positions with the built map, thereby achieving spatial localization of these objects. The identified objects are associated with feature points to accurately represent the locations of the objects in the map, and as new data is continuously acquired, the algorithm updates the map, including the locations of the objects and the layout of the environment. And then detecting whether a revisit place exists or not through loop detection so as to correct any accumulated positioning error, and simultaneously adjusting the map and the camera track by using an optimization technology so as to improve the overall accuracy. The OA-SLAM final output includes a dynamic three-dimensional map of the environment and the location of objects therein and provides information about the precise location and orientation of the identified objects for the particular application. And finally, combining the rendered object video with the original video, namely replacing the area of the target position of the original video frame by the rendered video.

In this embodiment, if the target object is incomplete, a point cloud data generating method of 3D gaussian splatter needs to be used to locate the spatial position, and a point cloud may be selected to fit the 3D coordinate position, so as to adjust the reference target position, for example, a bottle of cola is placed on a table, and when shooting, there may be a situation that the table is incomplete in shooting, so that the placement position of the cola will be inaccurate, and therefore, the position determination is performed by adopting the point cloud data manner, so as to generate a complete and accurate video with the rendered object.

Fig. 2 is a block diagram of an automated video synthesis device 200 based on a three-dimensional gaussian neural radiation field according to an embodiment of the present disclosure.

As shown in fig. 2, the automated video compositing apparatus 200 based on three-dimensional gaussian nerve radiation field mainly includes:

a target video acquisition module 201, configured to acquire a target composite video, where the target composite video includes a target object;

the pose information determining module 202 is configured to determine camera pose information of the target synthesized video based on COLMAP performing camera pose estimation on the target synthesized video;

the light field information determining module 203 is configured to perform light field information extraction on the target synthesized video based on a DEPI algorithm, and determine video light field information of the target synthesized video;

A three-dimensional model generation module 204, configured to generate a three-dimensional model of the target object based on the camera pose information and the three-dimensional gaussian neural radiation field model;

The composite video generating module 205 is configured to automatically generate a composite video including the target object based on the target composite video, the video light field information and the three-dimensional model.

As an optional implementation manner of this embodiment, the pose information determining module 202 is specifically configured to divide the target composite video into image sequences based on frames; inputting the image sequence to COLMAP; COLMAP extracting and matching characteristic points based on the image sequence to generate a matching relationship between the characteristic points; and determining camera pose information of the target synthesized video based on the feature points and the matching relation.

As an optional implementation manner of this embodiment, the light field information determining module 203 is specifically configured to perform key extraction on each frame of the image sequence, and determine a key visual feature; determining depth information based on a DEPI algorithm and key visual features; generating a parallax plane image based on the DEPI algorithm and the depth information; the angle information is determined based on the moveout plane image.

As an alternative implementation of the present embodiment, the three-dimensional model generating module 204 includes:

the object image acquisition module is used for acquiring an object image of the target object;

the standby position determining module is used for determining a pose to be used based on the object image and the camera pose information;

the key feature extraction module is used for extracting key features of the object image and generating key feature points;

The Gaussian splashing conversion module is used for converting the key feature points into Gao Sifei splashing points;

and the three-dimensional model construction module is used for constructing a three-dimensional model of the target object based on the Gaussian splash points.

In this optional embodiment, the three-dimensional model building module is specifically configured to fuse gaussian splash points, and use the fused model as a three-dimensional model of the target object.

As an alternative implementation of this embodiment, the composite video generating module 205 includes:

the rendering parameter generation module is used for generating rendering parameters based on the camera information and the video light field information;

The processing image generation module is used for carrying out projection plane processing on the Gaussian splash points to generate a processing image;

the rendering image generation module is used for fusing the processing image with the rendering parameters to generate a rendering image;

A substitute position determining module for locating a substitute position of the target object based on the target composite video;

and the synthetic video creation module is used for generating a synthetic video containing the target object based on the rendered image, the substitution position and the three-dimensional model.

In this alternative embodiment, the surrogate location determination module is specifically configured to perform feature extraction and object recognition based on OA-SLAM and the target composite video; the extracted features are associated with the identified object, and a surrogate location is determined based on the association results.

In one example, a module in any of the above apparatuses may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (application specific integratedcircuit, ASIC), or one or more digital signal processors (DIGITAL SIGNAL processor, DSP), or one or more field programmable gate arrays (field programmable GATE ARRAY, FPGA), or a combination of at least two of these integrated circuit forms.

For another example, when a module in an apparatus may be implemented in the form of a scheduler of processing elements, the processing elements may be general-purpose processors, such as a central processing unit (central processing unit, CPU) or other processor that may invoke a program. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

Fig. 3 is a block diagram of an electronic device 300 according to an embodiment of the present application.

As shown in FIG. 3, electronic device 300 includes a processor 301 and memory 302, and may further include an information input/information output (I/O) interface 303, one or more of a communication component 304, and a communication bus 305.

Wherein the processor 301 is configured to control the overall operation of the electronic device 300 to perform all or part of the steps of the above-described automated video synthesis method based on three-dimensional gaussian neural radiation fields; the memory 302 is used to store various types of data to support operation at the electronic device 300, which may include, for example, instructions for any application or method operating on the electronic device 300, as well as application-related data. The Memory 302 may be implemented by any type or combination of volatile or non-volatile Memory devices, such as one or more of static random access Memory (Static Random Access Memory, SRAM), electrically erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.

The I/O interface 303 provides an interface between the processor 301 and other interface modules, which may be a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 304 is used for wired or wireless communication between the electronic device 300 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, near field Communication (NFC for short), 2G, 3G, or 4G, or a combination of one or more thereof, and accordingly the Communication component 104 can include: wi-Fi part, bluetooth part, NFC part.

The electronic device 300 may be implemented by one or more Application Specific Integrated Circuits (ASIC), digital signal Processor (DIGITAL SIGNAL Processor, DSP), digital signal processing device (DIGITAL SIGNAL Processing Device, DSPD), programmable logic device (Programmable Logic Device, PLD), field programmable gate array (Field Programmable GATE ARRAY, FPGA), controller, microcontroller, microprocessor, or other electronic components for performing the automated video synthesis method based on three-dimensional gaussian neural radiation field as given in the above embodiments.

Communication bus 305 may include a pathway to transfer information between the aforementioned components. The communication bus 305 may be a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus 305 may be divided into an address bus, a data bus, a control bus, and the like.

The electronic device 300 may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), car terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like, and may also be a server, and the like.

The application also provides a computer readable storage medium, wherein the computer readable storage medium is stored with a computer program, and the computer program realizes the steps of the automatic video synthesis method based on the three-dimensional Gaussian neural radiation field when being executed by a processor.

The computer readable storage medium may include: a usb disk, a removable hard disk, a read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application is not limited to the specific combinations of the features described above, but also covers other embodiments which may be formed by any combination of the features described above or their equivalents without departing from the spirit of the application. Such as the above-mentioned features and the technical features having similar functions (but not limited to) applied for in the present application are replaced with each other.

Claims

1. An automated video synthesis method based on a three-dimensional gaussian nerve radiation field is characterized by comprising the following steps:

2. The method of claim 1, wherein the determining camera pose information for the target composite video based on COLMAP camera pose estimation for the target composite video comprises:

Dividing the target composite video into a sequence of images based on frames;

Inputting the sequence of images to the COLMAP;

3. The method of claim 2, wherein the video light field information comprises depth information and angle information; the step of extracting the light field information of the target synthesized video based on the DEPI algorithm, and the step of determining the video light field information of the target synthesized video comprises the following steps:

and determining angle information based on the moveout plane image.

4. The method of claim 1, wherein the generating a three-dimensional model of the target object based on the camera pose information and a three-dimensional gaussian neural radiation field model comprises:

acquiring an object image of the target object;

extracting key features of the object image to generate key feature points;

Converting the key feature points into Gao Sifei splash points;

5. The method of claim 4, wherein constructing the three-dimensional model of the target object based on the gaussian splatter points comprises:

6. The method of claim 4, wherein automatically generating a composite video containing the target object based on the target composite video, the video light field information, and the three-dimensional model comprises:

7. The method of claim 6, wherein locating the replacement location of the target object based on the target composite video comprises:

8. An automated video synthesis device based on a three-dimensional gaussian nerve radiation field, comprising:

9. An electronic device comprising a processor coupled to a memory;

The processor is configured to execute a computer program stored in the memory to cause the electronic device to perform the method of any one of claims 1 to 7.

10. A computer readable storage medium comprising a computer program or instructions which, when run on a computer, cause the computer to perform the method of any of claims 1 to 7.