CN115761565A

CN115761565A - Video generation method, device, equipment and computer readable storage medium

Info

Publication number: CN115761565A
Application number: CN202211231054.XA
Authority: CN
Inventors: 周彧聪; 王志浩; 杨斌
Original assignee: Mingzhimeng Shanghai Technology Co ltd
Current assignee: Shanghai Xiyu Jizhi Technology Co ltd
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2023-03-07
Anticipated expiration: 2042-10-09
Also published as: WO2024077791A1; CN115761565B

Abstract

The invention discloses a video generation method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining first information of a plurality of sampling points on a first light ray, wherein the first information comprises a space coordinate and an orientation visual angle, obtaining second information of a plurality of first key points of a target object for a plurality of times, the second information comprises a space coordinate of the key points and characteristics of the key points, aiming at the plurality of sampling points and the second information of the plurality of first key points obtained each time, respectively generating a plurality of first key point fusion characteristics according to the first information and the second information, respectively carrying out offset operation on the space coordinates of the plurality of sampling points based on the first key point fusion characteristics, obtaining the offset space coordinates, pairwise inputting the offset space coordinates of the plurality of sampling points and the plurality of first key point fusion characteristics corresponding to the plurality of sampling points into a pre-trained nerve radiation field model, obtaining a plurality of static images of the target object, and then synthesizing the plurality of static images into a video of the target object.

Description

Video generation method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of computer vision technologies, and in particular, to a video generation method, apparatus, device, and computer-readable storage medium.

Background

In recent years, computer vision techniques based on deep learning have been greatly developed, such as target tracking/image segmentation and the like. Among them, research on the reconstruction of 3D scenes and their rendering has also made great progress. Neural radiation fields (NeRF for short) are a continuous and implicit expression mode for three-dimensional static scenes, flexibly represent the geometry and appearance of the three-dimensional scenes, and realize vivid two-dimensional image synthesis with new visual angles. However, only two-dimensional images can be obtained by NeRF, and the requirement of three-dimensional video reconstruction cannot be met.

For the synthesis of three-dimensional video, in the prior art, a time parameter is usually added on the basis of 5D to train a neural network, so that a 3D image at any time can be obtained, and the video is synthesized. However, in this way, a latitude is directly added, so that the data volume during training is greatly increased, the training time is also greatly increased, and the efficiency is low. Another common approach is to implement 3D video generation of dynamic scenes using time-based latent codes (late codes).

Therefore, the current 3D video generation mainly depends on time directly or indirectly, and a video generation method independent of time parameters is urgently needed.

Disclosure of Invention

The invention mainly aims to provide a video generation method, a video generation device, video generation equipment and a computer readable storage medium, and aims to solve the technical problem that the conventional video generation method depends on time parameters. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a video generation method, including:

acquiring first information of a plurality of sampling points on a first light ray, wherein the first information comprises a space coordinate and an azimuth viewing angle;

acquiring second information of a plurality of first key points of the target object for a plurality of times, wherein the second information comprises the space coordinates of the key points and the characteristics of the key points;

aiming at a plurality of sampling points and second information of a plurality of first key points acquired at each time, respectively generating a plurality of first key point fusion characteristics according to the first information and the second information, wherein each first key point fusion characteristic in the plurality of first key point fusion characteristics corresponds to one sampling point in the plurality of sampling points;

for a first sampling point in the plurality of sampling points, carrying out offset operation on the spatial coordinate of the first sampling point according to the spatial coordinate of the first sampling point and a first key point fusion feature corresponding to the first sampling point to obtain the offset spatial coordinate, wherein the first sampling point is any one of the plurality of sampling points;

for second information of a plurality of first key points acquired each time, inputting the shifted spatial coordinates of the plurality of sampling points and the fusion characteristics of the plurality of first key points corresponding to the plurality of sampling points into a pre-trained NeRF model in a paired manner, so as to acquire a plurality of static images of the target object, wherein the number of the static images is equal to the number of times of acquiring the second information of the first key points for a plurality of times;

and synthesizing the plurality of static images into the video of the target object.

In a second aspect, an embodiment of the present application provides a video generating apparatus, including:

the light ray acquisition module is used for acquiring first information of a plurality of sampling points on a first light ray, wherein the first information comprises a space coordinate and an azimuth viewing angle;

the key point acquisition module is used for acquiring second information of a plurality of first key points of the target object for a plurality of times, wherein the second information comprises the spatial coordinates of the key points and the characteristics of the key points;

the key point coding module is used for respectively generating a plurality of first key point fusion characteristics according to a plurality of sampling points and second information of a plurality of first key points acquired each time according to the first information and the second information, wherein each first key point fusion characteristic in the plurality of first key point fusion characteristics corresponds to one sampling point in the plurality of sampling points;

the light ray bending module is used for carrying out offset operation on the spatial coordinate of a first sampling point in the plurality of sampling points according to the spatial coordinate of the first sampling point and a first key point fusion characteristic corresponding to the first sampling point to obtain the offset spatial coordinate, wherein the first sampling point is any one of the plurality of sampling points;

the nerve radiation field module is used for inputting the shifted spatial coordinates of the sampling points and the fusion characteristics of the first key points corresponding to the sampling points into a pre-trained NeRF model in a paired manner aiming at the second information of the first key points acquired each time, so as to acquire a plurality of static images of the target object, wherein the number of the static images is equal to the number of times of acquiring the second information of the first key points;

and the video generating module is used for synthesizing the plurality of static images into the video of the target object.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method as described above.

In a fourth aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of the method as described above.

In the embodiment of the invention, by acquiring the spatial coordinates and the azimuth visual angle information of a plurality of sampling points on a first light ray, acquiring a plurality of first key points of a target object, and acquiring second information of the first key points for a plurality of times, wherein the second information comprises the spatial coordinates and the characteristics of the key points, each sampling point on the first light ray is fused with the second information acquired each time to generate a first key point fusion characteristic, the spatial coordinates of each sampling point are subjected to offset operation according to the first key point fusion characteristic to obtain the offset spatial coordinates of each sampling point, the offset spatial coordinates of the sampling points and the first key point fusion characteristic are input into a pre-trained nerve radiation field model to generate a static image corresponding to each sampling point, a corresponding static image is generated for the second information of a plurality of first key points input each time, and a video is synthesized according to the plurality of static images. By sequentially inputting the second information of the plurality of first key points of the target object, when the static images corresponding to the first light are generated according to the nerve radiation field, each static image is actually associated with the second information of different key points input each time, the change of each image in a dynamic scene is simulated by integrating the changed second information of the key points, and then the video is synthesized according to the generated pictures, so that the 3D video synthesis is realized while the time decoupling is realized, the synthesis method is simple, and the video of the target object can be synthesized only by the user for specifying the visual angle.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is an exemplary schematic diagram of a video generation method provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a video generation method according to an embodiment of the present application;

fig. 3 is a key point diagram of a video generation method according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram illustrating ray bending according to key points in a video generation method according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart illustrating a detailed process of generating a plurality of first keypoint fusion features in a video generation method according to an embodiment of the present application;

fig. 6 is an overall flowchart of a video generation method provided in an embodiment of the present application;

fig. 7 is a schematic flowchart of a refinement process for determining a first key point in a video generation method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a video generating device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The video generating device can be a terminal device such as a mobile phone, a computer, a tablet computer, a smart watch or a vehicle-mounted device, and can also be a module for realizing a video generating method in the terminal device, the video generating device can acquire first information of a plurality of sampling points on a first light ray, the first information comprises a spatial coordinate and an azimuth viewing angle, second information of a plurality of first key points of a target object is acquired for a plurality of times, the second information comprises a spatial coordinate of the key points and characteristics of the key points, a plurality of first key point fusion characteristics are respectively generated according to the plurality of sampling points and the second information of the plurality of first key points acquired each time, wherein each first key point fusion characteristic in the plurality of first key point fusion characteristics corresponds to one sampling point in the plurality of sampling points, and for a first sampling point in the plurality of sampling points, carrying out offset operation on the spatial coordinate of the first sampling point according to the spatial coordinate of the first sampling point and a first key point fusion feature corresponding to the first sampling point to obtain an offset spatial coordinate, wherein the first sampling point is any one of the plurality of sampling points, and for second information of a plurality of first key points obtained each time, inputting the offset spatial coordinates of the plurality of sampling points and a plurality of first key point fusion features corresponding to the plurality of sampling points into a pretrained NeRF model in a paired manner so as to obtain a plurality of static images of the target object, wherein the number of the plurality of static images is equal to the number of times of obtaining second information of the first key points for a plurality of times, and synthesizing the plurality of static images into a video of the target object.

Referring to fig. 1 together, an exemplary schematic diagram of a video generation method is provided for the embodiment of the present application, in which a process of synthesizing a 3D video of a target object is shown, in an actual application scene, a light ray or a shooting direction viewing angle may be obtained according to a viewing angle of a target object to be watched, a NeRF model is driven according to key point information of the target object to obtain a plurality of 3D still pictures corresponding to the light ray, and the 3D video is synthesized according to the plurality of still pictures.

The following describes the video generation method provided by the present application in detail with reference to specific embodiments.

Please refer to fig. 2, which provides a schematic flow chart of a video generation method according to an embodiment of the present application. As shown in fig. 2, the method of the embodiment of the present application may include the following steps S10-S60.

S10, acquiring first information of a plurality of sampling points on a first light ray, wherein the first information comprises a space coordinate and an azimuth viewing angle;

s20, obtaining second information of a plurality of first key points of the target object for a plurality of times, wherein the second information comprises the space coordinates of the key points and the characteristics of the key points;

s30, aiming at a plurality of sampling points and second information of a plurality of first key points acquired at each time, respectively generating a plurality of first key point fusion characteristics according to the first information and the second information, wherein each first key point fusion characteristic in the plurality of first key point fusion characteristics corresponds to one sampling point in the plurality of sampling points;

s40, aiming at a first sampling point in the plurality of sampling points, carrying out offset operation on the spatial coordinate of the first sampling point according to the spatial coordinate of the first sampling point and a first key point fusion characteristic corresponding to the first sampling point to obtain the offset spatial coordinate, wherein the first sampling point is any one of the plurality of sampling points;

s50, aiming at second information of a plurality of first key points acquired each time, inputting the shifted spatial coordinates of the plurality of sampling points and the fusion characteristics of the plurality of first key points corresponding to the plurality of sampling points into a pre-trained NeRF model in a paired manner, so as to acquire a plurality of static images of the target object, wherein the number of the plurality of static images is equal to the number of times of acquiring the second information of the first key points for a plurality of times;

and S60, synthesizing the plurality of static images into the video of the target object.

In this embodiment, a three-dimensional video synthesis method is proposed based on a conventional NeRF model. NeRF uses an MLP multi-layer perceptron to implicitly learn a static 3D scene. For each static 3D scene, a large number of pictures of known camera parameters need to be provided to train the NeRF model. The trained NeRF model can realize the reconstruction of a three-dimensional model from any angle, such as a human body, a building, a vehicle and the like. The input is 5-dimensional coordinates including three-dimensional spatial coordinates (x, y, z) and viewing angle direction D = (θ, Φ), and the output of the intermediate layer of the NeRF model is color c and volume density σ (which can be understood approximately as opacity, and the smaller the value, the more transparent the function is), and the function can be understood as color mapping the 5D coordinates to the corresponding volume density and direction. The image is then generated using volume rendering techniques. During rendering, for each ray (which does not necessarily exist actually), we obtain the pixel value of this point by using a stereoscopic rendering method, specifically, we first sample many points on one ray, and then obtain the corresponding voxel and color of each point through the above, and then use the following rendering equation:

wherein t is _n And t _f Representing a ray being presentThe function T (T) represents the cumulative transmission of the rays tn to T along the ray, i.e., the ray from T _n Probability of not encountering any ion by t. The image can be rendered from any angle by drawing a view for a continuous field of nerve radiation by calculating the integral C (r) for each pixel's ray of the virtual camera.

The above description is of the reconstruction of static 3D images using NeRF, and clearly no dynamic 3D video can be obtained. Theoretically, a time parameter training neural network is added on the basis of the 5D coordinates used in NeRF, so that 3D images at any time can be obtained, and dynamic videos are synthesized by using the time parameter. However, in this way, a new parameter is directly added, the data volume during training is greatly increased, the training time is greatly prolonged, and the efficiency is obviously reduced. Meanwhile, other three-dimensional video synthesis methods in the prior art also need to directly or indirectly depend on the time parameter. Based on the above, the application provides a 3D video synthesis method capable of being decoupled from time parameters, so that the 3D video does not depend on time information, nor on the sequence of each frame image in the video used in training.

The respective steps will be described in detail below:

specifically, the first ray is a virtual ray obtained according to a video viewing angle, which can be regarded as a ray from a ray origin point such as a human eye or a camera to a target object. First information of a plurality of sampling points on the first light ray is acquired, and the first information can comprise the spatial coordinates and the azimuth viewing angle of the sampling points. It will be appreciated that since the conventional NeRF model input requires spatial coordinates and an azimuthal view, in this scenario at least the spatial coordinates and azimuthal view of the sample points are also taken.

specifically, the target object is dynamic content included in the video that needs to be generated, and may be a certain object, a person, a scene, and the like. For example, a video of a person speaking needs to be generated, then the target object is the head of the person, and when the person speaks, the facial expression of the person changes, for example, the lips open or close, and the like. Referring to fig. 3, fig. 3 is a schematic diagram of key points of a video generation method according to an embodiment of the present application, where black points in the diagram are key points of the head of the person. It is understood that the number of key points may be determined according to the target object, and generally speaking, the greater the number of the first key points, the higher the precision of the simulated motion of the generated video.

It should be noted that the feature of the key point does not change, and what is changed is the spatial coordinate of the key point.

specifically, for a dynamic scene, the same key point may move in space along with different actions, and the difficulty of dynamic NeRF is: how to simulate the dynamic change under the same light angle. Referring to fig. 4, fig. 4 is a schematic diagram illustrating ray bending according to key points in a video generation method according to an embodiment of the present application, where a dot marked in the diagram is a first key point on a tooth, and a ray with an arrow in the diagram is a first ray. Assuming that the mouth is closed on the left side in fig. 4 and is open on the right side, the light needs to be bent so that the volume density and color at this position can still be obtained after the light is bent. In other words, the (x, y, z) plus some offset of the originally static key point is changed into (x ', y ', z ') during the dynamic time, so that the key point still corresponds to the position of the tooth. Therefore, before the light is bent, the coordinates of the sampling point on the first light, the coordinates of the key point and the key point feature are associated or bound with each other through feature fusion, so that the NeRF is driven by the key point. Taking generation of a picture as an example, first information of each sampling point on the first light ray and second information of a plurality of first key points acquired at a certain time are respectively fused to generate a first key point fusion feature corresponding to each sampling point. The first key point fusion characteristics comprise the coordinates of the sampling points, the coordinates of the key points and key point characteristic information after fusion, and also comprise the orientation view angles of the sampling points.

specifically, after the first key point fusion feature is obtained, the first light ray is bent, that is, coordinates of each sampling point on the light ray are shifted. And acquiring a first sampling point from the plurality of sampling points, inputting the spatial coordinates of the first sampling point and the first key point fusion characteristics corresponding to the first sampling point into the trained light ray bending module, and acquiring the spatial coordinates of the first sampling point after deviation from the light ray bending module. The ray bending module can be obtained based on neural network training.

S50, aiming at the second information of the first key points acquired each time, inputting the shifted spatial coordinates of the sampling points and the first key point fusion characteristics corresponding to the sampling points into a pre-trained NeRF model in a paired mode, and accordingly acquiring a plurality of static images of the target object, wherein the number of the static images is equal to the number of times of acquiring the second information of the first key points for a plurality of times;

specifically, in step S40, the coordinates of the shifted keypoints are obtained, and then the coordinates of the shifted keypoints are combined with the first keypoint fusion features and input into the trained NeRF model. Alternatively, the multisolution hash coding Hashgrid scheme of InstantNG of NVIDIA is used to optimize NeRF coding, since the conventional frequency coding (encoding) is an implicit encoding, whereas Hashgrid is an explicit encoding, which in combination have a better effect and allow the same rendering quality with less computation. The output of the pre-trained NeRF model is the RGB values and the bulk density, and meanwhile, a static image is generated according to the RGB values and the bulk density volume rendering technology. The volume rendering technology is a prior art, and is not described herein.

For example, the generated still images are used as images of one frame by one frame in a video, and the images are sequentially spliced to obtain the video. It can be understood that, assuming that the generated video is a person speaking video, in the pre-training process, the data collected by the user is the person speaking video, frame sampling is performed, for example, the FPS is 60, the spatial coordinates of the key points in each frame image are obtained, and the corresponding second information is generated, so that the obtained second information of a plurality of first key points is sequentially input in the video synthesis process, and then the static images are correspondingly generated in sequence, and the video can be obtained by directly splicing. The order in which the images are stitched corresponds to the order in which the first keypoints are input, rather than the temporal order of each frame of image in the video when the NeRF model is trained.

In the embodiment of the application, by acquiring the spatial coordinates and the azimuth viewing angle information of a plurality of sampling points on a first light ray, acquiring a plurality of first key points of a target object, and acquiring second information of the first key points for a plurality of times, wherein the second information includes the spatial coordinates and the features of the key points, each sampling point on the first light ray is fused with the second information acquired each time to generate a first key point fusion feature, the spatial coordinates of each sampling point are subjected to a shift operation according to the first key point fusion feature to obtain the spatial coordinates after the shift of each sampling point, the spatial coordinates after the shift of the sampling point and the first key point fusion feature are input into a pre-trained nerve radiation field model to generate a static image corresponding to each sampling point, a corresponding static image is generated for the second information of a plurality of first key points input each time, and a video is synthesized according to the plurality of static images. By sequentially inputting the second information of the plurality of first key points of the target object, when the static image corresponding to the first light is generated according to the nerve radiation field, each static image is actually associated with the second information of different key points input each time, the change of each image in a dynamic scene is simulated by fusing the changed second information of the key points, and then the video is synthesized according to the generated picture, so that the 3D video synthesis is realized while the time is decoupled, the synthesis method is simple, and the video of the target object can be synthesized only by appointing a visual angle by a user.

Referring to fig. 5, a schematic detailed flow chart for generating a plurality of first keypoint fusion features in a video synthesis method according to the embodiment of the present application is provided. As shown in fig. 5, the method of the embodiment of the present application may include the following steps S31 to S32.

S31, determining at least one second keypoint associated with the first sampling point from a plurality of first keypoints;

in the embodiment, when the first keypoint fusion feature is generated, part of the second keypoints are selected from the plurality of first keypoints to perform feature fusion with the first information of the first sampling point. It will be appreciated that the sample point P (x, y, z) in space will not be associated with all the keypoints landmark (x, y, z), e.g. keypoints near the eye drive eye movement, keypoints near the mouth drive mouth movement, and keypoints near the eye will not drive mouth movement. Therefore, it is necessary to select a second keypoint associated with the first sampling point from the first keypoints, thereby making the keypoint drive more accurate. Specifically, at least one second keypoint determined from the plurality of first keypoints and associated with the first sampling point can be determined by training a neural network, and the associated features are input, so that the neural network learns the associated features between the keypoints and the sampling points, and then the second keypoints in the first keypoints can be obtained by performing association prediction through the trained neural network. Optionally, by setting a correspondence between the key points and the sampling points, when at least one second key point related to the first sampling point needs to be determined from the plurality of first key points, the second key point can be obtained from the corresponding relationship mapping table.

And S32, performing attention calculation on the first information of the first sampling point and the second information of the at least one second key point to obtain the first key point fusion feature.

Specifically, after a second key point related to the first sampling point is confirmed, second information of the second key point and first information of the first sampling point are obtained to perform attention calculation, so that the first sampling point and the second key point are related, and after attention mechanism attention, the obtained first key point fusion feature represents a feature obtained after interaction of key point information and light ray information of the target object.

For example, after a second key point related to the first sampling point is confirmed, second information of the second key point and first information of the first sampling point are acquired to perform attention calculation, so that the first sampling point and the second key point are related, and after attention mechanism attention, the obtained first key point fusion feature represents a feature obtained after interaction of key point information of the target object and light information. The sampling point is represented by point P, P (x, y, z) and landmark (x, y, z) are actually points in the same space, and the influence of the key point on P (x, y, z) is related to its spatial position, so a cross-attribute based encoding method is used here, for example, as follows:

sample point P (x, y, z) is 1x3 tensor, taken as query;

landmark (x, y, z) is the tensor for Mx3, as key;

considering that M landrakes have corresponding semantics, setting corresponding landrake feature to landrakes (x, y, z), and taking the landrakes as value, as embedding of Mx 3;

performing an attribute operation on the query, the key and the value to obtain final landmark encoding, namely key point fusion characteristics.

Further, in an embodiment, the determining at least one second keypoint associated with the first sample point from among the plurality of first keypoints comprises:

s311, calculating the distances between the spatial coordinates of the first sampling point and the spatial coordinates of the first key points;

s312, determining at least one first key point with the distance smaller than or equal to a preset threshold value as the at least one second key point.

Specifically, the distance between the first sampling point P (x, y, z) and all the first keypoints landmark (x, y, z) is calculated, and at least one first keypoint with the distance smaller than or equal to a preset threshold is determined as at least one second keypoint associated with the first sampling point. Q multiplied by K in the following formula represents a similarity and is also a measure of distance. It should be noted that, the attention calculation attention may directly adopt a calculation formula in the prior art, specifically as shown in the following formula:

where Q is the coordinates (x, y, z) of the input sample points, K is the coordinates (x, y, z) of landmak, V is a learnable landrak feature (learnable means initialized to some random values that can then be updated as network parameters are updated at the time of training), d _k Is the embedding (embedding) dimension of Q or K, and here for example, suppose Q is 200x2048, K and V are 200x2048 where dk is 2048.

Further, in an embodiment, the method of the embodiment of the present application may include the following steps S33 to S34.

S34, determining at least one second keypoint associated to the first sampling point from a plurality of first keypoints;

and S34, splicing the first information of the first sampling point and the second information of the at least one second key point to generate the first key point fusion feature.

In an embodiment, after the second keypoints are determined from the multiple first keypoints, the method for generating the first keypoint fusion feature includes splicing the first information of the first sampling point and the second information of at least one second keypoint. Specifically, the second keypoint coordinates (x, y, z) are directly transformed into a 1-dimensional vector, and then are spliced with the first sampling point P (x, y, z) to be used as the input of the subsequent NeRF model. It should be noted that, compared with the method of feature fusion by attention, the feature fusion method of directly stitching the coordinates of the key points and the coordinates of the sampling points has a poorer effect, but the method is simple and fast, and the method can be used to increase the speed of generating the video under the condition that the quality requirement of the target synthesized video is not high.

Referring to fig. 6, fig. 6 is a flowchart illustrating an overall video generation method according to an embodiment of the present disclosure. Fig. 6 shows a process of processing second information of each first sampling point (i.e., sampling point in the graph) and the corresponding acquired first keypoint (i.e., keypoint in the graph), where the core processing module includes a keypoint coding module, a ray bending module, and a nerve radiation field module. Firstly, taking coordinates of a sampling point as Query, coordinates of Key points as Key, and characteristics of the Key points as Value to be fused through an attention mechanism to obtain Key point fusion characteristics, wherein the azimuth and visual angle of the sampling point is implicitly included in the fusion characteristics to be used for subsequent input into a NeRF model; then, inputting the key point fusion characteristics and the sampling point coordinates into a light bending multilayer perceptron in a light bending module, outputting the shifted sampling point coordinates, inputting the shifted sampling point coordinates and the key point fusion characteristics into a nerve radiation field module, generating color RGB and volume density corresponding to a sampling point by combining Hashgrid through NeRF, respectively inputting all first sampling points on first light into each module to obtain the color RGB and the volume density of all sampling points on the first light, generating a static picture based on the color RGB and the volume density of all sampling points, wherein the number of the static pictures is equal to the number of times of acquiring second information of the first key point for many times, and generating a video according to the acquired static pictures.

In this embodiment, when the first keypoint fusion feature is generated, a second keypoint associated with the first sampling point is selected from the multiple first keypoints, and feature fusion is performed on second information of the second sampling point and first information of the first sampling point to generate the first keypoint fusion feature, specifically, attention calculation is performed on the first information of the first sampling point and the second information of at least one second keypoint or the first information of the first sampling point and the second information of at least one second keypoint are directly spliced to generate the first keypoint fusion feature, so that interaction between the keypoint information and the information of the sampling point on the light ray is realized, and a subsequent nerve radiation field model can generate a corresponding picture according to the input keypoint information.

Please refer to fig. 7, which provides a detailed flowchart of determining a first keypoint in a video generation method according to an embodiment of the present application. As shown in fig. 7, the method of the embodiment of the present application may include the following steps S71-S72.

S71, determining the type of the target object;

in one embodiment, the target objects may be of multiple types, and the key points to be extracted are different for different types of target objects. It can be understood that if the target object is a character avatar, key points corresponding to the character avatar need to be acquired; if the target object is an animal, acquiring a key point corresponding to the animal; or, if the target object is a human body limb, key points corresponding to the limb need to be acquired. Specifically, the type division of the target object may be determined according to actual conditions.

S72, selecting a key point extraction model based on the type of the target object, and determining the first key point according to the key point extraction model.

In an embodiment, after the type of the target object is determined, a key point extraction model of the target object is obtained, and a first key point is obtained through the key point extraction model. The key point extraction model can adopt the existing open source model and can also be trained through a convolutional neural network. For example, the face key point extraction model may use the open source library Dlib of the more popular face recognition. Assuming that the target object is a human face, 68 key points on the human face in the input picture can be extracted by using the Dlib key point extraction model.

Optionally, the present application further provides a training method for generating a neural radiation field model of a video, including the following steps S81 to S82:

s81, establishing an initialized nerve radiation field model;

and S82, training the initialized nerve radiation field model by utilizing a pre-acquired training image to obtain a trained nerve radiation field model, wherein first information of sampling points, space coordinates of all key points and features of the key points are marked in each training image.

Specifically, the neural radiation field model in the application is different from the traditional neural radiation field model to a certain extent, and the input data of the neural radiation field model also comprises key point information, so that an initialized neural radiation field model needs to be created, and model training is performed through a training image, so that the neural radiation field model learns the first information of the sampling points, and the spatial coordinates and the key point characteristics of all key points. The method includes the steps of capturing a pre-acquired training image from a video, for example, a target to be synthesized is a person walking video seen from a side view, capturing a picture from a person walking video captured from the side during training, for example, capturing 30 pictures from a 1-second video, labeling 100 key points for each picture, acquiring spatial positions of the key points and characteristics of the key points in each picture, generating a piece of key point information corresponding to each picture, performing characteristic fusion and sampling point offset on a plurality of sampling points on a light ray corresponding to the side view and each piece of key point information, inputting space coordinates of the plurality of sampling points after offset and a plurality of key point fusion characteristics corresponding to the plurality of sampling points in a paired manner into an initialized nerve radiation field NeRF model, generating an experimental image, comparing the experimental image with the training image corresponding to the key point information used during training, and training to obtain a nerve radiation field model through iterative computation loss function.

Optionally, the steps of performing feature fusion and performing spatial coordinate bending in the application may be respectively implemented by the key point coding module and the ray bending module, and the corresponding key point coding model and the corresponding ray bending model may be trained together with the nerve radiation field model. Illustratively, the training steps may be as follows:

(1) Acquiring training video data, and performing frame sampling on the training video data to generate a training image set;

(2) Acquiring space coordinates and orientation visual angles of all pixel points of each image in a training image set, extracting key points of each image in the training image set, and inputting the space coordinates, the orientation visual angles and the key points into an initial key point coding model to obtain initial key point fusion characteristics;

(3) Inputting the initial key point fusion characteristics and the space coordinates into an initial ray bending model, and outputting initial correction three-dimensional coordinates;

(4) Inputting the initial correction three-dimensional coordinates and the initial key point driving characteristics into the initialized nerve radiation field model to render and generate an experimental image;

(5) And iteratively calculating a preset loss function based on the experimental image and the training image set until the loss function meets a preset condition, and training to obtain a key point coding model, a light bending model and a nerve radiation field model.

Further, in an embodiment, for a first sampling point of the plurality of sampling points, according to the spatial coordinate of the first sampling point and a first key point fusion feature corresponding to the first sampling point, the spatial coordinate of the first sampling point is subjected to a shift operation to obtain a shifted spatial coordinate, which is implemented by using a multilayer perceptron.

It can be understood that when a person speaks to move the face, the key points on the face move, and what we want to realize is that the light rays move along with the key points. Specifically, the shift of the light is automatically learned through a multi-layer perceptron (MLP), and the position of the shifted key point can be obtained only by determining where the light is shifted according to the information provided by the key point. During training, a constraint is set, and the generated coordinates of the frame are the same as the actual positions of the lips and the teeth of the frame, namely, the generated picture is compared with the original picture (training picture) to train the model.

Further, in an embodiment, the second information for the plurality of sampling points and the plurality of first keypoints acquired each time respectively generate the plurality of first keypoint fusion features according to the first information and the second information, and the second information may also be implemented by using a multi-layer perceptron, and a specific training manner is similar to the above steps, which is not repeated herein.

In the embodiment of the application, the first key point is determined according to the key point extraction model by determining the type of the target object and then selecting the key point extraction model according to the type of the target object. Through the key point extraction model that corresponds for the target object matching of different grade type, can be for the provision of target object more accurate key point information in the video of formation to improve the accuracy of video synthesis, draw the key point through the key point extraction model of training in advance simultaneously, and need not carry out artifical key point mark temporarily, can improve key point information acquisition efficiency, thereby make the process of video generation quicker.

The video generation apparatus provided in the embodiment of the present application will be described in detail below with reference to fig. 8. It should be noted that, the video generating apparatus in fig. 8 is used for executing the method of the embodiment shown in fig. 2 to fig. 7 of the present application, and for convenience of description, only the portion related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to the embodiment shown in fig. 2 to fig. 7 of the present application.

Please refer to fig. 8, which shows a schematic structural diagram of a video generating apparatus according to an exemplary embodiment of the present application. The video generating device may be implemented as all or part of a device in software, hardware, or a combination of both. The device comprises a light ray acquisition module 10, a key point acquisition module 20, a key point coding module 30, a light ray bending module 40, a nerve radiation field module 50 and a video generation module 60.

The system comprises a light ray acquisition module 10, a processing module and a display module, wherein the light ray acquisition module is used for acquiring first information of a plurality of sampling points on a first light ray, and the first information comprises a space coordinate and an azimuth viewing angle;

the key point obtaining module 20 is configured to obtain second information of a plurality of first key points of the target object for a plurality of times, where the second information includes spatial coordinates of the key points and features of the key points;

the keypoint encoding module 30 is configured to generate, according to the first information and the second information, a plurality of first keypoint fusion features respectively for a plurality of sampling points and second information of a plurality of first keypoints acquired each time, where each of the plurality of first keypoint fusion features corresponds to one of the plurality of sampling points;

the light ray bending module 40 is configured to, for a first sampling point of the multiple sampling points, perform a shift operation on a spatial coordinate of the first sampling point according to the spatial coordinate of the first sampling point and a first key point fusion feature corresponding to the first sampling point, so as to obtain a shifted spatial coordinate, where the first sampling point is any one of the multiple sampling points;

the nerve radiation field module 50 is configured to input pre-trained NeRF models in a paired manner to obtain a plurality of static images of the target object, where the number of the plurality of static images is equal to the number of times of acquiring the second information of the first keypoints for a plurality of times, and the shifted spatial coordinates of the plurality of sampling points and the fusion features of the plurality of first keypoints corresponding to the plurality of sampling points are fused, so as to obtain a plurality of static images of the target object;

a video generating module 60, configured to synthesize the plurality of still images into a video of the target object.

Optionally, the keypoint coding module 30 is specifically configured to, for the first sampling point and the second information of multiple first keypoints acquired at a time,

determining at least one second keypoint associated with the first sampling point from a plurality of first keypoints;

performing attention calculation on the first information of the first sampling point and the second information of the at least one second key point to obtain the fusion characteristic of the first key point

Optionally, the keypoint coding module 30 is specifically configured to, for the first sampling point and the second information of the plurality of first keypoints acquired at each time,

and splicing the first information of the first sampling point and the second information of the at least one second key point to generate the first key point fusion characteristic.

Optionally, the keypoint coding module 30 is specifically configured to calculate distances between the spatial coordinates of the first sampling point and the spatial coordinates of the plurality of first keypoints;

and determining at least one first key point with the distance smaller than or equal to a preset threshold value as the at least one second key point.

Optionally, the keypoint coding module 30 is configured to execute, by using a multi-layer perceptron, second information for a plurality of sampling points and a plurality of first keypoints acquired each time, and generate an operation of fusing features of the plurality of first keypoints according to the first information and the second information, respectively;

optionally, the light bending model 40 performs, by using a multilayer perceptron, an offset operation on a spatial coordinate of a first sampling point of the plurality of sampling points according to the spatial coordinate of the first sampling point and a first key point fusion feature corresponding to the first sampling point, so as to obtain an offset spatial coordinate;

optionally, the apparatus further includes a key point extraction module, where the key point extraction module is configured to determine a type of the target object;

selecting a key point extraction model based on the type of the target object, and determining the first key point according to the key point extraction model.

It should be noted that, when the video generating apparatus provided in the foregoing embodiment executes the video generating method, only the division of the functional modules is illustrated, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the video generation apparatus and the video generation method provided by the above embodiments belong to the same concept, and details of implementation processes thereof are referred to in the method embodiments and are not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the video generation method according to the embodiment shown in fig. 2 to 7 is implemented, and a specific execution process may refer to specific descriptions of the embodiment shown in fig. 2 to 7, which is not described herein again.

Referring to fig. 9, a schematic structural diagram of a video generating apparatus according to an exemplary embodiment of the present application is shown. The video generation device in the present application may comprise one or more of the following components: a processor 110, a memory 120, an input device 130, an output device 140, and a bus 150. The processor 110, memory 120, input device 130, and output device 140 may be coupled by a bus 150.

Processor 110 may include one or more processing cores. The processor 110 connects various parts within the entire video generating apparatus using various interfaces and lines, and performs various functions of the terminal 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120 and calling data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware using at least one of Digital Signal Processing (DSP), field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 110 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. The CPU mainly processes an operating system, a user page, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 110, but may be implemented by a communication chip.

The Memory 120 may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). Optionally, the memory 120 includes a Non-Transitory Computer-Readable Medium (Non-transient Computer-Readable Storage Medium). The memory 120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 120 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, and the like), instructions for implementing the above method embodiments, and the like, and the operating system may be an Android (Android) system, including a system based on Android system depth development, an IOS system developed by apple, including a system based on IOS system depth development, or other systems.

The memory 120 may be divided into an operating system space, where an operating system runs, and a user space, where native and third-party applications run. In order to ensure that different third-party application programs can achieve a better operation effect, the operating system allocates corresponding system resources for the different third-party application programs. However, the requirements of different application scenarios in the same third-party application program on system resources are different, for example, in a local resource loading scenario, the third-party application program has a higher requirement on the disk reading speed; in the animation rendering scene, the third-party application program has a high requirement on the performance of the GPU. The operating system and the third-party application program are independent from each other, and the operating system cannot sense the current application scene of the third-party application program in time, so that the operating system cannot perform targeted system resource adaptation according to the specific application scene of the third-party application program.

In order to enable the operating system to distinguish a specific application scenario of the third-party application program, data communication between the third-party application program and the operating system needs to be opened, so that the operating system can acquire current scenario information of the third-party application program at any time, and further perform targeted system resource adaptation based on the current scenario.

The input device 130 is used for receiving input instructions or data, and the input device 130 includes, but is not limited to, a keyboard, a mouse, a camera, a microphone, or a touch device. The output device 140 is used for outputting instructions or data, and the output device 140 includes, but is not limited to, a display device, a speaker, and the like. In one example, the input device 130 and the output device 140 may be combined, and the input device 130 and the output device 140 are touch display screens.

The touch display screen can be designed as a full-face screen, a curved screen or a profiled screen. The touch display screen can also be designed to be a combination of a full-face screen and a curved-face screen, and a combination of a special-shaped screen and a curved-face screen, which is not limited in the embodiment of the present application.

In addition, those skilled in the art will appreciate that the structure of the video generating device shown in the above figures does not constitute a limitation of the video generating device, and the video generating device may include more or less components than those shown, or combine some components, or arrange different components. For example, the video generating device further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a Wireless Fidelity (Wi-Fi) module, a power supply, a bluetooth module, and other components, which are not described herein again.

In the video generating device shown in fig. 9, the processor 110 may be configured to call up a computer program stored in the memory 120 and specifically perform the following operations:

acquiring second information of a plurality of first key points of the target object for a plurality of times, wherein the second information comprises the spatial coordinates of the key points and the characteristics of the key points;

and synthesizing the plurality of static images into a video of the target object.

In one embodiment, when executing the second information on the plurality of sampling points and the plurality of first keypoints acquired at each time, and generating a plurality of first keypoint fusion features according to the first information and the second information, the processor 110 specifically performs the following operations:

second information for the first sampling point and the plurality of first keypoints acquired each time,

determining at least one second keypoint associated with the first sampling point from among a plurality of first keypoints;

and performing attention calculation on the first information of the first sampling point and the second information of the at least one second key point to obtain the first key point fusion feature.

In one embodiment, when executing second information for a plurality of sampling points and a plurality of first keypoints acquired each time, and respectively generating a plurality of first keypoint fusion features according to the first information and the second information, the processor 110 specifically executes the following operations:

In one embodiment, the processor 110, when performing the determining of the at least one second keypoint associated with the first sample point from the plurality of first keypoints, specifically performs the following operations:

calculating the distance between the spatial coordinates of the first sampling point and the spatial coordinates of a plurality of first key points;

In one embodiment, the processor 110 is implemented by using a multi-layer perceptron when executing second information for a plurality of sampling points and a plurality of first keypoints acquired each time, and respectively generating a plurality of first keypoint fusion features according to the first information and the second information.

In one embodiment, the processor 110 is implemented by using a multi-layer perceptron when executing a shift operation on the spatial coordinate of a first sampling point in the plurality of sampling points according to the spatial coordinate of the first sampling point and a first key point fusion feature corresponding to the first sampling point, so as to obtain a shifted spatial coordinate.

In one embodiment, the processor 110 performs the following operations before performing that the first key points are human key points, the first key points include facial key points and limb key points, and the second information of the plurality of first key points of the target object is acquired for a plurality of times:

determining a type of the target object;

In the embodiment of the application, the spatial coordinates and the azimuth visual angle information of a plurality of sampling points on a first light ray are obtained, a plurality of first key points of a target object are obtained, and second information of the first key points is obtained for a plurality of times, wherein the second information comprises the spatial coordinates of the key points and the characteristics of the key points, each sampling point on the first light ray is fused with the second information obtained each time to generate a first key point fusion characteristic, the spatial coordinates of each sampling point are subjected to offset operation according to the first key point fusion characteristic to obtain the spatial coordinates of each sampling point after offset, the spatial coordinates of the sampling point after offset and the first key point fusion characteristic are input into a pre-trained nerve radiation field model to generate a static image corresponding to each sampling point, a corresponding static image is generated for the second information of a plurality of first key points input each time, and a video is synthesized according to a plurality of static images. By sequentially inputting the second information of the plurality of first key points of the target object, when the static images corresponding to the first light are generated according to the nerve radiation field, each static image is actually associated with the second information of different key points input each time, the change of each image in a dynamic scene is simulated by integrating the changed second information of the key points, and then the video is synthesized according to the generated pictures, so that the 3D video synthesis is realized while the time decoupling is realized, the synthesis method is simple, and the video of the target object can be synthesized only by the user for specifying the visual angle. And when generating the first keypoint fusion feature, selecting a second keypoint associated with the first sampling point from the plurality of first keypoints, performing feature fusion on second information of the second sampling point and first information of the first sampling point to generate the first keypoint fusion feature, specifically, performing attention calculation on the first information of the first sampling point and the second information of at least one second keypoint or directly splicing the first information of the first sampling point and the second information of at least one second keypoint to generate the first keypoint fusion feature, so that interaction between the keypoint information and the information of the sampling point on the light ray is realized, and a subsequent nerve radiation field model can generate a corresponding picture according to the input keypoint information. In addition, the first key point is determined according to the key point extraction model by determining the type of the target object and then selecting the key point extraction model according to the type of the target object. Through the key point extraction model that the target object matching for the different grade type corresponds, can be for the key point information that provides more accurate of the target object in the video of formation to improve the accuracy of video synthesis, the key point is extracted through the key point extraction model of training in advance simultaneously, and need not carry out artifical key point mark temporarily, can improve key point information acquisition efficiency, thereby makes the process of video formation faster.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method of video generation, comprising:

for second information of a plurality of first key points acquired each time, inputting the shifted spatial coordinates of the plurality of sampling points and the plurality of first key point fusion features corresponding to the plurality of sampling points into a pre-trained NeRF model in a paired manner, thereby acquiring a plurality of static images of the target object, wherein the number of the plurality of static images is equal to the number of times of acquiring the second information of the first key points for a plurality of times;

2. The method as claimed in claim 1, wherein the generating a plurality of first keypoint fusion features from the first information and the second information for a plurality of sampling points and a plurality of first keypoints acquired at each time, respectively, comprises:

3. The method as claimed in claim 1, wherein the generating a plurality of first keypoint fusion features from the first information and the second information for a plurality of sampling points and a plurality of first keypoints acquired at each time, respectively, comprises:

4. A method as claimed in claim 2 wherein said determining at least one second keypoint associated with said first sample point from among a plurality of first keypoints comprises:

5. The method as claimed in claim 2, wherein the generating of the plurality of first keypoint fusion features from the first information and the second information for the plurality of sampling points and the plurality of first keypoints obtained at each time, respectively, is implemented by using a multi-layer perceptron.

6. The method of claim 1, wherein for a first sampling point in the plurality of sampling points, the spatial coordinate of the first sampling point is shifted according to the spatial coordinate of the first sampling point and a first key point fusion feature corresponding to the first sampling point, and the shifted spatial coordinate is obtained by using a multi-layer perceptron.

7. The method of claim 1, wherein the first keypoints are human keypoints, the first keypoints include facial keypoints and limb keypoints, and before obtaining the second information of the plurality of first keypoints of the target object a plurality of times, the method further comprises:

determining a type of the target object;

8. A video generation apparatus, comprising:

the nerve radiation field module is used for inputting the shifted spatial coordinates of the plurality of sampling points and the plurality of first key point fusion characteristics corresponding to the plurality of sampling points into a pre-trained nerve radiation field NeRF model in a paired manner aiming at second information of a plurality of first key points acquired each time, so as to acquire a plurality of static images of the target object, wherein the number of the static images is equal to the number of times of acquiring the second information of the first key points for a plurality of times;

and the video generation module is used for synthesizing the plurality of static images into the video of the target object.

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of the method according to any one of claims 1 to 7.