WO2024077791A1 - 视频生成方法、装置、设备与计算机可读存储介质 - Google Patents

视频生成方法、装置、设备与计算机可读存储介质 Download PDF

Info

Publication number
WO2024077791A1
WO2024077791A1 PCT/CN2022/143214 CN2022143214W WO2024077791A1 WO 2024077791 A1 WO2024077791 A1 WO 2024077791A1 CN 2022143214 W CN2022143214 W CN 2022143214W WO 2024077791 A1 WO2024077791 A1 WO 2024077791A1
Authority
WO
WIPO (PCT)
Prior art keywords
key
information
point
points
sampling
Prior art date
Application number
PCT/CN2022/143214
Other languages
English (en)
French (fr)
Inventor
周彧聪
王志浩
杨斌
Original Assignee
名之梦(上海)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 名之梦(上海)科技有限公司 filed Critical 名之梦(上海)科技有限公司
Publication of WO2024077791A1 publication Critical patent/WO2024077791A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof

Definitions

  • the present invention relates to the field of computer vision technology, and in particular to a video generation method, device, equipment and a computer-readable storage medium.
  • Neural radiance fields is a continuous and implicit expression of three-dimensional static scenes. It flexibly represents the geometry and appearance of three-dimensional scenes and realizes realistic new perspective two-dimensional image synthesis.
  • NeRF can only obtain two-dimensional images, which cannot meet people's needs for three-dimensional video reconstruction.
  • the existing technology usually trains neural networks by adding time parameters on the basis of 5D, so that 3D images at any time can be obtained and synthesized videos.
  • this method directly adds a dimension, which greatly increases the amount of data and training time during training, and is inefficient.
  • Another commonly used method is to use time-based latent codes to achieve 3D video generation of dynamic scenes.
  • the current 3D video generation still depends mainly on time directly or indirectly, so it is urgent to propose a video generation method that does not depend on time parameters.
  • the main purpose of the present invention is to provide a video generation method, device, equipment and computer-readable storage medium, aiming to solve the technical problem that the existing video generation method depends on time parameters.
  • the technical solution is as follows:
  • an embodiment of the present application provides a video generation method, comprising:
  • each first key point fusion feature in the multiple first key point fusion features corresponds to one sampling point in the multiple sampling points
  • For a first sampling point among the multiple sampling points performing an offset operation on the spatial coordinates of the first sampling point according to the spatial coordinates of the first sampling point and a first key point fusion feature corresponding to the first sampling point to obtain offset spatial coordinates, wherein the first sampling point is any one sampling point among the multiple sampling points;
  • the offset spatial coordinates of the plurality of sampling points and the fusion features of the plurality of first key points corresponding to the plurality of sampling points are input into the pre-trained NeRF model in pairs, so as to obtain a plurality of static images of the target object, wherein the number of the plurality of static images is equal to the number of acquisitions of the second information of the first key points;
  • the plurality of static images are synthesized into a video of the target object.
  • an embodiment of the present application provides a video generating device, including:
  • a light acquisition module used to acquire first information of a plurality of sampling points on a first light, wherein the first information includes spatial coordinates and azimuth viewing angle;
  • a key point acquisition module used for acquiring second information of a plurality of first key points of a target object for multiple times, wherein the second information includes spatial coordinates of the key points and features of the key points;
  • a key point encoding module for generating a plurality of first key point fusion features according to the first information and the second information for a plurality of sampling points and a plurality of first key points obtained each time, wherein each first key point fusion feature in the plurality of first key point fusion features corresponds to a sampling point in the plurality of sampling points;
  • a light bending module configured to perform an offset operation on the spatial coordinates of a first sampling point among the multiple sampling points according to the spatial coordinates of the first sampling point and a fusion feature of a first key point corresponding to the first sampling point, so as to obtain offset spatial coordinates, wherein the first sampling point is any one sampling point among the multiple sampling points;
  • a neural radiation field module for each time obtaining the second information of the plurality of first key points, inputting the offset spatial coordinates of the plurality of sampling points and the fusion features of the plurality of first key points corresponding to the plurality of sampling points into a pre-trained NeRF model in pairs, thereby obtaining a plurality of static images of the target object, wherein the number of the plurality of static images is equal to the number of times the second information of the first key points is obtained multiple times;
  • a video generation module is used to synthesize the multiple static images into a video of the target object.
  • an embodiment of the present application provides an electronic device, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program implements the steps of the above method when executed by the processor.
  • an embodiment of the present application provides a computer storage medium, wherein the computer storage medium stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor and executing the steps of the above method.
  • each sampling point on the first light ray is fused with the second information obtained each time to generate a first key point fusion feature
  • an offset operation is performed on the spatial coordinates of each sampling point according to the first key point fusion feature to obtain the offset spatial coordinates of each sampling point
  • the offset spatial coordinates of the sampling point and the first key point fusion feature are input into a pre-trained neural radiation field model to generate a static image corresponding to each sampling point
  • a corresponding static image is generated for each input of the second information of multiple first key points
  • a video is synthesized based on the multiple static images.
  • each static image is actually associated with the second information of different key points input each time.
  • the changes of each image in the dynamic scene are simulated by incorporating the changing second information of the key points, and then the video is synthesized based on the generated pictures. This realizes 3D video synthesis while decoupling from time.
  • the synthesis method is simple, and the user only needs to specify the viewing angle to synthesize the video of the target object.
  • FIG1 is a schematic diagram of an example of a video generation method provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of a flow chart of a video generation method provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of key points of a video generation method provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of light bending according to key points in a video generation method provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of a refinement process of generating a plurality of first key point fusion features in a video generation method provided in an embodiment of the present application;
  • FIG6 is an overall flow chart of a video generation method provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of a refinement process of determining a first key point in a video generation method provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of a video generating device provided in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the structure of a video generating device provided in an embodiment of the present application.
  • the video generating device may be a terminal device such as a mobile phone, a computer, a tablet computer, a smart watch or an in-vehicle device, or may be a module in the terminal device for implementing the video generating method.
  • the video generating device may obtain first information of multiple sampling points on a first light ray, wherein the first information includes spatial coordinates and azimuth viewing angles, obtain second information of multiple first key points of a target object multiple times, wherein the second information includes spatial coordinates of the key points and features of the key points, and generate multiple first key point fusion features respectively according to the first information and the second information for multiple sampling points and multiple first key points obtained each time, wherein each of the multiple first key point fusion features corresponds to one sampling point in the multiple sampling points.
  • an offset operation is performed on the spatial coordinates of the first sampling point to obtain the offset spatial coordinates, wherein the first sampling point is any sampling point among the multiple sampling points, and for the second information of the multiple first key points obtained each time, the offset spatial coordinates of the multiple sampling points and the multiple first key point fusion features corresponding to the multiple sampling points are paired and input into the pre-trained NeRF model, so as to obtain multiple static images of the target object, wherein the number of the multiple static images is equal to the number of times the second information of the first key point is obtained multiple times, and the multiple static images are synthesized into a video of the target object.
  • Figure 1 provides an example schematic diagram of a video generation method for an embodiment of the present application.
  • the figure shows the process of synthesizing a 3D video of the target object.
  • a ray of light or a shooting direction perspective can be obtained according to the desired viewing angle of the target object, and then the NeRF model is driven according to the key point information of the target object to obtain multiple 3D static images corresponding to the ray, and then the 3D video is synthesized based on the multiple static images.
  • Figure 2 is a schematic flow chart of a video generation method according to an embodiment of the present application. As shown in Figure 2, the method according to the embodiment of the present application may include the following steps S10-S60.
  • the offset spatial coordinates of the multiple sampling points and the fusion features of the multiple first key points corresponding to the multiple sampling points are input into a pre-trained neural radiation field NeRF model in pairs, so as to obtain multiple static images of the target object, wherein the number of the multiple static images is equal to the number of times the second information of the first key points is acquired multiple times;
  • NeRF uses MLP multi-layer perceptron to implicitly learn a static 3D scene. For each static 3D scene, a large number of pictures with known camera parameters need to be provided to train the NeRF model.
  • the output of the middle layer of the NeRF model is the color c and the volume density ⁇ (which can be approximately understood as opacity, the smaller the value, the more transparent).
  • the function can be understood as mapping the 5D coordinate to the corresponding volume density and color of the direction. Then the image is generated using volume rendering technology.
  • volume rendering technology For each ray (not necessarily actually existing), we use stereo rendering to get the pixel value of this point. Specifically, we first sample a lot of points on a ray, and then get the voxel and color corresponding to each point through the above, and then use the following rendering equation:
  • tn and tf represent the part of a ray in the scene that we want to render
  • T(t) represents the cumulative transmittance along the ray from tn to t, that is, the probability that the ray does not hit any ions from tn to t.
  • the first ray is a virtual ray obtained according to the video viewing angle, which can be considered as a ray from the starting point of the ray such as the human eye or the camera to the target object.
  • First information of multiple sampling points on the first ray is obtained,
  • the first information may include the spatial coordinates and azimuth viewing angle of the sampling point. It is understandable that, since the traditional NeRF model input requires spatial coordinates and azimuth viewing angle, in this solution, at least the spatial coordinates and azimuth viewing angle of the sampling point are also obtained.
  • the target object is the dynamic content contained in the video to be generated, which can be an object, a person, a scene, etc.
  • the target object is the person's head.
  • the facial expression of the person will change, such as the lips will open or close, etc.
  • the second information of multiple first key points can be obtained.
  • Figure 3 is a schematic diagram of the key points of a video generation method provided by an embodiment of the present application, and the black points in the figure are the key points of the person's head. It can be understood that the number of key points can be determined according to the target object. Generally speaking, the more the number of first key points, the higher the accuracy of the simulated action of the generated video.
  • FIG4 is a schematic diagram of light bending according to key points in a video generation method provided by an embodiment of the present application.
  • the dot marked in the figure is a first key point on the teeth, and the light with an arrow in the figure is the first light. Assuming that the mouth on the left side of FIG4 is closed and the mouth on the right side is open, it is necessary to bend the light so that the volume density and color of this position can still be obtained after the light is bent.
  • the (x, y, z) of the originally static key point is added with some offsets to become (x', y', z'), so that the key point still corresponds to the position of the tooth. Therefore, before bending the light, the coordinates of the sampling points on the first light are first associated or bound with the coordinates of the key points and the key point features through feature fusion, so as to realize the use of key points to drive NeRF.
  • the first information of each sampling point on the first light ray and the second information of multiple first key points acquired at a time are fused respectively to generate the first key point fusion feature corresponding to each sampling point.
  • the first key point fusion feature includes not only the fused sampling point coordinates, key point coordinates and key point feature information, but also the azimuth viewing angle of the sampling point.
  • the first light is bent, that is, the coordinates of each sampling point on the light are offset.
  • a first sampling point is obtained from multiple sampling points, and then the spatial coordinates of the first sampling point and the first key point fusion feature corresponding to the first sampling point are input into the trained light bending module, and the spatial coordinates of the offset first sampling point are obtained through the light bending module.
  • the light bending module can be obtained based on neural network training.
  • the offset spatial coordinates of the plurality of sampling points and the fusion features of the plurality of first key points corresponding to the plurality of sampling points are input into a pre-trained NeRF model in pairs, thereby obtaining a plurality of static images of the target object, wherein the number of the plurality of static images is equal to the number of acquisitions of the second information of the first key points;
  • step S40 we obtain the offset key point coordinates, and then combine them with the first key point fusion features and input them into the trained NeRF model.
  • NVIDIA's InstantNG's multi-resolution hash coding Hashgrid scheme is used to optimize the NeRF encoding.
  • the reason is that traditional frequency coding (encoding) is an implicit encoding, while Hashgrid is an explicit encoding. The combination of the two can have better effects and allow the same rendering quality to be achieved with less computation.
  • the output of the pre-trained NeRF model is RGB values and volume density. At the same time, a static image is generated based on the RGB values and volume density volume rendering technology. Volume rendering technology is an existing public technology and will not be described here.
  • the generated static images are used as frames in the video, and the images are stitched together in order to obtain the video.
  • the generated video is a video of a person speaking
  • frame sampling is performed, such as FPS is 60
  • the spatial coordinates of the key points in each frame of the image are obtained to generate the corresponding second information.
  • the second information of the multiple first key points obtained are input in order, and the static images are generated in order accordingly, and the video can be obtained by directly stitching.
  • the order in which the images are spliced corresponds to the order in which the multiple first key points are input, rather than the time order of each frame of the image in the video when the NeRF model is trained.
  • each sampling point on the first light ray is fused with the second information acquired each time to generate a first key point fusion feature
  • an offset operation is performed on the spatial coordinates of each sampling point according to the first key point fusion feature to obtain the offset spatial coordinates of each sampling point
  • the offset spatial coordinates of the sampling point and the first key point fusion feature are input into a pre-trained neural radiation field model to generate a static image corresponding to each sampling point
  • a corresponding static image is generated for each input of the second information of multiple first key points
  • a video is synthesized based on the multiple static images.
  • each static image is actually associated with the second information of different key points input each time.
  • the changes of each image in the dynamic scene are simulated by incorporating the changing second information of the key points, and then the video is synthesized based on the generated pictures. This realizes 3D video synthesis while decoupling from time.
  • the synthesis method is simple, and the user only needs to specify the viewing angle to synthesize the video of the target object.
  • a detailed flow diagram of generating multiple first key point fusion features in a video synthesis method is provided in an embodiment of the present application.
  • the method in the embodiment of the present application may include the following steps S31-S32.
  • some second key points are selected from multiple first key points to perform feature fusion with the first information of the first sampling point.
  • the sampling point P (x, y, z) in the space will not be associated with all the key points landmark (x, y, z).
  • the key points near the eyes drive the movement of the eyes
  • the key points near the mouth drive the movement of the mouth
  • the key points near the eyes do not drive the movement of the mouth. Therefore, it is necessary to select the second key point associated with the first sampling point from the first key point, so that the key point driving is more accurate.
  • At least one second key point associated with the first sampling point from multiple first key points can be determined by training a neural network, inputting the associated features, so that the neural network learns the associated features between the key points and the sampling points, and then the relevance prediction can be performed through the trained neural network to obtain the second key point in the first key point.
  • the corresponding relationship between the key point and the sampling point can also be set.
  • S32 Perform attention calculation on the first information of the first sampling point and the second information of the at least one second key point to obtain a fusion feature of the first key point.
  • the second information of the second key point is obtained and the first information of the first sampling point is used for attention calculation, so as to associate the first sampling point with the second key point.
  • the obtained fusion feature feature of the first key point represents the feature after the interaction between the key point information and the light information of the target object.
  • the second information of the second key point is obtained and the first information of the first sampling point is calculated for attention, so as to associate the first sampling point with the second key point.
  • the first key point fusion feature obtained represents the feature after the interaction between the key point information and the light information of the target object.
  • Point P represents the sampling point
  • P (x, y, z) and landmark (x, y, z) are actually points in the same space, and the influence of the key point on P (x, y, z) is related to its spatial position, so the encoding method based on cross-attention is adopted here, for example, as follows:
  • the sampling point P(x, y, z) is a 1x3 tensor, which is used as the query;
  • Landmark(x, y, z) is the tensor of Mx3, which is used as the key;
  • the corresponding landmark feature is set for landmark (x, y, z), which is the embedding of Mx3 and is used as the value;
  • determining at least one second key point associated with the first sampling point from the plurality of first key points comprises:
  • S312 Determine at least one first key point whose distance is less than or equal to a preset threshold as the at least one second key point.
  • the distance between the first sampling point P(x, y, z) and all the first key points landmark(x, y, z) is calculated, and at least one first key point whose distance is less than or equal to the preset threshold is determined as at least one second key point associated with the first sampling point.
  • Q multiplied by K itself represents a similarity, which is also a measure of distance. It should be noted that the attention calculation can directly adopt the calculation formula in the prior art, which is specifically shown in the following formula:
  • Q is the coordinate (x, y, z) of the input sampling point
  • K is the coordinate (x, y, z) of landmak
  • V is the learnable landmrak feature (learnable means initialized to some random values, and then these random values can be updated as the network parameters are updated during training)
  • dk is the embedding dimension of Q or K. For example, assuming Q is 200x2048, K and V are 200x2048, then dk is 2048.
  • the method of the embodiment of the present application may include the following steps S33-S34.
  • S34 Concatenate the first information of the first sampling point and the second information of the at least one second key point to generate the first key point fusion feature.
  • the method for generating the first key point fusion feature is to splice the first information of the first sampling point and the second information of at least one second key point.
  • the second key point coordinates (x, y, z) are directly transformed into a 1-dimensional vector, and then spliced together with the first sampling point P (x, y, z), and then used as the input of the subsequent NeRF model.
  • the feature fusion method of splicing the key point coordinates directly with the sampling point coordinates is worse than the method of feature fusion through attention.
  • this method is simple and fast, and can be used to increase the speed of generating videos when the quality requirements of the target synthetic video are not high.
  • Figure 6 is an overall flow chart of a video generation method provided by an embodiment of the present application.
  • Figure 6 shows the processing process of each first sampling point (i.e., the sampling point in the figure) and the corresponding second information of the first key point (i.e., the key point in the figure), and the core processing module includes a key point encoding module, a light bending module, and a neural radiation field module.
  • the sampling point coordinates are used as Query
  • the key point coordinates are used as Key
  • the key point features are used as Value to be fused through the attention mechanism to obtain the key point fusion features.
  • the azimuth and perspective of the sampling point are implicitly included in the fusion features for subsequent input into the NeRF model; then, the key point fusion features and the sampling point coordinates are input into the light bending multilayer perceptron in the light bending module, and the offset sampling point coordinates are output. Then, the offset sampling point coordinates and the key point fusion features are input into the neural radiation field module.
  • NeRF combined with Hashgrid, the color RGB and volume density corresponding to a sampling point are generated. All the first sampling points on the first light are input into the above modules respectively to obtain the color RGB and volume density of all the sampling points on the first light. A static picture is generated based on the color RGB and volume density of all the sampling points. The number of static pictures is equal to the number of times the second information of the first key point is obtained multiple times, and then a video is generated based on the multiple static pictures obtained.
  • a second key point associated with the first sampling point is selected from multiple first key points, and the second information of the second sampling point is feature fused with the first information of the first sampling point to generate the first key point fusion feature.
  • attention calculation is performed on the first information of the first sampling point and the second information of at least one second key point, or the first information of the first sampling point and the second information of at least one second key point are directly spliced to generate the first key point fusion feature, so as to realize the interaction between the key point information and the sampling point information on the light, so that the subsequent neural radiation field model can generate the corresponding picture according to the input key point information.
  • Figure 7 is a schematic diagram of a detailed process of determining the first key point in a video generation method according to an embodiment of the present application. As shown in Figure 7, the method according to the embodiment of the present application may include the following steps S71-S72.
  • the target object can be of multiple types, and different types of target objects require different key points to be extracted. It is understandable that if the target object is a head portrait of a person, it is necessary to obtain the key points corresponding to the head portrait of the person; if the target object is an animal, it is necessary to obtain the key points corresponding to the animal; or, if the target object is a human limb, it is necessary to obtain the key points corresponding to the limb.
  • the type classification of the target object can be determined according to actual conditions.
  • S72 Select a key point extraction model based on the type of the target object, and determine the first key point according to the key point extraction model.
  • a key point extraction model of the target object is obtained, and the first key point is obtained through the key point extraction model.
  • the key point extraction model can adopt an existing open source model, or can be extracted by training a convolutional neural network.
  • the face key point extraction model can adopt the more popular open source library Dlib for face recognition. Assuming that the target object is a face, using the Dlib key point extraction model, 68 key points on the face in the input image can be extracted.
  • the present application also proposes a training method for generating a neural radiation field model for a video, comprising the following steps S81-S82:
  • each training image is annotated with first information of the sampling points, spatial coordinates of all key points, and features of the key points.
  • the neural radiation field model in this application is different from the traditional neural radiation field model, its input data also includes key point information, so it is necessary to create an initialized neural radiation field model and train the model through training images so that the neural radiation field model learns the first information of the sampling points and the spatial coordinates and key point features of all key points.
  • the pre-acquired training images are images captured from the video. For example, if the target to be synthesized is a video of a person walking seen from a side perspective, then during training, pictures can be captured from a video of a person walking shot from the side, for example, 30 pictures are captured from a 1-second video, and 100 key points are marked for each picture.
  • the spatial position and features of the key points in each picture are obtained, and a key point information is generated for each picture.
  • the multiple sampling points on the light corresponding to the side perspective are fused with each key point information and the sampling points are offset.
  • the offset spatial coordinates of the multiple sampling points and the multiple key points corresponding to the multiple sampling points are fused with features, and paired into the initialized neural radiation field NeRF model to generate an experimental image.
  • the experimental image is compared with the training image corresponding to the key point information used during training, and the neural radiation field model is trained by iteratively calculating the loss function.
  • the steps of feature fusion and spatial coordinate bending in the present application can be implemented by a key point encoding module and a light bending module respectively, and the corresponding key point encoding model and light bending model can be trained together with the neural radiation field model.
  • the training steps can be as follows:
  • the initial corrected three-dimensional coordinates and the initial key point driving features are input into the initialized neural radiation field model to render and generate experimental images;
  • the preset loss function is iteratively calculated until the loss function meets the preset conditions.
  • the key point encoding model, light bending model and neural radiation field model are trained.
  • performing an offset operation on the spatial coordinates of the first sampling point according to the spatial coordinates of the first sampling point and a fusion feature of a first key point corresponding to the first sampling point to obtain the offset spatial coordinates is achieved by using a multi-layer perceptron.
  • the multi-layer perceptron automatically learns the offset of the light, and determines where the light deflects based on the information provided by the key points, so as to obtain the position of the key points after the offset.
  • constraints are set to constrain the coordinates generated in this frame to be the same as the actual position of the lips and teeth in this frame, that is, the model is trained by comparing the generated image with the original image (training image).
  • the second information for multiple sampling points and multiple first key points obtained each time respectively generates multiple first key point fusion features, which can also be implemented using a multilayer perceptron.
  • the specific training method is similar to the above steps and will not be repeated here.
  • the type of the target object is determined, and then a key point extraction model is selected according to the type of the target object, and the first key point is determined according to the key point extraction model.
  • the video generation device provided in the embodiment of the present application will be described in detail below in conjunction with FIG8. It should be noted that the video generation device in FIG8 is used to execute the method of the embodiment shown in FIG2-FIG7 of the present application. For the convenience of description, only the part related to the embodiment of the present application is shown. For the specific technical details not disclosed, please refer to the embodiment shown in FIG2-FIG7 of the present application.
  • FIG8 shows a schematic diagram of the structure of a video generation device provided by an exemplary embodiment of the present application.
  • the video generation device can be implemented as all or part of the device through software, hardware, or a combination of both.
  • the device includes a light acquisition module 10, a key point acquisition module 20, a key point encoding module 30, a light bending module 40, a neural radiation field module 50, and a video generation module 60.
  • a light acquisition module 10 configured to acquire first information of a plurality of sampling points on a first light, wherein the first information includes spatial coordinates and azimuth viewing angle;
  • a key point acquisition module 20 used for repeatedly acquiring second information of a plurality of first key points of a target object, wherein the second information includes spatial coordinates of the key points and features of the key points;
  • a light bending module 40 configured to perform an offset operation on the spatial coordinates of a first sampling point among the plurality of sampling points according to the spatial coordinates of the first sampling point and a fusion feature of a first key point corresponding to the first sampling point, so as to obtain offset spatial coordinates, wherein the first sampling point is any one of the plurality of sampling points;
  • a neural radiation field module 50 is used to input the offset spatial coordinates of the plurality of sampling points and the fusion features of the plurality of first key points corresponding to the plurality of sampling points into a pre-trained NeRF model in pairs for each acquisition of the second information of the plurality of first key points, so as to obtain a plurality of static images of the target object, wherein the number of the plurality of static images is equal to the number of acquisitions of the second information of the first key points;
  • the video generation module 60 is used to synthesize the multiple static images into a video of the target object.
  • the key point encoding module 30 is specifically used for the first sampling point and the second information of the plurality of first key points obtained each time,
  • the key point encoding module 30 is specifically used for the first sampling point and the second information of the plurality of first key points obtained each time,
  • the first information of the first sampling point and the second information of the at least one second key point are concatenated to generate the first key point fusion feature.
  • the key point encoding module 30 is specifically used to calculate the distance between the spatial coordinates of the first sampling point and the spatial coordinates of multiple first key points;
  • At least one first key point whose distance is less than or equal to a preset threshold is determined as the at least one second key point.
  • the key point encoding module 30 uses a multi-layer perceptron to perform an operation of generating a plurality of first key point fusion features respectively according to the first information and the second information for a plurality of sampling points and a plurality of first key points obtained each time;
  • the light bending model 40 uses a multi-layer perceptron to perform an operation of offsetting the spatial coordinates of a first sampling point among the multiple sampling points according to the spatial coordinates of the first sampling point and a fusion feature of a first key point corresponding to the first sampling point, so as to obtain the offset spatial coordinates;
  • the device further comprises a key point extraction module, wherein the key point extraction module is used to determine the type of the target object;
  • a key point extraction model is selected based on the type of the target object, and the first key point is determined according to the key point extraction model.
  • the video generating device provided in the above embodiment executes the video generating method
  • only the division of the above functional modules is used as an example.
  • the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the video generating device provided in the above embodiment and the video generating method embodiment belong to the same concept, and the implementation process thereof is detailed in the method embodiment, which will not be repeated here.
  • An embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored.
  • a computer program is stored.
  • the video generation method of the embodiment shown in Figures 2 to 7 is implemented.
  • the specific execution process can be found in the specific description of the embodiment shown in Figures 2 to 7, which will not be repeated here.
  • FIG. 9 shows a schematic diagram of the structure of a video generation device provided by an exemplary embodiment of the present application.
  • the video generation device in the present application may include one or more of the following components: a processor 110, a memory 120, an input device 130, an output device 140, and a bus 150.
  • the processor 110, the memory 120, the input device 130, and the output device 140 may be connected via a bus 150.
  • the processor 110 may include one or more processing cores.
  • the processor 110 uses various interfaces and lines to connect various parts in the entire video generation device, and executes various functions and processes data of the terminal 100 by running or executing instructions, programs, code sets or instruction sets stored in the memory 120, and calling data stored in the memory 120.
  • the processor 110 can be implemented in at least one hardware form of digital signal processing (DSP), field-programmable gate array (FPGA), and programmable logic array (PLA).
  • DSP digital signal processing
  • FPGA field-programmable gate array
  • PDA programmable logic array
  • the processor 110 can integrate one or a combination of a central processing unit (CPU), a graphics processing unit (GPU), and a modem.
  • the CPU mainly processes the operating system, user pages, and applications;
  • the GPU is responsible for rendering and drawing display content; and the modem is used to process wireless communications. It can be understood that the above-mentioned modem may not be integrated into the processor 110, but may be implemented separately through a communication chip
  • the memory 120 may include a random access memory (RAM) or a read-only memory (ROM).
  • the memory 120 includes a non-transitory computer-readable medium (Non-Transitory Computer-Readable Storage Medium).
  • the memory 120 may be used to store instructions, programs, codes, code sets or instruction sets.
  • the memory 120 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.), instructions for implementing the above-mentioned various method embodiments, etc.
  • the operating system may be an Android system, including a system deeply developed based on the Android system, an IOS system developed by Apple, including a system deeply developed based on the IOS system, or other systems.
  • the memory 120 can be divided into an operating system space and a user space.
  • the operating system runs in the operating system space, and native and third-party applications run in the user space.
  • the operating system allocates corresponding system resources to different third-party applications.
  • the requirements for system resources in different application scenarios in the same third-party application are also different. For example, in the local resource loading scenario, the third-party application has higher requirements for disk reading speed; in the animation rendering scenario, the third-party application has higher requirements for GPU performance.
  • the operating system and third-party applications are independent of each other, and the operating system often cannot perceive the current application scenario of the third-party application in a timely manner, resulting in the operating system being unable to perform targeted system resource adaptation according to the specific application scenario of the third-party application.
  • the input device 130 is used to receive input commands or data, and includes but is not limited to a keyboard, a mouse, a camera, a microphone, or a touch device.
  • the output device 140 is used to output commands or data, and includes but is not limited to a display device and a speaker. In one example, the input device 130 and the output device 140 can be combined, and the input device 130 and the output device 140 are touch screen displays.
  • the touch display screen can be designed as a full screen, a curved screen or a special-shaped screen.
  • the touch display screen can also be designed as a combination of a full screen and a curved screen, or a combination of a special-shaped screen and a curved screen, which is not limited in the embodiments of the present application.
  • the structure of the video generating device shown in the above drawings does not constitute a limitation on the video generating device, and the video generating device may include more or fewer components than shown in the figure, or combine certain components, or arrange the components differently.
  • the video generating device also includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a wireless fidelity (Wireless Fidelity, Wi-Fi) module, a power supply, a Bluetooth module and other components, which will not be repeated here.
  • the processor 110 may be used to call a computer program stored in the memory 120 and specifically perform the following operations:
  • each first key point fusion feature in the multiple first key point fusion features corresponds to one sampling point in the multiple sampling points
  • For a first sampling point among the multiple sampling points performing an offset operation on the spatial coordinates of the first sampling point according to the spatial coordinates of the first sampling point and a first key point fusion feature corresponding to the first sampling point to obtain offset spatial coordinates, wherein the first sampling point is any one sampling point among the multiple sampling points;
  • the offset spatial coordinates of the plurality of sampling points and the fusion features of the plurality of first key points corresponding to the plurality of sampling points are input into the pre-trained NeRF model in pairs, so as to obtain a plurality of static images of the target object, wherein the number of the plurality of static images is equal to the number of acquisitions of the second information of the first key points;
  • the plurality of static images are synthesized into a video of the target object.
  • the processor 110 when the processor 110 generates multiple first key point fusion features based on the second information for multiple sampling points and multiple first key points obtained each time, according to the first information and the second information, the processor 110 specifically performs the following operations:
  • Attention calculation is performed on the first information of the first sampling point and the second information of the at least one second key point to obtain the first key point fusion feature.
  • the processor 110 when the processor 110 generates a plurality of first key point fusion features according to the first information and the second information for a plurality of sampling points and a plurality of first key points obtained each time, the processor 110 specifically performs the following operations:
  • the first information of the first sampling point and the second information of the at least one second key point are concatenated to generate the first key point fusion feature.
  • the processor 110 when determining at least one second key point associated with the first sampling point from a plurality of first key points, the processor 110 specifically performs the following operations:
  • At least one first key point whose distance is less than or equal to a preset threshold is determined as the at least one second key point.
  • the processor 110 uses a multilayer perceptron to generate multiple first key point fusion features according to the first information and the second information for multiple sampling points and multiple first key points obtained each time.
  • the processor 110 performs an offset operation on the spatial coordinates of a first sampling point among the multiple sampling points according to the spatial coordinates of the first sampling point and a first key point fusion feature corresponding to the first sampling point to obtain the offset spatial coordinates by using a multi-layer perceptron.
  • the processor 110 performs the following operations before acquiring the second information of the first key points of the target object multiple times:
  • a key point extraction model is selected based on the type of the target object, and the first key point is determined according to the key point extraction model.
  • each sampling point on the first light ray is fused with the second information obtained each time to generate a first key point fusion feature
  • an offset operation is performed on the spatial coordinates of each sampling point according to the first key point fusion feature to obtain the offset spatial coordinates of each sampling point
  • the offset spatial coordinates of the sampling point and the first key point fusion feature are input into a pre-trained neural radiation field model to generate a static image corresponding to each sampling point
  • a corresponding static image is generated for each input of the second information of multiple first key points
  • a video is synthesized based on the multiple static images.
  • each static image is actually associated with the second information of different key points input each time, then the change of each image in the dynamic scene is simulated by integrating the second information of the changing key points, and then the video is synthesized according to the generated picture, and 3D video synthesis is realized while decoupling from time.
  • the synthesis method is simple, and the video of the target object can be synthesized only by the user specifying the viewing angle.
  • the second key point associated with the first sampling point is selected from the multiple first key points, and the second information of the second sampling point is feature fused with the first information of the first sampling point to generate the first key point fusion feature.
  • the first information of the first sampling point and the second information of at least one second key point are calculated by attention or the first information of the first sampling point and the second information of at least one second key point are directly spliced to generate the first key point fusion feature, so as to realize the interaction between the key point information and the sampling point information on the light, so that the subsequent neural radiation field model can generate the corresponding picture according to the input key point information.
  • the first key point is determined according to the key point extraction model.
  • the key point extraction model By matching corresponding key point extraction models for different types of target objects, more accurate key point information can be provided for the target objects in the generated video, thereby improving the accuracy of video synthesis.
  • the efficiency of key point information collection can be improved, thereby making the video generation process faster.
  • the storage medium can be a disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)
  • Microscoopes, Condenser (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种视频生成方法、装置、设备与计算机可读存储介质,该方法包括:获取第一光线上的多个采样点的第一信息,多次获取目标对象的多个第一关键点的第二信息,根据第一信息和第二信息,分别生成多个第一关键点融合特征,基于第一关键点融合特征分别对多个采样点的空间坐标进行偏移,将多个采样点的偏移后的空间坐标和多个采样点对应的多个第一关键点融合特征,配对地输入预训练的神经辐射场模型,获得目标对象的多个静态图像,通过图像合成为目标对象的视频。

Description

视频生成方法、装置、设备与计算机可读存储介质
相关申请的交叉引用
本申请要求于2022年10月09日提交的申请号为202211231054X的中国新申请的优先权,其在此出于所有目的通过引用将其全部内容并入本文。
技术领域
本发明涉及计算机视觉技术领域,尤其涉及视频生成方法、装置、设备与计算机可读存储介质。
背景技术
近年来,基于深度学习的计算机视觉技术取得了很大的发展,如目标跟踪/图像分割等场景。其中,对于3D场景的重建及其渲染的研究也取得了很大的进展。神经辐射场(Neural radiance fields,简称NeRF)是一种对三维静态场景连续、隐式的表达方式,其灵活地表示了三维场景的几何和外观,实现了逼真的新视角二维图像合成。然而,通过NeRF仅仅能够得到的是二维图像,并不能满足人们对于三维视频重建的需求。
对于三维视频的合成,现有技术中通常采用在5D的基础上加入时间参数训练神经网络,从而就能够得到任意时间的3D图像,并合成视频。但是这种方式,直接增加了一个纬度,使得训练时的数据量大大增加,训练时间也大大增加,效率低。另一种方式常用的方式为基于时间的潜码(latent codes)来实现动态场景的3D视频生成。
因此,目前的3D视频生成主要还是直接或间接地依赖于时间,故亟待提出一种不依赖时间参数的视频生成方法。
发明内容
本发明的主要目的在于提供一种视频生成方法、装置、设备与计算机可读存储介质,旨在解决现有视频生成方法依赖时间参数的技术问题。所述技术方案如下:
第一方面,本申请实施例提供了一种视频生成方法,包括:
获取第一光线上的多个采样点的第一信息,所述第一信息包括空间坐标和方位视角;
多次获取目标对象的多个第一关键点的第二信息,所述第二信息包括关键点的空间坐标和关键点的特征;
针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,其中,所述多个第一关键点融合特征中的每个第一关键点融合特征对应于所述多个采样点中的一个采样点;
针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标,其中,所述第一采样点为所述多个采样点中的任意一个采样点;
针对每次获取的多个第一关键点的第二信息,将所述多个采样点的偏移后的空间坐标和所述多个采样点对应的所述多个第一关键点融合特征,配对地输入预训练的NeRF模型,从而获得所述目标对象的多个静态图像,其中,所述多个静态图像的数量和多次获取第一关键点的第二信息的次数相等;
将所述多个静态图像合成为所述目标对象的视频。
第二方面,本申请实施例提供一种视频生成装置,包括:
光线获取模块,用于获取第一光线上的多个采样点的第一信息,所述第一信息包括空间坐标和方位视角;
关键点获取模块,用于多次获取目标对象的多个第一关键点的第二信息,所述第二信息包括关键点的空间坐标和关键点的特征;
关键点编码模块,用于针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,其中,所述多个第一关键点融合特征中的每个第一关键点融合特征对应于所述多个采样点中的一个采样点;
光线弯曲模块,用于针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标,其中,所述第一采样点为所述多个采样点中的任意一个采样点;
神经辐射场模块,用于针对每次获取的多个第一关键点的第二信息,将所述多个采样点的偏移后的空间坐标和所述多个采样点对应的所述多个第一关键点融合特征,配对地输入预训练的NeRF模型,从而获得所述目标对象的多个静态图像,其中,所述多个静态图像的数量和多次获取第一关键点的第二信息的次数相等;
视频生成模块,用于将所述多个静态图像合成为所述目标对象的视频。
第三方面,本申请实施例提供一种电子设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如上述方法的步骤。
第四方面,本申请实施例提供一种计算机存储介质,所述计算机存储介质存储有多条指令,所述指令适于由处理器加载并执行如上述方法的步骤。
在本发明实施例中,通过获取第一光线上多个采样点的空间坐标和方位视角信息,并获取目标对象的多个第一关键点,再多次获取第一关键点的第二信息,其中,第二信息包括关键点的空间坐标和关键点的特征,将第一光线上的每个采样点与每次获得的第二信息进行融合,生成第一关键点融合特征,根据第一关键点融合特征对每个采样点的空间坐标进行偏移操作,获得每个采样点偏移后的空间坐标,将采样点偏移后的空间坐标和第一关键点融合特征输入预训练的神经辐射场模型,生成每个采样点对应的静态图像,对每次输入的多个第一关键点的第二信息都生成一张对应的静态图像,再根据多张静态图像合成视频。通过依次输入目标对象多个第一关键点的第二信息,使得在根据神经辐射场生成第一光线对应的静态图像时,每一张静态图像实际是与每次输入的不同关键点的第二信息相关联的,则通过融入变化的关键点第二信息来模拟动态的场景中每一张图像的变化,再根据生成的图片来合成视频,在与时间解耦的同时实现了3D视频合成,合成方法简单,只需要用户指定视角即可合成目标对象的视频。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种视频生成方法的举例示意图;
图2是本申请实施例提供的一种视频生成方法的流程示意图;
图3是本申请实施例提供的一种视频生成方法的关键点示意图;
图4是本申请实施例提供的一种视频生成方法中根据关键点进行光线弯曲的示意图;
图5是本申请实施例提供的一种视频生成方法中生成多个第一关键点融合特征的细化流程示意图;
图6是本申请实施例提供的一种视频生成方法的整体流程图;
图7是本申请实施例提供的一种视频生成方法中确定第一关键点的细化流程示意图;
图8是本申请实施例提供的一种视频生成装置的结构示意图;
图9是本申请实施例提供的一种视频生成设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
视频生成装置可以为手机、电脑、平板电脑、智能手表或车载设备等终端设备,也可以为终端设备中用于实现视频生成方法的模块,视频生成装置能够获取第一光线上的多个采样点的第一信息,第一信息包括空间坐标和方位视角,多次获取目标对象的多个第一关键点的第二信息,第二信息包括关键点的空间坐标和关键点的特征,针对多个采样点和每次获取的多个第一关键点的第二信息,根据第一信息和第二信息,分别生成多个第一关键点融合特征,其中,多个第一关键点融合特征中的每个第一关键点融合特征对应于多个采样点中的一个采样点,针对多个采样点中的第一采样点,根据第一采样点的空间坐标和第一采样点对应的第一关键点融合特征,对第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标,其中,第一采样点为多个采样点中的任意一个采样点,针对每次获取的多个第一关键点的第二信息,将多个采样点的偏移后的空间坐标和多个采样点对应的多个第一关键点融合特征,配对地输入预训练的NeRF模型,从而获得目标对象的多个静态图像,其中,多个静态图像的数量和多次获取第一关键点的第二信息的次数相等,将多个静态图像合成为目标对象的视频。
请一并参见图1,为本申请实施例提供了一种视频生成方法的举例示意图,图中示出的为合成目标对象3D视频的过程,在实际应用场景中,根据所需观看目标对象的视角可以得到一条光线或者可以说是拍摄方向视角,再根据目标对象的关键点信息驱动NeRF模型获得该光线对应的多张3D静态图片,再根据多张静态图片合成3D视频。
下面结合具体的实施例对本申请提供的视频生成方法进行详细说明。
请参见图2,为本申请实施例提供了一种视频生成方法的流程示意图。如图2所示,本申请实施例的所述方法可以包括以下步骤S10-S60。
S10,获取第一光线上的多个采样点的第一信息,所述第一信息包括空间坐标和方位视角;
S20,多次获取目标对象的多个第一关键点的第二信息,所述第二信息包括关键点的空间坐标和关键点的特征;
S30,针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,其中,所述多个第一关键点融合特征中的每个第一关键点融合特征对应于所述多个采样点中的一个采样点;
S40,针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标,其中,所述第一采样点为所述多个采样点中的任意一个采样点;
S50,针对每次获取的多个第一关键点的第二信息,将所述多个采样点的偏移后的空间坐标和所述多个采样点对应的所述多个第一关键点融合特征,配对地输入预训练的神经辐射场NeRF模型,从而获得所述目标对象的多个静态图像,其中,所述多个静态图像的数量和多次获取第一关键点的第二信息的次数相等;
S60,将所述多个静态图像合成为所述目标对象的视频。
本实施例中,基于传统的NeRF模型,提出了一种三维视频合成方法。NeRF利用MLP多层感知机去隐式地学习一个静态的3D场景。针对每一个静态3D场景,需要提供大量的 已知相机参数的图片,来训练NeRF模型。训练好的NeRF模型,可以实现从任意角度重建三维模型,例如人体、建筑物、交通工具等。其输入是5维的坐标,该5维的坐标包括三维的空间坐标(x,y,z)和视角方向d=(θ,φ),NeRF模型的中间层的输出则是颜色c和体积密度σ(可近似理解为不透明度,取值越小越透明),该函数可以理解为将5D坐标映射到相应的体积密度和方向的颜色。然后利用体渲染技术生成图像。在渲染时,针对每一条光线ray(并非必须实际存在),我们使用立体渲染的方式得到这个点的像素值,具体而言,我们首先在一条光线上采样很多个点,之后通过上面的得到每个点对应的体素以及颜色,接下来利用下面的渲染方程:
Figure PCTCN2022143214-appb-000001
其中
Figure PCTCN2022143214-appb-000002
其中t n和t f代表着一条光线在场景中我们想要渲染的那一部分,函数T(t)表示光线tn到t沿着光线累积透射率,即光线从t n到t不碰到任何离子的概率。通过对虚拟相机的每个像素的光线计算积分C(r)来为连续的神经辐射场绘制视图,从而就可以从任意角度渲染出图像。
上面描述的是利用NeRF重建静态的3D图像,显然不能得到动态的3D视频。理论上,在NeRF中所用5D坐标的基础上加入时间参数训练神经网络,就能够得到任意时间的3D图像,从而利用时间参数来合成动态视频。但是这种方式,会直接增加了一个全新的参数,训练时的数据量大大增加,训练时间也大量延长,效率明显降低。同时,现有技术中的其他三维视频合成方法也需要直接地或者间接地依赖时间参数。基于此,本申请提出了一种能与时间参数解耦的3D视频合成方法,使得3D视频不依赖于时间信息,也不依赖于训练时所使用的视频中每一帧图像的顺序。
以下将对各个步骤进行详细说明:
S10,获取第一光线上的多个采样点的第一信息,所述第一信息包括空间坐标和方位视角;
具体的,第一光线为根据视频观看视角得到的一条虚拟光线,其可以认为是从人眼或摄像机等光线始发点到目标对象处的一条光线。获取第一光线上多个采样点的第一信息,
第一信息可以包括采样点的空间坐标和方位视角。可以理解的,由于传统的NeRF的模型输入需要空间坐标和方位视角,故在本方案中同样至少要获取采样点的空间坐标和方位视角。
S20,多次获取目标对象的多个第一关键点的第二信息,所述第二信息包括关键点的空间坐标和关键点的特征;
具体的,目标对象为所需要生成视频中包含的动态内容,可以是某个物体、人物、场景等等。例如,需要生成一个人说话的视频,那么目标对象就是该人头部,在说话的时候,人物的面部表情会发生变化,比如嘴唇会张开或者闭合等,通过获取人物面部的多个第一关键点,并跟踪人物在说话时这些关键点空间坐标的具体发生的变化,可以得到多个第一关键点的第二信息。参照图3,图3为本申请实施例提供的一种视频生成方法的关键点示意图,图中黑色的点为该人物头部的关键点。可以理解的,关键点的数量可以根据目标对象来确定,通常来说,第一关键点的数量越多则生成视频的模拟动作精度越高。
需要说明的是,关键点的特征并不会发生改变,会发生改变的是关键点的空间坐标。
S30,针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,其中,所述多个第一关键点融合特征中的每个第一关键点融合特征对应于所述多个采样点中的一个采样点;
具体的,对于动态的场景,相同的关键点会随着不同动作在空间中出现空间位置的移 动,则动态NeRF的难点在于:在同一个光线角度下,如何模拟出该种动态的变化。参照图4,图4为本申请实施例提供的一种视频生成方法中根据关键点进行光线弯曲的示意图,图中标注的圆点为牙齿上的一个第一关键点,图中带箭头的光线为第一光线。假设要从图4中左边是闭合的嘴,变到右边张开的嘴,那么就需要对光线进行弯曲,使得光线弯曲之后依然能得到这个位置的体积密度和颜色。换句话说就是在动态的时候把这个原本静态的关键点的(x,y,z)加一些偏移量变成(x’,y’,z’),使得该关键点仍然对应牙齿这个位置。因此,在对光线进行弯曲之前,通过特征融合,先将第一光线上的采样点坐标与关键点的坐标以及关键点特征相互关联或绑定起来,从而实现利用关键点来驱动NeRF。以生成一张图片为例,将第一光线上的每个采样点的第一信息和某一次获取的多个第一关键点的第二信息分别进行融合,生成每个采样点对应的第一关键点融合特征。其中,第一关键点融合特征不仅包括了融合后的采样点坐标、关键点坐标和关键点特征信息,还包括了采样点的方位视角。
S40,针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标,其中,所述第一采样点为所述多个采样点中的任意一个采样点;
具体的,在获得第一关键点融合特征之后,对第一光线进行弯曲,也即对光线上每一个采样点坐标进偏移。从多个采样点中获取一个第一采样点,再将第一采样点的空间坐标和第一采样点对应的第一关键点融合特征输入训练好的光线弯曲模块,通过光线弯曲模块中获得偏移后的第一采样点的空间坐标。光线弯曲模块可以基于神经网络训练得到。
S50,针对每次获取的多个第一关键点的第二信息,将所述多个采样点的偏移后的空间坐标和所述多个采样点对应的所述多个第一关键点融合特征,配对地输入预训练的NeRF模型,从而获得所述目标对象的多个静态图像,其中,所述多个静态图像的数量和多次获取第一关键点的第二信息的次数相等;
具体的,在步骤S40中我们得到了偏移后的关键点坐标,再和第一关键点融合特征结合输入已经训练好的NeRF模型中。可选的,采用NVIDIA的InstantNG的多分辨率哈希编码Hashgrid方案来优化NeRF的编码,原因是传统的频率编码(encoding)是一种隐式encoding,而Hashgrid是一种显式encoding,两者结合能有更好的效果,并且允许用更少的计算量实现相同的渲染质量。通过预训练的NeRF模型输出是RGB的值以及体积密度density,同时,根据RGB的值和体积密度体渲染技术生成静态图像。体渲染技术为现有公开技术,在此不加赘述。
S60,将所述多个静态图像合成为所述目标对象的视频。
例如,将生成的静态图像作为视频中一帧一帧的图像,将图像按顺序进行拼接得到视频。可以理解的,假设生成的视频为人物说话视频,则在预先训练的过程中,我们采集的数据是人物说话的视频,并进行帧采样,如FPS为60,并获取每一帧图像中关键点的空间坐标,生成对应的第二信息,那么在视频合成的过程中将获取的多个第一关键点的第二信息按顺序输入,则对应的将顺序生成静态图像,直接进行拼接即可得到视频。这里的图像被拼接的顺序对应的是多个第一关键点输入的顺序,而不是对NeRF模型进行训练时的视频中的每一帧图像的时间顺序。
在本申请实施例中,通过获取第一光线上多个采样点的空间坐标和方位视角信息,并获取目标对象的多个第一关键点,再多次获取第一关键点的第二信息,其中,第二信息包括关键点的空间坐标和关键点的特征,将第一光线上的每个采样点与每次获得的第二信息进行融合,生成第一关键点融合特征,根据第一关键点融合特征对每个采样点的空间坐标进行偏移操作,获得每个采样点偏移后的空间坐标,将采样点偏移后的空间坐标和第一关键点融合特征输入预训练的神经辐射场模型,生成每个采样点对应的静态图像,对每次输 入的多个第一关键点的第二信息都生成一张对应的静态图像,再根据多张静态图像合成视频。通过依次输入目标对象多个第一关键点的第二信息,使得在根据神经辐射场生成第一光线对应的静态图像时,每一张静态图像实际是与每次输入的不同关键点的第二信息相关联的,则通过融入变化的关键点第二信息来模拟动态的场景中每一张图像的变化,再根据生成的图片来合成视频,在与时间解耦的同时实现了3D视频合成,合成方法简单,只需要用户指定视角即可合成目标对象的视频。
参见图5,为本申请实施例提供了一种视频合成方法中生成多个第一关键点融合特征的细化流程示意图。如图5所示,本申请实施例的所述方法可以包括以下步骤S31-S32。
S31,从多个第一关键点中确定与所述第一采样点相关联的至少一个第二关键点;
在本实施例中,在生成第一关键点融合特征时,从多个第一关键点中选取部分第二关键点与第一采样点的第一信息进行特征融合。可以理解的,空间中的采样点P(x,y,z)不会跟所有的关键点landmark(x,y,z)都存在关联,例如,眼睛附近的关键点驱动眼睛的运动,嘴巴附近的关键点驱动嘴巴的运动,眼睛附近的关键点不会驱动嘴巴运动。因此,需要从第一关键点中选取与第一采样点相关联的第二关键点,从而使得关键点驱动更加准确。具体的,从多个第一关键点中确定与第一采样点相关联的至少一个第二关键点可以通过训练神经网络来确定,输入关联特征,使神经网络学习关键点与采样点之间的关联特征,则可以通过训练得到的神经网络进行关联性预测,获取第一关键点中的第二关键点。可选的,还可以通过设定关键点与采样点的对应关系,当需要从多个第一关键点中确定与第一采样点相关的至少一个第二关键点时,可以从对应的关系映射表得到。
S32,对所述第一采样点的第一信息和所述至少一个第二关键点的第二信息进行注意力计算,获取所述第一关键点融合特征。
具体的,当确认与第一采样点相关的第二关键点后,获取第二关键点的第二信息与第一采样点的第一信息进行注意力计算,从而将第一采样点与第二关键点之间进行关联,经过注意力机制attention之后,得到的第一关键点融合特征feature代表着目标对象关键点信息与光线信息交互后的特征。
例如,当确认与第一采样点相关的第二关键点后,获取第二关键点的第二信息与第一采样点的第一信息进行注意力计算,从而将第一采样点与第二关键点之间进行关联,经过注意力机制attention之后,得到的第一关键点融合特征feature代表着目标对象关键点信息与光线信息交互后的特征。以点P代表采样点,P(x,y,z)和landmark(x,y,z)实际上是同一空间中的点,并且关键点对P(x,y,z)的影响和其空间位置有关,所以这里采用了基于cross-attention的编码方法,例如如下方式:
采样点P(x,y,z)是1x3的tensor,将其作为query;
landmark(x,y,z)是Mx3的tensor,将其作为key;
考虑到M个landmark有对应的语义,所以给landmark(x,y,z)设置了对应的landmark feature,为Mx3的embedding,将其作为value;
对query、key、value做attention操作,得到最终的landmark encoding也即关键点融合特征。
进一步地,在一实施例中,所述从多个第一关键点中确定与所述第一采样点相关联的至少一个第二关键点,包括:
S311,计算所述第一采样点的空间坐标与多个第一关键点的空间坐标的距离;
S312,确定所述距离小于或等于预设阈值的至少一个第一关键点为所述至少一个第二关键点。
具体的,对第一采样点P(x,y,z)和所有的第一关键点landmark(x,y,z)进行距离的计算,确定距离小于或等于预设阈值的至少一个第一关键点为与第一采样点相关联 的至少一个第二关键点。下述公式中的Q乘以K本身表示一种相似度,也是一种距离的衡量。需要说明的是,注意力计算attention可以直接采用现有技术中的计算公式,具体为下面的公式所示:
Figure PCTCN2022143214-appb-000003
式中,Q是输入的采样点的坐标(x,y,z),K是landmak的坐标(x,y,z),V是可学习的landmrak特征(可学习是指初始化为一些随机值,然后这些随机值可以在训练的时候随着网络参数的更新而更新),d k是Q或者K的嵌入(embedding)维度,这里举个例子,假设Q是200x2048,K和V是200x2048这里的dk就是2048。
进一步地,在一实施例中,本申请实施例的所述方法可以包括以下步骤S33-S34。
S34,从多个第一关键点中确定与所述第一采样点相关联的至少一个第二关键点;
S34,将所述第一采样点的第一信息和所述至少一个第二关键点的第二信息进行拼接,生成所述第一关键点融合特征。
在一实施例中,从多个第一关键点中确认第二关键点后,生成第一关键点融合特征的方法为,将第一采样点的第一信息和至少一个第二关键点的第二信息进行拼接。具体的,把第二关键点坐标(x,y,z)直接变换成1维向量,然后和第一采样点P(x,y,z)拼接在一起,然后作为后续NeRF模型的输入。需要说明的是,将关键点坐标直接与采样点坐标进行拼接的特征融合方式,相较于通过注意力进行特征融合的方法,其效果更差,然而该方法简单、快速,在目标合成视频的质量要求不高的情况下可以采用该种方式提高生成视频的速度。
参照图6,图6为本申请实施例提供了一种视频生成方法的整体流程图。图6示出的为对每个第一采样点(也即图中的采样点)和对应的获取的一次第一关键点(也即图中的关键点)的第二信息的处理过程,核心处理模块包括关键点编码模块、光线弯曲模块和神经辐射场模块。首先将采样点坐标作为Query,关键点的坐标作为Key,关键点的特征作为Value通过注意力机制进行融合,得到关键点融合特征,需要说明的是,采样点的方位视角隐含包括在融合特征中以用于后续输入NeRF模型中;然后,将关键点融合特征和采样点坐标输入光线弯曲模块中的光线弯曲多层感知机,输出偏移后的采样点坐标,再将偏移后的采样点坐标和关键点融合特征输入神经辐射场模块,通过NeRF,结合Hashgrid,生成一个采样点对应的颜色RGB和体积密度,将第一光线上所有的第一采样点分别输入上述各个模块中,得到第一光线上所有采样点的颜色RGB和体积密度,基于所有采样点的颜色RGB和体积密度生成一张静态图片,静态图片的数量与多次获取第一关键点的第二信息的次数相等,再根据获取的多张静态图片生成视频。
本实施例中在生成第一关键点融合特征时,从多个第一关键点中选取与第一采样点相关联的第二关键点,并将第二采样点的第二信息与第一采样点的第一信息进行特征融合,生成第一关键点融合特征,具体的,对第一采样点的第一信息和至少一个第二关键点的第二信息进行注意力计算或者是直接将第一采样点的第一信息和至少一个第二关键点的第二信息进行拼接,生成第一关键点融合特征,实现关键点信息与光线上采样点信息的交互,使得后续神经辐射场模型能根据输入的关键点信息生成相应的图片。
请参见图7,为本申请实施例提供了一种视频生成方法中确定第一关键点的细化流程示意图。如图7所示,本申请实施例的所述方法可以包括以下步骤S71-S72。
S71,确定所述目标对象的类型;
在一实施例中,目标对象可以为多种类型,不同类型的目标对象所需要提取的关键点不同。可以理解的,如果目标对象是人物头像,则需要获取人物头像对应的关键点;如果 目标对象是动物,则需要获取动物对应的关键点;又或者,目标对象为人体肢体,则需要获取肢体对应的关键点。具体的,目标对象的类型划分可以根据实际情况确定。
S72,基于所述目标对象的类型选择关键点提取模型,根据所述关键点提取模型确定所述第一关键点。
在一实施例中,根据确定目标对象的类型后,获取该目标对象的关键点提取模型,并通过关键点提取模型得到第一关键点。关键点提取模型可以采用现有的开源模型,也可以通过卷积神经网络训练提取模型。例如,人脸关键点提取模型,可以采用较流行的人脸识别的开源库Dlib。假设目标对象为人脸,使用Dlib关键点提取模型,则可以提取到输入的图片中人脸上的68个关键点。
可选的,本申请还提出一种用于生成视频的神经辐射场模型的训练方法,包括以下步骤S81-S82:
S81,创建初始化的神经辐射场模型;
S82,利用预先获取的训练图像对所述初始化的神经辐射场模型进行训练,获得训练好的神经辐射场模型,其中,每张训练图像中标注有采样点的第一信息、所有关键点的空间坐标和关键点的特征。
具体的,由于本申请中神经辐射场模型与传统的神经辐射场模型存在一定的区别,其输入数据还包括了关键点信息,故需要创建初始化的神经辐射场模型,并通过训练图像进行模型训练,使得神经辐射场模型学习采样点的第一信息和所有关键点的空间坐标及关键点特征。其中,预先获取的训练图像为从视频中截取出的图像,例如待合成的目标为侧面视角看到的人物行走视频,那么在训练时则可以从侧面拍摄的一个人物行走视频中,截取图片,例如从1秒的视频中截取30张图片,为每一张图片标注100个关键点,并获取每一张图片中关键点的空间位置和关键点的特征,为每一张图片对应的生成一份关键点信息,再将侧面视角对应的光线上的多个采样点,与每一份关键点信息进行特征融合和采样点偏移,将多个采样点的偏移后的空间坐标和多个采样点对应的多个关键点融合特征,配对地输入初始化的神经辐射场NeRF模型,生成一张实验图像,将实验图像与训练时使用的关键点信息对应的训练图像进行比较,通过迭代计算损失函数,训练得到神经辐射场模型。
可选的,本申请中进行特征融合和进行空间坐标弯曲的步骤可分别通过关键点编码模块和光线弯曲模块实现,对应的关键点编码模型和光线弯曲模型可以与神经辐射场模型一同训练。示例性的,训练步骤可如下:
获取训练视频数据,对训练视频数据进行帧采样生成训练图像集;
获取训练图像集中各图像的所有像素点的空间坐标和方位视角,并提取训练图像集中各图像的关键点,将空间坐标、方位视角和关键点输入初始关键点编码模型,获得初始关键点融合特征;
将初始关键点融合特征和空间坐标输入初始光线弯曲模型,并输出初始校正三维坐标;
将初始校正三维坐标和初始关键点驱动特征输入初始化的神经辐射场模型渲染生成实验图像;
基于实验图像与训练图像集,迭代计算预设的损失函数,直到损失函数满足预设条件时,训练得到关键点编码模型、光线弯曲模型和神经辐射场模型。
进一步地,在一实施例中,所述针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标,是利用多层感知机实现的。
可以理解的,在人说话脸动的时候,脸上的关键点也会动,那么我们要实现的是光线随着关键点动。具体的,通过多层感知机(MLP)自动学习光线的偏移,根据关键点提供的信息来决定光线往哪里偏,才能得到偏移后关键点的位置。在训练的时候会设定约束, 约束这一帧生成的坐标要和实际的这一帧嘴唇、牙齿的位置是一样的,也即通过生成的图片与原始图片(训练图片)进行对比训练模型。
进一步地,在一实施例中,所述针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,也可以利用多层感知机实现,具体训练方式与上述步骤类似,在此不加赘述。
在本申请实施例中,通过确定目标对象的类型,再根据目标对象的类型选择关键点提取模型,根据关键点提取模型来确定第一关键点。通过为不同类型的目标对象匹配对应的关键点提取模型,能够为生成视频中目标对象的提供更加准确地关键点信息,从而提高视频合成的准确性,同时通过预训练的关键点提取模型提取关键点,而不需要临时进行人工关键点标注,能够提高关键点信息采集效率,从而使得视频生成的过程更加快速。
下面将结合附图8,对本申请实施例提供的视频生成装置进行详细介绍。需要说明的是,附图8中的视频生成装置,用于执行本申请图2-图7所示实施例的方法,为了便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请图2-图7所示的实施例。
请参见图8,其示出了本申请一个示例性实施例提供的视频生成装置的结构示意图。该视频生成装置可以通过软件、硬件或者两者的结合实现成为装置的全部或一部分。该装置包括光线获取模块10、关键点获取模块20、关键点编码模块30、光线弯曲模块40、神经辐射场模块50和视频生成模块60。
光线获取模块10,用于获取第一光线上的多个采样点的第一信息,所述第一信息包括空间坐标和方位视角;
关键点获取模块20,用于多次获取目标对象的多个第一关键点的第二信息,所述第二信息包括关键点的空间坐标和关键点的特征;
关键点编码模块30,用于针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,其中,所述多个第一关键点融合特征中的每个第一关键点融合特征对应于所述多个采样点中的一个采样点;
光线弯曲模块40,用于针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标,其中,所述第一采样点为所述多个采样点中的任意一个采样点;
神经辐射场模块50,用于针对每次获取的多个第一关键点的第二信息,将所述多个采样点的偏移后的空间坐标和所述多个采样点对应的所述多个第一关键点融合特征,配对地输入预训练的NeRF模型,从而获得所述目标对象的多个静态图像,其中,所述多个静态图像的数量和多次获取第一关键点的第二信息的次数相等;
视频生成模块60,用于将所述多个静态图像合成为所述目标对象的视频。
可选的,所述关键点编码模块30具体用于针对所述第一采样点和每次获取的多个第一关键点的第二信息,
从多个第一关键点中确定与所述第一采样点相关联的至少一个第二关键点;
对所述第一采样点的第一信息和所述至少一个第二关键点的第二信息进行注意力计算,获取所述第一关键点融合特征
可选的,所述关键点编码模块30具体用于针对所述第一采样点和每次获取的多个第一关键点的第二信息,
从多个第一关键点中确定与所述第一采样点相关联的至少一个第二关键点;
将所述第一采样点的第一信息和所述至少一个第二关键点的第二信息进行拼接,生成所述第一关键点融合特征。
可选的,所述关键点编码模块30具体用于计算所述第一采样点的空间坐标与多个第一关键点的空间坐标的距离;
确定所述距离小于或等于预设阈值的至少一个第一关键点为所述至少一个第二关键点。
可选的,所述关键点编码模块30是利用多层感知机执行针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征的操作;
可选的,所述光线弯曲模型40是利用多层感知机执行针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标的操作;
可选的,所述装置还包括关键点提取模块,所述关键点提取模块用于确定所述目标对象的类型;
基于所述目标对象的类型选择关键点提取模型,根据所述关键点提取模型确定所述第一关键点。
需要说明的是,上述实施例提供的视频生成装置在执行视频生成方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的视频生成装置与视频生成方法实施例属于同一构思,其体现实现过程详见方法实施例,这里不再赘述。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上述图2-图7所示实施例的所述视频生成方法,具体执行过程可以参见图2-图7所示实施例的具体说明,在此不进行赘述。
请参考图9,其示出了本申请一个示例性实施例提供的视频生成设备的结构示意图。本申请中的视频生成设备可以包括一个或多个如下部件:处理器110、存储器120、输入装置130、输出装置140和总线150。处理器110、存储器120、输入装置130和输出装置140之间可以通过总线150连接。
处理器110可以包括一个或者多个处理核心。处理器110利用各种接口和线路连接整个视频生成设备内的各个部分,通过运行或执行存储在存储器120内的指令、程序、代码集或指令集,以及调用存储在存储器120内的数据,执行终端100的各种功能和处理数据。可选地,处理器110可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器110可集成中央处理器(Central Processing Unit,CPU)、图像处理器(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作系统、用户页面和应用程序等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器110中,单独通过一块通信芯片进行实现。
存储器120可以包括随机存储器(Random Access Memory,RAM),也可以包括只读存储器(Read-Only Memory,ROM)。可选地,该存储器120包括非瞬时性计算机可读介质(Non-Transitory Computer-Readable Storage Medium)。存储器120可用于存储指令、程序、代码、代码集或指令集。存储器120可包括存储程序区和存储数据区,其中,存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现上述各个方法实施例的指令等,该操作系统可以是安卓(Android)系统,包括基于Android系统深度开发的系统、苹果公司开发的IOS系统,包括基于IOS系统深度开发的系统或其它系统。
存储器120可分为操作系统空间和用户空间,操作系统即运行于操作系统空间,原生及第三方应用程序即运行于用户空间。为了保证不同第三方应用程序均能够达到较好的运行效果,操作系统针对不同第三方应用程序为其分配相应的系统资源。然而,同一第三方应用程序中不同应用场景对系统资源的需求也存在差异,比如,在本地资源加载场景下,第三方应用程序对磁盘读取速度的要求较高;在动画渲染场景下,第三方应用程序则对GPU性能的要求较高。而操作系统与第三方应用程序之间相互独立,操作系统往往不能及时感知第三方应用程序当前的应用场景,导致操作系统无法根据第三方应用程序的具体应用场景进行针对性的系统资源适配。
为了使操作系统能够区分第三方应用程序的具体应用场景,需要打通第三方应用程序与操作系统之间的数据通信,使得操作系统能够随时获取第三方应用程序当前的场景信息,进而基于当前场景进行针对性的系统资源适配。
其中,输入装置130用于接收输入的指令或数据,输入装置130包括但不限于键盘、鼠标、摄像头、麦克风或触控设备。输出装置140用于输出指令或数据,输出装置140包括但不限于显示设备和扬声器等。在一个示例中,输入装置130和输出装置140可以合设,输入装置130和输出装置140为触摸显示屏。
所述触摸显示屏可被设计成为全面屏、曲面屏或异型屏。触摸显示屏还可被设计成为全面屏与曲面屏的结合,异型屏与曲面屏的结合,本申请实施例对此不加以限定。
除此之外,本领域技术人员可以理解,上述附图所示出的视频生成设备的结构并不构成对视频生成设备的限定,视频生成设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。比如,视频生成设备中还包括射频电路、输入单元、传感器、音频电路、无线保真(Wireless Fidelity,Wi-Fi)模块、电源、蓝牙模块等部件,在此不再赘述。
在图9所示的视频生成设备中,处理器110可以用于调用存储器120中存储的计算机程序,并具体执行以下操作:
获取第一光线上的多个采样点的第一信息,所述第一信息包括空间坐标和方位视角;
多次获取目标对象的多个第一关键点的第二信息,所述第二信息包括关键点的空间坐标和关键点的特征;
针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,其中,所述多个第一关键点融合特征中的每个第一关键点融合特征对应于所述多个采样点中的一个采样点;
针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标,其中,所述第一采样点为所述多个采样点中的任意一个采样点;
针对每次获取的多个第一关键点的第二信息,将所述多个采样点的偏移后的空间坐标和所述多个采样点对应的所述多个第一关键点融合特征,配对地输入预训练的NeRF模型,从而获得所述目标对象的多个静态图像,其中,所述多个静态图像的数量和多次获取第一关键点的第二信息的次数相等;
将所述多个静态图像合成为所述目标对象的视频。
在一个实施例中,所述处理器110在执行基于所述针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征时,具体执行以下操作:
针对所述第一采样点和每次获取的多个第一关键点的第二信息,
从多个第一关键点中确定与所述第一采样点相关联的至少一个第二关键点;
对所述第一采样点的第一信息和所述至少一个第二关键点的第二信息进行注意力计算, 获取所述第一关键点融合特征。
在一个实施例中,所述处理器110在执行针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征时,具体执行以下操作:
针对所述第一采样点和每次获取的多个第一关键点的第二信息,
从多个第一关键点中确定与所述第一采样点相关联的至少一个第二关键点;
将所述第一采样点的第一信息和所述至少一个第二关键点的第二信息进行拼接,生成所述第一关键点融合特征。
在一个实施例中,所述处理器110在执行从多个第一关键点中确定与所述第一采样点相关联的至少一个第二关键点时,具体执行以下操作:
计算所述第一采样点的空间坐标与多个第一关键点的空间坐标的距离;
确定所述距离小于或等于预设阈值的至少一个第一关键点为所述至少一个第二关键点。
在一个实施例中,所述处理器110在执行针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征时,是利用多层感知机实现的。
在一个实施例中,所述处理器110在执行针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标时,是利用多层感知机实现的。
在一个实施例中,所述处理器110在执行第一关键点为人体关键点,所述第一关键点包括面部关键点和肢体关键点,所述在多次获取目标对象的多个第一关键点的第二信息之前,还执行以下操作:
确定所述目标对象的类型;
基于所述目标对象的类型选择关键点提取模型,根据所述关键点提取模型确定所述第一关键点。
在本申请实施例中,通过获取第一光线上多个采样点的空间坐标和方位视角信息,并获取目标对象的多个第一关键点,再多次获取第一关键点的第二信息,其中,第二信息包括关键点的空间坐标和关键点的特征,将第一光线上的每个采样点与每次获得的第二信息进行融合,生成第一关键点融合特征,根据第一关键点融合特征对每个采样点的空间坐标进行偏移操作,获得每个采样点偏移后的空间坐标,将采样点偏移后的空间坐标和第一关键点融合特征输入预训练的神经辐射场模型,生成每个采样点对应的静态图像,对每次输入的多个第一关键点的第二信息都生成一张对应的静态图像,再根据多张静态图像合成视频。通过依次输入目标对象多个第一关键点的第二信息,使得在根据神经辐射场生成第一光线对应的静态图像时,每一张静态图像实际是与每次输入的不同关键点的第二信息相关联的,则通过融入变化的关键点第二信息来模拟动态的场景中每一张图像的变化,再根据生成的图片来合成视频,在与时间解耦的同时实现了3D视频合成,合成方法简单,只需要用户指定视角即可合成目标对象的视频。并且,在生成第一关键点融合特征时,从多个第一关键点中选取与第一采样点相关联的第二关键点,并将第二采样点的第二信息与第一采样点的第一信息进行特征融合,生成第一关键点融合特征,具体的,对第一采样点的第一信息和至少一个第二关键点的第二信息进行注意力计算或者是直接将第一采样点的第一信息和至少一个第二关键点的第二信息进行拼接,生成第一关键点融合特征,实现关键点信息与光线上采样点信息的交互,使得后续神经辐射场模型能根据输入的关键点信息生成相应的图片。此外,通过确定目标对象的类型,再根据目标对象的类型选择关键点提取模型,根据关键点提取模型来确定第一关键点。通过为不同类型的目标对象匹配对应的关键点提取模型,能够为生成视频中目标对象的提供更加准确的关键点信息,从而提高视频合 成的准确性,同时通过预训练的关键点提取模型提取关键点,而不需要临时进行人工关键点标注,能够提高关键点信息采集效率,从而使得视频生成的过程更加快速。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (10)

  1. 一种视频生成方法,其特征在于,包括:
    获取第一光线上的多个采样点的第一信息,所述第一信息包括空间坐标和方位视角;
    多次获取目标对象的多个第一关键点的第二信息,所述第二信息包括关键点的空间坐标和关键点的特征;
    针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,其中,所述多个第一关键点融合特征中的每个第一关键点融合特征对应于所述多个采样点中的一个采样点;
    针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标,其中,所述第一采样点为所述多个采样点中的任意一个采样点;
    针对每次获取的多个第一关键点的第二信息,将所述多个采样点的偏移后的空间坐标和所述多个采样点对应的所述多个第一关键点融合特征,配对地输入预训练的神经辐射场NeRF模型,从而获得所述目标对象的多个静态图像,其中,所述多个静态图像的数量和多次获取第一关键点的第二信息的次数相等;
    将所述多个静态图像合成为所述目标对象的视频。
  2. 如权利要求1所述的方法,其特征在于,所述针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,包括:
    针对所述第一采样点和每次获取的多个第一关键点的第二信息,
    从多个第一关键点中确定与所述第一采样点相关联的至少一个第二关键点;
    对所述第一采样点的第一信息和所述至少一个第二关键点的第二信息进行注意力计算,获取所述第一关键点融合特征。
  3. 如权利要求1所述的方法,其特征在于,所述针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,包括:
    针对所述第一采样点和每次获取的多个第一关键点的第二信息,
    从多个第一关键点中确定与所述第一采样点相关联的至少一个第二关键点;
    将所述第一采样点的第一信息和所述至少一个第二关键点的第二信息进行拼接,生成所述第一关键点融合特征。
  4. 如权利要求2所述的方法,其特征在于,所述从多个第一关键点中确定与所述第一采样点相关联的至少一个第二关键点,包括:
    计算所述第一采样点的空间坐标与多个第一关键点的空间坐标的距离;
    确定所述距离小于或等于预设阈值的至少一个第一关键点为所述至少一个第二关键点。
  5. 如权利要求2所述的方法,其特征在于,所述针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,是利用多层感知机实现的。
  6. 如权利要求1所述的方法,其特征在于,所述针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标,是利用多层感知机实现的。
  7. 如权利要求1所述的方法,其特征在于,所述第一关键点为人体关键点,所述第一关键点包括面部关键点和肢体关键点,所述在多次获取目标对象的多个第一关键点的第二信息之前,还包括:
    确定所述目标对象的类型;
    基于所述目标对象的类型选择关键点提取模型,根据所述关键点提取模型确定所述第一关键点。
  8. 一种视频生成装置,其特征在于,包括:
    光线获取模块,用于获取第一光线上的多个采样点的第一信息,所述第一信息包括空间坐标和方位视角;
    关键点获取模块,用于多次获取目标对象的多个第一关键点的第二信息,所述第二信息包括关键点的空间坐标和关键点的特征;
    关键点编码模块,用于针对多个采样点和每次获取的多个第一关键点的第二信息,根据所述第一信息和所述第二信息,分别生成多个第一关键点融合特征,其中,所述多个第一关键点融合特征中的每个第一关键点融合特征对应于所述多个采样点中的一个采样点;
    光线弯曲模块,用于针对所述多个采样点中的第一采样点,根据所述第一采样点的空间坐标和所述第一采样点对应的第一关键点融合特征,对所述第一采样点的空间坐标进行偏移操作,获得偏移后的空间坐标,其中,所述第一采样点为所述多个采样点中的任意一个采样点;
    神经辐射场模块,用于针对每次获取的多个第一关键点的第二信息,将所述多个采样点的偏移后的空间坐标和所述多个采样点对应的所述多个第一关键点融合特征,配对地输入预训练的神经辐射场NeRF模型,从而获得所述目标对象的多个静态图像,其中,所述多个静态图像的数量和多次获取第一关键点的第二信息的次数相等;
    视频生成模块,用于将所述多个静态图像合成为所述目标对象的视频。
  9. 一种电子设备,其特征在于,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1至7中任一项所述方法的步骤。
  10. 一种计算机可读存储介质,其特征在于,所述计算机存储介质存储有多条指令,所述指令适于由处理器加载并执行如权利要求1至7中任一项所述方法的步骤。
PCT/CN2022/143214 2022-10-09 2022-12-29 视频生成方法、装置、设备与计算机可读存储介质 WO2024077791A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211231054.XA CN115761565B (zh) 2022-10-09 2022-10-09 视频生成方法、装置、设备与计算机可读存储介质
CN202211231054.X 2022-10-09

Publications (1)

Publication Number Publication Date
WO2024077791A1 true WO2024077791A1 (zh) 2024-04-18

Family

ID=85350905

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/143214 WO2024077791A1 (zh) 2022-10-09 2022-12-29 视频生成方法、装置、设备与计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN115761565B (zh)
WO (1) WO2024077791A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118250529A (zh) * 2024-05-27 2024-06-25 暗物智能科技(广州)有限公司 一种语音驱动的2d数字人视频生成方法及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109951654A (zh) * 2019-03-06 2019-06-28 腾讯科技(深圳)有限公司 一种视频合成的方法、模型训练的方法以及相关装置
CN113538659A (zh) * 2021-07-05 2021-10-22 广州虎牙科技有限公司 一种图像生成方法、装置、存储介质及设备
CN114972632A (zh) * 2022-04-21 2022-08-30 阿里巴巴达摩院(杭州)科技有限公司 基于神经辐射场的图像处理方法及装置
WO2022182441A1 (en) * 2021-02-26 2022-09-01 Meta Platforms Technologies, Llc Latency-resilient cloud rendering
CN115082639A (zh) * 2022-06-15 2022-09-20 北京百度网讯科技有限公司 图像生成方法、装置、电子设备和存储介质
US20220301252A1 (en) * 2021-03-17 2022-09-22 Adobe Inc. View synthesis of a dynamic scene

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018119808A1 (zh) * 2016-12-29 2018-07-05 浙江工商大学 一种基于3d卷积神经网络的立体视频生成方法
CN112019828B (zh) * 2020-08-14 2022-07-19 上海网达软件股份有限公司 一种视频的2d到3d的转换方法
CN112489225A (zh) * 2020-11-26 2021-03-12 北京邮电大学 视频与三维场景融合的方法、装置、电子设备和存储介质
CN114758081A (zh) * 2022-06-15 2022-07-15 之江实验室 基于神经辐射场的行人重识别三维数据集构建方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109951654A (zh) * 2019-03-06 2019-06-28 腾讯科技(深圳)有限公司 一种视频合成的方法、模型训练的方法以及相关装置
WO2022182441A1 (en) * 2021-02-26 2022-09-01 Meta Platforms Technologies, Llc Latency-resilient cloud rendering
US20220301252A1 (en) * 2021-03-17 2022-09-22 Adobe Inc. View synthesis of a dynamic scene
CN113538659A (zh) * 2021-07-05 2021-10-22 广州虎牙科技有限公司 一种图像生成方法、装置、存储介质及设备
CN114972632A (zh) * 2022-04-21 2022-08-30 阿里巴巴达摩院(杭州)科技有限公司 基于神经辐射场的图像处理方法及装置
CN115082639A (zh) * 2022-06-15 2022-09-20 北京百度网讯科技有限公司 图像生成方法、装置、电子设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MILDENHALL BEN; SRINIVASAN PRATUL P.; TANCIK MATTHEW; BARRON JONATHAN T.; RAMAMOORTHI RAVI; NG REN: "NeRF", COMMUNICATIONS OF THE ACM, ASSOCIATION FOR COMPUTING MACHINERY, INC, UNITED STATES, vol. 65, no. 1, 17 December 2021 (2021-12-17), United States , pages 99 - 106, XP058662055, ISSN: 0001-0782, DOI: 10.1145/3503250 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118250529A (zh) * 2024-05-27 2024-06-25 暗物智能科技(广州)有限公司 一种语音驱动的2d数字人视频生成方法及可读存储介质

Also Published As

Publication number Publication date
CN115761565A (zh) 2023-03-07
CN115761565B (zh) 2023-07-21

Similar Documents

Publication Publication Date Title
CN110515452B (zh) 图像处理方法、装置、存储介质和计算机设备
CN113810587B (zh) 一种图像处理方法及装置
CN110503703B (zh) 用于生成图像的方法和装置
US11049310B2 (en) Photorealistic real-time portrait animation
KR102697772B1 (ko) 메시징 시스템 내의 3d 데이터를 포함하는 증강 현실 콘텐츠 생성기들
CN109377544A (zh) 一种人脸三维图像生成方法、装置和可读介质
US12106554B2 (en) Image sequence processing using neural networks
CN108961369A (zh) 生成3d动画的方法和装置
KR20220051376A (ko) 메시징 시스템에서의 3d 데이터 생성
CN109754464B (zh) 用于生成信息的方法和装置
US11157773B2 (en) Image editing by a generative adversarial network using keypoints or segmentation masks constraints
WO2020211573A1 (zh) 用于处理图像的方法和装置
US11748913B2 (en) Modeling objects from monocular camera outputs
KR20230079264A (ko) 증강 현실 콘텐츠 생성기들에 대한 수집 파이프라인
CN115937033A (zh) 图像生成方法、装置及电子设备
KR20230162107A (ko) 증강 현실 콘텐츠에서의 머리 회전들에 대한 얼굴 합성
WO2024077791A1 (zh) 视频生成方法、装置、设备与计算机可读存储介质
CN116958344A (zh) 虚拟形象的动画生成方法、装置、计算机设备及存储介质
CN117252791A (zh) 图像处理方法、装置、电子设备及存储介质
CN114358112A (zh) 视频融合方法、计算机程序产品、客户端及存储介质
CN115731326A (zh) 虚拟角色生成方法及装置、计算机可读介质和电子设备
CN109816791B (zh) 用于生成信息的方法和装置
CN117115331A (zh) 一种虚拟形象的合成方法、合成装置、设备及介质
Tous Pictonaut: movie cartoonization using 3D human pose estimation and GANs
CN115714888B (zh) 视频生成方法、装置、设备与计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22961968

Country of ref document: EP

Kind code of ref document: A1