WO2024077792A1 - 视频生成方法、装置、设备与计算机可读存储介质 - Google Patents

视频生成方法、装置、设备与计算机可读存储介质 Download PDF

Info

Publication number
WO2024077792A1
WO2024077792A1 PCT/CN2022/143239 CN2022143239W WO2024077792A1 WO 2024077792 A1 WO2024077792 A1 WO 2024077792A1 CN 2022143239 W CN2022143239 W CN 2022143239W WO 2024077792 A1 WO2024077792 A1 WO 2024077792A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
key
key point
light
points
Prior art date
Application number
PCT/CN2022/143239
Other languages
English (en)
French (fr)
Inventor
周彧聪
王志浩
杨斌
Original Assignee
名之梦(上海)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 名之梦(上海)科技有限公司 filed Critical 名之梦(上海)科技有限公司
Publication of WO2024077792A1 publication Critical patent/WO2024077792A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof

Definitions

  • the present invention relates to the field of artificial intelligence technology, and in particular to a video generation method, device, equipment and computer-readable storage medium.
  • NeRF Neural Radiance Fields
  • MLP Multi-Layer Perceptron
  • the neural network can render pictures from any angle.
  • NeRF The neural network (11-layer MLP) used by NeRF is very small, but rendering a pixel requires collecting many points on a ray (e.g., hundreds), which results in a very large amount of computation required to render a single image.
  • NeRF can only reconstruct static 3D images, while the reconstruction of dynamic 3D videos in NeLF can be realized by directly adding time parameters, similar to the existing NeRF implementation.
  • the main purpose of the present invention is to provide a video generation method, device, equipment and computer-readable storage medium, aiming to solve the technical problems of the existing video generation method with slow rendering speed and dependence on time parameters.
  • the technical solution is as follows:
  • an embodiment of the present application provides a video generation method, including: obtaining first information characterizing a first light ray; obtaining second information of multiple first key points of a target object for multiple times, the second information including the spatial coordinates of the key points and the features of the key points; generating multiple first key point fusion features corresponding to the first light ray respectively according to the first information and the second information obtained for multiple times; inputting the first information and the multiple first key point fusion features into a pre-trained neural light field NeLF model in pairs for multiple times, thereby obtaining multiple static images of the target object, wherein the number of the multiple static images is equal to the number of times the second information of the first key points is obtained for multiple times, and each time the first information and a first key fusion feature are paired and input into the NeLF model; and synthesizing the multiple static images into a video.
  • an embodiment of the present application provides a video generating device, including: a light information acquisition module, used to acquire first information characterizing a first light; a key point information acquisition module, used to repeatedly acquire second information of multiple first key points of a target object, the second information including the spatial coordinates of the key points and the features of the key points; a key point encoding module, used to generate multiple first key point fusion features corresponding to the first light according to the first information and the second information acquired multiple times; an image acquisition module, used to pair the first information with the multiple first key point fusion features and input them into a pre-trained neural light field NeLF model multiple times, thereby obtaining multiple static images of the target object, wherein the number of the multiple static images is equal to the number of times the second information of the first key points is acquired multiple times, and each time the first information and a first key fusion feature are paired and input into the NeLF model; a video synthesis module, used to synthesize multiple static images into a video.
  • an embodiment of the present application provides an electronic device, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the computer program is executed by the processor, the steps of any one of the methods in the first aspect are implemented.
  • an embodiment of the present application provides a computer storage medium, which stores a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the steps of any method in the first aspect described above.
  • each static image is actually associated with a different key point input each time.
  • the static images all correspond to the first light, due to the difference in key points, the static images generated each time may be different, thereby achieving the goal of using key points to drive the static image to "move", and then synthesizing the video based on the generated static image.
  • This not only realizes 3D video synthesis, but also enables the generation of the video to be decoupled from time information or time parameters.
  • the fast speed of the neural light field can also increase the speed of video generation.
  • FIG1 is a schematic diagram of an example of a video generation method provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of a flow chart of a video generation method provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of key points of a video generation method provided in an embodiment of the present application.
  • FIG4 is a schematic diagram comparing a neural radiation field model and a neural light field model of a video generation method provided in an embodiment of the present application;
  • FIG5 is a schematic diagram of the structure of a video generating device provided in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the structure of a video generating device provided in an embodiment of the present application.
  • the video generation device can be a terminal device such as a mobile phone, a computer, a tablet computer, a smart watch or an in-vehicle device, or a module in the terminal device for implementing the video generation method.
  • the video generation device can obtain the first information representing the first light, and obtain the second information of multiple first key points of the target object multiple times.
  • the second information includes the spatial coordinates of the key point and the features of the key point.
  • the video generation device can generate multiple first key point fusion features corresponding to the first light according to the first information and the second information obtained multiple times, and then pair the first information with the multiple first key point fusion features and input them into the pre-trained neural light field NeLF model multiple times, so as to obtain multiple static images of the target object, wherein the number of multiple static images is equal to the number of times the second information of the first key point is obtained multiple times, and each time the first information and a first key fusion feature are paired and input into the NeLF model.
  • the video generation device can also synthesize multiple static images into a video.
  • FIG. 1 is a schematic diagram of an example of a video generation method provided by the embodiment of the present application.
  • the figure shows the process of synthesizing a 3D video of a target object.
  • a ray of light can be obtained according to the desired viewing angle of the target object.
  • This ray of light or the viewing angle does not necessarily exist in reality, and can be a viewing angle that does not exist when the NeLF model is trained, that is, it can be a completely new viewing angle.
  • the NeLF model is driven according to the key point information of the target object to obtain multiple 3D static images corresponding to the ray of light, and then the 3D video is synthesized according to the multiple static images.
  • the desired viewing angle in FIG. 1 is the desired viewing angle of the target object.
  • Figure 2 is a schematic flow chart of a video generation method according to an embodiment of the present application. As shown in Figure 2, the method according to the embodiment of the present application may include the following steps S10-S50.
  • the first information and the plurality of first key point fusion features are paired and input into a pre-trained neural light field NeLF model multiple times, thereby obtaining multiple static images of the target object.
  • the number of the multiple static images is equal to the number of times the second information of the first key point is obtained multiple times, and each time the first information and a first key fusion feature are paired and input into the NeLF model.
  • a video generation method is proposed based on NeLF, which can synthesize three-dimensional video without requiring time parameters and has a faster rendering speed.
  • NeLF and NeRF have similar functions and can both be used to render 3D target objects.
  • the input of NeRF is a point in the scene (for example, the input represents the spatial coordinates of the point and the direction of the line of sight of the point), and the corresponding output is the color RGB and opacity of the point, so that the 3D target object can be rendered according to the color and opacity of each point.
  • the input of NeLF is a ray, and the corresponding output is directly the pixel value on the image corresponding to the ray.
  • NeLF has an obvious advantage, that is, fast speed.
  • NeLF neural network calculation
  • the first ray can be represented by a vector of a virtual ray obtained according to the video viewing angle, and the first ray can also be represented by multiple sampling points, for example, using 16 sampling points and the direct positional relationship between adjacent sampling points to represent the first ray. For example, after the azimuth angle is determined at the starting point of the ray, a ray (ray) is obtained, and then multiple sampling points are uniformly sampled on the ray, and the multiple sampling points are connected into a vector to characterize the first ray.
  • the relative positions between adjacent sampling points in the multiple sampling points are obtained, and the multiple sampling points are not combined into a vector, but the information of the sampling points and the relative positional relationship information between the sampling points are directly used to characterize the first ray.
  • first information representing the first light is obtained.
  • the first information is information representing multiple sampling points of the first light, or the first information is information representing a vector of the first light.
  • the first information is spatial coordinates and viewing angles of 20 sampling points.
  • the first information is vector information, and the vector can reflect the position and viewing angle of the first light in space, such as the vector is formed by connecting at least two sampling points on the first light.
  • the first information in this solution may vary according to the input parameters of the NeLF model actually used.
  • the target object is the object in the desired generated video, which can be an object, a person, a building, etc.
  • the target object can be the person's head, the upper body of the person, the entire human body, etc.
  • the facial expression of the person will change when speaking, such as the lips will open or close, the position of the eyebrows will change, the cheek contour will change, etc.
  • a plurality of first key points can be set on the face of the person, and the plurality of first key points of the face of the person can be obtained, and the specific changes in the spatial coordinates of these key points when the person speaks can be tracked, and the second information of the plurality of first key points can be obtained.
  • hundreds of first key points can be set on the face of a person, such as more than 400 first key points.
  • the first key points of the target object change with the target object, such as facial key points, human key points, car key points, etc.
  • Figure 3 is a schematic diagram of key points of a video generation method provided by an embodiment of the present application, and the black points in the figure are the key points of the character's head. It can be understood that the number of key points can be determined according to the target object. Generally speaking, the more the number of first key points, the higher the accuracy of the simulated action of the generated video.
  • the features of the key points will not change, but the spatial coordinates of the key points will change.
  • the features of the key points in the embodiments of the present application can also be understood as the semantic features of the key points, which give the key points corresponding semantics.
  • the semantics of the key points of the corners of the mouth are the corners of the mouth, so that even if the key points change their positions in space with the expression, they still correspond to the same semantics or features.
  • step S30 may associate or bind the first light and the first key point, so that the key point can be used to drive NeLF.
  • the first information only needs to be acquired once, while the second information is acquired multiple times. For example, the second information is acquired continuously, and each time the second information is acquired, a first key point fusion feature is generated, thereby continuously obtaining the first key point fusion feature.
  • At least one second key point associated with the first light is determined from the multiple first key points, and attention calculation is performed on the first information and the second information of the at least one second key point to obtain the first key point fusion feature.
  • the attention calculation in this embodiment can adopt an existing calculation method, which is not limited here.
  • At least one second key point can be determined from the multiple first key points according to the positional relationship between each sampling point in the multiple sampling points and the multiple first key points. For example, assuming that there are 12 sampling points and 456 first key points, the distance between each sampling point in the 12 sampling points and the 456 first key points is calculated respectively, and the first key point whose distance is less than or equal to the preset threshold is determined to be the second key point. For another example, in addition to the distance, the direction angle between the sampling point and the first key point can be further considered. For example, a reference plane is selected, and the angle between the sampling point, the first key point and the reference plane is calculated. The first key point whose angle is greater than the preset angle is determined to be not the second key point.
  • At least one second key point can be determined from the multiple first key points according to the positional relationship between the vector and the multiple first key points. For example, the projection distance from each first key point to the vector is calculated or the vertical distance from each first key point to the vector is determined, and the first key point whose projection distance or vertical distance is less than or equal to a preset threshold is determined as the second key point.
  • the direction angle between the point on the vector and the first key point can be further considered. For example, a reference plane is selected and the point on the vector closest to the first key point is determined, and the angle between the point, the first key point and the reference plane is calculated. The first key point whose angle is greater than the preset angle is determined to be not the second key point.
  • the second key point can be obtained from a corresponding relationship mapping table.
  • determining the second key point can reduce the number of key points associated with the first light, thereby reducing the amount of calculation, saving computing resources and speeding up the processing speed.
  • the key points near the eyes drive the movement of the eyes
  • the key points near the mouth drive the movement of the mouth
  • the key points near the eyes do not drive the movement of the mouth. Therefore, it is necessary to select the second key point associated with the first sampling point from the first key point, so that the key point driving is faster.
  • the first information and the plurality of first key point fusion features are paired and input into a pre-trained neural light field NeLF model multiple times, thereby obtaining multiple static images of the target object.
  • the number of the multiple static images is equal to the number of times the second information of the first key point is obtained multiple times, and each time the first information and a first key fusion feature are paired and input into the NeLF model.
  • the first information and the first key point fusion features are inputs of the NeLF model, and the trained NeLF model can render different three-dimensional images according to the first information and different first key point fusion features.
  • the NeLF model of the neural light field in the present application can adopt the existing NeLF model, but it needs to be trained in advance. For example, when training the existing NeLF model, only the first information and the corresponding image need to be labeled, so that the input of the trained NeLF model is the first information and the output is a three-dimensional image. When training the NeLF model in this embodiment, it is necessary to label the first information and the first key point fusion feature and the corresponding image, so that the input of the trained NeLF model is both the first information and the first key point fusion feature.
  • Figure 4 is a schematic diagram comparing the neural radiation field model and the neural light field model of a video generation method provided in this embodiment.
  • Figure 4 shows that the amount of data during training of the neural radiation field is much larger than that of the neural light field.
  • the neural radiation field needs to be trained on N sampling points on a light ray, while the neural light field uses, for example, a vector to represent a light ray, and thus trains on the light ray. Therefore, the amount of training data is one-Nth of that of the neural radiation field. Due to the significant reduction in the amount of training data and the difference in network structure, the training speed is significantly improved.
  • the generated static image is used as an image of a frame in the video, and multiple images are synthesized into a video.
  • the generated video is a video of a person speaking
  • frame sampling is performed, such as FPS is 60
  • the spatial coordinates of the key points in each frame image are obtained, the corresponding second information is generated, and then the NeLF model is trained.
  • executing the above steps S10-S40 can continuously obtain multiple static images, so that real-time dynamic video can be obtained using multiple static images.
  • the second information of at least one key point is input, which can be obtained using the existing key point extraction method.
  • the video generation device provided in the embodiment of the present application will be described in detail below in conjunction with FIG. 5. It should be noted that the video generation device in FIG. 5 is used to execute the method of the embodiment shown in FIG. 2 to FIG. 4 of the present application. For the convenience of description, only the part related to the embodiment of the present application is shown. For the specific technical details not disclosed, please refer to the embodiment shown in FIG. 2 to FIG. 4 of the present application.
  • FIG5 shows a schematic diagram of the structure of a video generation device provided by an exemplary embodiment of the present application.
  • the video generation device can be implemented as all or part of the device through software, hardware, or a combination of both.
  • the device 1 includes a light information acquisition module 10, a key point information acquisition module 20, a key point encoding module 30, an image acquisition module 40, and a video synthesis module 50.
  • the light information acquisition module 10 is used to acquire first information representing the first light.
  • the key point information acquisition module 20 is used to obtain the second information of multiple first key points of the target object multiple times, and the second information includes the spatial coordinates of the key points and the features of the key points.
  • the key point encoding module 30 is used to generate a plurality of first key point fusion features corresponding to the first light according to the first information and the second information obtained multiple times.
  • the image acquisition module 40 is used to input the first information and the multiple first key point fusion features into the pre-trained neural light field NeLF model in pairs for multiple times, so as to obtain multiple static images of the target object, wherein the number of the multiple static images is equal to the number of times the second information of the first key point is obtained multiple times, and each time the first information is paired with a first key fusion feature and then input into the NeLF model.
  • the video synthesis module 50 is used to synthesize multiple static images into a video.
  • the key point encoding module 30 determines at least one second key point associated with the first light from multiple first key points for the first information and the second information obtained each time; performs attention calculation on the first information and the second information of at least one second key point to obtain the first key point fusion feature.
  • the first information is information representing a plurality of sampling points of the first light ray; or, the first information is information representing a vector of the first light ray.
  • the key point encoding module 30 is further configured to determine at least one second key point associated with the plurality of sampling points from the plurality of first key points according to a positional relationship between the plurality of sampling points and the plurality of first key points.
  • the key point encoding module 30 is further used to determine at least one second key point associated with the multiple sampling points from the multiple first key points based on the positional relationship between the vector and the multiple first key points.
  • the key point encoding module 30 is further used to calculate the distance between the spatial coordinates of each sampling point in the multiple sampling points and the spatial coordinates of the multiple first key points; and determine at least one first key point whose distance is less than or equal to a preset threshold as at least one second key point.
  • the video generating device provided in the above embodiment executes the video generating method
  • only the division of the above functional modules is used as an example.
  • the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the video generating device provided in the above embodiment and the video generating method embodiment belong to the same concept, and the implementation process thereof is detailed in the method embodiment, which will not be repeated here.
  • An embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored.
  • a computer program is stored.
  • the video generation method of the embodiment shown in Figures 2 to 4 is implemented.
  • the specific execution process can be found in the specific description of the embodiment shown in Figures 2 to 4, which will not be repeated here.
  • FIG. 6 shows a schematic diagram of the structure of a video generation device provided by an exemplary embodiment of the present application.
  • the video generation device in the present application may include one or more of the following components: a processor 110, a memory 120, an input device 130, an output device 140, and a bus 150.
  • the processor 110, the memory 120, the input device 130, and the output device 140 may be connected via the bus 150.
  • the processor 110 may include one or more processing cores.
  • the processor 110 uses various interfaces and lines to connect various parts in the entire video generation device, and executes various functions and processes data of the terminal 100 by running or executing instructions, programs, code sets or instruction sets stored in the memory 120, and calling data stored in the memory 120.
  • the processor 110 can be implemented in at least one hardware form of digital signal processing (DSP), field-programmable gate array (FPGA), and programmable logic array (PLA).
  • DSP digital signal processing
  • FPGA field-programmable gate array
  • PDA programmable logic array
  • the processor 110 can integrate one or a combination of a central processing unit (CPU), a graphics processing unit (GPU), and a modem.
  • the CPU mainly processes the operating system, user pages, and applications;
  • the GPU is responsible for rendering and drawing display content; and the modem is used to process wireless communications. It can be understood that the above-mentioned modem may not be integrated into the processor 110, but may be implemented separately through a communication chip
  • the memory 120 may include a random access memory (RAM) or a read-only memory (ROM).
  • the memory 120 includes a non-transitory computer-readable medium (Non-Transitory Computer-Readable Storage Medium).
  • the memory 120 may be used to store instructions, programs, codes, code sets or instruction sets.
  • the memory 120 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.), instructions for implementing the above-mentioned various method embodiments, etc.
  • the operating system may be an Android system, including a system deeply developed based on the Android system, an IOS system developed by Apple, including a system deeply developed based on the IOS system, or other systems.
  • the memory 120 can be divided into an operating system space and a user space.
  • the operating system runs in the operating system space, and native and third-party applications run in the user space.
  • the operating system allocates corresponding system resources to different third-party applications.
  • the requirements for system resources in different application scenarios in the same third-party application are also different. For example, in the local resource loading scenario, the third-party application has higher requirements for disk reading speed; in the animation rendering scenario, the third-party application has higher requirements for GPU performance.
  • the operating system and third-party applications are independent of each other, and the operating system often cannot perceive the current application scenario of the third-party application in a timely manner, resulting in the operating system being unable to perform targeted system resource adaptation according to the specific application scenario of the third-party application.
  • the input device 130 is used to receive input commands or data, and includes but is not limited to a keyboard, a mouse, a camera, a microphone, or a touch device.
  • the output device 140 is used to output commands or data, and includes but is not limited to a display device and a speaker. In one example, the input device 130 and the output device 140 can be combined, and the input device 130 and the output device 140 are touch screen displays.
  • the touch display screen can be designed as a full screen, a curved screen or a special-shaped screen.
  • the touch display screen can also be designed as a combination of a full screen and a curved screen, or a combination of a special-shaped screen and a curved screen, which is not limited in the embodiments of the present application.
  • the structure of the video generating device shown in the above drawings does not constitute a limitation on the video generating device, and the video generating device may include more or fewer components than shown in the figure, or combine certain components, or arrange the components differently.
  • the video generating device also includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a wireless fidelity (Wireless Fidelity, Wi-Fi) module, a power supply, a Bluetooth module and other components, which will not be repeated here.
  • the processor 110 may be used to call a computer program stored in the memory 120 and implement the method described in the above method embodiment.
  • the storage medium can be a disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Studio Devices (AREA)
  • Microscoopes, Condenser (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本发明公开了一种视频生成方法、装置、设备与计算机可读存储介质,该方法包括:获取表征第一光线的第一信息,多次获取目标对象的多个第一关键点的第二信息,根据第一信息和多次获取的第二信息,分别生成对应于第一光线的多个第一关键点融合特征;将第一信息与多个第一关键点融合特征,配对地多次输入预训练的神经光场NeLF模型,从而获得目标对象的多个静态图像,将多个静态图像合成为视频。

Description

视频生成方法、装置、设备与计算机可读存储介质
相关申请的交叉引用
本申请要求于2022年10月09日提交的申请号为2022112261806的中国新申请的优先权,其在此出于所有目的通过引用将其全部内容并入本文。
技术领域
本发明涉及人工智能技术领域,尤其涉及视频生成方法、装置、设备与计算机可读存储介质。
背景技术
近年来提出的神经光场是目前解决新视角合成问题的一个有力工具。传统的神经辐射场(NeRF,Neural Radiance Fields)是利用多层感知机(MLP,Multi-Layer Perceptron)神经网络去隐式地学习一个静态的三维(3D,3Dimensions)场景。针对每一个静态3D场景,需要提供大量的已知相机参数的图片,来训练神经网络。训练好的神经网络,可以实现从任意角度渲染出图片的结果。
NeRF用的神经网络(11层的MLP)本身很小,但是渲染一个像素需要采集一条光线上的很多点(例如,上百个),这导致渲染一张图的计算量非常大。此外,通过NeRF仅能够重建静态的3D图像,而针对NeLF的动态的3D视频的重建,可以理解地直接加入时间参数实现,类似于现有的NeRF的实现方式。
然而,基于NeRF进行视频生成,需要大量的渲染时间,同时,需要时间参数才能实现。因此,如何提高三维视频的生成速度,并且进一步地脱离时间参数是亟待解决的问题。
发明内容
本发明的主要目的在于提供一种视频生成方法、装置、设备与计算机可读存储介质,旨在解决现有视频生成方式渲染速度慢且依赖时间参数的技术问题。所述技术方案如下:
第一方面,本申请实施例提供了一种视频生成方法,包括:获取表征第一光线的第一信息;多次获取目标对象的多个第一关键点的第二信息,第二信息包括关键点的空间坐标和关键点的特征;根据第一信息和多次获取的第二信息,分别生成对应于第一光线的多个第一关键点融合特征;将第一信息与多个第一关键点融合特征,配对地多次输入预训练的神经光场NeLF模型,从而获得目标对象的多个静态图像,其中,多个静态图像的数量和多次获取第一关键点的第二信息的次数相等,每次由第一信息和一个第一关键融合特征配对后输入NeLF模型;将多个静态图像合成为视频。
第二方面,本申请实施例提供一种视频生成装置,包括:光线信息获取模块,用于获取表征第一光线的第一信息;关键点信息获取模块,用于多次获取目标对象的多个第一关键点的第二信息,第二信息包括关键点的空间坐标和关键点的特征;关键点编码模块,用于根据第一信息和多次获取的第二信息,分别生成对应于第一光线的多个第一关键点融合特征;图像获取模块,用于将第一信息与多个第一关键点融合特征,配对地多次输入预训练的神经光场NeLF模型,从而获得目标对象的多个静态图像,其中,多个静态图像的数量和多次获取第一关键点的第二信息的次数相等,每次由第一信息和一个第一关键融合特征配对后输入NeLF模型;视频合成模块,用于将多个静态图像合成为视频。
第三方面,本申请实施例提供一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,计算机程序被处理器执行时实现如上述第一方面中任一项方法的步骤。
第四方面,本申请实施例提供一种计算机存储介质,计算机存储介质存储有多条指令,指令适于由处理器加载并执行如上述第一方面中任一项方法的步骤。
在本发明实施例中,通过依次输入目标对象多个第一关键点的第二信息,使得在根据神经光场生成第一光线对应的静态图像时,每一张静态图像实际还与每次输入的不同关键点相关联。如此一来,虽然静态图像都对应于第一光线,但由于关键点的不同,每次生成的静态图像可以是不一样的,从而达到利用关键点来驱动静态图像“动起来”,再根据生成的静态图像来合成视频,不仅实现了3D视频合成,还能够让视频的生成与时间信息或时间参数解耦,此外,利用神经光场速度快的特性,还可以提高视频生成的速度。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种视频生成方法的举例示意图;
图2是本申请实施例提供的一种视频生成方法的流程示意图;
图3是本申请实施例提供的一种视频生成方法的关键点示意图;
图4是本申请实施例提供的一种视频生成方法的神经辐射场模型与神经光场模型对比示意图;
图5是本申请实施例提供的一种视频生成装置的结构示意图;
图6是本申请实施例提供的一种视频生成设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
视频生成装置可以为手机、电脑、平板电脑、智能手表或车载设备等终端设备,也可以为终端设备中用于实现视频生成方法的模块,视频生成装置能够获取表征第一光线的第一信息,并多次获取目标对象的多个第一关键点的第二信息。其中,第二信息包括关键点的空间坐标和关键点的特征。视频生成装置能够根据第一信息和多次获取的第二信息,分别生成对应于第一光线的多个第一关键点融合特征,然后将第一信息与多个第一关键点融合特征,配对地多次输入预训练的神经光场NeLF模型,从而获得目标对象的多个静态图像,其中,多个静态图像的数量和多次获取第一关键点的第二信息的次数相等,每次由第一信息和一个第一关键融合特征配对后输入NeLF模型。视频生成装置还能够将多个静态图像合成为视频。
请一并参见图1,为本申请实施例提供了一种视频生成方法的举例示意图,图中示出的为合成目标对象3D视频的过程,在实际应用场景中,根据所需观看目标对象的视角可以得到一条光线,这光线或该视角并非必须实际存在,并且可以是在训练NeLF模型时所没有的视角,即可以是全新的视角,再根据目标对象的关键点信息驱动NeLF模型获得该光线对应的多张3D静态图片,而后根据多张静态图片合成3D视频。图1中的所需观看视角即为所需观看目标对象的视角。
下面结合具体的实施例对本申请提供的视频生成方法进行详细说明。
请参见图2,为本申请实施例提供了一种视频生成方法的流程示意图。如图2所示,本申请实施例的所述方法可以包括以下步骤S10-S50。
S10,获取表征第一光线的第一信息。
S20,多次获取目标对象的多个第一关键点的第二信息。其中,第二信息包括关键点的空间坐标和关键点的特征。
S30,根据第一信息和多次获取的第二信息,分别生成对应于第一光线的多个第一关键点融合特征。
S40,将第一信息与多个第一关键点融合特征,配对地多次输入预训练的神经光场NeLF模型,从而获得目标对象的多个静态图像。其中,多个静态图像的数量和多次获取第一关键点的第二信息的次数相等,每次由第一信息和一个第一关键融合特征配对后输入NeLF模型。
S50,将多个静态图像合成为视频。
本实施例中,基于NeLF提出了一种视频生成方法,能够在不需要时间参数的情况下合成三维视频,并且具有较快的渲染速度。
在计算机视觉技术领域,NeLF和NeRF相同的功能相似,都可以用作3D的目标对象的渲染。NeRF的输入是场景中的一个点(例如,输入表征该点的空间坐标和该点所在视线的方向的参数),对应的输出是该点的颜色RGB和不透明度,从而根据每个点的颜色和不透明度,就可以渲染出3D的目标对象。NeLF的输入是一条光线,对应的输出直接是该光线对应图片上像素值。对于3D图像重建来说,NeLF的优势非常明显,即速度快。要得到图片中一个像素的RGB只需要运行一次神经网络的计算,而NeRF则需要运行上百次神经网络的计算。此外,基于NeRF进行渲染时,由于每条光线上需要采样的点较多,且图片分辨率较高,因此渲染时速度较慢,由此我们提出了使用NeLF进行优化,通过NeLF直接得到光线的颜色等参数,利用NeLF的快速渲染的特点,实现高校的动态渲染。
以下将对各个步骤进行详细说明:
S10,获取表征第一光线的第一信息;
可选地,第一光线可以用根据视频观看视角得到的一条虚拟光线的向量表示,第一光线也可以用多个采样点表示,例如利用16个采样点以及相邻采样点直接的位置关系表示第一光线。例如,光线起始处确定方位视角后,得到一条射线(光线),然后在光线上均匀采样获得多个采样点,并将多个采样点相连成向量,以此表征第一光线。又如,依旧沿用前例,在光线上均匀采样获得多个采样点后,得到多个采样点中相邻采样点之间的相对位置,不将多个采样点组合为向量,而是直接利用采样点的信息和采样点之间的相对位置关系信息来表征第一光线。
在步骤S10中,获取表征第一光线的第一信息。可选地,第一信息为表征第一光线的多个采样点的信息,或者,第一信息为表征第一光线的向量的信息。例如,第一信息为20个采样点的空间坐标和视角。又如,第一信息为向量的信息,该向量能够反映出第一光线在空间中的位置以及视角,比如该向量为第一光线上的至少两个采样点连接而成。
可以理解的,由于不同的NeLF的模型所需要的输入参数可能不同,故在本方案中的第一信息可根据实际使用的NeLF模型的输入参数而变化。
S20,多次获取目标对象的多个第一关键点的第二信息。其中,第二信息包括关键点的空间坐标和关键点的特征。
具体的,目标对象为所期望生成视频中对象,其可以是物体、人物、建筑等等。例如,需要生成一个人说话的视频,那么目标对象可以是该人头部、人的上半身、整个人体等。例如,目标对象为人的头部,则在说话的时候,人物的面部表情会发生变化,比如嘴唇会张开或者闭合、眉毛的位置变动、脸颊轮廓的变化等。可以在人物的面部设置多个第一关键点,获取人物面部的多个第一关键点,并跟踪人物在说话时这些关键点空间坐标的具体发生的变化,可以得到多个第一关键点的第二信息。例如,人物的面部可以设置数百个第一关键点,如400多个第一关键点。目标对象的第一关键点随着目标对象而变化,比如面 部关键点、人体关键点、汽车的关键点等。
参照图3,图3为本申请实施例提供的一种视频生成方法的关键点示意图,图中黑色的点为该人物头部的关键点。可以理解的,关键点的数量可以根据目标对象来确定,通常来说,第一关键点的数量越多则生成视频的模拟动作精度越高。
需要说明的是,关键点的特征并不会发生改变,会发生改变的是关键点的空间坐标。本申请的实施例中的关键点的特征也可以理解为关键点的语义特征,其赋予关键点相应的语义,比如,嘴角的关键点的语义为嘴角,从而即使关键点随着表情在空间内发生位置变化,其依旧对应于相同的语义或特征。
S30,根据第一信息和多次获取的第二信息,分别生成对应于第一光线的多个第一关键点融合特征。
具体的,步骤S30可以将第一光线和第一关键点关联或绑定,从而可以实现利用关键点来驱动NeLF。在步骤S30中,第一信息只需要获取一次,而第二信息则是多次获取到的,例如第二信息是不断的获取到的,每次获取到第二信息对应生成一个第一关键点融合特征,从而不断的得到第一关键点融合特征。
可选地,针对所述第一信息和每次获取的第二信息,从多个第一关键点中确定与第一光线相关联的至少一个第二关键点,对第一信息和至少一个第二关键点的第二信息进行注意力计算,获取所述第一关键点融合特征。本实施例中的注意力计算可采用现有的计算方法,在此不进行限定。
在第一信息为表征第一光线的多个采样点的信息时,可以根据多个采样点中的每个采样点与多个第一关键点的位置关系,从多个第一关键点中确定至少一个第二关键点。例如,假设有12个采样点和456个第一关键点,分别计算12个采样点中每个采样点与456个第一关键点的距离,确定距离小于或等于预设阈值的第一关键点为第二关键点。又如,除了距离,还可以进一步考虑采样点和第一关键点之间的方向角,比如,选定一参考面,计算出采样点、第一关键点和该参考面之间的夹角,夹角大于预设角度的第一关键点则确定为不是第二关键点。
在第一信息为表征第一光线的向量的信息时,可以根据向量与多个第一关键点的位置关系,从多个第一关键点中确定至少一个第二关键点。例如,计算每个第一关键点到向量的投影距离或者确定每个第一关键点到向量的垂直距离,确定投影距离或者垂直距离小于或等于预设阈值的第一关键点为第二关键点。类似地,除了距离,还可以进一步考虑向量上的点和第一关键点之间的方向角,比如,选定一参考面并确定该向量上与第一关键点最接近的点,计算出该点、第一关键点和该参考面之间的夹角,夹角大于预设角度的第一关键点则确定为不是第二关键点。
可选地,还可以通过设定关键点与采样点的对应关系,当需要从多个第一关键点中确定与第一采样点相关的至少一个第二关键点时,可以从对应的关系映射表得到。
本实施方式中,确定第二关键点可以减少与第一光线关联的关键点的数量,从而可以减少计算量、节省计算资源并加快处理速度。例如,眼睛附近的关键点驱动眼睛的运动,嘴巴附近的关键点驱动嘴巴的运动,眼睛附近的关键点不会驱动嘴巴运动。因此,需要从第一关键点中选取与第一采样点相关联的第二关键点,从而使得关键点驱动更加快速。
S40,将第一信息与多个第一关键点融合特征,配对地多次输入预训练的神经光场NeLF模型,从而获得目标对象的多个静态图像。其中,多个静态图像的数量和多次获取第一关键点的第二信息的次数相等,每次由第一信息和一个第一关键融合特征配对后输入NeLF模型。
本实施例中,第一信息和第一关键点融合特征是NeLF模型的输入,已经训练好的NeLF模型可以根据第一信息和不同的第一关键点融合特征渲染出不同的三维图像。
本申请中的神经光场NeLF模型可采用现有的NeLF模型,但需要预先进行训练。例如,现有的NeLF模型训练时仅需要标注出第一信息和对应的图像,从而训练好的NeLF模型的输入为第一信息,输出为三维图像。本实施例中的NeLF模型在训练时,需要标注出第一信息和第一关键点融合特征以及对应的图像,从而训练好的NeLF模型的输入是第一信息和第一关键点融合特征两者。
参照图4,图4为本实施例提供的一种视频生成方法的神经辐射场模型与神经光场模型对比示意图,由图4说明神经辐射场在训练时的数据量是远大于神经光场的,神经辐射场需要针对一条光线上的N个采样点进行训练,而神经光场则是利用例如向量来表征一条光线,从而针对光线进行训练,因此训练的数据量是神经辐射场的N分之一,由于训练的数据量大幅度减少以及网络结构的不同,训练速度显著提升。
S50,将多个静态图像合成为视频。
具体的,将生成的静态图像作为视频中一帧的图像,将多个图像合成为视频。可以理解的,假设生成的视频为人物说话视频,则在预先训练的过程中,我们采集的数据是人物说话的视频,并进行帧采样,如FPS为60,并获取每一帧图像中关键点的空间坐标,生成对应的第二信息,然后对NeLF模型进行训练。在视频合成的过程中,执行上述步骤S10-S40可以持续的获得多个静态图像,从而利用多个静态图像就可以得到实时的动态视频。可以理解地,在视频合成的过程中,至少一个关键点的第二信息是被输入的,其可以采用现有的关键点提取方法获得。
下面将结合附图5,对本申请实施例提供的视频生成装置进行详细介绍。需要说明的是,附图5中的视频生成装置,用于执行本申请图2-图4所示实施例的方法,为了便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请图2-图4所示的实施例。
请参见图5,其示出了本申请一个示例性实施例提供的视频生成装置的结构示意图。该视频生成装置可以通过软件、硬件或者两者的结合实现成为装置的全部或一部分。该装置1包括光线信息获取模块10、关键点信息获取模块20、关键点编码模块30、图像获取模块40和视频合成模块50。
光线信息获取模块10,用于获取表征第一光线的第一信息。
关键点信息获取模块20,用于多次获取目标对象的多个第一关键点的第二信息,第二信息包括关键点的空间坐标和关键点的特征。
关键点编码模块30,用于根据第一信息和多次获取的第二信息,分别生成对应于第一光线的多个第一关键点融合特征。
图像获取模块40,用于将第一信息与多个第一关键点融合特征,配对地多次输入预训练的神经光场NeLF模型,从而获得目标对象的多个静态图像,其中,多个静态图像的数量和多次获取第一关键点的第二信息的次数相等,每次由第一信息和一个第一关键融合特征配对后输入NeLF模型。
视频合成模块50,用于将多个静态图像合成为视频。
可选地,关键点编码模块30针对第一信息和每次获取的第二信息,从多个第一关键点中确定与第一光线相关联的至少一个第二关键点;对第一信息和至少一个第二关键点的第二信息进行注意力计算,获取第一关键点融合特征。
可选地,第一信息为表征第一光线的多个采样点的信息;或者,第一信息为表征第一光线的向量的信息。
可选地,关键点编码模块30还用于根据多个采样点与多个第一关键点的位置关系,从多个第一关键点中确定与多个采样点相关联的至少一个第二关键点。
可选地,关键点编码模块30还用于根据向量与多个第一关键点的位置关系,从多个第 一关键点中确定与多个采样点相关联的至少一个第二关键点。
可选地,关键点编码模块30还用于计算多个采样点中的每个采样点的空间坐标与多个第一关键点的空间坐标的距离;确定距离小于或等于预设阈值的至少一个第一关键点为至少一个第二关键点。
需要说明的是,上述实施例提供的视频生成装置在执行视频生成方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的视频生成装置与视频生成方法实施例属于同一构思,其体现实现过程详见方法实施例,这里不再赘述。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上述图2-图4所示实施例的所述视频生成方法,具体执行过程可以参见图2-图4所示实施例的具体说明,在此不进行赘述。
请参考图6,其示出了本申请一个示例性实施例提供的视频生成设备的结构示意图。本申请中的视频生成设备可以包括一个或多个如下部件:处理器110、存储器120、输入装置130、输出装置140和总线150。处理器110、存储器120、输入装置130和输出装置140之间可以通过总线150连接。
处理器110可以包括一个或者多个处理核心。处理器110利用各种接口和线路连接整个视频生成设备内的各个部分,通过运行或执行存储在存储器120内的指令、程序、代码集或指令集,以及调用存储在存储器120内的数据,执行终端100的各种功能和处理数据。可选地,处理器110可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器110可集成中央处理器(Central Processing Unit,CPU)、图像处理器(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作系统、用户页面和应用程序等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器110中,单独通过一块通信芯片进行实现。
存储器120可以包括随机存储器(RandomAccess Memory,RAM),也可以包括只读存储器(Read-Only Memory,ROM)。可选地,该存储器120包括非瞬时性计算机可读介质(Non-Transitory Computer-Readable Storage Medium)。存储器120可用于存储指令、程序、代码、代码集或指令集。存储器120可包括存储程序区和存储数据区,其中,存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现上述各个方法实施例的指令等,该操作系统可以是安卓(Android)系统,包括基于Android系统深度开发的系统、苹果公司开发的IOS系统,包括基于IOS系统深度开发的系统或其它系统。
存储器120可分为操作系统空间和用户空间,操作系统即运行于操作系统空间,原生及第三方应用程序即运行于用户空间。为了保证不同第三方应用程序均能够达到较好的运行效果,操作系统针对不同第三方应用程序为其分配相应的系统资源。然而,同一第三方应用程序中不同应用场景对系统资源的需求也存在差异,比如,在本地资源加载场景下,第三方应用程序对磁盘读取速度的要求较高;在动画渲染场景下,第三方应用程序则对GPU性能的要求较高。而操作系统与第三方应用程序之间相互独立,操作系统往往不能及时感知第三方应用程序当前的应用场景,导致操作系统无法根据第三方应用程序的具体应用场景进行针对性的系统资源适配。
为了使操作系统能够区分第三方应用程序的具体应用场景,需要打通第三方应用程序 与操作系统之间的数据通信,使得操作系统能够随时获取第三方应用程序当前的场景信息,进而基于当前场景进行针对性的系统资源适配。
其中,输入装置130用于接收输入的指令或数据,输入装置130包括但不限于键盘、鼠标、摄像头、麦克风或触控设备。输出装置140用于输出指令或数据,输出装置140包括但不限于显示设备和扬声器等。在一个示例中,输入装置130和输出装置140可以合设,输入装置130和输出装置140为触摸显示屏。
所述触摸显示屏可被设计成为全面屏、曲面屏或异型屏。触摸显示屏还可被设计成为全面屏与曲面屏的结合,异型屏与曲面屏的结合,本申请实施例对此不加以限定。
除此之外,本领域技术人员可以理解,上述附图所示出的视频生成设备的结构并不构成对视频生成设备的限定,视频生成设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。比如,视频生成设备中还包括射频电路、输入单元、传感器、音频电路、无线保真(Wireless Fidelity,Wi-Fi)模块、电源、蓝牙模块等部件,在此不再赘述。
在图6所示的视频生成设备中,处理器110可以用于调用存储器120中存储的计算机程序,并具体上述方法实施例中描述的方法。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(RandomAccess Memory,RAM)等。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (10)

  1. 一种视频生成方法,其特征在于,包括:
    获取表征第一光线的第一信息;
    多次获取目标对象的多个第一关键点的第二信息,所述第二信息包括关键点的空间坐标和关键点的特征;
    根据所述第一信息和多次获取的所述第二信息,分别生成对应于所述第一光线的多个第一关键点融合特征;
    将所述第一信息与多个所述第一关键点融合特征,配对地多次输入预训练的神经光场NeLF模型,从而获得所述目标对象的多个静态图像,其中,所述多个静态图像的数量和多次获取第一关键点的第二信息的次数相等,每次由所述第一信息和一个第一关键融合特征配对后输入所述NeLF模型;
    将所述多个静态图像合成为视频。
  2. 如权利要求1所述的方法,其特征在于,所述根据所述第一信息和多次获取的所述第二信息,分别生成对应于所述第一光线的多个第一关键点融合特征,包括:
    针对所述第一信息和每次获取的第二信息,
    从多个第一关键点中确定与所述第一光线相关联的至少一个第二关键点;
    对所述第一信息和所述至少一个第二关键点的第二信息进行注意力计算,获取所述第一关键点融合特征。
  3. 如权利要求1或2所述的方法,其特征在于,所述获取表征第一光线的第一信息,包括:
    所述第一信息为表征所述第一光线的多个采样点的信息;或者,
    所述第一信息为表征所述第一光线的向量的信息。
  4. 如权利要求3所述的方法,其特征在于,在所述第一信息为表征所述第一光线的多个采样点的信息时,所述从多个第一关键点中确定与所述第一光线相关联的至少一个第二关键点,包括:
    根据所述多个采样点与所述多个第一关键点的位置关系,从所述多个第一关键点中确定与所述多个采样点相关联的至少一个第二关键点。
  5. 如权利要求3所述的方法,其特征在于,在所述第一信息为表征所述第一光线的向量的信息时,所述从多个第一关键点中确定与所述第一光线相关联的至少一个第二关键点,包括:
    根据所述向量与所述多个第一关键点的位置关系,从所述多个第一关键点中确定与所述多个采样点相关联的至少一个第二关键点。
  6. 如权利要求4所述的方法,其特征在于,从所述多个第一关键点中确定与所述多个采样点相关联的至少一个第二关键点,包括:
    计算所述多个采样点中的每个采样点的空间坐标与多个第一关键点的空间坐标的距离;
    确定所述距离小于或等于预设阈值的至少一个第一关键点为所述至少一个第二关键点。
  7. 如权利要求5所述的方法,其特征在于,从所述多个第一关键点中确定与所述向量相关联的至少一个第二关键点,包括:
    计算所述向量与多个第一关键点的空间坐标的距离;
    确定所述距离小于或等于预设阈值的至少一个第一关键点为所述至少一个第二关键点。
  8. 一种视频生成装置,其特征在于,包括:
    光线信息获取模块,用于获取表征第一光线的第一信息;
    关键点信息获取模块,用于多次获取目标对象的多个第一关键点的第二信息,所述第二信息包括关键点的空间坐标和关键点的特征;
    关键点编码模块,用于根据所述第一信息和多次获取的所述第二信息,分别生成对应于所述第一光线的多个第一关键点融合特征;
    图像获取模块,用于将所述第一信息与多个所述第一关键点融合特征,配对地多次输入预训练的神经光场NeLF模型,从而获得所述目标对象的多个静态图像,其中,所述多个静态图像的数量和多次获取第一关键点的第二信息的次数相等,每次由所述第一信息和一个第一关键融合特征配对后输入所述NeLF模型;
    视频合成模块,用于将所述多个静态图像合成为视频。
  9. 一种电子设备,其特征在于,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1至7中任一项所述方法的步骤。
  10. 一种计算机可读存储介质,其特征在于,所述计算机存储介质存储有多条指令,所述指令适于由处理器加载并执行如权利要求1至7中任一项所述方法的步骤。
PCT/CN2022/143239 2022-10-09 2022-12-29 视频生成方法、装置、设备与计算机可读存储介质 WO2024077792A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211226180.6 2022-10-09
CN202211226180.6A CN115714888B (zh) 2022-10-09 2022-10-09 视频生成方法、装置、设备与计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2024077792A1 true WO2024077792A1 (zh) 2024-04-18

Family

ID=85231014

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/143239 WO2024077792A1 (zh) 2022-10-09 2022-12-29 视频生成方法、装置、设备与计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN115714888B (zh)
WO (1) WO2024077792A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112887698A (zh) * 2021-02-04 2021-06-01 中国科学技术大学 基于神经辐射场的高质量人脸语音驱动方法
CN113099208A (zh) * 2021-03-31 2021-07-09 清华大学 基于神经辐射场的动态人体自由视点视频生成方法和装置
CN113112592A (zh) * 2021-04-19 2021-07-13 浙江大学 一种可驱动的隐式三维人体表示方法
CN113822969A (zh) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 训练神经辐射场模型和人脸生成方法、装置及服务器
WO2022023442A1 (en) * 2020-07-28 2022-02-03 Deepmind Technologies Limited Semi-supervised keypoint based models
CN114663574A (zh) * 2020-12-23 2022-06-24 宿迁硅基智能科技有限公司 基于单视角照片的三维人脸自动建模方法、系统及装置
CN114926553A (zh) * 2022-05-12 2022-08-19 中国科学院计算技术研究所 基于神经辐射场的三维场景一致性风格化方法及系统

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013030833A1 (en) * 2011-08-29 2013-03-07 I.C.V.T. Ltd. Controlling a video content system
FR3066304A1 (fr) * 2017-05-15 2018-11-16 B<>Com Procede de compositon d'une image d'un utilisateur immerge dans une scene virtuelle, dispositif, equipement terminal, systeme de realite virtuelle et programme d'ordinateur associes
CN108985259B (zh) * 2018-08-03 2022-03-18 百度在线网络技术(北京)有限公司 人体动作识别方法和装置
CN109951654B (zh) * 2019-03-06 2022-02-15 腾讯科技(深圳)有限公司 一种视频合成的方法、模型训练的方法以及相关装置
CN111402290B (zh) * 2020-02-29 2023-09-12 华为技术有限公司 一种基于骨骼关键点的动作还原方法以及装置
US11816404B2 (en) * 2020-03-20 2023-11-14 Nvidia Corporation Neural network control variates
CN111694429B (zh) * 2020-06-08 2023-06-02 北京百度网讯科技有限公司 虚拟对象驱动方法、装置、电子设备及可读存储
CN112468796B (zh) * 2020-11-23 2022-04-29 平安科技(深圳)有限公司 注视点生成方法、系统及设备
CN112733616B (zh) * 2020-12-22 2022-04-01 北京达佳互联信息技术有限公司 一种动态图像的生成方法、装置、电子设备和存储介质
CN116569218A (zh) * 2020-12-24 2023-08-08 华为技术有限公司 图像处理方法和图像处理装置
CN113920230A (zh) * 2021-09-15 2022-01-11 上海浦东发展银行股份有限公司 人物形象视频生成方法、装置、计算机设备和存储介质
CN113793408B (zh) * 2021-09-15 2023-05-30 宿迁硅基智能科技有限公司 一种实时音频驱动人脸生成方法、装置及服务器

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022023442A1 (en) * 2020-07-28 2022-02-03 Deepmind Technologies Limited Semi-supervised keypoint based models
CN114663574A (zh) * 2020-12-23 2022-06-24 宿迁硅基智能科技有限公司 基于单视角照片的三维人脸自动建模方法、系统及装置
CN112887698A (zh) * 2021-02-04 2021-06-01 中国科学技术大学 基于神经辐射场的高质量人脸语音驱动方法
CN113099208A (zh) * 2021-03-31 2021-07-09 清华大学 基于神经辐射场的动态人体自由视点视频生成方法和装置
CN113112592A (zh) * 2021-04-19 2021-07-13 浙江大学 一种可驱动的隐式三维人体表示方法
CN113822969A (zh) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 训练神经辐射场模型和人脸生成方法、装置及服务器
CN114926553A (zh) * 2022-05-12 2022-08-19 中国科学院计算技术研究所 基于神经辐射场的三维场景一致性风格化方法及系统

Also Published As

Publication number Publication date
CN115714888B (zh) 2023-08-29
CN115714888A (zh) 2023-02-24

Similar Documents

Publication Publication Date Title
EP4054161A1 (en) Call control method and related product
EP3889915A2 (en) Method and apparatus for generating virtual avatar, device, medium and computer program product
US11989845B2 (en) Implementation and display of augmented reality
US11587280B2 (en) Augmented reality-based display method and device, and storage medium
EP3917131A1 (en) Image deformation control method and device and hardware device
CN109754464B (zh) 用于生成信息的方法和装置
WO2020211573A1 (zh) 用于处理图像的方法和装置
US20220375258A1 (en) Image processing method and apparatus, device and storage medium
CN112581635B (zh) 一种通用的快速换脸方法、装置、电子设备和存储介质
CN110288532B (zh) 生成全身图像的方法、装置、设备及计算机可读存储介质
CN113344776B (zh) 图像处理方法、模型训练方法、装置、电子设备及介质
CN112991208B (zh) 图像处理方法及装置、计算机可读介质和电子设备
CN111368668B (zh) 三维手部识别方法、装置、电子设备及存储介质
CN110059739B (zh) 图像合成方法、装置、电子设备和计算机可读存储介质
US11935176B2 (en) Face image displaying method and apparatus, electronic device, and storage medium
CN109816791B (zh) 用于生成信息的方法和装置
WO2024077792A1 (zh) 视频生成方法、装置、设备与计算机可读存储介质
US20240177409A1 (en) Image processing method and apparatus, electronic device, and readable storage medium
US20220198828A1 (en) Method and apparatus for generating image
WO2020155981A1 (zh) 表情图像效果生成方法、装置和电子设备
WO2024077791A1 (zh) 视频生成方法、装置、设备与计算机可读存储介质
US11836437B2 (en) Character display method and apparatus, electronic device, and storage medium
US11954779B2 (en) Animation generation method for tracking facial expression and neural network training method thereof
Gan et al. Research on realistic effect generation algorithm of rendering images based on GAN
CN112967362A (zh) 动画生成方法和装置、存储介质和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22961969

Country of ref document: EP

Kind code of ref document: A1