CN111432233B - Method, apparatus, device and medium for generating video - Google Patents

Method, apparatus, device and medium for generating video Download PDF

Info

Publication number
CN111432233B
CN111432233B CN202010199332.2A CN202010199332A CN111432233B CN 111432233 B CN111432233 B CN 111432233B CN 202010199332 A CN202010199332 A CN 202010199332A CN 111432233 B CN111432233 B CN 111432233B
Authority
CN
China
Prior art keywords
image
target person
audio frame
audio
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010199332.2A
Other languages
Chinese (zh)
Other versions
CN111432233A (en
Inventor
殷翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202010199332.2A priority Critical patent/CN111432233B/en
Publication of CN111432233A publication Critical patent/CN111432233A/en
Application granted granted Critical
Publication of CN111432233B publication Critical patent/CN111432233B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • H04N21/2335Processing of audio elementary streams involving reformatting operations of audio signals, e.g. by converting from one coding standard to another
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4398Processing of audio elementary streams involving reformatting operations of audio signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47205End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for manipulating displayed content, e.g. interacting with MPEG-4 objects, editing locally

Abstract

Embodiments of the present disclosure disclose methods, apparatuses, devices and media for generating video. One embodiment of the method comprises: acquiring a target voice audio and a target person image; generating an image sequence representing that the target person indicated by the target person image executes the action corresponding to the audio frame based on the audio frame and the target person image aiming at the audio frame included by the target voice audio; and generating a video representing that the target person performs the action corresponding to the target voice audio based on the target voice audio and the generated image sequences. According to the embodiment, the video representing the action of the person corresponding to the voice audio is generated according to the acquired voice audio and the image of the person, so that the generation mode of the video is enriched, and the flexibility of video generation is improved.

Description

Method, apparatus, device and medium for generating video
Technical Field
Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for generating a video.
Background
The popularity of video is a trend in modern society. The phenomenon is generated in addition to some technical aspects such as the appearance of smart phones and the popularization of 4G networks, and is also related to people. From the perspective of users, the behavior habits of people are changing, and more users acquire information and record their lives through videos.
At present, the production demand of users for videos is in an increasingly diversified development trend. For example, a user usually shares videos shot or made by the user with friends and publishes the videos on a video platform. In many cases, users want images, sounds, motions, and the like presented in videos to be more aesthetically pleasing, and the processes of shooting and making are simple and convenient to operate.
Disclosure of Invention
The present disclosure presents methods, apparatuses, devices and media for generating video.
In a first aspect, an embodiment of the present disclosure provides a method for generating a video, where the method includes: acquiring a target voice audio and a target person image; generating an image sequence representing that the target person indicated by the target person image executes the action corresponding to the audio frame based on the audio frame and the target person image aiming at the audio frame included by the target voice audio; and generating a video representing that the target person performs the action corresponding to the target voice audio based on the target voice audio and the generated image sequences.
In some embodiments, generating, based on the audio frame and the target person image, a sequence of images characterizing the target person indicated by the target person image to perform an action corresponding to the audio frame includes: generating fusion deformation information corresponding to the audio frame based on the phoneme information indicated by the audio frame; and generating an image sequence representing that the target person indicated by the target person image performs the action corresponding to the audio frame based on the fusion deformation information corresponding to the audio frame and the target person image.
In some embodiments, based on the audio frame and the target person image, generating a sequence of images characterizing the target person indicated by the target person image to perform an action corresponding to the audio frame includes: inputting the audio frame and the target person image into a pre-trained image generation model, and generating an image sequence for representing the target person indicated by the target person image to perform the action corresponding to the audio frame, wherein the image generation model is used for generating the image sequence for representing the action corresponding to the input audio frame performed by the person indicated by the input person image.
In some embodiments, inputting the audio frame and the target person image to a pre-trained image generation model, generating a sequence of images characterizing the target person indicated by the target person image to perform an action corresponding to the audio frame, includes: inputting the audio frame into a first network model in a pre-trained image generation model to obtain phoneme information indicated by the audio frame, wherein the first network model is used for determining the phoneme information indicated by the input audio frame; inputting the phoneme information indicated by the audio frame and the target person image into a second network model in the image generation model, and generating an image sequence for representing that the target person indicated by the target person image performs the action corresponding to the audio frame, wherein the second network model is used for representing the corresponding relation among the phoneme information, the person image and the image sequence.
In some embodiments, the first network model is trained by: acquiring a training sample set, wherein training samples in the training sample set comprise audio frames and phoneme information indicated by the audio frames; and training to obtain a first network model by using a machine learning algorithm and using the audio frames in the training sample set as input data and using the phoneme information indicated by the audio frames as expected output data.
In some embodiments, the image generation model is trained by: acquiring a preset number of target videos, wherein the target videos are videos obtained by recording voice audio and images of a person; extracting voice audio and an image sequence matched with the extracted voice audio from a preset number of target videos; acquiring an initial model for training to obtain an image generation model; initializing model parameters corresponding to the trained model parameters of the first network model in the initial model by using the trained model parameters of the first network model to obtain an intermediate model; and (3) adopting a machine learning algorithm, taking the audio frame in the extracted voice audio as input data of the intermediate model, taking the image sequence matched with the audio frame as expected output data of the intermediate model, and training to obtain the image generation model.
In some embodiments, training an image generation model by using an audio frame in the extracted speech audio as input data of an intermediate model and using an image sequence matched with the audio frame as expected output data of the intermediate model includes: responding to the fact that a preset training end condition is not met, inputting the audio frames in the extracted voice audio into an intermediate model, obtaining actual output data of the intermediate model, and adjusting model parameters of the intermediate model based on the actual output data and expected output data, wherein the actual output data represent an image sequence actually obtained by the intermediate model, and the expected output data represent the extracted image sequence matched with the audio frames; and in response to the preset training end condition being met, taking the intermediate model meeting the preset training end condition as an image generation model.
In some embodiments, the preset training end condition comprises at least one of: matching the image sequence represented by the actual output data with the audio frame; and the correlation degree of two adjacent images in the image sequence represented by the actual output data is greater than or equal to a preset correlation degree threshold value, wherein the correlation degree is used for representing the adjacent probability of the two target person images in the video.
In some embodiments, obtaining the target speech audio and the target person image comprises: acquiring target voice audio and a plurality of target person images of a target person; and for an audio frame included in the target voice audio, generating an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame based on the audio frame and the target person image, including: carrying out feature extraction on a plurality of target personnel images to obtain image feature information; and generating an image sequence representing that the target person indicated by the target person image executes the action corresponding to the audio frame based on the audio frame and the image characteristic information aiming at the audio frame included by the target voice audio.
In some embodiments, the target person image comprises a facial image of the target person; the action corresponding to the audio frame is characterized by: the target person sends out voice indicated by the audio frame; an action characterization corresponding to the target speech audio: the target person utters a voice with a target voice audio indication.
In a second aspect, an embodiment of the present disclosure provides an apparatus for generating a video, the apparatus including: an acquisition unit configured to acquire a target voice audio and a target person image; a first generation unit configured to generate, for an audio frame included in the target speech audio, an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame, based on the audio frame and the target person image; and a second generating unit configured to generate a video representing that the target person performs an action corresponding to the target voice audio, based on the target voice audio and the generated respective image sequences.
In a third aspect, an embodiment of the present disclosure provides an electronic device for generating a video, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method of any of the embodiments of the method for generating video as described above.
In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium for generating a video, on which a computer program is stored, which when executed by a processor, implements the method of any of the embodiments of the method for generating a video as described above.
According to the method, the device, the equipment and the medium for generating the video, the target voice audio and the target person image are acquired, then, aiming at the audio frame included by the target voice audio, the image sequence for representing the target person indicated by the target person image to execute the action corresponding to the audio frame is generated based on the audio frame and the target person image, finally, the video for representing the target person to execute the action corresponding to the target voice audio is generated based on the target voice audio and each generated image sequence, and the video for representing the person to execute the action corresponding to the voice audio can be generated according to the acquired voice audio and the person image, so that the generation mode of the video is enriched, and the flexibility of video generation is improved.
Drawings
Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram for one embodiment of a method for generating video in accordance with the present disclosure;
FIG. 3 is a schematic diagram of one application scenario of a method for generating video in accordance with the present disclosure;
FIG. 4 is a flow diagram of yet another embodiment of a method for generating video in accordance with the present disclosure;
FIG. 5 is a block diagram of one embodiment of an apparatus for generating video according to the present disclosure;
FIG. 6 is a schematic block diagram of a computer system of an electronic device suitable for use to implement embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the disclosure and are not limiting of the disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and the features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 of an embodiment of a method for generating video or an apparatus for generating video to which embodiments of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 over the network 104 to receive or transmit data (e.g., transmit target voice audio and/or target person images), and so on. The terminal devices 101, 102, 103 may have various client applications installed thereon, such as video playing software, video processing applications, news information applications, image processing applications, web browser applications, shopping applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having information processing functions, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server that provides various services, such as a background video processing server that generates video representing that the target person performs an action corresponding to the target voice audio based on the target voice audio and the target person image transmitted by the terminal devices 101, 102, 103. The background video processing server can analyze and process the received data such as the target voice audio and the target person image, and the like, so that a video representing that the target person executes the action corresponding to the target voice audio is generated. Optionally, the background video processing server may also feed back the generated video to the terminal device for the terminal device to play. As an example, the server 105 may be a cloud server.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster composed of multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be further noted that the method for generating a video provided by the embodiments of the present disclosure may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other. Accordingly, the various parts (e.g., the various units, sub-units, modules, and sub-modules) included in the apparatus for generating video may be all disposed in the server, may be all disposed in the terminal device, and may be disposed in the server and the terminal device, respectively.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation. The system architecture may only include the electronic device (e.g., server or terminal device) on which the method for generating video is running, when the electronic device on which the method for generating video is running does not need to be in data transfer with other electronic devices.
With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating video in accordance with the present disclosure is shown. The method for generating the video comprises the following steps:
step 201, acquiring a target voice audio and a target person image.
In this embodiment, an execution subject (for example, a server or a terminal device shown in fig. 1) of the method for generating a video may obtain the target voice audio and the target person image from other electronic devices or locally by a wired connection manner or a wireless connection manner.
The target voice audio may be any voice audio. As an example, the target voice audio may be foreign language audio (e.g., english audio) input by the user.
In some cases, the target voice audio may also be a voice audio of a user reading text. As an example, a terminal device used by a user may present text to be read (for example, english text), and during the process of reading the text, the terminal device may record voice uttered by the user, and use the recorded voice audio as a target voice audio. Optionally, the target voice audio may also be a voice audio pre-stored by the user.
Here, in a case where the execution main body is a terminal device, the execution main body may directly record a voice audio uttered by a user, thereby obtaining a target voice audio; in the case where the execution subject is a server, the execution subject may acquire a voice audio (i.e., a target voice audio) uttered by a user from a terminal device used by the user.
The target person image may be an image of a target person, and the target person may be any person. As an example, the target person may be a star, and the target person image may be a star image; the target person may also be a user and the target person image may also be a user image. Further, the target person image may be a partial image (e.g., a facial image) of the target person; the target person image may also be a whole-body image of the target person.
Here, the target person image may include one person image of one person, or may include a plurality of person images of one person (i.e., a plurality of person images indicate the same person). The speech audio (including the target speech audio) may be composed of one or more frames of audio.
And 202, generating an image sequence which is used for representing that the target person indicated by the target person image executes the action corresponding to the audio frame based on the audio frame and the target person image aiming at the audio frame included by the target voice audio.
In this embodiment, for an audio frame included in the target speech audio obtained in step 201, the executing body may generate an image sequence representing that the target person indicated by the target person image executes an action corresponding to the audio frame, based on the audio frame and the target person image obtained in step 201.
Wherein the action corresponding to the audio frame may be characterized by: the target person utters speech indicated by the audio frame. For example, if the audio frame is audio of "o", since mouth opening is required when the voice indicated by the audio frame is uttered, the action corresponding to the audio frame may represent: the target person opens his mouth. In addition, the action corresponding to the audio frame may also be characterized by: the target person performs a limb action corresponding to the audio frame. For example, if the audio frame is "salute" audio, the action corresponding to the audio frame may characterize: the target person presents.
In some optional implementations of this embodiment, the executing main body may execute the step 202 in the following manner:
step one, based on the phoneme information indicated by the audio frame, generating fusion deformation (Blendshap) information corresponding to the audio frame. The phoneme information may indicate phonemes (e.g., monophonic or triphone, etc.) or may be a predetermined posterior probability (i.e., a numerical value that is not normalized) of each phoneme in the phoneme set. The fusion deformation information may characterize predetermined deformation information of the respective elements. As an example, the predetermined element may include, but is not limited to, any of the following in the image: eyes, mouth, eyebrows, etc. In practice, typically at the time of a smile, the actions of blinking and mouth corner lifting may be performed. In this case, the fusion deformation information may characterize blinking and mouth corner lift.
As an example, the execution body may input phoneme information indicated by the audio frame to a predetermined information generation model, and generate blending warp (Blendshap) information corresponding to the audio frame. The information generation model may represent a correspondence between the phoneme information indicated by the audio frame and the fusion deformation information. For example, the information generation model may be a two-dimensional table or a database representing the correspondence between the phoneme information indicated by the audio frame and the fusion deformation information, or may be a model obtained by training using a machine learning algorithm.
And step two, generating an image sequence representing that the target person indicated by the target person image executes the action corresponding to the audio frame based on the fusion deformation information corresponding to the audio frame and the target person image acquired in the step 201.
As an example, the executing entity may adjust the target person image acquired in step 201 by using the fusion deformation information after obtaining the fusion deformation information, so as to obtain an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame. For example, if the fusion deformation information represents blink and mouth angle rise, the executive body may adjust the image of the target person acquired in step 201, so as to obtain an image sequence representing blink and mouth angle rise of the target person indicated by the image of the target person.
Here, the playback time length of the audio frame is generally the same as the playback time length of the video composed of the image sequence of the motion corresponding to the audio frame.
It is understood that the above alternative implementation may generate an image sequence based on the phoneme information and the fusion deformation information, so that the matching degree between the image (i.e., the video frame) in the finally generated video (the video generated by the subsequent step 203) and the target voice audio obtained in step 201 may be improved, and the accuracy of representing the action corresponding to the audio frame performed by the target person indicated by the target person image may be improved.
In some optional implementations of this embodiment, the executing main body may execute the step 201 as follows: and acquiring target voice audio and a plurality of target person images of the target person.
Based on this, the executing body may further execute the step 202 as follows:
firstly, feature extraction is carried out on a plurality of target person images to obtain image feature information. Wherein, the image feature information may include, but is not limited to, at least one of the following information: location information of key points, relative location information between respective key points, and the like. Wherein the key points may include, but are not limited to, at least one of: eyes, eyebrows, mouth, arms, legs, knees, etc.
Then, for an audio frame included in the target speech audio, based on the audio frame and the image feature information, an image sequence is generated that characterizes the target person indicated by the target person image to perform an action corresponding to the audio frame.
As an example, for an audio frame included in the target speech audio, the execution subject may input the audio frame and the image feature information to a first generation model trained in advance, and generate an image sequence representing an action that the target person indicated by the image of the target person performs corresponding to the audio frame. The first generation model can represent the audio frames, the image characteristic information and the image sequence which represents the target person indicated by the target person image to execute the action corresponding to the audio frames. The first generated model can be obtained by training with a machine learning algorithm.
It can be understood that, compared with the technical scheme of using one target person image, the alternative implementation manner can obtain richer image feature information, and further generate a video containing richer image feature information through subsequent steps.
And step 203, generating a video representing that the target person performs the action corresponding to the target voice audio based on the target voice audio and the generated image sequences.
In this embodiment, the executing body may generate a video representing that the target person performs an action corresponding to the target voice audio based on the target voice audio acquired in step 201 and each image sequence generated in step 202.
It will be appreciated that video in general comprises both audio and image sequences (i.e. sequences of video frames). The audio included in the video generated in step 203 may be the target voice audio acquired in step 201, and the image sequence included in the video generated in step 203 may be composed of the image sequences generated in step 202.
With continuing reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating video according to the present embodiment. In the application scenario of fig. 3, the terminal device 301 first obtains a target voice audio 303 (e.g., english voice audio of "Hi, I'm Tom" input by the user) and a target person image 302 (e.g., a self-portrait image uploaded by the user). Then, for an audio frame included in the target speech audio 303, the terminal device 301 generates an image sequence 304 representing that the target person indicated by the target person image performs an action corresponding to the audio frame, based on the audio frame and the target person image 302. Finally, the terminal device 301 generates a video 305 representing that the target person performs an action corresponding to the target voice audio based on the target voice audio 303 and the generated respective image sequences 304. By way of example, the generated video 305 may be a video of a target person (e.g., the user) reading english (e.g., "Hi, I'm Tom"), and the video 305 may present actions of the target person such as mouth, eyebrow, and eyes, for example, the mouth opening and closing matching the content being read. Optionally, the user may also forward the generated video 305 for other users to play, compare, comment, like, or score. For example, the user may rate based on whether the audio sounds in the video are accurate, the fluency of the audio, etc., to correct the spoken language of the person in the video.
According to the method provided by the above embodiment of the disclosure, the target voice audio and the target person image are acquired, then, for the audio frame included in the target voice audio, an image sequence representing that the target person indicated by the target person image performs the action corresponding to the audio frame is generated based on the audio frame and the target person image, and finally, a video representing that the target person performs the action corresponding to the target voice audio is generated based on the target voice audio and each generated image sequence, so that the video representing that the person performs the action corresponding to the voice audio can be generated according to the acquired voice audio and the person image, thereby enriching the generation manner of the video and improving the flexibility of video generation. Also, it is possible to improve expressiveness of a body language in the generated video. The method provided by the above embodiment can realize the generation of the dynamic video only by a small number of static images (one or more images), for example, a video for a person indicated by the static images to speak can be generated.
In some alternative implementations of the present embodiment, the target person image includes a facial image of the target person. The action corresponding to the audio frame is characterized by: the target person utters speech indicated by the audio frame. An action characterization corresponding to the target speech audio: the target person utters a voice with a target voice audio indication.
It can be understood that, in the above alternative implementation manner, the facial image of the target person may be adjusted according to the voice audio (for example, the mouth shape, eyes, eyebrows in the facial image are adjusted to match the voice audio), so as to obtain a video representing the voice indicated by the target person emitting the target voice audio, further enrich the generation manner of the video, and improve the flexibility of video generation. And, the facial expressions in the generated video are made richer.
With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating a video is shown. The flow 400 of the method for generating a video comprises the following steps:
step 401, acquiring a target voice audio and a target person image.
In this embodiment, step 401 is substantially the same as step 201 in the corresponding embodiment of fig. 2, and is not described here again.
Step 402, aiming at an audio frame included in the target voice audio, inputting the audio frame and the target person image into a pre-trained image generation model, and generating an image sequence representing that the target person indicated by the target person image executes an action corresponding to the audio frame.
In this embodiment, for an audio frame included in the target speech audio, an executing subject (e.g., the server or the terminal device shown in fig. 1) of the method for generating the video may input the audio frame and the target person image to a pre-trained image generation model, and generate an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame. Wherein the image generation model is used for generating an image sequence which represents that the person indicated by the input person image performs the action corresponding to the input audio frame.
As an example, the image generation model may be a generative confrontation network trained by a machine learning algorithm, or may be a two-dimensional table or database storing audio frames, images of persons, and image sequences representing the sequence of persons indicated by the images of persons performing actions corresponding to the audio frames.
For example, the executing entity may input the audio frame and the target person image to the generative confrontation network, thereby generating an image sequence representing the target person indicated by the target person image to perform the action corresponding to the audio frame.
As an example, the generative confrontation network described above may include a generative model and a discriminative model. In the training process of the generative confrontation network, a generative model may be used to generate a sequence of images from the input audio frames and the target person image. The discriminant model may be used to determine whether the generated image sequence corresponds to an input audio frame. As an example, the discrimination model may determine whether the generated target person image sequence corresponds to the input audio frame according to a magnitude relationship between a function value of the calculated loss function and a preset threshold value.
It should be understood that the playing of the audio frames usually needs to last for a certain period of time, so that an image of the target person corresponding to one audio frame at that moment can be generated every predetermined period of time, thereby obtaining the image sequence of the target person. For example, if the playing duration of the audio frame is 22 ms, for example, the target person image corresponding to one audio frame at the time may be generated every 2 ms, so as to sequentially obtain the target person images corresponding to the audio frames at 0 ms, 2 ms, 4 ms, 6 ms, 8 ms, 10 ms, 12 ms, 14 ms, 16 ms, 18 ms, 20 ms, and 22 ms. Here, if the audio frame is an "o" audio, it is generally required that the mouth shape is changed from closed to open and then to closed during the process of sending out the voice indicated by the audio frame, and in this case, the generated image sequence of the target person may indicate that the mouth shape gradually transitions from closed to open and then gradually transitions from open to closed.
And step 403, generating a video representing that the target person performs the action corresponding to the target voice audio based on the target voice audio and the generated image sequences.
In this embodiment, step 403 is substantially the same as step 203 in the corresponding embodiment of fig. 2, and is not described herein again.
It should be noted that, besides the above-mentioned contents, the embodiment of the present application may further include the same or similar features and effects as the embodiment corresponding to fig. 2, and details are not repeated herein.
As can be seen from fig. 4, the flow 400 of the method for generating a video in the present embodiment may employ an image generation model to generate an image sequence representing an action performed by a target person indicated by an image of the target person corresponding to the audio frame, thereby helping to improve the matching degree of the image and the voice in the generated video.
In some optional implementations of this embodiment, for an audio frame included in the target speech audio, the executing main body may further execute the step 402 by:
step one, inputting the audio frame into a first network model in a pre-trained image generation model to obtain phoneme information indicated by the audio frame. Wherein the first network model is used to determine phoneme information indicated by the input audio frames.
And step two, inputting the phoneme information indicated by the audio frame and the target person image into a second network model in the image generation model, and generating an image sequence representing that the target person indicated by the target person image executes the action corresponding to the audio frame. The second network model is used for representing the corresponding relation among the phoneme information, the personnel images and the image sequence.
It can be understood that, in the above alternative implementation manner, firstly, the phoneme information indicated by the audio frame is obtained through the first network model, and then, the second network model generates an image sequence that characterizes the target person indicated by the target person image to perform the action corresponding to the audio frame, so that the matching degree between the image (i.e., the video frame) in the finally generated video and the target speech audio can be further improved, and the accuracy of characterizing the target person indicated by the target person image to perform the action corresponding to the audio frame can be improved.
In some application scenarios of the foregoing alternative implementation, the first network model may be obtained by training, via the execution principal or an electronic device communicatively connected to the execution principal, through the following steps:
a set of training samples is obtained. Wherein the training samples in the training sample set comprise audio frames and phoneme information indicated by the audio frames.
And training to obtain a first network model by using a machine learning algorithm and using the audio frames in the training sample set as input data and using the phoneme information indicated by the audio frames as expected output data.
In practice, in the training process, the audio frames in the training sample set may be used as the input data of the initial model, so as to obtain the actual output data of the initial model. Wherein, the actual output data may be phoneme information calculated by the initial model. And then, adjusting parameters of the initial model by adopting a gradient descent method based on the actual output data and the expected output data so as to obtain the initial model meeting the preset conditions, and taking the initial model meeting the preset conditions as a first network model. Alternatively, a model structure other than the output layer in the initial model satisfying the preset condition may be used as the first network model.
The initial model may be a convolutional neural network including model structures such as convolutional layers and output layers. The preset condition may include, but is not limited to, at least one of the following: the training times are more than or equal to the preset times, the training time exceeds the preset time, and the function value of the loss function calculated based on the expected output data and the actual output data is less than or equal to the preset threshold.
It can be understood that, in the application scenario described above, the first network model may be obtained by training using a machine learning algorithm, so that the accuracy of determining the phoneme information may be improved.
In some examples of the above application scenarios, the image generation model is trained by:
step one, obtaining a preset number of target videos. The target video is a video obtained by recording voice audio and images of a person.
Here, the preset number of target videos may be obtained by recording audio and video of different persons. As an example, the preset number may be 1000, the videos correspond to people one to one, and the playing time of each video may be 2 to 3 minutes.
And step two, extracting voice audio and an image sequence matched with the extracted voice audio from a preset number of target videos.
Wherein the sequence of images matching the voice audio can instruct the person to perform the action indicated by the voice audio.
And step three, acquiring an initial model for training to obtain the image generation model.
Wherein, the initial model in step three may include, but is not limited to, the following model structure: convolutional layers, fully connected layers, output layers, etc. The initial model and the first network model may contain the same model structure and model parameters.
And step four, initializing model parameters corresponding to the trained model parameters of the first network model in the initial model by adopting the trained model parameters of the first network model to obtain an intermediate model.
Wherein the model parameters corresponding to the model parameters of the first network model may be the same model parameters as the model parameters of the first network model.
And step five, adopting a machine learning algorithm, taking the audio frame in the extracted voice audio as input data of the intermediate model, taking an image sequence matched with the audio frame as expected output data of the intermediate model, and training to obtain an image generation model.
It is to be understood that, in the above example, the model parameters of the first network model may be first adopted to initialize the model parameters corresponding to the trained model parameters of the first network model in the initial model to obtain an intermediate model, and then the image generation model is trained based on the intermediate model to obtain the image generation model, so that the relevant information of the phoneme information (for example, the phoneme information indicated by the audio frame is the posterior probability (i.e., the numerical value without normalization processing) of each predetermined phoneme information) may be used as the intermediate feature information of the finally trained image generation model, thereby improving the accuracy of the image generation model for generating the image sequence.
In some cases of the above examples, a machine learning algorithm may be employed to perform the above step five by:
firstly, under the condition that a preset training end condition is not met, inputting an audio frame in the extracted voice audio into an intermediate model to obtain actual output data of the intermediate model, and adjusting model parameters of the intermediate model based on the actual output data and expected output data. Wherein the actual output data represents the sequence of images actually obtained by the intermediate model and the expected output data represents the sequence of images extracted matching the audio frame.
Here, the "adjusting the model parameters of the intermediate model" may be adjusting all the model parameters of the intermediate model, or may be adjusting part of the model parameters of the intermediate model.
Then, in the case where the preset training end condition is satisfied, the intermediate model satisfying the preset training end condition is taken as the image generation model.
The training end condition may be used to determine whether to end the model training. As an example, the training end condition may include, but is not limited to, at least one of: the training times are more than or equal to the preset times, the training time exceeds the preset time, and the function value of the loss function calculated based on the expected output data and the actual output data is less than or equal to the preset threshold.
In addition, the preset training end condition may also include at least one of the following:
first, the sequence of images represented by the actual output data is matched to the audio frame.
The execution subject can judge whether the image sequence represented by the actual output data is matched with the audio frame through a first judgment model trained in advance. The first decision model can be obtained by training with a machine learning algorithm. Illustratively, the first decision model may be a frame discriminator (frame discriminator).
And in the second item, the correlation degree of two adjacent images in the image sequence represented by the actual output data is greater than or equal to a preset correlation degree threshold value. The relevancy is used for representing the probability that two target person images are adjacent in the video.
The execution subject may determine whether a correlation between two adjacent images in the image sequence represented by the actual output data is greater than or equal to a preset correlation threshold through a second determination model trained in advance. The second determination model can be obtained by training through a machine learning algorithm. Illustratively, the second determination model may be a Sequence discriminator (Sequence discriminator).
In practice, whether the image sequence represented by the actual output data is matched with the audio frame or not can be represented by the function value of the loss function, and/or the size relation between the correlation degree of two adjacent images in the image sequence represented by the actual output data and the preset correlation degree threshold value can be represented. For example, in a case where the image sequence represented by the actual output data does not match the audio frame, or the correlation between two adjacent images in the image sequence represented by the actual output data is smaller than a preset correlation threshold, the function value of the calculated loss function may be larger than the preset threshold. In this case, the model parameters of the intermediate model can be adjusted to finally obtain the image generation model, so that the image sequence in the finally generated video is more matched with the audio frame, the connectivity of two adjacent images in the finally generated video is improved, and the generated video is closer to the directly recorded video.
As an example, in the above case, the initial model for training the image generation model may include an Encoder-decoder (Encoder-decoder) model and a discriminator (discriminator) model.
It can be understood that when the preset training end condition includes the first item, the image sequence in the finally obtained video can be made to be more matched with the audio frame; when the preset training end condition includes the second item, the connectivity of two adjacent images in the finally obtained video can be improved, so that the generated video is closer to the directly recorded video.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating a video, the apparatus embodiment corresponding to the method embodiment shown in fig. 2, and the apparatus embodiment may include the same or corresponding features as the method embodiment shown in fig. 2 and produce the same or corresponding effects as the method embodiment shown in fig. 2, in addition to the features described below. The device can be applied to various electronic equipment in particular.
As shown in fig. 5, the apparatus 500 for generating a video of the present embodiment includes: an acquisition unit 501, a first generation unit 502, and a second generation unit 503. The acquiring unit 501 is configured to acquire a target voice audio and a target person image; a first generating unit 502 configured to generate, for an audio frame included in the target speech audio, an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame, based on the audio frame and the target person image; a second generating unit 503 configured to generate a video representing that the target person performs an action corresponding to the target voice audio based on the target voice audio and the generated respective image sequences.
In this embodiment, the acquiring unit 501 of the apparatus 500 for generating a video may acquire the target voice audio and the target person image from other electronic devices or locally through a wired connection manner or a wireless connection manner.
In this embodiment, for an audio frame included in the target speech audio acquired by the acquisition unit 501, the first generation unit 502 may generate an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame, based on the audio frame and the target person image acquired by the acquisition unit 501.
In this embodiment, the second generation unit 503 may generate a video representing that the target person performs an action corresponding to the target voice audio based on the target voice audio acquired by the acquisition unit 501 and each image sequence generated by the first generation unit 502.
In some optional implementations of the present embodiment, the first generating unit 502 includes: a first generating subunit (not shown in the figure) configured to generate fusion deformation information corresponding to the audio frame based on the phoneme information indicated by the audio frame; and a second generating subunit (not shown in the figure) configured to generate an image sequence representing that the target person indicated by the target person image performs the action corresponding to the audio frame, based on the fusion deformation information corresponding to the audio frame and the target person image.
In some optional implementations of the present embodiment, the first generating unit 502 includes: and a third generating subunit (not shown in the figure) configured to input the audio frame and the target person image into a pre-trained image generation model, and generate an image sequence representing that the target person indicated by the target person image performs the action corresponding to the audio frame, wherein the image generation model is used for generating the image sequence representing that the person indicated by the input person image performs the action corresponding to the input audio frame.
In some optional implementations of this embodiment, the third generating subunit includes: a first input module (not shown in the figure) configured to input the audio frame into a first network model in a pre-trained image generation model, resulting in phoneme information indicated by the audio frame, wherein the first network model is used for determining the phoneme information indicated by the input audio frame; and a second input module (not shown in the figure) configured to input the phoneme information indicated by the audio frame and the target person image into a second network model in the image generation model, and generate an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame, wherein the second network model is used for representing the corresponding relationship among the phoneme information, the person image and the image sequence.
In some optional implementations of this embodiment, the first network model is trained by: acquiring a training sample set, wherein training samples in the training sample set comprise audio frames and phoneme information indicated by the audio frames; and training to obtain a first network model by using a machine learning algorithm and using the audio frames in the training sample set as input data and using the phoneme information indicated by the audio frames as expected output data.
In some optional implementations of this embodiment, the image generation model is trained by: acquiring a preset number of target videos, wherein the target videos are videos obtained by recording voice audio and images of a person; extracting voice audio and an image sequence matched with the extracted voice audio from a preset number of target videos; acquiring an initial model for training to obtain an image generation model; initializing model parameters corresponding to the trained model parameters of the first network model in the initial model by using the trained model parameters of the first network model to obtain an intermediate model; and training to obtain an image generation model by using a machine learning algorithm and using an audio frame in the extracted voice audio as input data of the intermediate model and using an image sequence matched with the audio frame as expected output data of the intermediate model.
In some optional implementations of this embodiment, taking an audio frame in the extracted speech audio as input data of the intermediate model, taking an image sequence matched with the audio frame as expected output data of the intermediate model, and training to obtain the image generation model includes: responding to the fact that a preset training end condition is not met, inputting the audio frames in the extracted voice audio into an intermediate model, obtaining actual output data of the intermediate model, and adjusting model parameters of the intermediate model based on the actual output data and expected output data, wherein the actual output data represent an image sequence actually obtained by the intermediate model, and the expected output data represent the extracted image sequence matched with the audio frames; and in response to the preset training end condition being met, taking the intermediate model meeting the preset training end condition as an image generation model.
In some optional implementations of this embodiment, the preset training end condition includes at least one of: matching the image sequence represented by the actual output data with the audio frame; and the correlation degree of two adjacent images in the image sequence represented by the actual output data is greater than or equal to a preset correlation degree threshold value, wherein the correlation degree is used for representing the adjacent probability of the two target person images in the video.
In some optional implementations of this embodiment, the obtaining unit 501 includes: an acquisition subunit (not shown in the figure) configured to acquire the target voice audio and a plurality of target person images of the target person. And, the first generation unit 502 includes: an extraction subunit (not shown in the figure) configured to perform feature extraction on the images of the plurality of target persons to obtain image feature information; and a fourth generating subunit (not shown in the figure) configured to generate, for an audio frame included in the target speech audio, an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame, based on the audio frame and the image feature information.
In some alternative implementations of this embodiment, the target person image includes a facial image of the target person; the action corresponding to the audio frame is characterized by: the target person pronounces the voice indicated by the audio frame; an action characterization corresponding to the target speech audio: the target person utters a voice that is audibly indicative of the target voice.
According to the device provided by the above embodiment of the disclosure, the target voice audio and the target person image are acquired by the acquisition unit 501, then the first generation unit 502 generates an image sequence representing that the target person indicated by the target person image performs the action corresponding to the audio frame based on the audio frame and the target person image for the audio frame included by the target voice audio, and finally the second generation unit 503 generates a video representing that the target person performs the action corresponding to the target voice audio based on the target voice audio and each generated image sequence, so that the video representing that the person performs the action corresponding to the voice audio can be generated according to the acquired voice audio and the person image, thereby enriching the generation manner of the video and improving the flexibility of video generation. Also, it is possible to improve expressiveness of a body language in the generated video. The above-described embodiments provide an apparatus that requires only a small number of still images (which may be one or more) to enable the generation of a dynamic video, e.g., a video that allows a person indicated by the still images to speak.
Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use with the electronic device implementing embodiments of the present disclosure. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, the processes described above with reference to the flow diagrams may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The above-described functions defined in the method of the present disclosure are performed when the computer program is executed by a Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Python, Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In accordance with one or more embodiments of the present disclosure, there is provided a method for generating a video, the method comprising: acquiring a target voice audio and a target person image; generating an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame based on the audio frame and the target person image aiming at the audio frame included by the target voice audio; based on the target voice audio and the generated respective image sequences, a video is generated that characterizes the target person performing an action corresponding to the target voice audio.
According to one or more embodiments of the present disclosure, in a method for generating a video provided by the present disclosure, based on the audio frame and the target person image, generating an image sequence representing that a target person indicated by the target person image performs an action corresponding to the audio frame includes: generating fusion deformation information corresponding to the audio frame based on the phoneme information indicated by the audio frame; and generating an image sequence representing that the target person indicated by the target person image performs the action corresponding to the audio frame based on the fusion deformation information corresponding to the audio frame and the target person image.
According to one or more embodiments of the present disclosure, in a method for generating a video provided by the present disclosure, based on the audio frame and the target person image, generating an image sequence representing that a target person indicated by the target person image performs an action corresponding to the audio frame includes: and inputting the audio frame and the target person image into a pre-trained image generation model, and generating an image sequence for representing that the target person indicated by the target person image performs the action corresponding to the audio frame, wherein the image generation model is used for generating the image sequence for representing that the person indicated by the input person image performs the action corresponding to the input audio frame.
According to one or more embodiments of the present disclosure, in a method for generating a video, the audio frame and the target person image are input to a pre-trained image generation model, and an image sequence is generated, wherein the image sequence characterizes that the target person indicated by the target person image performs an action corresponding to the audio frame, and the method comprises: inputting the audio frame into a first network model in a pre-trained image generation model to obtain phoneme information indicated by the audio frame, wherein the first network model is used for determining the phoneme information indicated by the input audio frame; inputting the phoneme information indicated by the audio frame and the target person image into a second network model in the image generation model, and generating an image sequence for representing that the target person indicated by the target person image performs the action corresponding to the audio frame, wherein the second network model is used for representing the corresponding relation among the phoneme information, the person image and the image sequence.
According to one or more embodiments of the present disclosure, in the method for generating a video provided by the present disclosure, the first network model is trained by the following steps: acquiring a training sample set, wherein training samples in the training sample set comprise audio frames and phoneme information indicated by the audio frames; and training to obtain a first network model by using a machine learning algorithm and using the audio frames in the training sample set as input data and using the phoneme information indicated by the audio frames as expected output data.
According to one or more embodiments of the present disclosure, in a method for generating a video provided by the present disclosure, an image generation model is trained by the following steps: acquiring a preset number of target videos, wherein the target videos are videos obtained by recording voice audio and images of a person; extracting voice audio and an image sequence matched with the extracted voice audio from a preset number of target videos; acquiring an initial model for training to obtain an image generation model; initializing model parameters corresponding to the model parameters of the trained first network model in the initial model by using the model parameters of the trained first network model to obtain an intermediate model; and training to obtain an image generation model by using a machine learning algorithm and using an audio frame in the extracted voice audio as input data of the intermediate model and using an image sequence matched with the audio frame as expected output data of the intermediate model.
According to one or more embodiments of the present disclosure, a method for generating a video, provided by the present disclosure, training an image generation model by using an audio frame in extracted speech audio as input data of an intermediate model and using an image sequence matched with the audio frame as expected output data of the intermediate model, includes: responding to the fact that a preset training end condition is not met, inputting the audio frames in the extracted voice audio into an intermediate model, obtaining actual output data of the intermediate model, and adjusting model parameters of the intermediate model based on the actual output data and expected output data, wherein the actual output data represent an image sequence actually obtained by the intermediate model, and the expected output data represent the extracted image sequence matched with the audio frames; and in response to the preset training end condition being met, taking the intermediate model meeting the preset training end condition as an image generation model.
According to one or more embodiments of the present disclosure, in a method for generating a video provided by the present disclosure, the preset training end condition includes at least one of: matching the image sequence represented by the actual output data with the audio frame; and the correlation degree of two adjacent images in the image sequence represented by the actual output data is greater than or equal to a preset correlation degree threshold value, wherein the correlation degree is used for representing the adjacent probability of the two target person images in the video.
According to one or more embodiments of the present disclosure, in a method for generating a video, acquiring a target voice audio and a target person image includes: acquiring a target voice audio and a plurality of target person images of a target person; and for an audio frame included in the target speech audio, generating an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame based on the audio frame and the target person image, including: carrying out feature extraction on a plurality of target personnel images to obtain image feature information; and generating an image sequence representing that the target person indicated by the target person image executes the action corresponding to the audio frame based on the audio frame and the image characteristic information aiming at the audio frame included by the target voice audio.
According to one or more embodiments of the present disclosure, there is provided a method for generating a video, in which a target person image includes a face image of a target person; the action corresponding to the audio frame is characterized by: the target person pronounces the voice indicated by the audio frame; a motion characterization corresponding to the target speech audio: the target person utters a voice with a target voice audio indication.
In accordance with one or more embodiments of the present disclosure, there is provided an apparatus for generating a video, the apparatus including: an acquisition unit configured to acquire a target voice audio and a target person image; a first generation unit configured to generate, for an audio frame included in the target speech audio, an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame, based on the audio frame and the target person image; a second generating unit configured to generate a video representing that the target person performs an action corresponding to the target voice audio, based on the target voice audio and the generated respective image sequences.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video, a first generating unit includes: a first generation subunit configured to generate fusion deformation information corresponding to the audio frame based on the phoneme information indicated by the audio frame; and a second generating subunit configured to generate an image sequence representing that the target person indicated by the target person image performs the action corresponding to the audio frame, based on the fusion deformation information corresponding to the audio frame and the target person image.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video, a first generating unit includes: a third generating subunit configured to input the audio frame and the target person image into a pre-trained image generating model, and generate an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame, wherein the image generating model is used for generating the image sequence representing that the person indicated by the input person image performs an action corresponding to the input audio frame.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video, a third generation subunit includes: a first input module configured to input the audio frame into a first network model in a pre-trained image generation model, and obtain phoneme information indicated by the audio frame, wherein the first network model is used for determining the phoneme information indicated by the input audio frame; and the second input module is configured to input the phoneme information indicated by the audio frame and the target person image into a second network model in the image generation model, and generate an image sequence for representing that the target person indicated by the target person image performs the action corresponding to the audio frame, wherein the second network model is used for representing the corresponding relation among the phoneme information, the person image and the image sequence.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video provided by the present disclosure, a first network model is obtained by training: acquiring a training sample set, wherein training samples in the training sample set comprise audio frames and phoneme information indicated by the audio frames; and training to obtain a first network model by using a machine learning algorithm and using the audio frames in the training sample set as input data and using the phoneme information indicated by the audio frames as expected output data.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video provided by the present disclosure, an image generation model is obtained by training: acquiring a preset number of target videos, wherein the target videos are videos obtained by recording voice audio and images of a person; extracting voice audio and an image sequence matched with the extracted voice audio from a preset number of target videos; acquiring an initial model for training to obtain an image generation model; initializing model parameters corresponding to the trained model parameters of the first network model in the initial model by using the trained model parameters of the first network model to obtain an intermediate model; and training to obtain an image generation model by using a machine learning algorithm and using an audio frame in the extracted voice audio as input data of the intermediate model and using an image sequence matched with the audio frame as expected output data of the intermediate model.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video, training an image generation model by using an audio frame in extracted speech audio as input data of an intermediate model and using an image sequence matched with the audio frame as expected output data of the intermediate model, includes: responding to the fact that a preset training end condition is not met, inputting the audio frames in the extracted voice audio into an intermediate model, obtaining actual output data of the intermediate model, and adjusting model parameters of the intermediate model based on the actual output data and expected output data, wherein the actual output data represent an image sequence actually obtained by the intermediate model, and the expected output data represent the extracted image sequence matched with the audio frames; and in response to the preset training end condition being met, taking the intermediate model meeting the preset training end condition as an image generation model.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video provided by the present disclosure, the preset training end condition includes at least one of: matching the image sequence represented by the actual output data with the audio frame; and the correlation degree of two adjacent images in the image sequence represented by the actual output data is greater than or equal to a preset correlation degree threshold value, wherein the correlation degree is used for representing the adjacent probability of the two target person images in the video.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video, an obtaining unit includes: an acquisition subunit configured to acquire a target voice audio and a plurality of target person images of a target person; and the first generation unit includes: the extraction subunit is configured to perform feature extraction on the images of the plurality of target persons to obtain image feature information; and a fourth generating subunit configured to generate, for an audio frame included in the target speech audio, an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame, based on the audio frame and the image feature information.
According to one or more embodiments of the present disclosure, there is provided an apparatus for generating a video, in which a target person image includes a face image of a target person; the action corresponding to the audio frame is characterized by: the target person sends out voice indicated by the audio frame; an action characterization corresponding to the target speech audio: the target person utters a voice with a target voice audio indication.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a first generation unit, and a second generation unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the capturing unit may also be described as a "unit that captures target voice audio and target person image".
As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may be separate and not incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a target voice audio and a target person image; generating an image sequence representing that the target person indicated by the target person image executes the action corresponding to the audio frame based on the audio frame and the target person image aiming at the audio frame included by the target voice audio; based on the target voice audio and the generated respective image sequences, a video is generated that characterizes the target person performing an action corresponding to the target voice audio.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (9)

1. A method for generating video, comprising:
acquiring a target voice audio and a target person image;
generating an image sequence representing that the target person indicated by the target person image performs an action corresponding to an audio frame included in the target voice audio based on the audio frame and the target person image;
generating a video representing the target person performing an action corresponding to the target voice audio based on the target voice audio and the generated respective image sequences;
wherein the generating of the image sequence representing that the target person indicated by the target person image performs the action corresponding to the audio frame based on the audio frame and the target person image comprises:
inputting the audio frame and the target person image into a pre-trained image generation model, and generating an image sequence for representing the target person indicated by the target person image to perform the action corresponding to the audio frame, wherein the image generation model is used for generating the image sequence for representing the person indicated by the input person image to perform the action corresponding to the input audio frame, and the action corresponding to the audio frame comprises the following steps: representing the target person to execute the limb action corresponding to the audio frame;
inputting the audio frame and the target person image into a pre-trained image generation model, and generating an image sequence representing that a target person indicated by the target person image performs an action corresponding to the audio frame, including:
inputting the audio frame into a first network model in a pre-trained image generation model to obtain phoneme information indicated by the audio frame, wherein the first network model is used for determining the phoneme information indicated by the input audio frame;
inputting the phoneme information indicated by the audio frame and the target person image into a second network model in the image generation model, and generating an image sequence for representing that the target person indicated by the target person image performs an action corresponding to the audio frame, wherein the second network model is used for representing the corresponding relation among the phoneme information, the person image and the image sequence;
wherein the training end condition for training the image generation model comprises the following two items:
matching the image sequence represented by the actual output data with the audio frame; and the correlation degree of two adjacent images in the image sequence represented by the actual output data is greater than or equal to a preset correlation degree threshold value, wherein the correlation degree is used for representing the adjacent probability of the two images in the video.
2. The method of claim 1, wherein the first network model is trained by:
acquiring a training sample set, wherein training samples in the training sample set comprise audio frames and phoneme information indicated by the audio frames;
and training to obtain a first network model by using a machine learning algorithm and using the audio frames in the training sample set as input data and using phoneme information indicated by the audio frames as expected output data.
3. The method of claim 2, wherein the image generation model is trained by:
acquiring a preset number of target videos, wherein the target videos are videos obtained by recording voice audio and images of a person;
extracting voice audio and an image sequence matched with the extracted voice audio from the preset number of target videos;
acquiring an initial model for training to obtain an image generation model;
initializing model parameters corresponding to the model parameters of the trained first network model in the initial model by adopting the model parameters of the trained first network model to obtain an intermediate model;
and training to obtain an image generation model by using a machine learning algorithm and using an audio frame in the extracted voice audio as input data of the intermediate model and using an image sequence matched with the audio frame as expected output data of the intermediate model.
4. The method of claim 3, wherein training the extracted audio frame in the speech audio as input data of the intermediate model and the image sequence matched with the audio frame as expected output data of the intermediate model to obtain the image generation model comprises:
in response to the fact that the preset training end condition is not met, inputting the audio frame in the extracted voice audio into an intermediate model to obtain actual output data of the intermediate model, and adjusting model parameters of the intermediate model on the basis of the actual output data and expected output data, wherein the actual output data represent an image sequence actually obtained by the intermediate model, and the expected output data represent an extracted image sequence matched with the audio frame;
and in response to the preset training end condition being met, taking the intermediate model meeting the preset training end condition as an image generation model.
5. The method of any of claims 1-4, wherein said obtaining target speech audio and target person images comprises:
acquiring target voice audio and a plurality of target person images of a target person; and
the audio frame included by the target voice audio is used for generating an image sequence which is used for representing that the target person indicated by the target person image performs the action corresponding to the audio frame based on the audio frame and the target person image, and the method comprises the following steps:
extracting the features of the images of the plurality of target persons to obtain image feature information;
and generating an image sequence which is used for representing the target person indicated by the target person image to execute the action corresponding to the audio frame based on the audio frame and the image characteristic information.
6. The method of one of claims 1 to 4, wherein the target person image comprises a facial image of the target person; the action corresponding to the audio frame is characterized by: the target person sends out voice indicated by the audio frame; an action characterization corresponding to the target speech audio: the target person utters a voice indicated by the target voice audio.
7. An apparatus for generating video, comprising:
an acquisition unit configured to acquire a target voice audio and a target person image;
a first generating unit configured to generate, for an audio frame included in the target speech audio, an image sequence representing that a target person indicated by the target person image performs an action corresponding to the audio frame, based on the audio frame and the target person image;
a second generation unit configured to generate a video representing that the target person performs an action corresponding to the target voice audio based on the target voice audio and the generated respective image sequences;
wherein the first generation unit includes: a third generating subunit configured to input the audio frame and the target person image into a pre-trained image generating model, and generate an image sequence representing that the target person indicated by the target person image performs an action corresponding to the audio frame, wherein the image generating model is used for generating an image sequence representing that the person indicated by the input person image performs an action corresponding to the input audio frame; wherein the action corresponding to the audio frame comprises: representing the target person to execute the limb action corresponding to the audio frame;
wherein the third generation subunit includes:
a first input module configured to input the audio frame into a first network model in a pre-trained image generation model, and obtain phoneme information indicated by the audio frame, wherein the first network model is used for determining the phoneme information indicated by the input audio frame;
a second input module configured to input the phoneme information indicated by the audio frame and the target person image into a second network model in the image generation model, and generate an image sequence representing that a target person indicated by the target person image performs an action corresponding to the audio frame, wherein the second network model is used for representing a corresponding relationship among the phoneme information, the person image and the image sequence;
wherein the training end condition for training the image generation model comprises the following two items:
matching the image sequence represented by the actual output data with the audio frame; and the correlation degree of two adjacent images in the image sequence represented by the actual output data is greater than or equal to a preset correlation degree threshold value, wherein the correlation degree is used for representing the probability of the two images being adjacent in the video.
8. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.
9. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN202010199332.2A 2020-03-20 2020-03-20 Method, apparatus, device and medium for generating video Active CN111432233B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010199332.2A CN111432233B (en) 2020-03-20 2020-03-20 Method, apparatus, device and medium for generating video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010199332.2A CN111432233B (en) 2020-03-20 2020-03-20 Method, apparatus, device and medium for generating video

Publications (2)

Publication Number Publication Date
CN111432233A CN111432233A (en) 2020-07-17
CN111432233B true CN111432233B (en) 2022-07-19

Family

ID=71548219

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010199332.2A Active CN111432233B (en) 2020-03-20 2020-03-20 Method, apparatus, device and medium for generating video

Country Status (1)

Country Link
CN (1) CN111432233B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749711B (en) * 2020-08-04 2023-08-25 腾讯科技(深圳)有限公司 Video acquisition method and device and storage medium
CN112383721B (en) * 2020-11-13 2023-04-07 北京有竹居网络技术有限公司 Method, apparatus, device and medium for generating video
CN112383722B (en) * 2020-11-13 2023-04-07 北京有竹居网络技术有限公司 Method and apparatus for generating video
CN112887789B (en) * 2021-01-22 2023-02-21 北京百度网讯科技有限公司 Video generation model construction method, video generation device, video generation equipment and video generation medium
CN112989935A (en) 2021-02-05 2021-06-18 北京百度网讯科技有限公司 Video generation method, device, equipment and storage medium
CN113113046B (en) * 2021-04-14 2024-01-19 杭州网易智企科技有限公司 Performance detection method and device for audio processing, storage medium and electronic equipment
CN113409208A (en) * 2021-06-16 2021-09-17 北京字跳网络技术有限公司 Image processing method, device, equipment and storage medium
CN113507627B (en) * 2021-07-08 2022-03-25 北京的卢深视科技有限公司 Video generation method and device, electronic equipment and storage medium
CN115776597A (en) * 2021-08-30 2023-03-10 海信集团控股股份有限公司 Audio and video generation method and device and electronic equipment
CN113747086A (en) * 2021-09-30 2021-12-03 深圳追一科技有限公司 Digital human video generation method and device, electronic equipment and storage medium
CN114173188B (en) * 2021-10-18 2023-06-02 深圳追一科技有限公司 Video generation method, electronic device, storage medium and digital person server
CN114093384A (en) * 2021-11-22 2022-02-25 上海商汤科技开发有限公司 Speaking video generation method, device, equipment and storage medium
CN114245230A (en) * 2021-11-29 2022-03-25 网易(杭州)网络有限公司 Video generation method and device, electronic equipment and storage medium
CN114900733B (en) * 2022-04-28 2023-07-21 北京生数科技有限公司 Video generation method, related device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6560575B1 (en) * 1998-10-20 2003-05-06 Canon Kabushiki Kaisha Speech processing apparatus and method
CN106127696A (en) * 2016-06-13 2016-11-16 西安电子科技大学 A kind of image based on BP neutral net matching sports ground removes method for reflection
CN110598048A (en) * 2018-05-25 2019-12-20 北京中科寒武纪科技有限公司 Video retrieval method and video retrieval mapping relation generation method and device
CN110674790A (en) * 2019-10-15 2020-01-10 山东建筑大学 Abnormal scene processing method and system in video monitoring

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8266314B2 (en) * 2009-12-16 2012-09-11 International Business Machines Corporation Automated audio or video subset network load reduction
CN104347068B (en) * 2013-08-08 2020-05-22 索尼公司 Audio signal processing device and method and monitoring system
CN108230438B (en) * 2017-12-28 2020-06-19 清华大学 Face reconstruction method and device for voice-driven auxiliary side face image
CN109377539B (en) * 2018-11-06 2023-04-11 北京百度网讯科技有限公司 Method and apparatus for generating animation
CN109545193B (en) * 2018-12-18 2023-03-14 百度在线网络技术(北京)有限公司 Method and apparatus for generating a model
CN109545192B (en) * 2018-12-18 2022-03-08 百度在线网络技术(北京)有限公司 Method and apparatus for generating a model
CN110446066B (en) * 2019-08-28 2021-11-19 北京百度网讯科技有限公司 Method and apparatus for generating video
CN110880315A (en) * 2019-10-17 2020-03-13 深圳市声希科技有限公司 Personalized voice and video generation system based on phoneme posterior probability
CN111415677B (en) * 2020-03-16 2020-12-25 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating video

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6560575B1 (en) * 1998-10-20 2003-05-06 Canon Kabushiki Kaisha Speech processing apparatus and method
CN106127696A (en) * 2016-06-13 2016-11-16 西安电子科技大学 A kind of image based on BP neutral net matching sports ground removes method for reflection
CN110598048A (en) * 2018-05-25 2019-12-20 北京中科寒武纪科技有限公司 Video retrieval method and video retrieval mapping relation generation method and device
CN110674790A (en) * 2019-10-15 2020-01-10 山东建筑大学 Abnormal scene processing method and system in video monitoring

Also Published As

Publication number Publication date
CN111432233A (en) 2020-07-17

Similar Documents

Publication Publication Date Title
CN111432233B (en) Method, apparatus, device and medium for generating video
CN111415677B (en) Method, apparatus, device and medium for generating video
CN107818798B (en) Customer service quality evaluation method, device, equipment and storage medium
CN109726624B (en) Identity authentication method, terminal device and computer readable storage medium
US11705096B2 (en) Autonomous generation of melody
US11670015B2 (en) Method and apparatus for generating video
CN107153496B (en) Method and device for inputting emoticons
CN107481720B (en) Explicit voiceprint recognition method and device
US20200075024A1 (en) Response method and apparatus thereof
CN109254669B (en) Expression picture input method and device, electronic equipment and system
JP7108144B2 (en) Systems and methods for domain adaptation in neural networks using cross-domain batch normalization
WO2020081872A1 (en) Characterizing content for audio-video dubbing and other transformations
CN109189544B (en) Method and device for generating dial plate
CN116484318B (en) Lecture training feedback method, lecture training feedback device and storage medium
CN109582825B (en) Method and apparatus for generating information
CN112420014A (en) Virtual face construction method and device, computer equipment and computer readable medium
US20200013389A1 (en) Word extraction device, related conference extraction system, and word extraction method
US11076091B1 (en) Image capturing assistant
CN109697978B (en) Method and apparatus for generating a model
CN117635383A (en) Virtual teacher and multi-person cooperative talent training system, method and equipment
CN117152308A (en) Virtual person action expression optimization method and system
CN111862279A (en) Interaction processing method and device
CN109949213B (en) Method and apparatus for generating image
CN112383721B (en) Method, apparatus, device and medium for generating video
CN114138960A (en) User intention identification method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant