CN113223555A - Video generation method and device, storage medium and electronic equipment - Google Patents

Video generation method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN113223555A
CN113223555A CN202110481188.6A CN202110481188A CN113223555A CN 113223555 A CN113223555 A CN 113223555A CN 202110481188 A CN202110481188 A CN 202110481188A CN 113223555 A CN113223555 A CN 113223555A
Authority
CN
China
Prior art keywords
target
pronunciation
target user
audio
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110481188.6A
Other languages
Chinese (zh)
Inventor
殷翔
马泽君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202110481188.6A priority Critical patent/CN113223555A/en
Publication of CN113223555A publication Critical patent/CN113223555A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The disclosure relates to a video generation method, a video generation device, a storage medium and an electronic device. The video generation method comprises the following steps: determining a standard pronunciation video corresponding to the target text, wherein the standard pronunciation video comprises standard pronunciation audio and a standard pronunciation facial image frame sequence; obtaining a target pronunciation audio with the target user tone characteristic according to the standard pronunciation audio and the tone information of the target user; obtaining a target pronunciation face image frame sequence with the facial features of the target user according to the standard pronunciation face image frame sequence and the face image template of the target user; and synthesizing the target pronunciation video by the target pronunciation audio and the target pronunciation face image frame sequence. In this way, a target pronunciation video having a target user voice color and facial features can be generated from arbitrary text.

Description

Video generation method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of video technologies, and in particular, to a video generation method, an apparatus, a storage medium, and an electronic device.
Background
In a language teaching scenario, a teaching text is usually given, and a user clicks an audio button to play a standard pronunciation audio corresponding to the teaching text. The user learns the correct pronunciation of the instructional text by mimicking the standard pronunciation audio. However, it is difficult to judge how to pronounce sound by sound alone, and the sound colors of each person are largely different, which makes the pronunciation simulation more difficult, makes the pronunciation learning effect worse, and the efficiency is lower.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a video generation method, including: determining a standard pronunciation video corresponding to the target text, wherein the standard pronunciation video comprises standard pronunciation audio and a standard pronunciation facial image frame sequence; obtaining a target pronunciation audio with the target user tone characteristic according to the standard pronunciation audio and the tone information of the target user; obtaining a target pronunciation face image frame sequence with the facial features of the target user according to the standard pronunciation face image frame sequence and the face image template of the target user; and synthesizing the target pronunciation video by the target pronunciation audio and the target pronunciation face image frame sequence.
In a second aspect, the present disclosure provides a video generation apparatus, the apparatus comprising: the determining module is used for determining a standard pronunciation video corresponding to the target text, wherein the standard pronunciation video comprises standard pronunciation audio and a standard pronunciation facial image frame sequence; the first execution module is used for obtaining a target pronunciation audio with the target user tone characteristic according to the standard pronunciation audio and the tone information of the target user; the second execution module is used for obtaining a target pronunciation face image frame sequence with the facial features of the target user according to the standard pronunciation face image frame sequence and the face image template of the target user; and the synthesis module is used for synthesizing the target pronunciation video from the target pronunciation audio and the target pronunciation face image frame sequence.
In a third aspect, the present disclosure provides a computer-readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect described above.
In a fourth aspect, the present disclosure provides an electronic device comprising: a storage device having a computer program stored thereon; processing means for executing said computer program in said storage means to implement the steps of the method of the first aspect described above.
By adopting the technical scheme, the following technical effects can be at least achieved:
and determining a standard pronunciation audio and a standard pronunciation facial image frame sequence corresponding to the target text. And obtaining the target pronunciation audio with the target user tone characteristic according to the standard pronunciation audio and the target user tone information. And simultaneously, obtaining a target pronunciation face image frame sequence with the facial features of the target user according to the standard pronunciation face image frame sequence and the face image template of the target user. And synthesizing the target pronunciation video by the target pronunciation audio and the target pronunciation face image frame sequence. In this way, a target pronunciation video having a target user voice color and facial features can be generated from arbitrary text. Under the pronunciation teaching scene, the pronunciation teaching is carried out on the target user according to the target pronunciation video with the standard and the target user tone and facial features, and the pronunciation learning efficiency of the target user can be improved.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:
fig. 1 is a flowchart illustrating a video generation method according to an exemplary embodiment of the present disclosure.
Fig. 2 is a diagram illustrating a tone conversion model architecture according to an exemplary embodiment of the present disclosure.
Fig. 3 is a block diagram illustrating a video generation apparatus according to an exemplary embodiment of the present disclosure.
Fig. 4 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Before the detailed embodiment of the technical solution of the present disclosure is explained, it should be stated that the information related to the user, such as the user audio, video, facial image, etc., used in the embodiment of the present disclosure is collected/used under the condition of obtaining the user authorization or consent.
Fig. 1 is a flowchart illustrating a video generation method according to an exemplary embodiment of the present disclosure, as illustrated in fig. 1, the video generation method including the steps of:
and S11, determining a standard pronunciation video corresponding to the target text, wherein the standard pronunciation video comprises standard pronunciation audio and a standard pronunciation facial image frame sequence.
The target text is text of any length, consisting of one or more factors, words, or sentences. A sequence of standard articulatory facial image frames is used to compose the unvoiced video.
An implementable embodiment, said determining a standard pronunciation video corresponding to the target text comprises: and inputting the target text into a video generation model to obtain the standard pronunciation video output by the video generation model.
The video generation model can be obtained through the following modes:
and collecting video of a standard speaker reading the text sample in a scene of a clean recording studio, and accumulating the video sample with the time length of 10 hours. The video sample includes audio and images of facial movements (e.g., lip movements) of a standard speaker when reciting the text sample. And training a video generation model according to the text sample, the video sample and the labeling information of the sample. In one embodiment, the video generation model may use the same or similar architecture as the TTSA (Text To Speech & Animation) model. The trained video generation model can generate corresponding voice and animation corresponding to the voice according to the text, and the animation can be facial animation or lip animation.
And S12, obtaining a target pronunciation audio with the target user tone characteristic according to the standard pronunciation audio and the target user tone information.
It is worth mentioning that the audio in the present disclosure includes tone color information, accent information, and text content information. Tone information and accent information in the standard pronunciation audio are determined by a standard speaker corresponding to a video sample used for training the video generation model, and text content information is determined by a text sample used for training the video generation model.
The tone information in the standard pronunciation audio is replaced by the tone information of the target user, so that the standard pronunciation audio can be converted into the standard pronunciation audio with the tone characteristics of the target user, and the target pronunciation audio is obtained.
The tone information of the target user can be obtained by the following method: first, a speech classification model is trained based on the speech of a target user and the speech of a large number of other users. Secondly, the trained coding module in the speech classification model is used for coding the speech of the target user to obtain the tone information of the target user output by the coding module.
The training process of the voice classification model comprises the following steps: and randomly coding the voice of the target user and the voices of other users by using a coding module in the voice classification model to obtain the voice feature vector of the target user and the voice feature vectors of the other users. Further, a classification module in the speech classification model classifies the speech feature vector of the target user and the speech feature vectors of other users, and feeds back the classification result to the coding module, and the coding module reconstructs the classification result. Therefore, the trained voice classification model is obtained until the classification module can accurately classify the voice feature vector of the target user and the voice feature vectors of other users.
And S13, obtaining a target pronunciation face image frame sequence with the facial features of the target user according to the standard pronunciation face image frame sequence and the facial image template of the target user.
An implementable embodiment, said deriving a target pronunciation facial image frame sequence having target user facial features from said sequence of standard pronunciation facial image frames and said target user's facial image template, comprising: extracting lip key point information aiming at each frame of standard pronunciation face image; combining the lip key point information of each frame of standard pronunciation facial image with the facial image template of the target user to obtain the target pronunciation facial image frame sequence. Wherein the lip keypoint information is used to delineate the motion of the lips.
Specifically, lip keypoint information in each frame of the standard-pronunciation facial image in the frame of standard-pronunciation facial image frame sequence is extracted. And determining a target user face image with the same angle as the frame of standard pronunciation face image from the target user face image template, and replacing lip key point information in the target user face image with lip key point information in the frame of standard pronunciation face image to obtain the target user face image with the lip key point information in the frame of standard pronunciation face image. Thus, a target user face image sequence corresponding to the standard pronunciation face image frame sequence can be obtained, and the target user face image sequence is the target pronunciation face image frame sequence.
Here, it should be noted that the lip partial image in the target utterance facial image frame sequence is obtained by adjusting the lip image of the target user according to the motion described by the lip key point information.
Wherein, the face image template of the target user can be obtained by the following method: collecting a multi-angle face image of the target user; and determining a face image template of the target user according to the multi-angle face image.
Illustratively, the target user face image template may be a three-dimensional target user face image model, which may be synthesized from multi-angle face images of the target user. Or, the face image template of the target user can be a three-dimensional target user head image model, and the three-dimensional target user head image model can be synthesized according to the multi-angle head image of the target user.
When the target user is a new user and the user allows to access the terminal video recording device, the user can be prompted by the user terminal to record video data in the required modes of front, side, mouth opening, eye blinking and the like so as to obtain a multi-angle face image of the target user.
And S14, synthesizing the target pronunciation video by the target pronunciation audio and the target pronunciation face image frame sequence.
The target pronunciation face image frame sequence is used for forming a silent video, and target pronunciation video is synthesized by target pronunciation audio with target user tone color characteristics and the target pronunciation face image frame sequence with target user face characteristics. The target pronunciation video shows how to accurately pronounce the target text with the face and tone of the target user.
According to the technical scheme, the standard pronunciation audio and the standard pronunciation facial image frame sequence corresponding to the target text are determined. And obtaining the target pronunciation audio with the target user tone characteristic according to the standard pronunciation audio and the target user tone information. And simultaneously, obtaining a target pronunciation face image frame sequence with the facial features of the target user according to the standard pronunciation face image frame sequence and the face image template of the target user. And synthesizing the target pronunciation video by the target pronunciation audio and the target pronunciation face image frame sequence. In this way, a target pronunciation video having a target user voice color and facial features can be generated from arbitrary text.
The method can be applied to the server and also can be applied to the client. In an application scenario, in a pronunciation teaching scenario, if the method is applied to a client, the target text is a pronunciation teaching example sentence, and the method may further include: and displaying the target text and the target pronunciation video to the target user so that the target user pronounces the target text according to the target pronunciation video.
In a pronunciation teaching scene, pronunciation teaching is carried out on a target user according to the target pronunciation video with the target user tone and facial features in a standard way, the pronunciation imitation difficulty caused by imitating different tone sounds can be reduced, the target user can be assisted to learn how to pronounce by the action of the face of the target user when the face of the target user shows pronunciation, and the aim of improving the pronunciation learning efficiency of the target user is achieved.
The technical scheme of the present disclosure can be combined with pronunciation evaluation application in the related art. And under the condition that the pronunciation of the user is inaccurate, determining a target pronunciation video of the correct pronunciation of the text according to the text of the incorrect pronunciation of the user so as to perform targeted pronunciation teaching on the target user.
In one possible implementation, the obtaining, in the step S12, the target pronunciation audio with the target user tone quality characteristics according to the standard pronunciation audio and the target user tone quality information includes: and inputting the standard pronunciation audio into a tone conversion model, and outputting the target pronunciation audio by the tone conversion model in combination with the tone information of the target user.
The voice quality conversion model comprises a characteristic extraction module, a voice quality conversion module and a vocoder module, wherein the characteristic extraction module is used for extracting a phoneme posterior probability vector of the standard pronunciation voice frequency; the tone conversion module is used for combining the phoneme posterior probability vector with the tone information of the target user to obtain target pronunciation Mel feature information; the vocoder module is used for obtaining the target pronunciation audio according to the target pronunciation Mel characteristic information.
Referring to fig. 2, the feature extraction module may adopt an ASR model, which is called automatic Speech Recognition in english, that is, an automatic Speech Recognition technology, which is a technology for converting human Speech into text. In the present disclosure, the feature extraction module is configured to extract a phoneme posterior probability vector (PPG) of standard pronunciation audio. Or the characteristic extraction module is used for extracting BN (bounding volume feature) characteristic information of the standard pronunciation audio. And inputting the PPG characteristic information output by the characteristic extraction module into a tone conversion module, and combining the PPG characteristic information with tone information of a target user to obtain target pronunciation Mel characteristic information. Inputting the Mel feature information of the target pronunciation into a vocoder module (univeral wavernn) to obtain the audio frequency of the target pronunciation.
The tone conversion model can adopt an Accent conversion model, the training process of the tone conversion model comprises back transformation and spreader adaptive training, and the tone conversion model is obtained by training in the following way:
the target user pronounces the audio of a given example sentence, and the tone information and standard pronunciation audio of the target user determined according to the audio are used as training data of the tone conversion model; and training according to the training data to obtain the tone conversion model.
Specifically, the training process of the tone conversion model includes three parts, namely, training of the ASR module, training of the vocoder module, and training of the tone conversion module. The generation confrontation training process of the tone conversion module is to combine the PPG and tone information of a target user to obtain target pronunciation Mel characteristic information. And carrying out speaker classification training by using the Mel characteristic information of the target pronunciation and the Mel characteristic information corresponding to the audio of the given example sentence misread by the target user, wherein the speaker classification training is used for assisting in adjusting the tone conversion model until the speaker classifier cannot distinguish the synthesized Mel characteristic information of the target pronunciation and the Mel characteristic information corresponding to the audio of the given example sentence misread by the target user, so as to obtain the trained tone conversion model.
Since the audio includes tone color information, accent information, and text content information. Thus, an implementable embodiment may preserve the accent information and the text content information PPG in the audio when training the ASR module. In detail, the tone conversion model comprises a feature extraction module, a tone conversion module and a vocoder module, wherein the feature extraction module is used for extracting phoneme posterior probability vectors of the standard pronunciation audio and accent information of the standard pronunciation audio; the tone conversion module is used for combining the phoneme posterior probability vector, the accent information of the standard pronunciation audio and the tone information of the target user to obtain target pronunciation Mel characteristic information; the vocoder module is used for obtaining the target pronunciation audio according to the target pronunciation Mel characteristic information.
In another implementation, accent information in standard pronunciation audio may be added to the tone color conversion module. In detail, the tone conversion model comprises a feature extraction module, a tone conversion module and a vocoder module, wherein the feature extraction module is used for extracting a phoneme posterior probability vector of the standard pronunciation audio; the tone conversion module is used for combining the phoneme posterior probability vector, the tone information of the target user and the accent information in the standard pronunciation audio to obtain target pronunciation Mel characteristic information; the vocoder module is used for obtaining the target pronunciation audio according to the target pronunciation Mel characteristic information.
The accent information in the standard pronunciation audio can be directly extracted from the standard pronunciation audio. The accent information includes at least a fundamental frequency F0.
In this way, the standard pronunciation audio is input into the tone conversion model, and the tone conversion model combines the tone information of the target user, the accent information in the standard pronunciation audio and the text content information in the standard pronunciation audio to obtain the target pronunciation audio. In this way, a standard pronunciation video with a tunnel accent pronounced in the user's timbre is obtained.
Fig. 3 is a block diagram illustrating a video generating apparatus according to an exemplary embodiment of the present disclosure, and as shown in fig. 3, the apparatus 300 includes:
a determining module 310, configured to determine a standard pronunciation video corresponding to the target text, where the standard pronunciation video includes a standard pronunciation audio and a standard pronunciation facial image frame sequence; the first execution module 320 is configured to obtain a target pronunciation audio with a target user tone characteristic according to the standard pronunciation audio and the target user tone information; a second executing module 330, configured to obtain a target pronunciation facial image frame sequence with a target user facial feature according to the standard pronunciation facial image frame sequence and the target user facial image template; and the synthesizing module 340 is configured to synthesize the target pronunciation audio and the target pronunciation face image frame sequence into a target pronunciation video.
In a possible implementation, the first execution module 320 includes an input sub-module, configured to input the standard pronunciation audio into a tone conversion model, and the tone conversion model outputs the target pronunciation audio in combination with the tone information of the target user; the training submodule is used for taking the audio of the given example sentence read by the target user, the tone information of the target user determined according to the audio and the standard pronunciation audio as training data of the tone conversion model; and training according to the training data to obtain the tone conversion model.
In a possible implementation manner, the tone conversion model includes a feature extraction module, a tone conversion module and a vocoder module, the feature extraction module is configured to extract a phoneme posterior probability vector of the standard pronunciation audio; the tone conversion module is used for combining the phoneme posterior probability vector with the tone information of the target user to obtain target pronunciation Mel feature information; the vocoder module is used for obtaining the target pronunciation audio according to the target pronunciation Mel characteristic information.
In a possible implementation manner, the tone conversion model includes a feature extraction module, a tone conversion module, and a vocoder module, the feature extraction module is configured to extract a phoneme posterior probability vector of the standard pronunciation audio and accent information of the standard pronunciation audio; the tone conversion module is used for combining the phoneme posterior probability vector, the accent information of the standard pronunciation audio and the tone information of the target user to obtain target pronunciation Mel characteristic information; the vocoder module is used for obtaining the target pronunciation audio according to the target pronunciation Mel characteristic information.
In a possible implementation, the second performing module 330 is specifically configured to extract lip keypoint information for each frame of the standard pronunciation face image; combining the lip key point information of each frame of standard pronunciation facial image with the facial image template of the target user to obtain the target pronunciation facial image frame sequence.
In a possible implementation manner, the target text is a pronunciation teaching example sentence, and the apparatus further includes a display module configured to display the target text and the target pronunciation video to the target user, so that the target user pronounces the target text according to the target pronunciation video.
In one possible embodiment, the face image template of the target user is obtained by: collecting a multi-angle face image of the target user; and determining a face image template of the target user according to the multi-angle face image.
With such an apparatus 300, a target pronunciation video having a target user voice and facial features can be generated from arbitrary text. In a pronunciation teaching scene, pronunciation teaching is carried out on a target user according to the target pronunciation video with the target user tone and facial features in a standard way, the pronunciation imitation difficulty caused by imitating different tone sounds can be reduced, the target user can be assisted to learn how to pronounce by the action of the face of the target user when the face of the target user shows pronunciation, and the aim of improving the pronunciation learning efficiency of the target user is achieved.
Referring now to FIG. 4, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 4, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: determining a standard pronunciation video corresponding to the target text, wherein the standard pronunciation video comprises standard pronunciation audio and a standard pronunciation facial image frame sequence; obtaining a target pronunciation audio with the target user tone characteristic according to the standard pronunciation audio and the tone information of the target user; obtaining a target pronunciation face image frame sequence with the facial features of the target user according to the standard pronunciation face image frame sequence and the face image template of the target user; and synthesizing the target pronunciation video by the target pronunciation audio and the target pronunciation face image frame sequence.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Example 1 provides a video generation method, including determining a standard pronunciation video corresponding to a target text, the standard pronunciation video including a standard pronunciation audio and a standard pronunciation facial image frame sequence; obtaining a target pronunciation audio with the target user tone characteristic according to the standard pronunciation audio and the tone information of the target user; obtaining a target pronunciation face image frame sequence with the facial features of the target user according to the standard pronunciation face image frame sequence and the face image template of the target user; and synthesizing the target pronunciation video by the target pronunciation audio and the target pronunciation face image frame sequence.
Example 2 provides the method of example 1, the obtaining target pronunciation audio with target user timbre features according to the standard pronunciation audio and target user timbre information, including: inputting the standard pronunciation audio into a tone conversion model, and outputting the target pronunciation audio by the tone conversion model in combination with tone information of the target user; wherein, the tone conversion model is obtained by training in the following way: the target user pronounces the audio of a given example sentence, and the tone information and standard pronunciation audio of the target user determined according to the audio are used as training data of the tone conversion model; and training according to the training data to obtain the tone conversion model.
Example 3 provides the method of example 2, the tone conversion model including a feature extraction module to extract a phoneme posterior probability vector of the standard pronunciation audio, a tone conversion module, and a vocoder module, according to one or more embodiments of the present disclosure; the tone conversion module is used for combining the phoneme posterior probability vector with the tone information of the target user to obtain target pronunciation Mel feature information; the vocoder module is used for obtaining the target pronunciation audio according to the target pronunciation Mel characteristic information.
Example 4 provides the method of example 2, the tone conversion model including a feature extraction module to extract a phoneme posterior probability vector of the standard pronunciation audio and accent information of the standard pronunciation audio, a tone conversion module, and a vocoder module, according to one or more embodiments of the present disclosure; the tone conversion module is used for combining the phoneme posterior probability vector, the accent information of the standard pronunciation audio and the tone information of the target user to obtain target pronunciation Mel characteristic information; the vocoder module is used for obtaining the target pronunciation audio according to the target pronunciation Mel characteristic information.
Example 5 provides the method of example 1, the deriving a target pronunciation facial image frame sequence having facial features of a target user from the standard pronunciation facial image frame sequence and the facial image template of the target user according to one or more embodiments of the present disclosure, including: extracting lip key point information aiming at each frame of standard pronunciation face image; combining the lip key point information of each frame of standard pronunciation facial image with the facial image template of the target user to obtain the target pronunciation facial image frame sequence.
Example 6 provides the method of examples 1-5, the target text being a pronunciation tutorial example sentence, the method further comprising: and displaying the target text and the target pronunciation video to the target user so that the target user pronounces the target text according to the target pronunciation video.
Example 7 provides the methods of examples 1-5, the target user's facial image template being obtained by: collecting a multi-angle face image of the target user; and determining a face image template of the target user according to the multi-angle face image.
Example 8 provides, in accordance with one or more embodiments of the present disclosure, a video generation apparatus comprising: the determining module is used for determining a standard pronunciation video corresponding to the target text, wherein the standard pronunciation video comprises standard pronunciation audio and a standard pronunciation facial image frame sequence; the first execution module is used for obtaining a target pronunciation audio with the target user tone characteristic according to the standard pronunciation audio and the tone information of the target user; the second execution module is used for obtaining a target pronunciation face image frame sequence with the facial features of the target user according to the standard pronunciation face image frame sequence and the face image template of the target user; and the synthesis module is used for synthesizing the target pronunciation video from the target pronunciation audio and the target pronunciation face image frame sequence.
Example 9 provides the apparatus of example 8, the first execution module including an input submodule to input the standard pronunciation audio into a timbre conversion model, the timbre conversion model outputting the target pronunciation audio in combination with timbre information of the target user; the training submodule is used for taking the audio of the given example sentence read by the target user, the tone information of the target user determined according to the audio and the standard pronunciation audio as training data of the tone conversion model; and training according to the training data to obtain the tone conversion model.
Example 10 provides the apparatus of example 9, the tone conversion model comprising a feature extraction module to extract a phoneme posterior probability vector of the standard pronunciation audio, a tone conversion module, and a vocoder module, according to one or more embodiments of the present disclosure; the tone conversion module is used for combining the phoneme posterior probability vector with the tone information of the target user to obtain target pronunciation Mel feature information; the vocoder module is used for obtaining the target pronunciation audio according to the target pronunciation Mel characteristic information.
Example 11 provides the apparatus of example 9, the tone conversion model including a feature extraction module to extract a phoneme posterior probability vector of the standard pronunciation audio and accent information of the standard pronunciation audio, a tone conversion module, and a vocoder module, according to one or more embodiments of the present disclosure; the tone conversion module is used for combining the phoneme posterior probability vector, the accent information of the standard pronunciation audio and the tone information of the target user to obtain target pronunciation Mel characteristic information; the vocoder module is used for obtaining the target pronunciation audio according to the target pronunciation Mel characteristic information.
Example 12 provides the apparatus of example 8, the second execution module to extract lip keypoint information, in particular for each frame of the standard pronunciation face image; combining the lip key point information of each frame of standard pronunciation facial image with the facial image template of the target user to obtain the target pronunciation facial image frame sequence.
Example 13 provides the apparatus of examples 8-12, the target text being a pronunciation teaching example sentence, the apparatus further comprising a presentation module to present the target text and the target pronunciation video to the target user to cause the target user to pronounce the target text according to the target pronunciation video.
Example 14 provides the apparatus of examples 8-12, further comprising an acquisition module to acquire multi-angle facial images of the target user, in accordance with one or more embodiments of the present disclosure; and determining a face image template of the target user according to the multi-angle face image.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims (10)

1. A method of video generation, the method comprising:
determining a standard pronunciation video corresponding to the target text, wherein the standard pronunciation video comprises standard pronunciation audio and a standard pronunciation facial image frame sequence;
obtaining a target pronunciation audio with the target user tone characteristic according to the standard pronunciation audio and the tone information of the target user;
obtaining a target pronunciation face image frame sequence with the facial features of the target user according to the standard pronunciation face image frame sequence and the face image template of the target user;
and synthesizing the target pronunciation video by the target pronunciation audio and the target pronunciation face image frame sequence.
2. The method of claim 1, wherein obtaining the target pronunciation audio with the target user timbre features according to the standard pronunciation audio and the target user timbre information comprises:
inputting the standard pronunciation audio into a tone conversion model, and outputting the target pronunciation audio by the tone conversion model in combination with tone information of the target user;
wherein, the tone conversion model is obtained by training in the following way:
the target user pronounces the audio of a given example sentence, and the tone information and standard pronunciation audio of the target user determined according to the audio are used as training data of the tone conversion model;
and training according to the training data to obtain the tone conversion model.
3. The method of claim 2, wherein the tone conversion model comprises a feature extraction module, a tone conversion module and a vocoder module, the feature extraction module is configured to extract a phoneme posterior probability vector of the standard pronunciation audio;
the tone conversion module is used for combining the phoneme posterior probability vector with the tone information of the target user to obtain target pronunciation Mel feature information;
the vocoder module is used for obtaining the target pronunciation audio according to the target pronunciation Mel characteristic information.
4. The method according to claim 2, wherein the tone conversion model comprises a feature extraction module, a tone conversion module and a vocoder module, the feature extraction module is configured to extract a phoneme posterior probability vector of the standard pronunciation audio and accent information of the standard pronunciation audio;
the tone conversion module is used for combining the phoneme posterior probability vector, the accent information of the standard pronunciation audio and the tone information of the target user to obtain target pronunciation Mel characteristic information;
the vocoder module is used for obtaining the target pronunciation audio according to the target pronunciation Mel characteristic information.
5. The method of claim 1, wherein deriving a target pronunciation facial image frame sequence having target user facial features from the sequence of standard pronunciation facial image frames and the target user's facial image template comprises:
extracting lip key point information aiming at each frame of standard pronunciation face image;
combining the lip key point information of each frame of standard pronunciation facial image with the facial image template of the target user to obtain the target pronunciation facial image frame sequence.
6. The method of any one of claims 1-5, wherein the target text is a pronunciation teaching illustrative sentence, the method further comprising:
and displaying the target text and the target pronunciation video to the target user so that the target user pronounces the target text according to the target pronunciation video.
7. The method according to any one of claims 1-5, wherein the target user's facial image template is obtained by:
collecting a multi-angle face image of the target user;
and determining a face image template of the target user according to the multi-angle face image.
8. A video generation apparatus, characterized in that the apparatus comprises:
the determining module is used for determining a standard pronunciation video corresponding to the target text, wherein the standard pronunciation video comprises standard pronunciation audio and a standard pronunciation facial image frame sequence;
the first execution module is used for obtaining a target pronunciation audio with the target user tone characteristic according to the standard pronunciation audio and the tone information of the target user;
the second execution module is used for obtaining a target pronunciation face image frame sequence with the facial features of the target user according to the standard pronunciation face image frame sequence and the face image template of the target user;
and the synthesis module is used for synthesizing the target pronunciation video from the target pronunciation audio and the target pronunciation face image frame sequence.
9. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 7.
10. An electronic device, comprising:
a storage device having a computer program stored thereon;
processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 7.
CN202110481188.6A 2021-04-30 2021-04-30 Video generation method and device, storage medium and electronic equipment Pending CN113223555A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110481188.6A CN113223555A (en) 2021-04-30 2021-04-30 Video generation method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110481188.6A CN113223555A (en) 2021-04-30 2021-04-30 Video generation method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN113223555A true CN113223555A (en) 2021-08-06

Family

ID=77090514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110481188.6A Pending CN113223555A (en) 2021-04-30 2021-04-30 Video generation method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113223555A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114222077A (en) * 2021-12-14 2022-03-22 惠州视维新技术有限公司 Video processing method and device, storage medium and electronic equipment
CN114938476A (en) * 2022-05-31 2022-08-23 深圳市优必选科技股份有限公司 Speaking head video synthesis method and device, terminal equipment and readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007022054A2 (en) * 2005-08-15 2007-02-22 Tepper Daniel A System and method for simultaneous demonstration of mouth movements together with visual presentation of an image that represents a letter(s) or word(s) being pronounced
CN101563698A (en) * 2005-09-16 2009-10-21 富利克索尔股份有限公司 Personalizing a video
CN110223705A (en) * 2019-06-12 2019-09-10 腾讯科技(深圳)有限公司 Phonetics transfer method, device, equipment and readable storage medium storing program for executing
US20200058288A1 (en) * 2018-08-16 2020-02-20 National Taiwan University Of Science And Technology Timbre-selectable human voice playback system, playback method thereof and computer-readable recording medium
US20200234690A1 (en) * 2019-01-18 2020-07-23 Snap Inc. Text and audio-based real-time face reenactment
CN112259072A (en) * 2020-09-25 2021-01-22 北京百度网讯科技有限公司 Voice conversion method and device and electronic equipment
TW202109474A (en) * 2019-08-15 2021-03-01 亞東技術學院 Language pronunciation learning system and method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007022054A2 (en) * 2005-08-15 2007-02-22 Tepper Daniel A System and method for simultaneous demonstration of mouth movements together with visual presentation of an image that represents a letter(s) or word(s) being pronounced
CN101563698A (en) * 2005-09-16 2009-10-21 富利克索尔股份有限公司 Personalizing a video
US20200058288A1 (en) * 2018-08-16 2020-02-20 National Taiwan University Of Science And Technology Timbre-selectable human voice playback system, playback method thereof and computer-readable recording medium
US20200234690A1 (en) * 2019-01-18 2020-07-23 Snap Inc. Text and audio-based real-time face reenactment
CN110223705A (en) * 2019-06-12 2019-09-10 腾讯科技(深圳)有限公司 Phonetics transfer method, device, equipment and readable storage medium storing program for executing
TW202109474A (en) * 2019-08-15 2021-03-01 亞東技術學院 Language pronunciation learning system and method
CN112259072A (en) * 2020-09-25 2021-01-22 北京百度网讯科技有限公司 Voice conversion method and device and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114222077A (en) * 2021-12-14 2022-03-22 惠州视维新技术有限公司 Video processing method and device, storage medium and electronic equipment
CN114938476A (en) * 2022-05-31 2022-08-23 深圳市优必选科技股份有限公司 Speaking head video synthesis method and device, terminal equipment and readable storage medium
CN114938476B (en) * 2022-05-31 2023-09-22 深圳市优必选科技股份有限公司 Method and device for synthesizing voice head video, terminal equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN111933110B (en) Video generation method, generation model training method, device, medium and equipment
CN111369967B (en) Virtual character-based voice synthesis method, device, medium and equipment
CN111402842B (en) Method, apparatus, device and medium for generating audio
US11670015B2 (en) Method and apparatus for generating video
CN112967212A (en) Virtual character synthesis method, device, equipment and storage medium
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
CN111883107B (en) Speech synthesis and feature extraction model training method, device, medium and equipment
CN113592985B (en) Method and device for outputting mixed deformation value, storage medium and electronic device
CN110880198A (en) Animation generation method and device
CN113299312B (en) Image generation method, device, equipment and storage medium
CN113257218B (en) Speech synthesis method, device, electronic equipment and storage medium
CN113223555A (en) Video generation method and device, storage medium and electronic equipment
CN111897976A (en) Virtual image synthesis method and device, electronic equipment and storage medium
CN112785670A (en) Image synthesis method, device, equipment and storage medium
WO2022194044A1 (en) Pronunciation assessment method and apparatus, storage medium, and electronic device
CN111369968A (en) Sound reproduction method, device, readable medium and electronic equipment
CN114429658A (en) Face key point information acquisition method, and method and device for generating face animation
CN112383721B (en) Method, apparatus, device and medium for generating video
CN111354362A (en) Method and device for assisting hearing-impaired communication
CN112381926A (en) Method and apparatus for generating video
CN111862933A (en) Method, apparatus, device and medium for generating synthesized speech
CN111916050A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN113079328B (en) Video generation method and device, storage medium and electronic equipment
CN113205797B (en) Virtual anchor generation method, device, computer equipment and readable storage medium
CN113079327A (en) Video generation method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination