CN111415662A - Method, apparatus, device and medium for generating video - Google Patents

Method, apparatus, device and medium for generating video Download PDF

Info

Publication number
CN111415662A
CN111415662A CN202010182273.8A CN202010182273A CN111415662A CN 111415662 A CN111415662 A CN 111415662A CN 202010182273 A CN202010182273 A CN 202010182273A CN 111415662 A CN111415662 A CN 111415662A
Authority
CN
China
Prior art keywords
information
video
generating
feedback information
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010182273.8A
Other languages
Chinese (zh)
Inventor
殷翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202010182273.8A priority Critical patent/CN111415662A/en
Publication of CN111415662A publication Critical patent/CN111415662A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Abstract

Embodiments of the present disclosure disclose methods, apparatuses, devices and media for generating video. One embodiment of the method for generating a video includes: acquiring user interaction information of a target user; generating feedback information aiming at the user interaction information based on the user interaction information; and generating a video for instructing the preset person to perform the action corresponding to the feedback information based on the feedback information. The embodiment can perform information interaction with the user in a video generating mode, so that the diversity of the interaction mode is improved, and the user is helped to avoid the photophobia generated in the interaction process with a real person.

Description

Method, apparatus, device and medium for generating video
Technical Field
Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for generating a video.
Background
Human-Computer InteracTIon technology (Human-Computer InteracTIon technologies) is a product of development of information technology, and is a technology for realizing Human-Computer conversation in an effective manner through Computer input and output devices. It enables a conversion from human-adapted computers to computer-adapted humans.
At present, human-computer interaction modes not only include keyboard input and handle operation, but also include more novel modes, such as micro-motions of fingers, vibration of sound waves in air, rotation of eyeball and tongue, and the like, which can realize information transmission and complete 'conversation' between people and machines.
Disclosure of Invention
The present disclosure presents methods, apparatuses, devices and media for generating video.
In a first aspect, an embodiment of the present disclosure provides a method for generating a video, the method including: acquiring user interaction information of a target user; generating feedback information aiming at the user interaction information based on the user interaction information; and generating a video for instructing the preset person to perform the action corresponding to the feedback information based on the feedback information.
In some embodiments, the feedback information is text information; and generating a video for instructing a preset person to perform an action corresponding to the feedback information based on the feedback information, including: determining phoneme information and intonation information corresponding to the text information; and generating a video for instructing a preset person to perform an action corresponding to the feedback information based on the phoneme information and the intonation information.
In some embodiments, generating a video for instructing a preset person to perform an action corresponding to the feedback information based on the phoneme information and the intonation information includes: determining the number of audio frames of the voice audio to be generated and the number of images of the image sequence to be generated based on the phoneme information, wherein the number of audio frames is equal to the number of images; generating a voice audio and an image sequence based on the phoneme information and the tone information, wherein the voice audio comprises the number of audio frames which is the number of audio frames, the number of images in the image sequence is the number of images, and the image sequence indicates a preset person to execute an action corresponding to the voice audio; and synthesizing the voice audio and the image sequence to obtain a video for indicating a preset person to execute the action corresponding to the voice audio.
In some embodiments, audio frames in the speech audio correspond one-to-one to images in the image sequence, and the mouth shape of the images in the image sequence characterizes: the preset person utters a voice indicated by the audio frame corresponding to the image.
In some embodiments, the user interaction information comprises user video; and generating feedback information for the user interaction information based on the user interaction information, including: responding to the voice audio in the user video meeting a preset intonation adjustment condition, and generating feedback information for indicating the user to adjust the intonation of the audio; and generating feedback information for indicating the user to adjust the mouth shape in response to the fact that the mouth shape in the image in the user video meets the preset mouth shape adjusting condition.
In some embodiments, generating, based on the feedback information, a video for instructing a preset person to perform an action corresponding to the feedback information includes: responding to the fact that the user interaction information comprises voice audio, and determining the emotion category to which the voice audio belongs from a predetermined emotion category set; and generating a video for indicating the preset person to execute a target action based on the feedback information and the determined emotion category, wherein the target action corresponds to the feedback information and the emotion indicated by the determined emotion category.
In some embodiments, the user interaction information comprises foreign language voice audio, and the pre-set person is a foreign education; and generating a video for instructing a preset person to perform an action corresponding to the feedback information based on the feedback information, including: in response to the feedback information being text information, inputting the text information to a generation model pre-trained for the foreign education, generating a voice audio corresponding to the text information, and an image sequence for instructing the foreign education to emit the generated voice audio, wherein the generation model is used for generating the voice audio corresponding to the input text information and the image sequence for instructing the foreign education to emit the voice audio corresponding to the input text information; based on the generated image sequence and the voice audio corresponding to the text information, a video for instructing a foreign education to emit the voice audio corresponding to the text information is generated.
In some embodiments, the generative model is trained by: acquiring a target video, wherein the target video is obtained by shooting images and recording voice for external education, and the playing time of the target video is greater than or equal to a preset threshold; extracting matched images and audio frames from a target video to obtain a training sample set, wherein the training samples in the training sample set comprise the audio frames, the images matched with the audio frames and text information corresponding to the audio frames; and training to obtain a generated model by using a machine learning algorithm and using the text information included in the training samples in the training sample set as input data and using the audio frames and the images included in the training samples as expected output data.
In a second aspect, an embodiment of the present disclosure provides an apparatus for generating a video, the apparatus including: an acquisition unit configured to acquire user interaction information of a target user; a first generating unit configured to generate feedback information for the user interaction information based on the user interaction information; and a second generating unit configured to generate a video for instructing a preset person to perform an action corresponding to the feedback information, based on the feedback information.
In a third aspect, an embodiment of the present disclosure provides an electronic device for generating a video, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method of any of the embodiments of the method for generating video as described above.
In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium for generating a video, on which a computer program is stored, which when executed by a processor, implements the method of any of the embodiments of the method for generating a video as described above.
According to the method, the device, the equipment and the medium for generating the video, the user interaction information of the target user is obtained, then the feedback information aiming at the user interaction information is generated based on the user interaction information, and then the video used for indicating the preset personnel to execute the action corresponding to the feedback information is generated based on the feedback information, so that the information interaction can be carried out with the user in a video generation mode, the diversity of interaction modes is improved, the photophobia feeling generated in the process of interaction between the user and a real person is avoided, and the expression ability of the user is improved.
Drawings
Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram for one embodiment of a method for generating video in accordance with the present disclosure;
FIG. 3 is a schematic diagram of one application scenario of a method for generating video in accordance with the present disclosure;
FIG. 4 is a flow diagram of yet another embodiment of a method for generating video in accordance with the present disclosure;
FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for generating video in accordance with the present disclosure;
FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the disclosure and are not limiting of the disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 of an embodiment of a method for generating video or an apparatus for generating video to which embodiments of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 101, 102, 103 to interact with a server 105 over a network 104 to receive or transmit data (e.g., user interaction information) and the like. The terminal devices 101, 102, 103 may have various client applications installed thereon, such as video playing software, video processing applications, news information applications, image processing applications, web browser applications, shopping applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having information processing functions, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server providing various services, such as a background video processing server generating a video for instructing a preset person to perform a corresponding action based on the user interaction information sent by the terminal devices 101, 102, 103. Optionally, the background video processing server may also feed back the generated video to the terminal device for the terminal device to play. As an example, the server 105 may be a cloud server.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be further noted that the method for generating the video provided by the embodiment of the present disclosure may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other. Accordingly, the various parts (e.g., the various units, sub-units, modules, and sub-modules) included in the apparatus for generating video may be all disposed in the server, may be all disposed in the terminal device, and may be disposed in the server and the terminal device, respectively.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. The system architecture may only include the electronic device (e.g., server or terminal device) on which the method for generating video operates, when the electronic device on which the method for generating video operates does not require data transmission with other electronic devices.
With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating video in accordance with the present disclosure is shown. The method for generating the video comprises the following steps:
step 201, obtaining user interaction information of a target user.
In this embodiment, an execution subject (for example, a server or a terminal device shown in fig. 1) of the method for generating a video may obtain user interaction information of a target user from other electronic devices or locally through a wired connection manner or a wireless connection manner.
The target user may be any user. The user interaction information may be used to instruct a target user to interact with the execution main body. By way of example, the user interaction information may include, but is not limited to, information in the form of: text, voice, image, video, etc.
Here, when the execution main body is a terminal device, the execution main body may acquire user interaction information of a target user through at least one of a voice acquisition device, an image acquisition device, a mouse, a keyboard, and a touch screen provided thereon; when the execution main body is a server, the execution main body may acquire the user interaction information of the target user from the terminal device after the terminal device used by the target user acquires the user interaction information of the target user through at least one of a voice acquisition device, an image acquisition device, a mouse, a keyboard, and a touch screen.
Step 202, generating feedback information aiming at the user interaction information based on the user interaction information.
In this embodiment, based on the user interaction information acquired in step 201, the execution subject may generate feedback information for the user interaction information. When the user interaction information indicates an operation instruction of a user, the feedback information may represent whether the operation instruction is completed; when the user interaction information indicates a question of the user, the feedback information may be reply information to the question.
As an example, the execution main body or the electronic device communicatively connected to the execution main body may store the user interaction information and the feedback information of the user interaction information in association with each other in advance. Thus, after the execution subject performs step 201, the feedback information stored in association with the acquired user interaction information may be searched in a local area or an electronic device communicatively connected thereto, and the searched feedback information may be used as the feedback information for the user interaction information generated in step 202.
As yet another example, the execution principal or an electronic device communicatively coupled to the execution principal may first train the feedback information generation model using a machine learning algorithm based on a training sample of feedback information including user interaction information and user interaction information. The feedback information generation model can be used for generating feedback information of user interaction information. After obtaining the feedback information generation model, the execution subject may input the user interaction information to the feedback information generation model, thereby generating feedback information for the user interaction information.
In practice, in the case that the user interaction information includes text information or voice audio information, the execution main body may employ a conversation robot to generate feedback information of the user interaction information.
And step 203, generating a video for instructing the preset person to execute the action corresponding to the feedback information based on the feedback information.
In this embodiment, based on the feedback information generated in step 202, the execution subject may generate a video for instructing a preset person to execute an action corresponding to the feedback information.
The preset person may be any person. As an example, the preset person may be a predetermined person, or may be a person selected by the target user from a predetermined group of persons.
Here, the correspondence between the feedback information and the action may be established in advance, for example, the feedback information may be stored in association with action information characterizing the action, whereby the correspondence between the feedback information and the action may be established. In addition, the executing body may also use a model obtained by training a machine learning algorithm to generate an image sequence for instructing a preset person to execute an action corresponding to the feedback information, thereby obtaining a video composed of the generated image sequence.
In some optional implementations of this embodiment, in a case that the user interaction information includes a voice audio, the executing main body may execute the step 203 in the following manner:
the first step is that the emotion type to which the voice audio belongs is determined from a predetermined emotion type set. As an example, the emotion categories in the set of emotion categories may characterize any of the following emotions: joy, anger, worry, anxiety, etc. It can be understood that the emotion represented by the emotion category in the emotion category set can be determined according to actual needs, and is not limited herein.
It is understood that the speech audio uttered by the target user may contain emotions of joy, anger, worry, anxiety, and the like. In practice, the emotion category to which the voice audio belongs can be determined based on a hidden Markov and Gaussian mixture model method; an SVM (Support Vector Machine) based classification method can also be adopted to determine the emotion type to which the voice audio belongs; deep neural networks, end-to-end methods, or other speech emotion recognition methods can also be adopted to determine the emotion type to which the speech audio belongs.
And secondly, generating a video for indicating a preset person to execute the target action based on the feedback information and the determined emotion type. Wherein the target action corresponds to the feedback information and the emotion indicated by the determined emotion classification.
Here, when the feedback information indicates speech, an action (i.e., a target action) corresponding to the feedback information and an emotion indicated by the determined emotion category may characterize: the preset person utters the voice with the emotion indicated by the determined emotion classification. For example, if the feedback information indicates audio of "o" and the determined emotion category indicates "panic", then the target action may characterize that: it is preset that the person opens his mouth under an emotion with a panic. Further, when the feedback information indicates an image, the target action may characterize: the preset person executes the limb action corresponding to the image under the condition of the emotion indicated by the determined emotion category. For example, if the feedback information indicates an image of "smiling", and the determined emotion category indicates "embarrassment", then the target action may characterize: the target person is awkwardly laughing.
Here, the execution subject may input the feedback information and the determined emotion category to a video generation model trained in advance, thereby obtaining a video for instructing a preset person to execute the target action. The video generation model can represent the corresponding relation among the feedback information, the emotion types and videos for indicating preset personnel to execute the target action.
As an example, the video generation model may be obtained based on a generative confrontation network trained by a machine learning algorithm, and may be a two-dimensional table or database storing feedback information, emotion categories, and videos for instructing preset persons to perform target actions.
For example, the generative countermeasure network described above may include a generative network and a discriminative network. The generation network can be used for generating a video for indicating preset personnel to execute the target action according to the input feedback information and the emotion category. The discrimination network may be configured to determine whether the generated video meets a preset condition. As an example, the discriminant model may determine whether the generated video meets a preset condition according to a magnitude relationship between a function value of the calculated loss function and a preset threshold. As an example, the preset conditions may include: the similarity between the generated video and the actually recorded video is greater than or equal to a preset threshold value. Thus, the video generative model may be a generative network comprised by a trained generative confrontation network.
It will be appreciated that the target action of the video indication generated by the alternative implementation described above corresponds to the emotion indicated by the determined emotion classification, and thus the accuracy of the generated video may be improved.
In some optional implementation manners of this embodiment, in a case that the user interaction information includes a foreign language voice audio, the preset person is a foreign education, and the feedback information is text information, the executing main body may execute the step 203 by using the following steps:
firstly, inputting text information into a generation model trained in advance for the foreign education, generating voice audio corresponding to the text information, and an image sequence for instructing the foreign education to emit the generated voice audio. Wherein the generative model is to generate a speech audio corresponding to the input text information and an image sequence to instruct a foreign education to emit the speech audio corresponding to the input text information. Here, each external teaching can correspond to one generative model, and different external teaching can correspond to different generative models.
In some application scenarios of the above alternative implementation, the generative model may be trained by the following steps:
step one, obtaining a target video. The target video is obtained by shooting images and recording voice for the external education, and the playing time of the target video is greater than or equal to a preset threshold value.
And step two, extracting matched images and audio frames from the target video to obtain a training sample set. The training samples in the training sample set comprise audio frames, images matched with the audio frames and text information corresponding to the audio frames.
And step three, training by adopting a machine learning algorithm by taking the text information included in the training samples in the training sample set as input data and taking the audio frames and the images included in the training samples as expected output data to obtain a generated model.
It can be understood that, in the above application scenario, a target video with a longer playing time (the playing time is greater than or equal to a preset threshold) is obtained first, and then a training sample is obtained by extracting matched images and audio frames from the target video, so as to train and obtain a generated model, thereby enabling a video generated by the generated model obtained by training to be closer to a really recorded video, and improving the accuracy of video generation.
Optionally, the generative model may also be obtained based on a generative confrontation network trained by a machine learning algorithm, for example, the generative confrontation network may include a generative network and a discriminant network. Wherein the generation network can be used to generate voice audio corresponding to the textual information and a sequence of images to instruct the external education to emit the generated voice audio. The discriminant model may be used to determine whether the generated video (voice audio and/or image sequence) meets a preset condition. As an example, the discriminant model may determine whether the generated speech audio and/or image sequence meets a preset condition according to a magnitude relationship between a function value of the calculated loss function and a preset threshold. Wherein the preset condition may include at least one of the following: the similarity between the generated voice audio and the actually recorded voice audio is greater than or equal to a preset threshold value; the similarity between the generated image sequence and the image sequence in the real recorded video is greater than or equal to a preset threshold value; the degree of matching of the generated voice audio with an image sequence for instructing the external education to emit the generated voice audio is greater than or equal to a preset threshold value of degree of matching. It will be appreciated that if the speech to which the generated speech audio indicates is open-mouthed and the generated image sequence indicates that the external education is closed-mouthed, then it can be determined that the generated speech audio does not match the image sequence used to indicate that the external education is to emit the generated speech audio (i.e., the degree of match is less than a preset threshold degree of match).
And a second step of generating a video for instructing a foreign language to emit a voice audio corresponding to the text information based on the generated image sequence and the voice audio corresponding to the text information.
Typically, video includes both audio and image sequences (i.e., sequences of video frames). The audio included in the video generated in the second step may be the voice audio corresponding to the text information generated in the first step, and the image sequence included in the video generated in the second step may be the image sequence generated in the first step.
It can be appreciated that, at present, more and more people choose to adopt a one-to-one external teaching learning mode to learn more authentic and pure foreign languages (such as english). With the wider and wider application of foreign languages, compared with domestic teachers, the advantage of private foreign education is spoken language. Has inherent advantages in terms of pronunciation accuracy and word use. However, the learning method of one-to-one external education has great limitations in terms of learning time, learning place, learning cost, and the like. Moreover, when the real human is used for oral practice, the user is often sensitive to the sense of shame, so that the learning efficiency is low. The optional implementation mode can generate the video used for indicating the foreign language to send out the voice audio corresponding to the text information based on the generated image sequence and the voice audio corresponding to the text information, so that the interactive scene of foreign language learning can be applied, the video generated by the foreign language instructor is interacted with the user, the photophobia feeling during real-person spoken language practice can be reduced, and the limitation of the one-to-one learning mode of the foreign language teaching in the aspects of learning time, learning place, learning cost and the like is solved.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating video according to the present embodiment. In the application scenario of fig. 3, the terminal device 301 first obtains user interaction information 303 ("Good moving" in the figure) of the target user 302, then the terminal device 301 generates feedback information 304 ("moving" in the figure) for the user interaction information 303 based on the user interaction information 303, and then the terminal device 301 generates a video 305 for instructing a preset person to perform an action corresponding to the feedback information based on the feedback information 304. Optionally, after the terminal device 301 generates the video 305, the video 305 may also be played to achieve information interaction with the target user 302. Here, for example, the video 305 may instruct the preset person to emit a voice corresponding to the feedback information 304 (i.e., "moving"), that is, the mouth shape of the image of the preset person presented in the video 305 matches the voice corresponding to the feedback information 304.
According to the method provided by the embodiment of the disclosure, the user interaction information of the target user is acquired, then the feedback information aiming at the user interaction information is generated based on the user interaction information, and then the video used for instructing the preset personnel to execute the action corresponding to the feedback information is generated based on the feedback information, so that the information interaction can be performed with the user in a video generation mode, the diversity of interaction modes is improved, the photophobia generated in the interaction process between the user and a real person is avoided, and the expression ability of the user is improved.
In some optional implementations of this embodiment, the user interaction information may include a user video. Specifically, the user video may be a video obtained by image-capturing and voice-recording a target user. Thus, the executing body may execute the step 202 by:
in a case where the voice audio in the user video satisfies the preset intonation adjustment condition, the execution main body may generate feedback information for instructing the user to adjust the intonation of the audio.
Wherein, the preset intonation adjustment condition may include: the intonation of the voice audio in the user video does not match the intonation previously associated with the voice audio.
It can be understood that, in the case that the voice audio in the user video satisfies the preset intonation adjustment condition, the intonation of the voice audio of the target user is usually incorrect, and therefore, the above-mentioned alternative implementation manner may correct the intonation of the user by generating feedback information for instructing the user to adjust the intonation of the audio, so as to correct the pronunciation of the target user.
In some optional implementations of this embodiment, the user interaction information may include a user video. For example, the user video may be a video obtained by image-taking and voice-recording the target user. Thus, the executing body may further execute the step 202 by:
and generating feedback information for indicating the user to adjust the mouth shape under the condition that the mouth shape in the image in the user video accords with the preset mouth shape adjusting condition.
Wherein, the preset mouth shape adjusting condition may include: the mouth shape in the image in the user's video does not match the speech indicated by the speech audio (or audio frame) corresponding to the image.
It can be understood that, in the case that the mouth shape in the image in the user video conforms to the preset mouth shape adjustment condition, the pronunciation mouth shape of the target user is usually incorrect, and therefore, the above alternative implementation manner can correct the mouth shape of the user by generating feedback information for instructing the user to adjust the mouth shape so as to correct the pronunciation of the target user.
With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating a video is shown. The flow 400 of the method for generating a video comprises the steps of:
step 401, obtaining user interaction information of a target user.
In this embodiment, step 401 is substantially the same as step 201 in the corresponding embodiment of fig. 2, and is not described here again.
Step 402, generating feedback information aiming at the user interaction information based on the user interaction information.
In this embodiment, an execution subject (e.g., a server or a terminal device shown in fig. 1) of the method for generating a video may generate feedback information for the user interaction information based on the user interaction information. Wherein the feedback information is text information.
In this embodiment, step 402 is substantially the same as step 202 in the corresponding embodiment of fig. 2, and is not described herein again.
Step 403, determining phoneme information and intonation information corresponding to the text information.
In the present embodiment, the execution body may determine phoneme (phone) information and tone (tone) information corresponding to the text information.
As an example, the execution body may input text information to a phoneme and intonation training model trained in advance, thereby obtaining phoneme information and intonation information corresponding to the text information. The phoneme and intonation training model may be used to determine phoneme information and intonation information corresponding to the text information. For example, the phoneme and intonation training model can be trained by the following steps:
a set of training samples is obtained. The training samples in the training sample set comprise text information, phoneme information and intonation information corresponding to the text information. The phoneme information may indicate phonemes (e.g., monophonic or triphone, etc.) or may be a predetermined posterior probability (i.e., a numerical value that is not normalized) of each phoneme in the set of phonemes.
And training to obtain a phoneme and intonation training model by adopting a machine learning algorithm and taking the text information in the training sample set as input data and taking phoneme information and intonation information corresponding to the input text information as expected output data.
It can be understood that, in the above implementation manner, the obtained phoneme and intonation training model may be trained by using a machine learning algorithm, so as to improve the accuracy of determining the phoneme information and the intonation information.
Optionally, the execution main body may further pre-establish a correspondence between the text information, and the phoneme information and the intonation information corresponding to the text information, so that the phoneme information and the intonation information associated with the text information are respectively used as the phoneme information and the intonation information corresponding to the text information.
And step 404, generating a video for instructing a preset person to execute an action corresponding to the feedback information based on the phoneme information and the intonation information.
In this embodiment, the execution body may generate a video for instructing a preset person to execute an action corresponding to the feedback information based on the phoneme information and the intonation information.
In some optional implementations of this embodiment, the executing main body may execute the step 404 by:
first, based on phoneme information, the number of audio frames of speech audio to be generated and the number of images of an image sequence to be generated are determined. Wherein the number of audio frames is equal to the number of images.
It is to be understood that, since the number of factors indicated by the phoneme information determines the playing time length of the speech audio, the number of audio frames of the speech audio to be generated may be determined by the number of phonemes indicated by the phoneme information.
Then, based on the phoneme information and the intonation information, a speech audio and image sequence is generated. The voice audio comprises the number of audio frames which is the number of audio frames, the number of images in the image sequence is the number of images, and the image sequence indicates a preset person to execute an action corresponding to the voice audio. As an example, the execution body described above may first generate a voice audio based on the phoneme information and the intonation information, and then generate an image sequence that instructs the preset person to perform an action corresponding to the voice audio (e.g., a voice that instructs the preset person to emit an instruction of the voice audio).
And finally, synthesizing the voice audio and the image sequence to obtain a video for indicating a preset person to execute the action corresponding to the voice audio.
It can be understood that, the above alternative implementation manner may generate the video for instructing the preset person to perform the action corresponding to the voice audio based on the equal number of voice frames and images, and the matching degree of the audio and the image of the voice audio during the playing process may be further improved.
It should be noted that, besides the above-mentioned contents, the embodiment of the present application may further include the same or similar features and effects as the embodiment corresponding to fig. 2, and details are not repeated herein.
As can be seen from fig. 4, the flow 400 of the method for generating a video in the present embodiment may generate a video for instructing a preset person to perform an action corresponding to the feedback information, by using the phoneme information and the intonation information, thereby helping to improve the degree of matching between an image and a voice in the generated video.
In some application scenarios of the above alternative implementation, audio frames in the speech audio correspond to images in the image sequence one-to-one. Mouth shape characterization of images in an image sequence: the preset person utters a voice indicated by the audio frame corresponding to the image.
It can be understood that in the above application scenario, the mouth shape of the image in the image sequence represents that the preset person utters the voice indicated by the audio frame corresponding to the image, so that the mouth shape of the image in the finally generated video and the voice audio can be more matched, and the matching degree of the image in the generated video and the voice is further improved.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating a video, the apparatus embodiment corresponding to the method embodiment shown in fig. 2, and the apparatus embodiment may include the same or corresponding features as the method embodiment shown in fig. 2 and produce the same or corresponding effects as the method embodiment shown in fig. 2, in addition to the features described below. The device can be applied to various electronic equipment.
As shown in fig. 5, the apparatus 500 for generating a video of the present embodiment includes: an obtaining unit 501 configured to obtain user interaction information of a target user; a first generating unit 502 configured to generate feedback information for the user interaction information based on the user interaction information; a second generating unit 503 configured to generate a video for instructing a preset person to perform an action corresponding to the feedback information based on the feedback information.
In this embodiment, the obtaining unit 501 of the apparatus 500 for generating a video may obtain the user interaction information of the target user from other electronic devices through a wired connection manner or a wireless connection manner, or locally.
In this embodiment, based on the user interaction information acquired by the acquisition unit 501, the first generation unit 502 may generate feedback information for the user interaction information.
In this embodiment, based on the feedback information generated by the first generating unit 502, the second generating unit 503 may generate a video for instructing a preset person to perform an action corresponding to the feedback information.
In some optional implementations of this embodiment, the feedback information is text information; and, the second generating unit 503 includes: a first determining subunit (not shown in the figure) configured to determine phoneme information and intonation information corresponding to the text information; a first generating subunit (not shown in the figure) configured to generate a video for instructing a preset person to perform an action corresponding to the feedback information, based on the phoneme information and the intonation information.
In some optional implementations of this embodiment, the first generating subunit includes: a determining module (not shown in the figure) configured to determine, based on the phoneme information, a number of audio frames of the speech audio to be generated and a number of images of the image sequence to be generated, wherein the number of audio frames is equal to the number of images; a generating module (not shown in the figures) configured to generate a voice audio and an image sequence based on the phoneme information and the intonation information, wherein the voice audio includes the number of audio frames as the number of audio frames, the number of images in the image sequence is the number of images, and the image sequence indicates a preset person to perform an action corresponding to the voice audio; and a synthesis module (not shown in the figure) configured to synthesize the voice audio and the image sequence to obtain a video for instructing a preset person to perform an action corresponding to the voice audio.
In some optional implementations of this embodiment, the audio frames in the speech audio correspond to the images in the image sequence one-to-one, and the mouth shape of the images in the image sequence represents: the preset person utters a voice indicated by the audio frame corresponding to the image.
In some optional implementations of this embodiment, the user interaction information includes a user video; and the first generation unit includes: a second generation subunit (not shown in the figure) configured to generate feedback information indicating that the user adjusts the intonation of the audio in response to the voice audio in the user video satisfying a preset intonation adjustment condition; and a third generating subunit (not shown in the figure) configured to generate feedback information for instructing the user to adjust the mouth shape in response to the mouth shape in the image in the user video conforming to the preset mouth shape adjusting condition.
In some optional implementations of this embodiment, the second generating unit includes: a second determining subunit (not shown in the figure) configured to determine, in response to the user interaction information including the voice audio, an emotion category to which the voice audio belongs from the predetermined emotion category set; and a fourth generating subunit (not shown in the figure) configured to generate a video for instructing the preset person to perform a target action based on the feedback information and the determined emotion category, wherein the target action corresponds to the emotion indicated by the feedback information and the determined emotion category.
In some optional implementation manners of the embodiment, the user interaction information includes foreign language voice audio, and the preset person is a foreigner; and the second generating unit includes: an input subunit (not shown in the figure) configured to, in response to the feedback information being text information, input the text information to a generation model trained in advance for the foreign education, generate a speech audio corresponding to the text information, and an image sequence for instructing the foreign education to emit the generated speech audio, wherein the generation model is used to generate the speech audio corresponding to the input text information, and the image sequence for instructing the foreign education to emit the speech audio corresponding to the input text information; a fifth generating subunit (not shown in the figure) configured to generate, based on the generated image sequence and the voice audio corresponding to the text information, a video for instructing a foreign education to emit the voice audio corresponding to the text information.
In some optional implementations of this embodiment, the generative model is trained by: acquiring a target video, wherein the target video is obtained by shooting images and recording voice for external education, and the playing time of the target video is greater than or equal to a preset threshold; extracting matched images and audio frames from a target video to obtain a training sample set, wherein the training samples in the training sample set comprise the audio frames, the images matched with the audio frames and text information corresponding to the audio frames; and training to obtain a generated model by using a machine learning algorithm and using the text information included in the training samples in the training sample set as input data and using the audio frames and the images included in the training samples as expected output data.
In the apparatus provided by the above embodiment of the present disclosure, the obtaining unit 501 obtains the user interaction information of the target user, then the first generating unit 502 generates the feedback information for the user interaction information based on the user interaction information, and then the second generating unit 503 generates the video for instructing the preset person to perform the action corresponding to the feedback information based on the feedback information, so that information interaction can be performed with the user in a manner of generating the video, thereby improving diversity of interaction manners, helping to avoid a sense of shame generated in the process of interaction between the user and a real person, and improving expression ability of the user.
Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use with the electronic device implementing embodiments of the present disclosure. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
To the I/O interface 605, AN input section 606 including a keyboard, a mouse, and the like, AN output section 607 including a network interface card such as a Cathode Ray Tube (CRT), a liquid crystal display (L CD), and the like, a speaker, and the like, a storage section 608 including a hard disk, and the like, and a communication section 609 including a network interface card such as a L AN card, a modem, and the like, the communication section 609 performs communication processing via a network such as the internet, a drive 610 is also connected to the I/O interface 605 as necessary, a removable medium 611 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted into the storage section 608 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The above-described functions defined in the method of the present disclosure are performed when the computer program is executed by a Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including AN object oriented programming language such as Python, Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In accordance with one or more embodiments of the present disclosure, there is provided a method for generating a video, the method comprising: acquiring user interaction information of a target user; generating feedback information aiming at the user interaction information based on the user interaction information; and generating a video for instructing the preset person to perform the action corresponding to the feedback information based on the feedback information.
According to one or more embodiments of the present disclosure, in a method for generating a video provided by the present disclosure, feedback information is text information; and generating a video for instructing a preset person to perform an action corresponding to the feedback information based on the feedback information, including: determining phoneme information and intonation information corresponding to the text information; and generating a video for instructing a preset person to perform an action corresponding to the feedback information based on the phoneme information and the intonation information.
According to one or more embodiments of the present disclosure, in a method for generating a video provided by the present disclosure, generating a video for instructing a preset person to perform an action corresponding to feedback information based on phoneme information and intonation information includes: determining the number of audio frames of the voice audio to be generated and the number of images of the image sequence to be generated based on the phoneme information, wherein the number of audio frames is equal to the number of images; generating a voice audio and an image sequence based on the phoneme information and the tone information, wherein the voice audio comprises the number of audio frames which is the number of audio frames, the number of images in the image sequence is the number of images, and the image sequence indicates a preset person to execute an action corresponding to the voice audio; and synthesizing the voice audio and the image sequence to obtain a video for indicating a preset person to execute the action corresponding to the voice audio.
According to one or more embodiments of the present disclosure, the present disclosure provides a method for generating video, in which audio frames in speech audio correspond to images in an image sequence one-to-one, and mouth shape characterization of the images in the image sequence: the preset person utters a voice indicated by the audio frame corresponding to the image.
According to one or more embodiments of the present disclosure, in a method for generating a video provided by the present disclosure, user interaction information includes a user video; and generating feedback information for the user interaction information based on the user interaction information, including: responding to the voice audio in the user video meeting a preset intonation adjustment condition, and generating feedback information for indicating the user to adjust the intonation of the audio; and generating feedback information for indicating the user to adjust the mouth shape in response to the fact that the mouth shape in the image in the user video meets the preset mouth shape adjusting condition.
According to one or more embodiments of the present disclosure, in a method for generating a video provided by the present disclosure, generating a video for instructing a preset person to perform an action corresponding to feedback information based on the feedback information includes: responding to the fact that the user interaction information comprises voice audio, and determining the emotion category to which the voice audio belongs from a predetermined emotion category set; and generating a video for indicating the preset person to execute a target action based on the feedback information and the determined emotion category, wherein the target action corresponds to the feedback information and the emotion indicated by the determined emotion category.
According to one or more embodiments of the present disclosure, in a method for generating a video provided by the present disclosure, user interaction information includes foreign language voice audio, and a preset person is a foreigner; and generating a video for instructing a preset person to perform an action corresponding to the feedback information based on the feedback information, including: in response to the feedback information being text information, inputting the text information to a generation model pre-trained for the foreign education, generating a voice audio corresponding to the text information, and an image sequence for instructing the foreign education to emit the generated voice audio, wherein the generation model is used for generating the voice audio corresponding to the input text information and the image sequence for instructing the foreign education to emit the voice audio corresponding to the input text information; based on the generated image sequence and the voice audio corresponding to the text information, a video for instructing a foreign education to emit the voice audio corresponding to the text information is generated.
According to one or more embodiments of the present disclosure, in a method for generating a video provided by the present disclosure, a generation model is trained by the following steps: acquiring a target video, wherein the target video is obtained by shooting images and recording voice for external education, and the playing time of the target video is greater than or equal to a preset threshold; extracting matched images and audio frames from a target video to obtain a training sample set, wherein the training samples in the training sample set comprise the audio frames, the images matched with the audio frames and text information corresponding to the audio frames; and training to obtain a generated model by using a machine learning algorithm and using the text information included in the training samples in the training sample set as input data and using the audio frames and the images included in the training samples as expected output data.
In accordance with one or more embodiments of the present disclosure, there is provided an apparatus for generating a video, the apparatus including: an acquisition unit configured to acquire user interaction information of a target user; a first generating unit configured to generate feedback information for the user interaction information based on the user interaction information; and a second generating unit configured to generate a video for instructing a preset person to perform an action corresponding to the feedback information, based on the feedback information.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video provided by the present disclosure, the feedback information is text information; and the second generating unit includes: a first determining subunit configured to determine phoneme information and intonation information corresponding to the text information; a first generation subunit configured to generate a video for instructing a preset person to perform an action corresponding to the feedback information, based on the phoneme information and the intonation information.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video, a first generation subunit includes: the determining module is configured to determine the number of audio frames of the voice audio to be generated and the number of images of the image sequence to be generated based on the phoneme information, wherein the number of audio frames is equal to the number of images; the generating module is configured to generate a voice audio and an image sequence based on the phoneme information and the intonation information, wherein the voice audio comprises the number of audio frames which is the number of audio frames, the number of images in the image sequence is the number of images, and the image sequence indicates a preset person to perform an action corresponding to the voice audio; and the synthesis module is configured to synthesize the voice audio and the image sequence to obtain a video for indicating a preset person to execute an action corresponding to the voice audio.
According to one or more embodiments of the present disclosure, in an apparatus for generating video, audio frames in speech audio correspond to images in an image sequence one to one, and a mouth shape of the images in the image sequence represents: the preset person utters a voice indicated by the audio frame corresponding to the image.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video provided by the present disclosure, user interaction information includes a user video; and the first generation unit includes: a second generation subunit configured to generate feedback information indicating a user to adjust a intonation of the audio in response to a voice audio in the user video satisfying a preset intonation adjustment condition; and the third generation subunit is configured to generate feedback information for indicating the user to adjust the mouth shape in response to that the mouth shape in the image in the user video conforms to the preset mouth shape adjustment condition.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video provided by the present disclosure, a second generating unit includes: the second determining subunit is configured to determine an emotion category to which the voice audio belongs from the predetermined emotion category set in response to the user interaction information including the voice audio; and a fourth generating subunit configured to generate a video for instructing the preset person to perform a target action based on the feedback information and the determined emotion category, wherein the target action corresponds to the emotion indicated by the feedback information and the determined emotion category.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video provided by the present disclosure, user interaction information includes foreign language voice audio, and a preset person is a foreigner; and the second generating unit includes: an input subunit configured to, in response to the feedback information being text information, input the text information to a generation model trained in advance for the foreign education, generate a voice audio corresponding to the text information, and an image sequence for instructing the foreign education to emit the generated voice audio, wherein the generation model is used to generate the voice audio corresponding to the input text information and the image sequence for instructing the foreign education to emit the voice audio corresponding to the input text information; a fifth generating subunit configured to generate, based on the generated image sequence and the voice audio corresponding to the text information, a video for instructing a foreign education to emit the voice audio corresponding to the text information.
According to one or more embodiments of the present disclosure, in an apparatus for generating a video, a generative model is trained by: acquiring a target video, wherein the target video is obtained by shooting images and recording voice for external education, and the playing time of the target video is greater than or equal to a preset threshold; extracting matched images and audio frames from a target video to obtain a training sample set, wherein the training samples in the training sample set comprise the audio frames, the images matched with the audio frames and text information corresponding to the audio frames; and training to obtain a generated model by using a machine learning algorithm and using the text information included in the training samples in the training sample set as input data and using the audio frames and the images included in the training samples as expected output data.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a first generation unit, and a second generation unit. The names of these units do not in some cases form a limitation on the units themselves, and for example, the acquiring unit may also be described as a "unit that acquires user interaction information of a target user".
As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring user interaction information of a target user; generating feedback information aiming at the user interaction information based on the user interaction information; and generating a video for instructing the preset person to perform the action corresponding to the feedback information based on the feedback information.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (11)

1. A method for generating video, comprising:
acquiring user interaction information of a target user;
generating feedback information aiming at the user interaction information based on the user interaction information;
and generating a video for instructing a preset person to execute an action corresponding to the feedback information based on the feedback information.
2. The method of claim 1, wherein the feedback information is textual information; and
the generating of the video for instructing a preset person to execute the action corresponding to the feedback information based on the feedback information includes:
determining phoneme information and intonation information corresponding to the text information;
and generating a video for instructing a preset person to execute an action corresponding to the feedback information based on the phoneme information and the intonation information.
3. The method of claim 2, wherein the generating a video for instructing a preset person to perform an action corresponding to the feedback information based on the phoneme information and the intonation information comprises:
determining the number of audio frames of the voice audio to be generated and the number of images of the image sequence to be generated based on the phoneme information, wherein the number of audio frames is equal to the number of images;
generating a voice audio and an image sequence based on the phoneme information and the intonation information, wherein the voice audio comprises the number of audio frames which is the number of audio frames, the number of images in the image sequence is the number of images, and the image sequence indicates a preset person to execute an action corresponding to the voice audio;
and synthesizing the voice audio and the image sequence to obtain a video for indicating the preset personnel to execute the action corresponding to the voice audio.
4. The method of claim 3, wherein audio frames in the speech audio correspond one-to-one to images in the sequence of images, and wherein a mouth shape of an image in the sequence of images characterizes: and the preset personnel sends out voice indicated by the audio frame corresponding to the image.
5. The method of one of claims 1-4, wherein the user interaction information comprises user video; and
the generating feedback information for the user interaction information based on the user interaction information comprises:
responding to the voice audio in the user video meeting a preset intonation adjustment condition, and generating feedback information for indicating the user to adjust the intonation of the audio;
and generating feedback information for indicating the user to adjust the mouth shape in response to the fact that the mouth shape in the image in the user video meets a preset mouth shape adjusting condition.
6. The method according to one of claims 1 to 4, wherein the generating, based on the feedback information, a video for instructing a preset person to perform an action corresponding to the feedback information comprises:
in response to the user interaction information comprising voice audio, determining an emotion category to which the voice audio belongs from a predetermined emotion category set;
and generating a video for indicating a preset person to execute a target action based on the feedback information and the determined emotion category, wherein the target action corresponds to the feedback information and the emotion indicated by the determined emotion category.
7. The method according to one of claims 1 to 4, wherein the user interaction information comprises foreign language voice audio, the predetermined person is a foreign education; and
the generating of the video for instructing a preset person to execute the action corresponding to the feedback information based on the feedback information includes:
in response to the feedback information being text information, inputting the text information to a generation model pre-trained for the foreign education, generating a voice audio corresponding to the text information, and an image sequence for instructing the foreign education to emit the generated voice audio, wherein the generation model is used for generating the voice audio corresponding to the input text information, and the image sequence for instructing the foreign education to emit the voice audio corresponding to the input text information;
generating a video for instructing the external education to emit voice audio corresponding to the text information based on the generated image sequence and the voice audio corresponding to the text information.
8. The method of claim 7, wherein the generative model is trained by:
acquiring a target video, wherein the target video is obtained by shooting images and recording voice of the external education, and the playing time of the target video is greater than or equal to a preset threshold;
extracting matched images and audio frames from the target video to obtain a training sample set, wherein the training samples in the training sample set comprise the audio frames, the images matched with the audio frames and text information corresponding to the audio frames;
and training to obtain a generated model by using a machine learning algorithm and using the text information included in the training samples in the training sample set as input data and using the audio frames and the images included in the training samples as expected output data.
9. An apparatus for generating video, comprising:
an acquisition unit configured to acquire user interaction information of a target user;
a first generating unit configured to generate feedback information for the user interaction information based on the user interaction information;
a second generating unit configured to generate a video for instructing a preset person to perform an action corresponding to the feedback information based on the feedback information.
10. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.
11. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-8.
CN202010182273.8A 2020-03-16 2020-03-16 Method, apparatus, device and medium for generating video Pending CN111415662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010182273.8A CN111415662A (en) 2020-03-16 2020-03-16 Method, apparatus, device and medium for generating video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010182273.8A CN111415662A (en) 2020-03-16 2020-03-16 Method, apparatus, device and medium for generating video

Publications (1)

Publication Number Publication Date
CN111415662A true CN111415662A (en) 2020-07-14

Family

ID=71493003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010182273.8A Pending CN111415662A (en) 2020-03-16 2020-03-16 Method, apparatus, device and medium for generating video

Country Status (1)

Country Link
CN (1) CN111415662A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735371A (en) * 2020-12-28 2021-04-30 出门问问(苏州)信息科技有限公司 Method and device for generating speaker video based on text information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063903A (en) * 2010-09-25 2011-05-18 中国科学院深圳先进技术研究院 Speech interactive training system and speech interactive training method
CN107340859A (en) * 2017-06-14 2017-11-10 北京光年无限科技有限公司 The multi-modal exchange method and system of multi-modal virtual robot
CN108326855A (en) * 2018-01-26 2018-07-27 上海器魂智能科技有限公司 A kind of exchange method of robot, device, equipment and storage medium
US20190251859A1 (en) * 2018-02-15 2019-08-15 International Business Machines Corporation Customer care training with situational feedback generation
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063903A (en) * 2010-09-25 2011-05-18 中国科学院深圳先进技术研究院 Speech interactive training system and speech interactive training method
CN107340859A (en) * 2017-06-14 2017-11-10 北京光年无限科技有限公司 The multi-modal exchange method and system of multi-modal virtual robot
CN108326855A (en) * 2018-01-26 2018-07-27 上海器魂智能科技有限公司 A kind of exchange method of robot, device, equipment and storage medium
US20190251859A1 (en) * 2018-02-15 2019-08-15 International Business Machines Corporation Customer care training with situational feedback generation
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735371A (en) * 2020-12-28 2021-04-30 出门问问(苏州)信息科技有限公司 Method and device for generating speaker video based on text information
CN112735371B (en) * 2020-12-28 2023-08-04 北京羽扇智信息科技有限公司 Method and device for generating speaker video based on text information

Similar Documents

Publication Publication Date Title
CN111415677B (en) Method, apparatus, device and medium for generating video
CN107945786B (en) Speech synthesis method and device
CN111432233B (en) Method, apparatus, device and medium for generating video
US11475897B2 (en) Method and apparatus for response using voice matching user category
US20230223010A1 (en) Video generation method, generation model training method and apparatus, and medium and device
CN108962255B (en) Emotion recognition method, emotion recognition device, server and storage medium for voice conversation
CN110347867B (en) Method and device for generating lip motion video
CN107657017A (en) Method and apparatus for providing voice service
CN111899719A (en) Method, apparatus, device and medium for generating audio
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN107481715B (en) Method and apparatus for generating information
CN110246488B (en) Voice conversion method and device of semi-optimized cycleGAN model
WO2022170848A1 (en) Human-computer interaction method, apparatus and system, electronic device and computer medium
CN107705782B (en) Method and device for determining phoneme pronunciation duration
CN110880198A (en) Animation generation method and device
CN109754783A (en) Method and apparatus for determining the boundary of audio sentence
CN111402842A (en) Method, apparatus, device and medium for generating audio
CN109582825B (en) Method and apparatus for generating information
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
CN109697978B (en) Method and apparatus for generating a model
CN114882861A (en) Voice generation method, device, equipment, medium and product
CN113223555A (en) Video generation method and device, storage medium and electronic equipment
US20210407504A1 (en) Generation and operation of artificial intelligence based conversation systems
CN112381926A (en) Method and apparatus for generating video
CN117635383A (en) Virtual teacher and multi-person cooperative talent training system, method and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination