CN117275485B - Audio and video generation method, device, equipment and storage medium - Google Patents

Audio and video generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN117275485B
CN117275485B CN202311560630.XA CN202311560630A CN117275485B CN 117275485 B CN117275485 B CN 117275485B CN 202311560630 A CN202311560630 A CN 202311560630A CN 117275485 B CN117275485 B CN 117275485B
Authority
CN
China
Prior art keywords
mouth shape
phoneme
video
phonemes
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311560630.XA
Other languages
Chinese (zh)
Other versions
CN117275485A (en
Inventor
廖少毅
陈钧浩
董伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yidong Huanqiu Shenzhen Digital Technology Co ltd
Original Assignee
Yidong Huanqiu Shenzhen Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yidong Huanqiu Shenzhen Digital Technology Co ltd filed Critical Yidong Huanqiu Shenzhen Digital Technology Co ltd
Priority to CN202311560630.XA priority Critical patent/CN117275485B/en
Publication of CN117275485A publication Critical patent/CN117275485A/en
Application granted granted Critical
Publication of CN117275485B publication Critical patent/CN117275485B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/18Details of the transformation process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Abstract

The embodiment of the application discloses an audio and video generation method, device and equipment and a storage medium, wherein the audio and video generation method comprises the following steps: the method comprises the steps of obtaining reply voice data fed back aiming at collected voice data and phonemes contained in the reply voice data, obtaining mouth shape adjustment parameters corresponding to each phoneme because one phoneme corresponds to one mouth shape adjustment parameter, generating video fragments corresponding to every two adjacent phonemes based on the mouth shape adjustment parameters corresponding to every two adjacent phonemes in the phonemes, splicing the video fragments corresponding to every two adjacent phonemes according to the time sequence of the phonemes in the reply voice data to obtain digital human video matched with the reply voice data, constructing audio and video based on the digital human video and the reply voice data, and playing the audio and video. By adopting the embodiment of the invention, the generation of the audio and video is directly finished at the front end, the efficiency requirements on the network bandwidth and the back-end server are reduced, and the digital personal deployment is easier and can be widely popularized and used.

Description

Audio and video generation method, device, equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, and a storage medium for generating an audio and video.
Background
With the rapid development of artificial intelligence technology, human-machine conversations have also become a reality from once not conceivable, and the forms of presentation have become more and more diversified. The human-computer dialogue is presented in the form of replying the voice of the user through the constructed digital human figure, and the digital human figure can make corresponding mouth shape and limb actions along with the replying content. However, the video clips of the digital population change and the limb movement are generated at the back end, and the generated video clips and the reply voice are transmitted to the front end for playing, so that the mode needs a larger network bandwidth, meanwhile, the network quality requirement is higher, the video clips are difficult to support the use of excessive users at the same time, and the high-resolution video is generated with extremely high requirement on the performance of a back end server. Therefore, how to reduce the network bandwidth requirement and the performance requirement of the back-end server, so that the digital personal deployment is easier and can be widely popularized and used is a problem to be solved.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present application is to provide a method, an apparatus, a device, and a storage medium for generating audio and video, which realize the reduction of the requirement of network bandwidth and the performance requirement of a back-end server, so that the deployment of digital people is easier and can be widely popularized and used.
In a first aspect, an embodiment of the present application provides an audio and video generating method, including:
acquiring reply voice data fed back for the acquired voice data and one or more phonemes contained in the reply voice data;
obtaining mouth shape adjusting parameters corresponding to each phoneme; wherein, the mouth shape adjustment parameter is used for indicating: the mouth shape of the digital person is adjusted to the adjustment parameters required by the mouth shape corresponding to the corresponding phonemes by the preset mouth shape;
generating video clips corresponding to every two adjacent phonemes based on the mouth shape adjustment parameters corresponding to every two adjacent phonemes in the one or more phonemes; the video clip is used for representing the change from the mouth shape corresponding to the former phoneme to the mouth shape corresponding to the latter phoneme in every two adjacent phonemes;
splicing video clips corresponding to each two adjacent phonemes according to the time sequence of one or more phonemes in the reply voice data to obtain digital human video matched with the reply voice data;
and constructing an audio and video based on the digital human video and the reply voice data, and playing the audio and video.
It can be seen that, in this embodiment of the present application, one or more phones included in the reply voice data and the reply voice data fed back for the collected voice data are obtained, and since one phone corresponds to one mouth shape adjustment parameter, the mouth shape adjustment parameter corresponding to each phone may be obtained, based on the mouth shape adjustment parameter corresponding to every two adjacent phones in the phones, a video segment corresponding to every two adjacent phones is generated at the front end, and according to the time sequence of the phones in the reply voice data, the video segments corresponding to every two adjacent phones are spliced to obtain a digital human video matched with the reply voice data, and an audio/video is constructed based on the digital human video and the reply voice data, and played. Therefore, the generation of the audio and video can be directly finished at the front end, the requirement of network bandwidth and the efficiency requirement of a rear-end server can be reduced, and the digital personal deployment is easier and can be widely popularized and used.
In an alternative embodiment, generating a video segment corresponding to each two adjacent phones based on the mouth shape adjustment parameters corresponding to each two adjacent phones in the one or more phones includes:
acquiring the sounding time length from the former phoneme to the latter phoneme in each two adjacent phonemes;
determining that the video duration of the video clips corresponding to each two adjacent phonemes is the same as the sounding duration;
determining the number of image frames contained in the video clip based on the video duration; the number of image frames contained in the video clip and the video duration show positive correlation trend;
generating a multi-frame image contained in the video clip based on the number of image frames contained in the video clip; the mouth shape of the digital person contained in the first frame image in the multi-frame images refers to the mouth shape corresponding to the previous phoneme, and the mouth shape of the digital person contained in the last frame image in the multi-frame images refers to the mouth shape corresponding to the next phoneme;
and splicing the multi-frame images to obtain video clips corresponding to each two adjacent phonemes.
In an alternative embodiment, generating the video segment corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes includes:
Acquiring the sounding time length from the former phoneme to the latter phoneme in each two adjacent phonemes;
determining a mouth shape adjustment step length based on the sounding time length and a preset mouth shape adjustment time period;
generating a target image corresponding to the previous phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the previous phoneme refers to the mouth shape corresponding to the previous phoneme;
based on the mouth shape adjustment step length, the mouth shape corresponding to the previous phoneme is adjusted to obtain a target image, wherein the mouth shape of the digital person contained in the target image is: a mouth shape obtained by mouth shape variation corresponding to the previous phoneme;
based on the mouth shape adjustment step length, adjusting the mouth shape obtained by the previous change to obtain another target image;
and if the mouth shape of the digital person contained in the recently obtained target image is the same as the mouth shape corresponding to the next phoneme, splicing all the target images to obtain the video clips corresponding to each two adjacent phonemes.
In an alternative embodiment, generating the video segment corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes includes:
Generating a target image corresponding to the previous phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the previous phoneme refers to the mouth shape corresponding to the previous phoneme;
generating an intermediate image based on the mouth shape adjustment parameters corresponding to the previous phoneme in every two adjacent phonemes, wherein the mouth shape of the digital person contained in the intermediate image refers to: changing the mouth shape corresponding to the previous phoneme to a target mouth shape, wherein the similarity between the target mouth shape and the preset mouth shape reaches a preset similarity threshold;
generating a target image corresponding to the latter phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the next phoneme refers to the mouth shape corresponding to the next phoneme;
and splicing the target image corresponding to the previous phoneme, the intermediate image and the target image corresponding to the next phoneme to obtain the video clips corresponding to each two adjacent phonemes.
In an alternative embodiment, the method further comprises:
collecting voice data of a target object;
interacting with a server to enable the server to analyze and process text data corresponding to the voice data and generate reply text data corresponding to the text data;
And obtaining the reply voice data corresponding to the reply text data.
In an alternative embodiment, the language type of the reply voice data is consistent with the language type of the collected voice data;
the method for obtaining the mouth shape adjustment parameters corresponding to each phoneme comprises the following steps:
acquiring the language type of the reply voice data; the language type is the language type appointed by the target object;
and acquiring the mouth shape adjusting parameters corresponding to each phoneme under the language type.
In a second aspect, an embodiment of the present application provides an audio/video generating device, where the device includes:
the acquisition unit is used for acquiring the reply voice data fed back for the acquired voice data and one or more phonemes contained in the reply voice data;
the acquisition unit is also used for acquiring mouth shape adjustment parameters corresponding to each phoneme; wherein the die adjustment parameter is used for indicating: the mouth shape of the digital person is adjusted to the adjustment parameters required by the mouth shape corresponding to the corresponding phonemes by the preset mouth shape;
the generating unit is used for generating video clips corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes; the video segment is used for representing that the mouth shape corresponding to the previous phoneme in each two adjacent phonemes is changed to the mouth shape corresponding to the next phoneme;
The splicing unit is used for splicing the video segments corresponding to each two adjacent phonemes according to the time sequence of the one or more phonemes in the reply voice data to obtain a digital human video matched with the reply voice data;
and the playing unit is used for constructing an audio and video based on the digital human video and the reply voice data and playing the audio and video.
In a third aspect, embodiments of the present application provide a computer device including a memory, a communication interface, and a processor, where the memory, the communication interface, and the processor are connected to each other; the memory stores a computer program and the processor invokes the computer program stored in the memory for implementing the method of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the method of the first aspect described above.
In a fifth aspect, embodiments of the present application provide a computer program product comprising computer program code which, when run on a computer, causes the computer to perform the method of the first aspect described above.
In a sixth aspect, embodiments of the present application provide a computer program comprising computer program code which, when run on a computer, causes the computer to perform the method of the first aspect described above.
Drawings
In order to more clearly describe the embodiments of the present invention or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present invention or the background art.
Fig. 1 is a schematic system architecture diagram of an audio/video generating method according to an embodiment of the present application;
fig. 2 is a flowchart of an audio/video generating method provided in an embodiment of the present application;
fig. 3 is a schematic diagram of correspondence between each phoneme and a mouth shape according to an embodiment of the present application;
FIG. 4 is a flowchart of generating video clips corresponding to each two adjacent phonemes according to an embodiment of the present application;
FIG. 5 is a flowchart of another method for generating video segments corresponding to each two adjacent phones according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a die conversion provided in an embodiment of the present application;
fig. 7 is an application architecture diagram of an audio/video generating method according to an embodiment of the present application;
Fig. 8 is a schematic diagram of an audio/video generating device according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
Embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.
Referring to fig. 1, fig. 1 is a schematic system architecture diagram of an audio/video generation method according to an embodiment of the present application. As shown in fig. 1, the system may include a front end and a back end, where the front end includes at least one speaker and at least one microphone, and the front end may be any terminal device, a client, or a browser, and the audio/video generating method provided in the embodiments of the present application is executed at the front end.
The audio and video generation method can be applied to man-machine conversation, and can be used for collecting the voice of a target object, replying the voice of the target object by using an artificial intelligence technology at a back-end server, generating the audio and video containing digital people in real time at a client by the audio and video generation method, and playing the audio and video to the target user, wherein the digital population type contained in the audio and video corresponds to replied content.
Referring to fig. 2, fig. 2 is a flowchart of an audio/video generating method provided in an embodiment of the present application, as shown in fig. 2.
S201, the reply voice data fed back for the collected voice data and one or more phonemes contained in the reply voice data are acquired.
In one embodiment, after the authorization agreement of the target object is obtained, voice data of the target object can be collected through a microphone, interaction is performed with a server, so that the server analyzes and processes text data corresponding to the voice data, reply text data corresponding to the text data is generated, and reply voice data corresponding to the reply text data is obtained. The front end can recognize and extract phonemes contained in the voice data based on the reply voice data transmitted by the server.
Text data corresponding to the voice data can be obtained through intelligent voice recognition technology (Automatic Speech Recognition, ASR) technology; the reply text data corresponding to the text data can be obtained through an artificial intelligence model, such as a ChatGPT model; the reply Voice data corresponding to the reply Text data may be obtained through a Text-to-Speech (TTS) model, which may include TTS models based on a recurrent neural network (Recurrent Neural Network, RNN), such as Tacotron and Tacotron 2, and TTS models based on a variance Auto-Encoders (VAE), such as Deep Voice and Deep Voice 2, etc.
In another embodiment, the server may analyze the text data corresponding to the voice data to generate reply text data corresponding to the text data, and obtain reply voice data corresponding to the reply text data and phonemes included in the reply voice data at the same time, and then send the reply voice data and phonemes included in the reply voice data to the front end.
S202, acquiring mouth shape adjustment parameters corresponding to each phoneme.
Wherein, the phonemes are the minimum pronunciation unit of the voice, and the pronunciation of a word or word is composed of one to a plurality of syllables and tones. Wherein the presentation of the tone is invisible and is independent of the shape of the mouth. While a syllable is composed of one to a plurality of phonemes, one phoneme corresponds to one mouth shape, please refer to fig. 3, fig. 3 is a schematic diagram of the correspondence between each phoneme and the mouth shape according to the embodiment of the present application. The die adjustment parameters may be used to indicate: the mouth shape of the digital person contained in the audio and video is adjusted to the adjustment parameters required by the mouth shape corresponding to the corresponding phonemes by the preset mouth shape. The preset mouth shape of the digital person can be obtained by setting when the image of the digital person is initialized, and the preset mouth shape can be a closed mouth shape or a smiling mouth shape and the like, and the image of the digital person can be a 3D image.
Further, the phonemes of different language types are not exactly the same. For example, there are about 40 phones in English, and about 100 phones in Japanese. These phonemes may be described and distinguished by different phonetic features such as pronunciation location, manner of pronunciation, and intonation of sound. While phones of different language types may have some commonality, they also have unique phones. This is because different language types have different perceptions and expressions of sound, resulting in them using different phonemes to represent speech. For example, some languages may have specific consonants or vowels, while other languages may not.
In one embodiment, a language type of the reply voice data may be acquired, and based on the language type, a mouth shape adjustment parameter corresponding to each phoneme under the language type may be acquired, where the language type may be specified by the target object. For example, the target object may specify english as the language type of the reply voice data, and the audio-video generating method obtains the language type of the reply voice data specified by the user as english, and obtains the mouth shape adjustment parameters corresponding to each phoneme of english based on the english language type.
In another embodiment, the language type of the reply voice data is consistent with the language type of the collected voice data of the target object, the collected voice data of the target object or the language type of the reply voice data can be obtained, and based on the obtained language type, the mouth shape adjustment parameters corresponding to each phoneme under the language type are obtained. For example, since the language type of the preset reply voice data is consistent with the language type of the collected voice data of the target object, the language type of the voice data of the target object is Chinese, the audio/video generation method can obtain that the language type of the voice data of the target object is Chinese, or the language type of the reply voice data fed back for the voice data of the target object is Chinese, and based on the Chinese language type, the mouth shape adjustment parameters corresponding to each phoneme of Chinese are obtained.
S203, generating video clips corresponding to every two adjacent phonemes based on the mouth shape adjustment parameters corresponding to every two adjacent phonemes in the one or more phonemes.
The video segment is used for representing that the mouth shape corresponding to the previous phoneme in each two adjacent phonemes is changed to the mouth shape corresponding to the next phoneme.
It will be appreciated that the mouth shape motion of a person speaking involves coordinated movements of groups of muscles and bones, so that to simulate the mouth shape motion of the digital person, multiple key points may be used instead of the action of the muscles and bones, and the key points may be adjusted based on the mouth shape adjustment parameters so that the mouth shape of the digital person is adjusted from a preset mouth shape to a mouth shape indicated by the mouth shape adjustment parameters, where the mouth shape adjustment parameters may include a displacement adjustment parameter, a rotation adjustment parameter, and a scaling adjustment parameter.
If the speech data includes a phoneme, a mouth shape adjustment parameter corresponding to the phoneme is obtained, based on the mouth shape adjustment parameter, an initial image of a digital person with a mouth shape being a preset mouth shape can be adjusted to obtain each frame image corresponding to the mouth shape of the digital person, which is changed from the preset mouth shape to the mouth shape corresponding to the phoneme, and each frame image is synthesized to obtain a video segment corresponding to the phoneme from the preset mouth shape.
For example, if a piece of recovered speech data is obtained, the speech data includes two phonemes, namely, a phoneme 1 and a phoneme 2, and a mouth shape adjustment parameter 1 corresponding to the phoneme 1 and a mouth shape adjustment parameter 2 corresponding to the phoneme 2 may be obtained, where the two phonemes are arranged according to the phoneme 1 and the phoneme 2. Based on the mouth shape adjustment parameter 1, an initial image of a digital person with a mouth shape being a preset mouth shape can be adjusted to obtain each frame image corresponding to the mouth shape 1 corresponding to the phoneme 1 from the preset mouth shape change of the digital person, and each frame image is synthesized to obtain a video segment 1 corresponding to the phoneme 1 from the preset mouth shape change of the digital person; based on the mouth shape adjustment parameter 2, the image of the digital person with the mouth shape of the mouth shape 1 can be adjusted to obtain each frame image corresponding to the mouth shape 2 of the digital person changing from the mouth shape 1 to the phoneme 2, and the video clip 2 corresponding to the mouth shape 2 of the digital person changing from the mouth shape 1 to the phoneme 2 is synthesized.
In the embodiment of the present application, based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, a specific manner of generating the video segment corresponding to each two adjacent phonemes may be referred to in the following related description of the embodiment.
S204, splicing video clips corresponding to each two adjacent phonemes according to the time sequence of one or more phonemes in the reply voice data to obtain the digital human video matched with the reply voice data.
Taking the step of obtaining a piece of reply voice data in S203, where the voice data includes two phonemes, which are respectively phoneme 1 and phoneme 2, as the phoneme 1 and phoneme 2 are arranged in the reply voice data according to the time sequence of the phoneme 1 and phoneme 2, the video segment 1 and the video segment 2 can be spliced based on the time sequence of the phoneme 1 and phoneme 2, so as to obtain a digital human video matched with the reply voice data.
S205, constructing an audio and video based on the digital human video and the reply voice data, and playing the audio and video.
In the embodiment of the application, the reply voice data fed back for the collected voice data and one or more phonemes contained in the reply voice data are obtained, as one phoneme corresponds to one mouth shape adjustment parameter, the mouth shape adjustment parameters corresponding to each phoneme can be obtained, based on the mouth shape adjustment parameters corresponding to every two adjacent phonemes in the phonemes, video fragments corresponding to every two adjacent phonemes are generated at the front end, the video fragments corresponding to every two adjacent phonemes are spliced according to the time sequence of the phonemes in the reply voice data, digital human video matched with the reply voice data is obtained, audio and video are constructed based on the digital human video and the reply voice data, and the audio and video are played. Therefore, the generation of the audio and video can be directly finished at the front end, the requirement on network bandwidth and the efficiency requirement of a rear-end server can be reduced, and the digital person deployment is easier and can be widely popularized and used.
In one embodiment, based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, the specific manner of generating the video segment corresponding to each two adjacent phonemes may be:
and acquiring sounding time lengths from a previous phoneme to a next phoneme in each two adjacent phonemes in one or more phonemes contained in the reply voice data, determining that video time lengths of video clips corresponding to each two adjacent phonemes are the same as the sounding time lengths, and determining the number of image frames contained in the video clips based on the video time lengths, wherein the number of the image frames contained in the video clips and the video time lengths are in positive correlation trend.
Based on the number of image frames contained in the video segment, generating multi-frame images contained in the video segment, and splicing the multi-frame images to obtain the video segment corresponding to each two adjacent phonemes. The mouth shape of the digital person contained in the first frame of image in the multi-frame image refers to the mouth shape corresponding to the previous phoneme, and the mouth shape of the digital person contained in the last frame of image in the multi-frame image refers to the mouth shape corresponding to the next phoneme. The reply speech includes two phonemes, the former phoneme is phoneme 1, the latter phoneme is phoneme 2, the sounding time length of the phonemes 1 to 2 is 5s, the time length of the video segments corresponding to the phonemes 1 to 2 is determined to be 5s, the number of the video segment image frames is determined to be 120 frames based on the determined time length of the video segments, the 120 frame images are generated, and the multi-frame images are spliced to obtain the video segments corresponding to the phonemes 1 to 2.
Therefore, the video segments corresponding to every two adjacent phonemes are spliced to obtain the digital person video matched with the reply voice data, the audio and video is constructed based on the digital person video and the reply voice data, and when the voice of the audio and video is played, the digital person contained in the digital person video presents the corresponding mouth shape according to the sounding time of the phonemes in a designated period, so that continuous mouth shape transformation of the whole section of speaking content is realized.
Optionally, based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, a specific manner of generating the video segments corresponding to each two adjacent phonemes may be referred to as fig. 4, and fig. 4 is a flowchart of generating the video segments corresponding to each two adjacent phonemes according to an embodiment of the present application, as shown in fig. 4.
S401, acquiring sounding time lengths from a previous phoneme to a next phoneme in every two adjacent phonemes.
S402, determining a mouth shape adjustment step length based on the sounding time length and a preset mouth shape adjustment time period.
Since the mouth shape action of the digital person can be simulated by using a plurality of key points to replace the action of the muscle and the skeleton, the key points can be adjusted based on mouth shape adjusting parameters so that the mouth shape of the digital person is adjusted from a preset mouth shape to the mouth shape indicated by the mouth shape adjusting parameters, and the mouth shape adjusting parameters can comprise displacement adjusting parameters, rotation adjusting parameters and scaling adjusting parameters.
In one embodiment, a mouth shape adjustment period may be preset, and the mouth shape adjustment step size is determined based on the acquired sound emission time length from the previous phoneme to the next phoneme in each two adjacent phonemes and the preset mouth shape adjustment period.
For example, if two adjacent phones are the former phone is phone 1 and the latter phone is phone 2, the mouth shape 1 corresponding to phone 1 includes the key point 1, the mouth shape 2 corresponding to phone 2 also includes the key point 1, and the mouth shape 2 can be changed by mouth shape 1 based on the indication of mouth shape adjustment parameter 1 corresponding to phone 1 and mouth shape adjustment parameter 2 corresponding to phone 2 to move the key point 1 to the left by 5 cm.
The sound production duration of the phonemes 1 to 2 is 5s, the preset mouth shape adjustment duration is 1s, and since the key point 1 contained in the mouth shape 2 is obtained by moving the key point 1 contained in the mouth shape 1, the adjustment step length of the mouth shape can be determined to be 1cm, namely, the mouth shape 1 is adjusted to the mouth shape 2, and the key point 1 is moved to the left for 1cm for 5 times every 1 s.
In another embodiment, the adjustment step length of the mouth shape may be preset, and the adjustment time period of the mouth shape may be determined based on the acquired sounding time length from the previous phoneme to the next phoneme in each two adjacent phonemes and the preset adjustment step length.
For example, if two adjacent phones are the former phone is phone 1 and the latter phone is phone 2, the mouth shape 1 corresponding to phone 1 includes the key point 1, the mouth shape 2 corresponding to phone 2 also includes the key point 1, and the mouth shape 2 can be changed by mouth shape 1 based on the indication of mouth shape adjustment parameter 1 corresponding to phone 1 and mouth shape adjustment parameter 2 corresponding to phone 2 to move the key point 1 to the left by 5 cm.
The sounding time length of the phonemes 1 to 2 is 5s, the preset mouth shape step length is 1cm, and since the key point 1 contained in the mouth shape 2 is obtained by moving the key point 1 contained in the mouth shape 1, the mouth shape adjusting time period can be determined to be 1s, namely, the mouth shape 1 is adjusted to the mouth shape 2, and the key point 1 is moved to the left by 1cm for 5 times every 1 s.
S403, generating a target image corresponding to the previous phoneme.
The digital population type contained in the target image is obtained by adjusting the mouth shape adjusting parameters corresponding to the previous phonemes.
S404, based on the mouth shape adjustment step length, the mouth shape corresponding to the previous phoneme is adjusted, and a target image is obtained, wherein the mouth shape of the digital person contained in the target image is the mouth shape obtained by the mouth shape change corresponding to the previous phoneme.
For example, if two adjacent phones are the former phone is phone 1 and the latter phone is phone 2, the mouth shape 1 corresponding to phone 1 includes the key point 1, the mouth shape 2 corresponding to phone 2 also includes the key point 1, and the mouth shape 2 can be changed by mouth shape 1 based on the indication of mouth shape adjustment parameter 1 corresponding to phone 1 and mouth shape adjustment parameter 2 corresponding to phone 2 to move the key point 1 to the left by 2 cm. The sounding time length of the phonemes 1 to 2 is 2s, the preset mouth shape adjusting time length is 1s, and the mouth shape adjusting step length is 1cm.
And moving the key point 1 contained in the mouth shape 1 leftwards by 1cm to obtain one target image 1, and continuing to move the key point 1 leftwards by 1cm from the mouth shape position of the digital person contained in the target image 1 to obtain the other target image 2.
S405, based on the mouth shape adjustment step length, the mouth shape obtained by the previous change is adjusted, and another target image is obtained.
The specific implementation is the same as the example in S404 described above.
And S406, if the mouth shape of the digital person contained in the recently obtained target image is the same as the mouth shape corresponding to the next phoneme, splicing the target images to obtain video clips corresponding to every two adjacent phonemes.
Optionally, based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, a specific manner of generating the video segments corresponding to each two adjacent phonemes may be referred to as fig. 5, and fig. 5 is a flowchart provided in the embodiment of the present application for generating the video segments corresponding to each two adjacent phonemes, as shown in fig. 5.
S501, generating a target image corresponding to a previous phoneme.
Wherein, the mouth shape of the digital person contained in the target image corresponding to the previous phoneme refers to the mouth shape corresponding to the previous phoneme.
S502, based on the mouth shape adjustment parameters corresponding to the previous phoneme in every two adjacent phonemes, generating an intermediate image, wherein the mouth shape of the digital person contained in the intermediate image is changed from the mouth shape corresponding to the previous phoneme to the target mouth shape.
For example, if the preset mouth shape is a closed mouth shape, before the previous phoneme is about to be spoken and the next phoneme is not spoken, the mouth shape may be adjusted according to the adjustment steps in S403 to S405 to obtain the target mouth shape, and then the adjustment and conversion are performed based on the target mouth shape to the mouth shape corresponding to the next phoneme.
Wherein, the similarity between the target mouth shape and the preset mouth shape reaches a preset similarity threshold value. It can be understood that the target mouth shape may be a mouth shape in the middle of the process of adjusting the mouth shape corresponding to the previous phoneme to the preset mouth shape, but not completely adjusting to obtain the preset mouth shape.
In one embodiment, the target mouth shape may be determined by the sounding time point of the next phoneme, that is, the mouth shape obtained by adjusting the sounding time point of the next phoneme in the process of adjusting the mouth shape corresponding to the previous phoneme to the preset mouth shape is the target mouth shape.
S503, generating a target image corresponding to the latter phoneme.
S504, splicing the target image corresponding to the previous phoneme, the intermediate image and the target image corresponding to the next phoneme to obtain video clips corresponding to every two adjacent phonemes.
The implementation manner of the conversion process from S501 to S504 may refer to fig. 6, and fig. 6 is a schematic diagram of a mouth shape conversion provided in the embodiment of the present application, as shown in fig. 6, and the process makes the mouth shape conversion more natural, and improves the sense of realism of the digital person.
Next, taking an application of the audio/video generation method provided in the embodiment of the present application as an example, please refer to fig. 7, fig. 7 is an application architecture diagram of the audio/video generation method provided in the embodiment of the present application. As shown in fig. 7, the front end executing the audio/video generating method may be a browser, through which the recording authority may be obtained, the voice data of the user may be collected, then the collected voice data may be sent to the back end server in real time, after the back end server receives the voice data, the voice data may be converted into text data, the OpenAI server may be utilized to generate reply text data for the text data, then the reply text data may be converted into reply voice data, and phonemes included in the reply voice data may be obtained, and the reply voice data and phonemes included in the reply voice data may be sent to the browser.
After the browser obtains the reply voice data and the phonemes contained in the reply voice data, the mouth shape adjustment parameters corresponding to each phoneme can be obtained, based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, the key point changes contained in the digital person are controlled by the Javascript, so that the mouth shape of the digital person changes from the mouth shape corresponding to the former phoneme in each two adjacent phonemes to the mouth shape corresponding to the latter phoneme, the video segments corresponding to each two adjacent phonemes are generated, the video segments corresponding to each two adjacent phonemes are spliced according to the time sequence of the one or more phonemes in the reply voice data, digital person video matched with the reply voice data is obtained, the audio and video are constructed based on the digital person video and the reply voice data, the audio and video are played, and the mouth shape of the digital person in the audio and video is correspondingly changed along with the played reply voice, so that real-time man-machine voice interaction is realized.
Therefore, the generation of the audio and video can be directly finished at the front end, the requirement on network bandwidth and the efficiency requirement of a rear-end server can be reduced, and the digital personal deployment is easier and can be widely popularized and used.
Based on the description of the related embodiments, the embodiments of the present application further provide an audio/video generating device, where the audio/video generating device may perform the operations performed by the front end shown in fig. 1 to 7. Referring to fig. 8, fig. 8 is a schematic diagram of an audio/video generating apparatus according to an embodiment of the present application. As shown in fig. 8, the audio/video generating apparatus may include, but is not limited to, an acquisition unit 801, a generation unit 802, a splicing unit 803, and a playback unit 804.
An obtaining unit 801, configured to obtain reply voice data fed back for the collected voice data and one or more phonemes contained in the reply voice data;
an obtaining unit 801, configured to obtain mouth shape adjustment parameters corresponding to each phoneme; wherein the die adjustment parameter is used for indicating: the mouth shape of the digital person is adjusted to the adjustment parameters required by the mouth shape corresponding to the corresponding phonemes by the preset mouth shape;
a generating unit 802, configured to generate a video segment corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes; the video segment is used for representing that the mouth shape corresponding to the previous phoneme in each two adjacent phonemes is changed to the mouth shape corresponding to the next phoneme;
A splicing unit 803, configured to splice video segments corresponding to each two adjacent phonemes according to a time sequence of the one or more phonemes in the reply voice data, so as to obtain a digital human video that matches the reply voice data;
and a playing unit 804, configured to construct an audio/video based on the digital personal video and the reply voice data, and play the audio/video.
In an alternative embodiment, the generating unit 802 generates the video segment corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, including:
acquiring the sounding time length from the former phoneme to the latter phoneme in each two adjacent phonemes;
determining that the video duration of the video clips corresponding to each two adjacent phonemes is the same as the sounding duration;
determining the number of image frames contained in the video clip based on the video duration; the number of image frames contained in the video clip and the video duration show positive correlation trend;
generating a multi-frame image contained in the video clip based on the number of image frames contained in the video clip; the mouth shape of the digital person contained in the first frame image in the multi-frame images refers to the mouth shape corresponding to the previous phoneme, and the mouth shape of the digital person contained in the last frame image in the multi-frame images refers to the mouth shape corresponding to the next phoneme;
And splicing the multi-frame images to obtain video clips corresponding to each two adjacent phonemes.
In an alternative embodiment, the generating unit 802 generates the video segment corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, including:
acquiring the sounding time length from the former phoneme to the latter phoneme in each two adjacent phonemes;
determining a mouth shape adjustment step length based on the sounding time length and a preset mouth shape adjustment time period;
generating a target image corresponding to the previous phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the previous phoneme refers to the mouth shape corresponding to the previous phoneme;
based on the mouth shape adjustment step length, the mouth shape corresponding to the previous phoneme is adjusted to obtain a target image, wherein the mouth shape of the digital person contained in the target image is: a mouth shape obtained by mouth shape variation corresponding to the previous phoneme;
based on the mouth shape adjustment step length, adjusting the mouth shape obtained by the previous change to obtain another target image;
and if the mouth shape of the digital person contained in the recently obtained target image is the same as the mouth shape corresponding to the next phoneme, splicing all the target images to obtain the video clips corresponding to each two adjacent phonemes.
In an alternative embodiment, the generating unit 802 generates the video segment corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, including:
generating a target image corresponding to the previous phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the previous phoneme refers to the mouth shape corresponding to the previous phoneme;
generating an intermediate image based on the mouth shape adjustment parameters corresponding to the previous phoneme in every two adjacent phonemes, wherein the mouth shape of the digital person contained in the intermediate image refers to: changing the mouth shape corresponding to the previous phoneme to a target mouth shape, wherein the similarity between the target mouth shape and the preset mouth shape reaches a preset similarity threshold;
generating a target image corresponding to the latter phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the next phoneme refers to the mouth shape corresponding to the next phoneme;
and splicing the target image corresponding to the previous phoneme, the intermediate image and the target image corresponding to the next phoneme to obtain the video clips corresponding to each two adjacent phonemes.
In an alternative embodiment, the audio/video generating device further includes an acquisition unit 805.
The acquisition unit 805 is further configured to acquire voice data of the target object;
the generating unit 802 is further configured to interact with a server, so that the server performs analysis processing on text data corresponding to the voice data, and generates reply text data corresponding to the text data;
the obtaining unit 801 is further configured to obtain reply voice data corresponding to the reply text data.
In an alternative embodiment, the obtaining unit 801 obtains the mouth shape adjustment parameters corresponding to each phoneme, including:
acquiring the language type of the reply voice data; the language type is the language type appointed by the target object;
and acquiring the mouth shape adjusting parameters corresponding to each phoneme under the language type.
In an alternative embodiment, the language type of the reply voice data is consistent with the language type of the collected voice data;
the acquiring unit 801 acquires mouth shape adjustment parameters corresponding to each phoneme, including:
acquiring the language type of the collected voice data or the reply voice data;
And acquiring the mouth shape adjusting parameters corresponding to each phoneme under the language type.
In this embodiment of the present application, the obtaining unit 801 obtains reply voice data fed back for the collected voice data and one or more phones included in the reply voice data, because one phone corresponds to one mouth shape adjustment parameter, the obtaining unit 801 may obtain mouth shape adjustment parameters corresponding to each phone, the generating unit 802 generates, at the front end, video segments corresponding to each two adjacent phones based on the mouth shape adjustment parameters corresponding to each two adjacent phones in the phones, the splicing unit 803 splices the video segments corresponding to each two adjacent phones according to the time sequence of the phones in the reply voice data, to obtain a digital human video matched with the reply voice data, and the playing unit 804 constructs an audio/video based on the digital human video and the reply voice data and plays the audio/video. Therefore, the generation of the audio and video can be directly finished at the front end, the requirement on network bandwidth and the efficiency requirement of a rear-end server can be reduced, and the digital person deployment is easier and can be widely popularized and used.
The embodiment of the application also provides a computer device, please refer to fig. 9, and fig. 9 is a schematic structural diagram of the computer device provided in the embodiment of the application. As shown in fig. 9, the computer device includes at least a processor 901, a memory 902, and a communication interface 903, which may be connected by a bus 904 or otherwise, and in the present embodiment, is exemplified by connection via the bus 904. The processor 901 of the embodiment of the present application may execute the operations of the foregoing audio/video generation method by executing a computer program stored in the memory 902, for example:
Acquiring reply voice data fed back aiming at the collected voice data and one or more phonemes contained in the reply voice data;
obtaining mouth shape adjusting parameters corresponding to each phoneme; wherein the die adjustment parameter is used for indicating: the mouth shape of the digital person is adjusted to the adjustment parameters required by the mouth shape corresponding to the corresponding phonemes by the preset mouth shape;
generating video clips corresponding to each two adjacent phonemes based on mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes; the video segment is used for representing that the mouth shape corresponding to the previous phoneme in each two adjacent phonemes is changed to the mouth shape corresponding to the next phoneme;
splicing the video segments corresponding to each two adjacent phonemes according to the time sequence of the one or more phonemes in the reply voice data to obtain a digital human video matched with the reply voice data;
and constructing an audio and video based on the digital human video and the reply voice data, and playing the audio and video.
In an alternative embodiment, the processor 901 generates a video clip corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, and is specifically configured to perform the following operations:
Acquiring the sounding time length from the former phoneme to the latter phoneme in each two adjacent phonemes;
determining that the video duration of the video clips corresponding to each two adjacent phonemes is the same as the sounding duration;
determining the number of image frames contained in the video clip based on the video duration; the number of image frames contained in the video clip and the video duration show positive correlation trend;
generating a multi-frame image contained in the video clip based on the number of image frames contained in the video clip; the mouth shape of the digital person contained in the first frame image in the multi-frame images refers to the mouth shape corresponding to the previous phoneme, and the mouth shape of the digital person contained in the last frame image in the multi-frame images refers to the mouth shape corresponding to the next phoneme;
and splicing the multi-frame images to obtain video clips corresponding to each two adjacent phonemes.
In an alternative embodiment, the processor 901 generates a video clip corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, and is specifically configured to perform the following operations:
Acquiring the sounding time length from the former phoneme to the latter phoneme in each two adjacent phonemes;
determining a mouth shape adjustment step length based on the sounding time length and a preset mouth shape adjustment time period;
generating a target image corresponding to the previous phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the previous phoneme refers to the mouth shape corresponding to the previous phoneme;
based on the mouth shape adjustment step length, the mouth shape corresponding to the previous phoneme is adjusted to obtain a target image, wherein the mouth shape of the digital person contained in the target image is: a mouth shape obtained by mouth shape variation corresponding to the previous phoneme;
based on the mouth shape adjustment step length, adjusting the mouth shape obtained by the previous change to obtain another target image;
and if the mouth shape of the digital person contained in the recently obtained target image is the same as the mouth shape corresponding to the next phoneme, splicing all the target images to obtain the video clips corresponding to each two adjacent phonemes.
In an alternative embodiment, the processor 901 generates a video clip corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, and is specifically configured to perform the following operations:
Generating a target image corresponding to the previous phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the previous phoneme refers to the mouth shape corresponding to the previous phoneme;
generating an intermediate image based on the mouth shape adjustment parameters corresponding to the previous phoneme in every two adjacent phonemes, wherein the mouth shape of the digital person contained in the intermediate image refers to: changing the mouth shape corresponding to the previous phoneme to a target mouth shape, wherein the similarity between the target mouth shape and the preset mouth shape reaches a preset similarity threshold;
generating a target image corresponding to the latter phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the next phoneme refers to the mouth shape corresponding to the next phoneme;
and splicing the target image corresponding to the previous phoneme, the intermediate image and the target image corresponding to the next phoneme to obtain the video clips corresponding to each two adjacent phonemes.
In an alternative embodiment, processor 901 further performs the following operations:
collecting voice data of a target object;
interacting with a server to enable the server to analyze and process text data corresponding to the voice data and generate reply text data corresponding to the text data;
And obtaining the reply voice data corresponding to the reply text data.
In an alternative embodiment, the processor 901 obtains the mouth shape adjustment parameters corresponding to each phoneme, and specifically performs the following operations:
acquiring the language type of the reply voice data; the language type is the language type appointed by the target object;
and acquiring the mouth shape adjusting parameters corresponding to each phoneme under the language type.
In an alternative embodiment, the language type of the reply voice data is consistent with the language type of the collected voice data; the processor 901 obtains the mouth shape adjustment parameters corresponding to each phoneme, and specifically performs the following operations:
acquiring the language type of the collected voice data or the reply voice data;
and acquiring the mouth shape adjusting parameters corresponding to each phoneme under the language type.
In this embodiment, the processor 901 obtains the reply voice data fed back for the collected voice data and one or more phones included in the reply voice data, and because one phone corresponds to one mouth shape adjustment parameter, can obtain the mouth shape adjustment parameter corresponding to each phone, generates the video segment corresponding to each two adjacent phones at the front end based on the mouth shape adjustment parameter corresponding to each two adjacent phones in the phones, splices the video segments corresponding to each two adjacent phones according to the time sequence of the phones in the reply voice data, obtains the digital human video matched with the reply voice data, constructs the audio and video based on the digital human video and the reply voice data, and plays the audio and video. Therefore, the generation of the audio and video can be directly finished at the front end, the requirement on network bandwidth and the efficiency requirement of a rear-end server can be reduced, and the digital person deployment is easier and can be widely popularized and used.
The present application also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of any of the method embodiments described above.
The present application also provides a computer program product comprising computer program code to, when run on a computer, cause the computer to perform the steps of any of the method embodiments described above.
The embodiment of the application further provides a chip, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for calling and running the computer program from the memory, so that the device provided with the chip executes the steps in any method embodiment.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.
The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.

Claims (9)

1. An audio/video generation method, comprising:
acquiring reply voice data fed back aiming at the collected voice data and one or more phonemes contained in the reply voice data;
Obtaining mouth shape adjusting parameters corresponding to each phoneme; wherein the die adjustment parameter is used for indicating: the mouth shape of the digital person is adjusted to the adjustment parameters required by the mouth shape corresponding to the corresponding phonemes by the preset mouth shape;
generating video clips corresponding to each two adjacent phonemes based on mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes; the video segment is used for representing that the mouth shape corresponding to the previous phoneme in each two adjacent phonemes is changed to the mouth shape corresponding to the next phoneme;
splicing the video segments corresponding to each two adjacent phonemes according to the time sequence of the one or more phonemes in the reply voice data to obtain a digital human video matched with the reply voice data;
constructing an audio and video based on the digital human video and the reply voice data, and playing the audio and video;
the generating the video segment corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes includes:
generating a target image corresponding to the previous phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the previous phoneme refers to the mouth shape corresponding to the previous phoneme;
Generating an intermediate image based on the mouth shape adjustment parameters corresponding to the previous phoneme in every two adjacent phonemes, wherein the mouth shape of the digital person contained in the intermediate image refers to: changing the mouth shape corresponding to the previous phoneme to a target mouth shape, wherein the similarity between the target mouth shape and the preset mouth shape reaches a preset similarity threshold value, and the preset mouth shape refers to a closed mouth shape;
generating a target image corresponding to the latter phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the next phoneme refers to the mouth shape corresponding to the next phoneme;
and splicing the target image corresponding to the previous phoneme, the intermediate image and the target image corresponding to the next phoneme to obtain the video clips corresponding to each two adjacent phonemes.
2. The method of claim 1, wherein generating the video segment corresponding to each two adjacent phones based on the mouth shape adjustment parameters corresponding to each two adjacent phones of the one or more phones comprises:
acquiring the sounding time length from the former phoneme to the latter phoneme in each two adjacent phonemes;
determining that the video duration of the video clips corresponding to each two adjacent phonemes is the same as the sounding duration;
Determining the number of image frames contained in the video clip based on the video duration; the number of image frames contained in the video clip and the video duration show positive correlation trend;
generating a multi-frame image contained in the video clip based on the number of image frames contained in the video clip; the mouth shape of the digital person contained in the first frame image in the multi-frame images refers to the mouth shape corresponding to the previous phoneme, and the mouth shape of the digital person contained in the last frame image in the multi-frame images refers to the mouth shape corresponding to the next phoneme;
and splicing the multi-frame images to obtain video clips corresponding to each two adjacent phonemes.
3. The method of claim 1, wherein generating the video segment corresponding to each two adjacent phones based on the mouth shape adjustment parameters corresponding to each two adjacent phones of the one or more phones comprises:
acquiring the sounding time length from the former phoneme to the latter phoneme in each two adjacent phonemes;
determining a mouth shape adjustment step length based on the sounding time length and a preset mouth shape adjustment time period;
generating a target image corresponding to the previous phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the previous phoneme refers to the mouth shape corresponding to the previous phoneme;
Based on the mouth shape adjustment step length, the mouth shape corresponding to the previous phoneme is adjusted to obtain a target image, wherein the mouth shape of the digital person contained in the target image is: a mouth shape obtained by mouth shape variation corresponding to the previous phoneme;
based on the mouth shape adjustment step length, adjusting the mouth shape obtained by the previous change to obtain another target image;
and if the mouth shape of the digital person contained in the recently obtained target image is the same as the mouth shape corresponding to the next phoneme, splicing all the target images to obtain the video clips corresponding to each two adjacent phonemes.
4. The method of claim 1, wherein the method further comprises:
collecting voice data of a target object;
interacting with a server to enable the server to analyze and process text data corresponding to the voice data and generate reply text data corresponding to the text data;
and obtaining the reply voice data corresponding to the reply text data.
5. The method of claim 1, wherein the obtaining the mouth shape adjustment parameter corresponding to each phoneme comprises:
acquiring the language type of the reply voice data; the language type is a language type appointed by the target object;
And acquiring the mouth shape adjusting parameters corresponding to each phoneme under the language type.
6. The method of claim 1, wherein the language type of the reply voice data is consistent with the language type of the collected voice data;
the obtaining the mouth shape adjustment parameters corresponding to each phoneme comprises the following steps:
acquiring the language type of the collected voice data or the reply voice data;
and acquiring the mouth shape adjusting parameters corresponding to each phoneme under the language type.
7. An audio/video generation device, characterized in that the device comprises:
the acquisition unit is used for acquiring the reply voice data fed back for the acquired voice data and one or more phonemes contained in the reply voice data;
the acquisition unit is also used for acquiring mouth shape adjustment parameters corresponding to each phoneme; wherein the die adjustment parameter is used for indicating: the mouth shape of the digital person is adjusted to the adjustment parameters required by the mouth shape corresponding to the corresponding phonemes by the preset mouth shape;
the generating unit is used for generating video clips corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes; the video segment is used for representing that the mouth shape corresponding to the previous phoneme in each two adjacent phonemes is changed to the mouth shape corresponding to the next phoneme;
The splicing unit is used for splicing the video segments corresponding to each two adjacent phonemes according to the time sequence of the one or more phonemes in the reply voice data to obtain a digital human video matched with the reply voice data;
the playing unit is used for constructing an audio and video based on the digital human video and the reply voice data and playing the audio and video;
the generating unit generates a video clip corresponding to each two adjacent phonemes based on the mouth shape adjustment parameters corresponding to each two adjacent phonemes in the one or more phonemes, including:
generating a target image corresponding to the previous phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the previous phoneme refers to the mouth shape corresponding to the previous phoneme;
generating an intermediate image based on the mouth shape adjustment parameters corresponding to the previous phoneme in every two adjacent phonemes, wherein the mouth shape of the digital person contained in the intermediate image refers to: changing the mouth shape corresponding to the previous phoneme to a target mouth shape, wherein the similarity between the target mouth shape and the preset mouth shape reaches a preset similarity threshold value, and the preset mouth shape refers to a closed mouth shape;
Generating a target image corresponding to the latter phoneme; wherein, the mouth shape of the digital person contained in the target image corresponding to the next phoneme refers to the mouth shape corresponding to the next phoneme;
and splicing the target image corresponding to the previous phoneme, the intermediate image and the target image corresponding to the next phoneme to obtain the video clips corresponding to each two adjacent phonemes.
8. A computer device comprising a memory, a communication interface, and a processor, wherein the memory, the communication interface, and the processor are interconnected; the memory stores a computer program, and the processor invokes the computer program stored in the memory for implementing the method of any one of claims 1 to 6.
9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements the method according to any of claims 1 to 6.
CN202311560630.XA 2023-11-22 2023-11-22 Audio and video generation method, device, equipment and storage medium Active CN117275485B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311560630.XA CN117275485B (en) 2023-11-22 2023-11-22 Audio and video generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311560630.XA CN117275485B (en) 2023-11-22 2023-11-22 Audio and video generation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117275485A CN117275485A (en) 2023-12-22
CN117275485B true CN117275485B (en) 2024-03-12

Family

ID=89218196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311560630.XA Active CN117275485B (en) 2023-11-22 2023-11-22 Audio and video generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117275485B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727303A (en) * 2024-02-08 2024-03-19 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB9005142D0 (en) * 1989-03-08 1990-05-02 Kokusai Denshin Denwa Co Ltd Picture synthesizing method and apparatus
JP2002108382A (en) * 2000-09-27 2002-04-10 Sony Corp Animation method and device for performing lip sinchronization
CN109872724A (en) * 2019-03-29 2019-06-11 广州虎牙信息科技有限公司 Virtual image control method, virtual image control device and electronic equipment
CN113539240A (en) * 2021-07-19 2021-10-22 北京沃东天骏信息技术有限公司 Animation generation method and device, electronic equipment and storage medium
CN116363268A (en) * 2023-02-20 2023-06-30 厦门黑镜科技有限公司 Method and device for generating mouth shape animation, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB9005142D0 (en) * 1989-03-08 1990-05-02 Kokusai Denshin Denwa Co Ltd Picture synthesizing method and apparatus
JP2002108382A (en) * 2000-09-27 2002-04-10 Sony Corp Animation method and device for performing lip sinchronization
CN109872724A (en) * 2019-03-29 2019-06-11 广州虎牙信息科技有限公司 Virtual image control method, virtual image control device and electronic equipment
CN113539240A (en) * 2021-07-19 2021-10-22 北京沃东天骏信息技术有限公司 Animation generation method and device, electronic equipment and storage medium
CN116363268A (en) * 2023-02-20 2023-06-30 厦门黑镜科技有限公司 Method and device for generating mouth shape animation, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN117275485A (en) 2023-12-22

Similar Documents

Publication Publication Date Title
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
CN112184858B (en) Virtual object animation generation method and device based on text, storage medium and terminal
US8224652B2 (en) Speech and text driven HMM-based body animation synthesis
CN113454708A (en) Linguistic style matching agent
KR102035596B1 (en) System and method for automatically generating virtual character's facial animation based on artificial intelligence
GB2516965A (en) Synthetic audiovisual storyteller
CN111145777A (en) Virtual image display method and device, electronic equipment and storage medium
CN111459452B (en) Driving method, device and equipment of interaction object and storage medium
CN112184859B (en) End-to-end virtual object animation generation method and device, storage medium and terminal
CN117275485B (en) Audio and video generation method, device, equipment and storage medium
CN110874137A (en) Interaction method and device
CN115700772A (en) Face animation generation method and device
CN112735371B (en) Method and device for generating speaker video based on text information
US20230082830A1 (en) Method and apparatus for driving digital human, and electronic device
CN113077537A (en) Video generation method, storage medium and equipment
JP4599606B2 (en) Head motion learning device, head motion synthesis device, and computer program for automatic head motion generation
CN114387945A (en) Voice generation method and device, electronic equipment and storage medium
WO2021232877A1 (en) Method and apparatus for driving virtual human in real time, and electronic device, and medium
CN113314097B (en) Speech synthesis method, speech synthesis model processing device and electronic equipment
CN115359780A (en) Speech synthesis method, apparatus, computer device and storage medium
CN115956269A (en) Voice conversion device, voice conversion method, program, and recording medium
CN113990295A (en) Video generation method and device
Verma et al. Animating expressive faces across languages
CN112634861A (en) Data processing method and device, electronic equipment and readable storage medium
CN114724540A (en) Model processing method and device, emotion voice synthesis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant