CN114051105A

CN114051105A - Multimedia data processing method and device, electronic equipment and storage medium

Info

Publication number: CN114051105A
Application number: CN202111320124.4A
Authority: CN
Inventors: 胡天舒; 韩钧宇; 洪智滨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-02-15
Anticipated expiration: 2041-11-09
Also published as: CN114051105B

Abstract

The disclosure provides a multimedia data processing method, a multimedia data processing device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the fields of deep learning, computer vision and the like. The specific implementation scheme is as follows: obtaining synthetic audio data according to reference audio data in the reference video data; mapping the synthetic audio data and the reference audio data to obtain a mapping result; and generating synthesized video data according to the mapping result and the mapping relation between the reference audio data and the video data frame of the reference video data, wherein the audio data in the synthesized video data is the synthesized audio data. The disclosed embodiments can provide highly matched material for use and improvement of speech driven face technology.

Description

Multimedia data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to the field of deep learning, computer vision, and the like.

Background

The face driving means that a character picture is driven through a medium as a content reference, so that a character video matched with the reference content is generated by using the character picture. In recent years, the fire heat of short videos and live broadcast tracks has created unprecedented prosperity of content creation, and the continuous upgrade of virtual reality technology has brought more possibilities for content creation. Face-driven technology has also become an important supportive technology behind these content creations.

It is generally believed that the more realistic and closer a face-driven work is to a real scene, the better the effect. How to improve the effect presented by the face driving work is the key point of improving the face driving technology.

Disclosure of Invention

The disclosure provides a multimedia data processing method, a multimedia data processing device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a multimedia data processing method including:

obtaining synthetic audio data according to reference audio data in the reference video data;

mapping the synthetic audio data and the reference audio data to obtain a mapping result;

and generating synthetic video data according to the mapping result and the mapping relation between the reference audio data and the video frames of the reference video data, wherein the audio data in the synthetic video data is synthetic audio data.

According to another aspect of the present disclosure, there is provided a model generation method including:

inputting the training video frame and the synthetic audio data into a voice-driven face model to be trained to obtain a voice-driven video frame; the synthesized audio data is the synthesized audio data provided by any one of the embodiments of the present disclosure;

and training the voice-driven face model to be trained according to the voice-driven video frame and the synthesized video data to obtain the voice-driven face model, wherein the synthesized video data is the synthesized video data provided by any one embodiment of the disclosure.

According to another aspect of the present disclosure, there is provided a multimedia data processing apparatus including:

the synthetic audio data acquisition module is used for acquiring synthetic audio data according to reference audio data in the reference video data;

the mapping module is used for mapping the synthetic audio data and the reference audio data to obtain a mapping result;

and the synthetic video data generation module is used for generating synthetic video data according to the mapping result and the mapping relation between the reference audio data and the video frame of the reference video data, wherein the audio data in the synthetic video data is synthetic audio data.

According to another aspect of the present disclosure, there is provided a model generation apparatus including:

the input module is used for inputting the training video frames and the synthetic audio data into a voice-driven face model to be trained to obtain voice-driven video frames; the synthesized audio data is the synthesized audio data provided by any one of the embodiments of the present disclosure;

the training module is configured to train the voice-driven face model to be trained according to the voice-driven video frame and the synthesized video data, so as to obtain the voice-driven face model, where the synthesized video data is synthesized video data provided in any embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method in any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the method in any of the embodiments of the present disclosure.

The synthesized video data generated according to the technology disclosed by the invention can provide highly matched materials between the synthesized voice and the video frame for the face driving technology, thereby providing a better guiding effect for the use and development of the face driving technology and being beneficial to improving the effect of the video data presented by the face driving technology.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a multimedia data processing method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a multimedia data processing method according to another embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a multimedia data processing method according to yet another embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a multimedia data processing method according to yet another embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a multimedia data processing method according to yet another embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a multimedia data processing method according to an example of the present disclosure;

FIG. 7 is a schematic diagram of a multimedia data processing apparatus according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a multimedia data processing apparatus according to another embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a multimedia data processing apparatus according to yet another embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a multimedia data processing apparatus according to yet another embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a multimedia data processing apparatus according to yet another embodiment of the present disclosure;

fig. 12 is a block diagram of an electronic device for implementing a multimedia data processing method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

According to an embodiment of the present disclosure, a multimedia data processing method is provided, and fig. 1 is a flowchart of a multimedia data processing method according to an embodiment of the present disclosure, which may be applied to a multimedia data processing apparatus, for example, where the apparatus may be deployed in a terminal or a server or other processing device, generation of synthesized audio data, mapping of different audio data, synthesis of audio data and video data, and the like may be performed. Among them, the terminal may be a User Equipment (UE), a mobile device, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, and so on. In some possible implementations, the method may also be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 1, the multimedia data processing method includes:

step S11: obtaining synthetic audio data according to reference audio data in the reference video data;

step S12: mapping the synthetic audio data and the reference audio data to obtain a mapping result;

step S13: and generating synthetic video data according to the mapping result and the mapping relation between the reference audio data and the video frames of the reference video data, wherein the audio data in the synthetic video data is synthetic audio data.

In this embodiment, the reference video data may include reference audio data and a pure video frame, that is, the reference audio data may be an audio data portion of the reference video data.

The reference video data may be real-person video data generated by recording, and the reference audio data may be real-person audio data generated by recording actual utterances of speakers in the reference video data.

The synthetic audio data obtained from the reference audio data in the reference video data may be synthetic audio data having the same content as the reference audio data, and a portion (such as timbre, volume, etc.) of the synthetic audio data that is inconsistent with the reference audio data may be synthetic.

The synthesized audio data is obtained according to the reference audio data in the reference video data, and the synthesized audio data can be obtained by directly adding an interference sound wave in the reference audio data, or changing the sound wave in the reference audio data, or overlapping the reference audio data and other audio data, or performing sound changing processing on the original sound of a sound producer of the reference audio data, and the like.

The synthetic audio data is obtained according to the reference audio data in the reference video data, or the synthetic audio data with the same content can be independently generated according to the content of the reference audio data by adopting the technologies of artificial intelligence voice generation and the like.

In the present embodiment, if necessary, an audio data portion in the reference video data may be extracted, and synthetic audio data may be generated from the extracted audio data portion.

The mapping result is obtained by mapping the synthesized audio data with the reference audio data, which may be by associating the synthesized audio data with the reference audio data, for example, if the X-th time of the synthesized audio data is consistent with the Y-th time of the reference audio data, and there is an association relationship, the mapping result is obtained according to the association relationship.

The synthesized video data is generated according to the mapping result and the mapping relationship between the reference audio data and the video frame of the reference video data, or the synthesized video data is generated by combining the video frame part of the reference video data and the synthesized audio data, and the corresponding relationship between the video frame of the reference video data and the synthesized audio data is determined according to the mapping relationship between the reference audio data and the synthesized audio data. Namely, the audio data in the reference video data is replaced by the synthetic audio data with the same content, and the corresponding relation between the synthetic audio data and the video frame of the reference video data is determined according to the corresponding relation between the original reference audio data and the video frame of the reference video data.

In this embodiment, the audio data portion in the synthesized video data is synthesized audio data, and the audio data in the driving video data produced by the face driving technology is also artificially synthesized audio data, so that the synthesized video data generated in this embodiment has the same audio data synthesis characteristics as the video data used or generated by the face driving technology, and the synthesized audio data is completely matched with the video frame of the video data, and the picture of the speaker in the video frame is a real picture in the reference video data, and has a face driving effect closest to a real scene. Furthermore, the synthesized video data in the embodiment can provide better materials for the face driving technology. For example, when a face driving model is trained, the synthesized video data generated by the embodiment can be adopted, and the model can learn the face driving mode which is the most standard and is close to the real effect from the synthesized video data, so that the face driving effect is improved.

In one embodiment, mapping the synthesized audio data with the reference audio data to obtain a mapping result, as shown in fig. 2, includes:

step S21: calculating the waveform distance between each synthesized audio data sampling point in the synthesized audio data and each reference audio data sampling point in the reference audio data aiming at each synthesized audio data sampling point in the synthesized audio data;

step S22: aiming at each synthetic audio data sampling point, taking the reference audio data sampling point with the closest waveform distance as the reference audio data sampling point with a mapping relation with the synthetic audio data sampling point;

step S23: and taking the mapping relation between all the synthesized audio data sampling points and the reference audio data sampling points as a mapping result.

The synthesized audio data sampling points may be sampling points in the synthesized audio data, that is, sound wave points collected in sound waves of the synthesized audio data. The sampling rates of the synthesized audio data and the reference audio data may be equal, so that the total number of sampling points in the synthesized audio data and the reference audio data is the same.

The reference audio data sample points may be sample points in the reference audio data. In a specific implementation manner, the number of sampling points of the synthesized audio data is the same as the number of sampling points in the reference audio data, so that each synthesized audio data sampling point corresponds to one reference audio data sampling point.

In this embodiment, when the waveform distance between the synthesized audio data sampling point and each reference audio data sampling point in the reference audio data is calculated, the distance may be calculated in the same reference coordinate system, for example, the waveform distance may be calculated in the same coordinate system, such as a rectangular coordinate system and a polar coordinate system.

For each synthesized audio data sampling point, such as the synthesized audio data sampling point a, the distance between all the reference audio data sampling points and the synthesized audio data sampling point can be calculated correspondingly, and the reference audio data sampling point closest to the reference audio data sampling point can be taken as the sampling point having a mapping relation with a.

In the mapping result, each synthesized audio data sampling point is mapped with a reference audio data sampling point correspondingly, and the reference audio data sampling points mapped by different synthesized audio data sampling points can be different.

In this embodiment, since the synthesized audio data is generated independently of the reference audio data, the synthesized audio data and the reference audio data are mapped, which is helpful for accurately corresponding the synthesized audio data and the reference video data according to the relationship between the reference audio data and the reference video data.

In one embodiment, generating composite video data according to the mapping result and the mapping relationship between the reference audio data and the video frame of the reference video data, as shown in fig. 3, includes:

step S31: dividing the reference audio data into a plurality of voice units according to the reference video data and the reference audio data, wherein each voice unit corresponds to one frame of video frame;

step S32: determining a video frame in reference video data corresponding to each voice unit according to the time of each reference audio data sampling point of each voice unit;

step S33: and generating synthesized video data according to the video frames corresponding to all the voice units and the mapping result.

In this embodiment, each speech unit may correspond to a plurality of reference audio data sampling points. The total number of the reference audio data sampling points can be divided by the total number of the video frames of the reference video data to obtain the total number of the voice units, and then the reference audio data is equally divided according to the total number of the voice units to obtain a plurality of voice units.

Because the reference audio data is the relation in the reference video data, the reference audio data and the reference video data can always correspond through time, and all the reference audio data sampling points in the voice unit correspond to the same video frame.

And generating synthesized video data according to the video frames and the mapping results corresponding to all the voice units, wherein the synthesized audio data sampling points can be indirectly corresponding to the video frames according to the corresponding relationship between the voice units and the video frames and the corresponding relationship between the reference audio data sampling points in the voice units and the synthesized audio data sampling points.

Since the reference video data corresponding to the reference audio data is the video data of a standard real scene under the condition of voice driving, and the face changes of the people in the video data, such as mouth shapes and the like, are completely matched with the reference audio data, after the synthesized audio data with the same content is corresponding to the video frames, the synthesized audio data is also completely matched with the mouth shapes, face muscles and the like of the people in the video frames, and thus, a standard reference material can be provided for the face driving technology.

In one embodiment, generating the synthesized video data according to the video frames corresponding to all the speech units and the mapping result, as shown in fig. 4, includes:

step S41: determining a video frame corresponding to each reference audio data sampling point according to the video frame corresponding to each voice unit;

step S42: determining a video frame corresponding to each synthesized audio data sampling point according to the mapping result and the video frame corresponding to each reference audio data sampling point;

step S43: and arranging the video frames corresponding to all the synthetic audio data sampling points according to the sequence of the synthetic audio data sampling points to generate synthetic video data.

In this embodiment, the video frame corresponding to each speech unit may be used as the video frame corresponding to the reference audio data sampling point in the speech unit. And taking the video frame corresponding to each reference audio data sampling point as a video frame corresponding to the synthesized audio data sampling point which has a mapping relation with the reference audio data sampling point.

Arranging the video frames corresponding to all the synthetic audio data sampling points according to the sequence of the synthetic audio data sampling points to generate synthetic video data, wherein each or a plurality of continuous synthetic audio data sampling points correspond to one video frame, and the video frames corresponding to the audio data sampling points are continuously arranged to generate the synthetic video data.

For example, the synthesized audio data is divided into a plurality of voice units of synthesized audio data, each voice unit of synthesized audio data includes a plurality of synthesized audio data sampling points, each synthesized audio data sampling point can correspond to a video frame of reference video data through a corresponding relationship with a reference audio data sampling point, and thus, for each voice unit of synthesized audio data, a video frame with the largest number in video frames corresponding to all synthesized audio data sampling points in the voice unit can be selected as a video frame corresponding to the voice unit of synthesized audio data. Specifically, for example, a speech unit B for synthesizing audio data includes 100 synthesized speech sampling points, where 80 synthesized speech sampling points correspond to the same video frame C, and the remaining 20 synthesized speech sampling points correspond to other video frames, so B may correspond to the video frame C.

According to the embodiment, the accurate and matched corresponding relation can be established between the synthesized voice and the video frame of the reference video data, so that the synthesized voice and the video frame of the reference video data are combined, and after the combined video data is generated, the video frame in the combined video data can be highly matched with the synthesized voice.

In one embodiment, dividing reference audio data into a plurality of speech units according to reference video data and reference audio data comprises:

obtaining the duration of the reference audio data corresponding to each frame of video frame according to the duration of the reference audio data and the frame rate of the reference video data;

and dividing the reference audio data according to the duration of the reference audio data corresponding to each frame of video frame to obtain a voice unit.

In this embodiment, the duration of the reference audio data may be divided by the frame rate of the reference video data, and the obtained quotient is used as the duration of the reference audio data corresponding to each frame of video frame.

The duration of each speech unit, i.e. the duration of the reference audio data corresponding to each video frame.

In this embodiment, according to the duration of the reference audio data and the frame rate of the reference video data, the duration of the reference audio data corresponding to each frame of video frame is divided and determined, so that the corresponding relationship between the reference audio data and the video frame can be reasonably determined, and the matching degree between the synthesized audio data in the synthesized video data and the video frame is higher.

In one embodiment, the synthesized audio data is Text To Speech (TTS) data. TTS is one type of speech synthesis application that converts files stored in a computer, such as help files or web pages, into natural speech output. TTS can not only help visually impaired people read information on a computer, but also increase the readability of text documents.

Generally, TTS speech cannot be exactly the same as a real person utterance, but in the speech-driven technology, the speech used and generated is generally synthesized speech such as TTS speech, and the TTS speech is used as the synthesized speech, so that the synthesized video data can be closer to the video data generated in the speech-driven technology, and a better reference effect is provided for the use and improvement of the speech-driven technology including speech-driven model training.

In one embodiment, obtaining synthetic audio data from reference audio data in reference video data comprises:

determining the text content according to the reference audio data;

and obtaining synthetic audio data according to the text content.

The text content is determined according to the reference audio data, and may be the text content in the reference audio data. The synthesized audio data obtained from the text content may be synthesized audio data that generates the same text content.

In the embodiment, the synthesized audio data is generated according to the text content, so that the synthesized audio data has the same characteristic of non-human pronunciation with the audio data generated in the voice-driven face technology, and a better reference function can be provided for improvement and use of the voice-driven technology.

An embodiment of the present disclosure further provides a model generation method, as shown in fig. 5, including:

step S51: inputting the training video frame and the synthetic audio data into a voice-driven face model to be trained to obtain a voice-driven video frame; the synthesized audio data is the synthesized audio data provided by any one of the embodiments of the present disclosure;

step S52: and training the voice-driven face model to be trained according to the voice-driven video frame and the synthesized video data to obtain the voice-driven face model, wherein the synthesized video data is the synthesized video data provided by any one embodiment of the disclosure.

With the advent of digitization technology, virtual digital Character images are increasingly appearing in people's lives, such as Non-Player characters (NPCs) in games, virtual broadcasters in news, two-dimensional cartoons in segment video data, etc., which can all be active on the screen like real people, interacting and conversing with real world people. An important technology for realizing the capability is a voice-driven lip (voice-driven human face) technology, namely, a virtual image in a screen is driven by a segment of voice, so that the image can generate lip movement matched with the voice. In order to train a voice-driven lip model, a large amount of paired voice and lip movement video data is collected as training samples.

Real-person voice-lip motion video data such as this is easily collected in the real world, such as scenes of news simulcasts, lectures, live broadcasts, etc.; however, in practical application scenarios, the speech used to drive the video data is often synthesized audio data such as TTS speech generated by a machine, rather than audio data occurring in a real person. There are machine-unrealizable differences between real-person audio data and synthesized audio data, and such differences can result in TTS audio data driving lip shape models that are far less effective than real-person audio data. However, for synthesized audio data such as TTS speech, it is difficult to directly find completely matched lip motion video data, and therefore, a one-to-one corresponding speech and lip motion video data pair required for training a speech-driven face model cannot be constructed.

However, in this embodiment, synthesized video data including synthesized audio data is used to train a voice-driven face model to be trained, and the relationship between the synthesized audio data and a video frame in the synthesized video data is determined by the mapping relationship between the synthesized audio data and reference audio data, so that a high-degree matching relationship exists between the synthesized audio data and the video frame in the synthesized video data, and the video frame can be a picture when a real person sends out reference audio data with the same text content, so that the model can learn information such as a face, a mouth shape, and muscle texture corresponding to the pronunciation of the synthesized audio data from the synthesized video data, and the voice-driven face model obtained after training can generate a more real and highly-matched face-driven picture under the drive of other synthesized voices.

In an example of the present disclosure, referring to fig. 6, the multimedia data processing method includes the steps of:

step S61: real person video data is obtained. In the step, the matched real voice and lip shape video data are found, and the data can be easily obtained from news and lecture video data.

Step S62: and acquiring real person voice data in the real person video data.

Step S63: synthesized speech data is obtained.

In this step, a TTS speech having the same content as the real person speech can be generated through the speech generation model. TTS speech and human speech are only the same, and other features including, but not limited to, timbre, pitch, rate, cadence (and other parameters that can be controlled in synthesized speech) can be used to distinguish them from human speech.

Step S64: and splitting the real person video data into a picture sequence.

In this step, the real person video data is unframed into one picture according to the sequence of the video frames, each picture is an image frame (video frame), and the sequence is marked as i-1, 2, …, and N.

Step S65: and representing the voice data part in the video data of the real person in a frequency spectrum form.

The spectrum of the real human voice is read at a sampling rate of 16kHz (i.e., 16000 points from the audio data wave every 1 second, representing the voice waveform) to obtain the spectrum of the voice.

Step S66: and determining a video frame corresponding to each interval of the real person voice data according to the duration of the real person voice data and the frame rate of the video data.

In this step, according to the time of the real-person voice and the frame rate of the real-person video data, the voice data spectrum can be equally divided into N intervals, and all audio data sampling points of each interval correspond to the same frame of video data picture between 1, … and N, so as to establish the mapping relationship between the real-person voice and the real-person video data.

Step S67: the synthesized speech data is represented in the form of a frequency spectrum.

In this step, TTS speech is read at a sampling rate of 16kHz, or TTS synthesized speech is generated directly at a sampling rate of 16kHz, and a corresponding frequency spectrum is obtained.

Step S68: and establishing a mapping relation between the synthesized voice data and the real voice data.

TTS speech and human speech can be dynamically matched by a Dynamic Time Warping method. Based on the method, each sampling point of the TTS voice can find a real voice sampling point closest to the sampling point from the real voice, and the mapping between the frequency spectrum of the TTS synthetic voice and the frequency spectrum of the real voice is established.

Step S69: and establishing a mapping relation between the synthesized voice data and the real person video data.

In this step, the mapping relationship between the human voice and the synthesized voice and the mapping relationship between the human voice and the human video data are combined, so that the corresponding situation between the frequency spectrum of the TTS voice and the video frame of the human video data can be indirectly achieved.

Step S610: composite video data is generated.

In this step, the number N of video frames required by the TTS synthesized speech may be calculated according to a desired frame rate (for example, a default frame rate for capturing real-person video data) and a duration of the TTS synthesized speech, and a spectrum of the TTS synthesized speech is divided into M equal sub-spectra (which are equivalent to speech units of the synthesized speech in the foregoing embodiment). And counting the real person video frames corresponding to all sampling points of the sub-spectrum of the synthesized voice, and taking the most video frame for each sub-spectrum as the real person video frame corresponding to the sub-spectrum of the TTS synthesized voice. After the corresponding relation between the sub-spectrum of the TTS synthesized speech and the video frame is established. The corresponding real person image frames are taken out from the video data according to the sequence of TTS synthesized voice to synthesize new video data, and the accurate matching of the TTS synthesized voice and the real person video data can be completed.

An embodiment of the present disclosure further provides a multimedia data processing apparatus, as shown in fig. 7, including:

a synthesized audio data obtaining module 71, configured to obtain synthesized audio data according to reference audio data in the reference video data;

a mapping module 72, configured to map the synthesized audio data with the reference audio data to obtain a mapping result;

and a synthesized video data generating module 73, configured to generate synthesized video data according to the mapping result and the mapping relationship between the reference audio data and the video frame of the reference video data, where the audio data in the synthesized video data is synthesized audio data.

In one embodiment, as shown in FIG. 8, the mapping module may include:

a waveform distance unit 81 for calculating a waveform distance between a synthesized audio data sampling point and each reference audio data sampling point in the reference audio data for each synthesized audio data sampling point in the synthesized audio data;

a sampling point corresponding unit 82, configured to, for each synthesized audio data sampling point, use the reference audio data sampling point with the closest waveform distance as a reference audio data sampling point having a mapping relationship with the synthesized audio data sampling point;

a mapping result generating unit 83 for generating a mapping relationship between all the synthesized audio data sampling points and the reference audio data sampling points as a mapping result.

In one embodiment, as shown in fig. 9, the composite video data generation module includes:

a dividing unit 91, configured to divide the reference audio data into a plurality of voice units according to the reference video data and the reference audio data, where each voice unit corresponds to one frame of video frame;

a video frame corresponding unit 92, configured to determine, according to a time of each reference audio data sampling point of a speech unit, a video frame in reference video data corresponding to each speech unit;

and a mapping result processing unit 93, configured to generate synthesized video data according to the video frames and mapping results corresponding to all the speech units.

In one embodiment, the mapping result processing unit is further configured to:

determining a video frame corresponding to each reference audio data sampling point according to the video frame corresponding to each voice unit;

determining a video frame corresponding to each synthesized audio data sampling point according to the mapping result and the video frame corresponding to each reference audio data sampling point;

and arranging the video frames corresponding to all the synthetic audio data sampling points according to the sequence of the synthetic audio data sampling points to generate synthetic video data.

In one embodiment, the dividing unit is further configured to:

In one embodiment, the synthesized audio data is text-converted conversational speech data.

In one embodiment, as shown in fig. 10, the synthetic audio data generation module includes:

a text content determining module 101, configured to determine text content according to the reference audio data;

and the word content processing module 102 is configured to obtain synthesized audio data according to the word content.

An embodiment of the present disclosure further provides a model generation apparatus, as shown in fig. 11, including:

the input module 111 is configured to input the training video frame and the synthesized audio data into a voice-driven face model to be trained, so as to obtain a voice-driven video frame; the synthetic audio data is generated in any embodiment of the disclosure;

the training module 112 is configured to train the voice-driven face model to be trained according to the voice-driven video frame and the synthesized video data, so as to obtain the voice-driven face model, where the synthesized video data is synthesized video data generated by any one of the embodiments of the present disclosure.

The embodiment of the disclosure can be applied to the technical fields of computer technology, artificial intelligence and the like, and particularly can be applied to the technical fields of deep learning, computer vision and the like.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 12 shows a schematic block diagram of an example electronic device 120 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the apparatus 120 includes a computing unit 121 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)122 or a computer program loaded from a storage unit 128 into a Random Access Memory (RAM) 123. In the RAM 123, various programs and data required for the operation of the device 120 can also be stored. The calculation unit 121, the ROM 122, and the RAM 123 are connected to each other via a bus 124. An input/output (I/O) interface 125 is also connected to bus 124.

A number of components in device 120 are connected to I/O interface 125, including: an input unit 126 such as a keyboard, a mouse, and the like; an output unit 127 such as various types of displays, speakers, and the like; a storage unit 128 such as a magnetic disk, optical disk, or the like; and a communication unit 129 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 129 allows the device 120 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 121 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 121 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 121 performs the respective methods and processes described above, such as a multimedia data processing method. For example, in some embodiments, the multimedia data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 128. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 120 via ROM 122 and/or communications unit 129. When the computer program is loaded into the RAM 123 and executed by the computing unit 121, one or more steps of the multimedia data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 121 may be configured to perform the multimedia data processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A multimedia data processing method, comprising:

and generating synthetic video data according to the mapping result and the mapping relation between the reference audio data and the video frame of the reference video data, wherein the audio data in the synthetic video data is the synthetic audio data.

2. The method of claim 1, wherein the mapping the synthesized audio data with the reference audio data to obtain a mapping result comprises:

calculating a waveform distance between each synthesized audio data sampling point in the synthesized audio data and each reference audio data sampling point in the reference audio data;

aiming at each synthesized audio data sampling point, taking a reference audio data sampling point with the closest waveform distance as a reference audio data sampling point having a mapping relation with the synthesized audio data sampling point;

and taking the mapping relation between all the synthetic audio data sampling points and the reference audio data sampling points as the mapping result.

3. The method according to claim 1 or 2, wherein the generating of the composite video data according to the mapping result and the mapping relationship between the reference audio data and the video frame of the reference video data comprises:

dividing the reference audio data into a plurality of voice units according to the reference video data and the reference audio data, wherein each voice unit corresponds to one frame of video frame;

determining a video frame in reference video data corresponding to each voice unit according to the time of each reference audio data sampling point of the voice unit;

and generating the synthesized video data according to the video frames corresponding to all the voice units and the mapping result.

4. The method of claim 3, wherein the generating the synthesized video data according to the video frames corresponding to all the speech units and the mapping result comprises:

and arranging the video frames corresponding to all the synthetic audio data sampling points according to the sequence of the synthetic audio data sampling points to generate the synthetic video data.

5. The method of claim 3, wherein said dividing the reference audio data into a plurality of speech units according to the reference video data and the reference audio data comprises:

and dividing the reference audio data according to the duration of the reference audio data corresponding to each frame of video frame to obtain the voice unit.

6. The method of any of claims 1-5, wherein the synthesized audio data is text-converted dialogue speech data.

7. The method of any of claims 1-6, wherein the obtaining synthetic audio data from reference audio data in reference video data comprises:

determining the text content according to the reference audio data;

and obtaining the synthetic audio data according to the text content.

8. A model generation method, comprising:

inputting the training video frame and the synthetic audio data into a voice-driven face model to be trained to obtain a voice-driven video frame; the synthetic audio data is the synthetic audio data of any one of claims 1-7;

training the voice-driven face model to be trained according to the voice-driven video frame and the synthesized video data to obtain the voice-driven face model, wherein the synthesized video data is the synthesized video data according to any one of claims 1 to 7.

9. A multimedia data processing apparatus comprising:

and the synthetic video data generation module is used for generating synthetic video data according to the mapping result and the mapping relation between the reference audio data and the video frame of the reference video data, wherein the audio data in the synthetic video data is the synthetic audio data.

10. The apparatus of claim 9, wherein the mapping module comprises:

a waveform distance unit for calculating, for each synthesized audio data sampling point in the synthesized audio data, a waveform distance between the synthesized audio data sampling point and each reference audio data sampling point in the reference audio data;

the sampling point corresponding unit is used for regarding each synthesized audio data sampling point, and taking the reference audio data sampling point with the closest waveform distance as the reference audio data sampling point with the mapping relation with the synthesized audio data sampling point;

and the mapping result generating unit is used for taking the mapping relation between all the synthetic audio data sampling points and the reference audio data sampling points as the mapping result.

11. The apparatus of claim 9 or 10, wherein the composite video data generation module comprises:

the dividing unit is used for dividing the reference audio data into a plurality of voice units according to the reference video data and the reference audio data, and each voice unit corresponds to one frame of video frame;

the video frame corresponding unit is used for determining a video frame in the reference video data corresponding to each voice unit according to the time of each reference audio data sampling point of the voice unit;

and the mapping result processing unit is used for generating the synthesized video data according to the video frames corresponding to all the voice units and the mapping result.

12. The apparatus of claim 11, wherein the mapping result processing unit is further configured to:

13. The apparatus of claim 11, wherein the partitioning unit is further configured to:

14. The apparatus of any of claims 9-13, wherein the synthesized audio data is text-converted dialogue speech data.

15. The apparatus of any of claims 9-14, wherein the synthetic audio data generation module comprises:

the text content determining module is used for determining text content according to the reference audio data;

and the word content processing module is used for acquiring the synthetic audio data according to the word content.

16. A model generation apparatus comprising:

the input module is used for inputting the training video frames and the synthetic audio data into a voice-driven face model to be trained to obtain voice-driven video frames; the synthetic audio data is the synthetic audio data of any one of claims 9-15;

a training module, configured to train the voice-driven face model to be trained according to a voice-driven video frame and synthesized video data, to obtain the voice-driven face model, where the synthesized video data is the synthesized video data according to any one of claims 9 to 15.

17. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the steps of the method of any one of claims 1 to 8.