CN116980643A

CN116980643A - Data processing method, device, equipment and readable storage medium

Info

Publication number: CN116980643A
Application number: CN202311015515.4A
Authority: CN
Inventors: 梁宇轩
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-08-10
Filing date: 2023-08-10
Publication date: 2023-10-31

Abstract

The application discloses a data processing method, a device, equipment and a readable storage medium, wherein the method comprises the following steps: acquiring dubbing audio data of a target object in media data; performing action conversion processing on dubbing audio data to obtain standard actions of pronunciation parts corresponding to the dubbing audio data; according to the standard action, adjusting and processing the pronunciation action of the pronunciation part of the target object in the media data to obtain the matching action of the pronunciation part of the target object in the media data; and fusing the dubbing audio data, the matching action and the media data to obtain dubbing media data. By adopting the application, the audio-visual synchronization of the media data can be realized, and the presentation effect of the media data is optimized.

Description

Data processing method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, and readable storage medium.

Background

With the rapid development of multimedia technology, a large number of media applications (such as video playing applications, game applications, etc.) are online, and a large amount of media data (such as video) enters the field of view of people due to the media applications, so that the method becomes a daily entertainment mode for people. For media data such as video, post-processing is often required, for example, for a certain video, a certain character in the video needs to be dubbed, so that a pronunciation part (such as a lip) of the character can emit desired audio data.

In the related art, when post-processing is performed on media data, only dubbing of a character in video is simply made or changed, and dubbing of a different character is not adjusted in combination with specific content of the media data, which may cause that when playing the media data, the output audio data may not match with the content of the media data itself, for example, when playing the media data, the output audio is audio that character a laughs in haha, but a picture presented in the media data is not a laughing picture, but a picture that the character a has very quiet expression. That is, only the post-processing mode of simply making or changing the dubbing of the character in the video is likely to have the problem of unsynchronized audio and video, so that the presentation effect of the finally presented media data is poor.

Disclosure of Invention

The embodiment of the application provides a data processing method, a device, equipment and a readable storage medium, which can realize audio-visual synchronization of media data and further optimize the presentation effect of the media data.

In one aspect, an embodiment of the present application provides a data processing method, including:

acquiring dubbing audio data of a target object in media data;

Performing action conversion processing on dubbing audio data to obtain standard actions of pronunciation parts corresponding to the dubbing audio data;

according to the standard action, adjusting and processing the pronunciation action of the pronunciation part of the target object in the media data to obtain the matching action of the pronunciation part of the target object in the media data;

and fusing the dubbing audio data, the matching action and the media data to obtain dubbing media data.

In one aspect, an embodiment of the present application provides a data processing apparatus, including:

the data acquisition module is used for acquiring dubbing audio data of a target object in the media data;

the data conversion module is used for performing action conversion processing on the dubbing audio data to obtain standard actions of the pronunciation parts corresponding to the dubbing audio data;

the action adjusting module is used for adjusting the pronunciation action of the pronunciation part of the target object in the media data according to the standard action to obtain the matching action of the pronunciation part of the target object in the media data;

and the fusion module is used for fusing the dubbing audio data, the matching action and the media data to obtain dubbing media data.

In one embodiment, after the fusing module fuses the dubbed audio data, the matching action and the media data to obtain the dubbed media data, the data processing apparatus further includes:

The subtitle acquisition module is used for acquiring subtitle data indicated by the dubbing audio data;

the time length determining module is used for obtaining the audio duration of the dubbing audio data and determining the caption display time length of the caption data according to the audio duration;

and the synchronous display module is used for synchronously displaying the caption data according to the caption display time length when the dubbing media data are played.

In one embodiment, the caption obtaining module obtains a specific implementation mode of caption data indicated by dubbing audio data, including:

performing action recognition processing on the matched action to obtain action language data indicated by the matched action;

the action language data indicated by the matching action is determined as subtitle data indicated by the dubbing audio data.

In one embodiment, the subtitle obtaining module performs an action recognition process on the matching action to obtain a specific implementation manner of action language data indicated by the matching action, including:

converting the matching action into a visual state sequence through a visual mapping model;

and performing character conversion processing on the visual state sequence to obtain action language data indicated by the matching action.

In one embodiment, the data conversion module performs an action conversion process on the dubbing audio data to obtain a specific implementation manner of a standard action of a pronunciation part corresponding to the dubbing audio data, where the specific implementation manner includes:

Acquiring the language category to which the dubbing audio data belongs, and determining a phoneme conversion rule of the dubbing audio data according to the language category;

converting dubbing audio data into a phoneme sequence according to a phoneme conversion rule;

and calling an action conversion model, and performing action conversion processing on the phoneme sequence of the dubbing audio data through the action conversion model to obtain a standard action corresponding to the dubbing audio data.

In one embodiment, the data conversion module determines a specific implementation of a phoneme conversion rule of dubbing audio data according to a language category, including:

if the language class is the first language class, determining a first conversion rule in the configuration conversion rule set as a phoneme conversion rule of the dubbing audio data;

if the language class is the second language class, determining a second conversion rule in the configuration conversion rule set as a phoneme conversion rule of the dubbing audio data.

In one embodiment, the language class to which the dubbing audio data belongs is a first language class, and the phoneme conversion rule is a first conversion rule;

the specific implementation mode of the data conversion module for converting dubbing audio data into a phoneme sequence according to a phoneme conversion rule comprises the following steps:

Preprocessing dubbing audio data according to a first conversion rule to obtain first preprocessed audio data;

carrying out semantic analysis processing on the first preprocessed audio data to obtain text data of dubbing audio data;

acquiring a sound dictionary; the sound dictionary comprises a phoneme mapping relation between text words and configuration phoneme sequences;

and determining a phoneme sequence indicated by the text data through a phoneme mapping relation between the text words and the configuration phoneme sequence in the sounding dictionary, and determining the phoneme sequence indicated by the text data as a phoneme sequence of the dubbing audio data.

In one embodiment, the data conversion module performs semantic analysis processing on the first preprocessed audio data to obtain a specific implementation manner of text data of dubbed audio data, including:

calling M semantic analysis models, and respectively carrying out semantic analysis processing on the first preprocessed audio data through each semantic analysis model to obtain analysis texts respectively corresponding to each voice analysis model;

and performing text comparison processing on the M analysis texts to determine text data of dubbing audio data.

In one embodiment, the data conversion module performs text comparison processing on the M analyzed texts to determine a specific implementation manner of text data of dubbing audio data, including:

Performing text comparison processing on the M analysis texts;

if the M analysis texts are different from each other, acquiring a high-precision semantic analysis model from the M semantic analysis models, and determining the analysis text corresponding to the high-precision semantic analysis model as text data of dubbing audio data;

if the same analysis text exists in the M analysis texts, determining the same analysis text in the M analysis texts as text data of dubbing audio data.

In one embodiment, the language class to which the dubbing audio data belongs is a second language class, and the phoneme conversion rule is a second conversion rule;

preprocessing dubbing audio data according to a second conversion rule to obtain second preprocessed audio data;

and carrying out phoneme recognition processing on the second preprocessed audio data to obtain a phoneme sequence converted by dubbing audio data.

In one embodiment, the action adjustment module adjusts the pronunciation action of the pronunciation part of the target object in the media data according to the standard action to obtain a specific implementation mode of the matching action of the pronunciation part of the target object in the media data, and the method comprises the following steps:

Acquiring an action model for representing the pronunciation action of a pronunciation part of a target object in media data;

acquiring a key region containing a pronunciation part of a target object in an action model;

dotting the pronunciation parts in the key areas according to standard actions to obtain dotting key points;

according to the dotting key points, adjusting the pronunciation actions of the pronunciation parts in the key areas to obtain the adjustment actions of the pronunciation parts in the key areas;

the adjustment action of the pronunciation parts in the key area is determined as a matching action of the target object in the media data.

In one embodiment, the action adjustment module performs dotting on the pronunciation parts in the key area according to the standard action to obtain dotting key points, including:

acquiring standard action feature points for composing standard actions and pronunciation part areas indicated by the standard actions;

acquiring the region proportion between the pronunciation part region and the key region;

acquiring a first position of a standard action feature point in a pronunciation part area and a second position of a pronunciation part in a key area;

determining a third position of the standard action feature point in the key region according to the first position, the second position and the region proportion;

And dotting is carried out on a third position in the key area, so that dotting key points are obtained.

In one aspect, an embodiment of the present application provides a computer device, including: a processor and a memory;

the memory stores a computer program that, when executed by the processor, causes the processor to perform the methods of embodiments of the present application.

In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, perform a method according to embodiments of the present application.

In one aspect of the present application, a computer program product is provided, the computer program product comprising a computer program stored in a computer readable storage medium. A processor of a computer device reads the computer program from a computer-readable storage medium, and the processor executes the computer program to cause the computer device to perform a method provided in an aspect of an embodiment of the present application.

In the embodiment of the application, dubbing audio data of a target object in media data can be converted into standard actions of a sounding part, then the sounding action of the sounding part of the target object in the media data can be adjusted according to the standard actions, and the adjusted sounding action can be used as a matching action of the sounding part of the target object in the media data; further, dubbing audio data and matching actions can be fused with the media data, and final dubbing audio data can be obtained. It should be understood that, the standard action is obtained after the action conversion processing is performed based on the dubbing audio data of the target object, and the standard action is matched and synchronous with the dubbing audio data, so that the obtained matching action is also matched with the dubbing audio data after the sound action of the sound producing part of the target object in the media data is adjusted based on the standard action, and the dubbing audio data and the matching action are fused into the media data to obtain dubbing media data, so that the dubbing audio data of the target object is matched with the sound producing action of the dubbing audio data, in other words, the sound and the picture are synchronous when the dubbing media data is output, and the media data presenting effect is better. In conclusion, the application can realize the audio-visual synchronization of the media data, thereby optimizing the presentation effect of the media data.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a network architecture according to an embodiment of the present application;

fig. 2 is a schematic view of a scenario for adjusting a lip of a character according to dubbing according to an embodiment of the present application;

fig. 3 is a schematic diagram of an architecture for playing audio recording data and adjusting volume according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of dotting based on standard actions provided by an embodiment of the application;

FIG. 6 is a schematic diagram of a lip-adjusting architecture according to an embodiment of the present application;

fig. 7 is a schematic flow chart of converting dubbing audio data into standard actions according to an embodiment of the present application;

fig. 8 is a schematic diagram of an architecture for converting dubbing audio data into standard actions according to an embodiment of the present application;

FIG. 9 is a schematic diagram of another architecture for converting dubbing audio data into standard motion according to an embodiment of the present application;

FIG. 10 is a diagram of an overall architecture of a system provided by an embodiment of the present application;

FIG. 11 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiment of the application also relates to related technologies such as artificial intelligence, and for convenience of understanding, the artificial intelligence and related concepts thereof will be described below:

artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The scheme provided by the embodiment of the application belongs to Machine Learning (ML) and natural language processing (Nature Language processing, NLP) technologies in the field of artificial intelligence.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The embodiment of the application can carry out semantic analysis processing on dubbing audio data of a certain object through a natural language processing technology, and convert the dubbing audio data into text data (particularly, the description of the subsequent embodiment can be seen); meanwhile, according to the embodiment of the application, the related models (such as the motion conversion model, the semantic analysis model and the like mentioned later) can be trained and optimized through a machine learning technology, so that the accuracy of the output result of the model is improved.

For ease of understanding, FIG. 1 is a diagram of a network architecture of a data processing system according to an embodiment of the present application. As shown in fig. 1, the network architecture may include a service server 1000 and a terminal device cluster, which may include one or more terminal devices, the number of which will not be limited here. As shown in fig. 1, the plurality of terminal devices may include a terminal device 100a, a terminal device 100b, terminal devices 100c, …, a terminal device 100n; as shown in fig. 1, the terminal devices 100a, 100b, 100c, …, 100n may respectively perform network connection with the service server 1000, so that each terminal device may perform data interaction with the service server 1000 through the network connection. In addition, any terminal device in the terminal device cluster 100 may refer to an intelligent device running an operating system, and the operating system of the terminal device is not specifically limited in the embodiment of the present application.

The terminal device in the data processing system shown in fig. 1 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a palm computer, a desktop computer, a mobile internet device (MID, mobile internet device), a POS (Point Of sale) machine, a smart speaker, a smart television, a smart watch, a smart car terminal, a Virtual Reality (VR) device, an augmented Reality (Augmented Reality, AR) device, and the like. The terminal device is often configured with a display device, which may be a display, a display screen, a touch screen, etc., and the touch screen may be a touch screen, a touch panel, etc.

The service server in the data processing system shown in fig. 1 may be a single physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal device and the service server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

In one possible implementation, a terminal device (e.g., terminal device 100 a) has a client running therein, such as a video client, a browser client, a game client, an educational client, etc., and the clients will not be illustrated one by one. In the embodiments of the present application, a video client is described as an example. An object (e.g., a user, an intelligent robot, etc.) may run a video client in the terminal device, and the video client may provide media data (e.g., video data) to the object, which the object may browse to play. Taking media data as video data as an example, when an object plays video data, audio data of different roles (such as a certain virtual character or a certain role in a television show) and picture data with various action expressions are heard at the same time, that is, the terminal equipment outputs audio and pictures of the video data synchronously. In order to optimize the audio-video synchronization effect of different roles in video data and optimize the presentation effect of the video data, the application provides a data processing method, which can specifically adjust and optimize the pronunciation action of the pronunciation part of the target object in the video data according to the dubbing audio data of the target object, thereby realizing the audio-video synchronization of the target object. The target object herein refers to any object contained in the video data, and may be a virtual character or a character in a television show, and the pronunciation part of the target object may refer to a part for expressing a language (or understanding to express an utterance), for example, the lip may express different lips through different actions, and then the lip may be used as the pronunciation part; for another example, the hand may express different sign language through different actions, and the hand may also be used as a pronunciation part.

In particular, it will be appreciated that for a target object in video data, it is often necessary to dub the target object at the time of post-processing, for example, for character a in a certain television series. The application can acquire the dubbing audio data of the target object in the video data, and then, the application can perform action conversion processing on the dubbing audio data, thereby obtaining the standard action of the pronunciation part corresponding to the dubbing audio data; here, the action conversion process may refer to an action of determining a sound emitting portion for expressing dubbing audio data based on the dubbing audio data, and the action may be a standard action. For the sake of understanding, taking the pronunciation part as the lip, since the lip generates different actions by moving, so that the lip has different shapes, and can generate different sounds, then after the dubbing audio data of the target object is acquired, performing the action conversion processing on the dubbing audio data may be understood as converting the dubbing audio data into a lip shape for generating the dubbing audio data, in other words, in the case that the pronunciation part is the lip, performing the action conversion processing on the dubbing audio data may be understood as performing the lip shape or the lip action conversion processing, so as to obtain a lip shape (or the lip action) for generating the dubbing audio data, and the lip shape may be understood as a standard lip shape (or a standard lip action) matched with the dubbing audio data.

Further, according to the standard action, the pronunciation action presented in the video data by the pronunciation part of the target object can be adjusted, so that the matching action of the pronunciation part of the target object in the media data can be obtained. For easy understanding, taking the pronunciation part as a lip, the pronunciation part of the target object is the lip of the target object, the pronunciation action presented by the pronunciation part of the target object in the video data may refer to the lip presented by the target object when speaking the dubbing audio data, and the lip presented by the target object when speaking the dubbing audio data may be adjusted based on the standard lip, so that the lip presented by the target object in the video data is the standard lip, which is synchronously matched with the dubbing audio data. In other words, the matching action is matched with the dubbing audio data, so that after the dubbing audio data and the matching action are fused into the video data, the dubbing of the target object and the pronunciation action of the pronunciation part are matched in the obtained dubbing video data, and when the dubbing video data is played, the dubbing of the target object and the picture are synchronously matched, and the problem of asynchronous audio and picture is avoided. The presentation effect of the video data can thus be optimized well.

It should be understood that, by the scheme provided by the embodiment of the application, the pronunciation action of the pronunciation part of the target object can be adjusted according to the standard action, so that the synchronization of sound and picture is realized. Specifically, the standard action is obtained after action conversion processing is performed based on dubbing audio data of a target object, the standard action is matched and synchronous with the dubbing audio data, then the obtained matching action is also matched with the dubbing audio data after the pronunciation action of a pronunciation part of the target object in the media data is adjusted based on the standard action, and based on the matching action, the dubbing audio data and the matching action are fused into the media data, and then the dubbing audio data of the target object is matched with the pronunciation action of the dubbing audio data, in other words, the dubbing media data is output, the sound and the picture are synchronous, and the media data presenting effect is good.

It will be appreciated that the method provided by the embodiments of the present application may be performed by a computer device, including but not limited to the terminal device or service server mentioned in fig. 1.

It should be noted that, in the specific embodiment of the present application, the related data related to the user information, the user data (such as the video data played and watched by the user), and the like are all obtained by the user manually authorizing the license (i.e. by the user agreeing). That is, when the above embodiments of the present application are applied to specific products or technologies, the methods and related functions provided by the embodiments of the present application are performed under the permission or agreement of the user (the functions provided by the embodiments of the present application may be actively started by the user), and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related region and territory.

For ease of understanding, further, please refer to fig. 2, fig. 2 is a schematic diagram of a scene of adjusting a lip shape of a character according to dubbing provided in an embodiment of the present application. The scene embodiment shown in fig. 2 is a scene described by taking media data as video data and the sound emitting part of the target object as a lip as an example. The server 2000 shown in fig. 2 may refer to the service server 1000 in the embodiment corresponding to fig. 1; the terminal device 200 shown in fig. 2 may refer to any terminal device in the terminal device cluster in the embodiment corresponding to fig. 1, for example, the terminal device 100a.

As shown in fig. 2, the terminal device 200 may be a recording device (e.g., a microphone device) to which the sound object 2a is assigned, and the dubbing object 2a may dub based on the recording device 200. The method for realizing the recording can adopt a mode of combining a getUserMedia method using a real-time audio-video interface (WebRTC) with an audioContext, and the method for realizing the recording by adopting the audioContext needs to use a script processor node interface. Specifically, after the dubbing object records dubbing data by using the microphone device, the dubbing object can be initialized to a mediastreamaudiosource object by using the getUserMedia method of WebRTC, then the object can be connected to a javascript processor node interface, audio data can be obtained in an ontaudioprocess of the interface, and the audio data can be stored, so that recording data can be obtained. If there is a need to directly play the recorded data, it is only required to connect it to a speaker. For easy understanding, please refer to fig. 3, fig. 3 is a schematic diagram of an architecture for playing audio data and adjusting volume according to an embodiment of the present application. As shown in fig. 3, the architecture may include a device connection interface (mediastreamaudiosource), a filter (BiquadFilterNode), a js processing interface (javascript processor node), a volume setting interface (gainNode), and an audio file connection interface (audiobufferssource). The various interface components shown in fig. 3 may all inherit to an audio processing universal interface (AudioNode) that may utilize AudioContext to process important hub elements of audio data. In the architecture shown in fig. 3, the data source of the audio file connection interface is a decoded complete recording buffer instance (buffer instance); the volume setting interface is used for setting volume; a filter may be used for filtering; the js processing interface provides a callback of the onaudioprocess and can be used for analyzing and processing the audio data; the device connection interface may be used to connect a microphone device. The interfaces of the components in the embodiment corresponding to fig. 3 are connected layer by layer, for example, the audio file connection interface is connected to the volume setting interface first, and then the volume setting interface is connected to the speaker, so that the volume can be adjusted.

It will be appreciated that, as shown in fig. 2, the audio recorded by the dubbing object 2a by using the terminal device 200 is audio data of the video data H at a certain relative moment (e.g. at 5 th minute of the video data H) in the video data H, and assuming that the video data H is the 1 st set of the tv drama, i.e. the "fate symphony", the relative moment is the 5 th minute in the 1 st set, it may be understood that the dubbing recorded by the dubbing object 2a is what the object 20a should say at the 5 th minute in the 1 st set. It should be appreciated that for audio recorded by dubbing object 2a, this may be referred to as dubbing audio data, which server 2000 may obtain; the server 2000 may then convert the dubbed audio data into a standard lip shape according to different conversion rules according to the language class (including, but not limited to, chinese, english, arabic, french, etc.) to which the dubbed audio data belongs, which may be understood as the shape that the lip should have when sounding by motion or action (i.e., the shape that the lip should have when expressing the dubbed audio data). For example, as shown in fig. 2, assuming that dubbing audio data is audio of "Yeah", the server 2000 converts it to obtain a standard lip 2001 as shown in fig. 2, that is, a shape that should be possessed when "Yeah" is uttered.

Further, as shown in fig. 2, at the 5 th minute of the video data H, the server 2000 may obtain a lip shape of the sound producing portion Pos1 of the object 20a (i.e., the lip of the object 20 a), which is not a lip shape that should be provided when the "Yeah" is produced, and then the server 2000 may modify the standard lip shape 2001 based on the lip shape to match the dubbed audio data "Yeah". As shown in fig. 2, the lip shape of the sound producing portion of the object 20a may be adjusted and changed based on the standard lip 2001, and the adjusted and changed lip shape is the same as the standard lip shape, and the lip shape is matched with the dubbing audio data, so that when the video data H is played, the synchronization of the audio and the video of different roles at each moment can be realized.

It should be understood that, by the scheme provided by the embodiment of the application, the lip shape of the target object can be adjusted according to the standard lip shape of the dubbing audio, so that the lip shape of the character in the video data is matched with the dubbing, and the audio-video synchronization is realized.

For ease of understanding, the data processing method provided by the embodiment of the present application will be described in detail below with reference to the accompanying drawings. Referring to fig. 4, fig. 4 is a flowchart of a data processing method according to an embodiment of the application. The process may be performed by a computer device, which may be a service server as shown in fig. 1, or may be any terminal device as shown in fig. 1. For ease of understanding, a computer device will be described below as an example of a service server. As shown in fig. 3, the flow may include at least the following steps S101 to S104:

Step S101, dubbing audio data of a target object in media data is acquired.

In the present application, the media data may refer to video data, the target object may refer to any one object (such as a certain character in a television play, a certain virtual character in a game, a certain virtual character in a cartoon, etc.) included in the video data, and the dubbing audio data of the target object may refer to audio data recorded by assigning an audio object (such as a dubbing person) to the audio data, where the dubbing audio data may refer to speech audio of the target object. The dubbing object may record dubbing for a certain target object through the recording device, and the computer device may obtain the dubbing audio data.

Step S102, performing action conversion processing on the dubbing audio data to obtain standard actions of the pronunciation parts corresponding to the dubbing audio data.

In the present application, the pronunciation part may refer to a part for expressing an utterance, for example, a lip may generate a lip language by a movement or an action, and the lip may serve as a pronunciation part; for another example, the hands may generate sign language by movement or action, and the hands may also be used as pronunciation parts. The standard action may refer to an action when the pronunciation part emits the dubbing audio data, for example, when the pronunciation part is a lip, the standard action may refer to a lip shape when the lip emits the dubbing audio data, for example, when the dubbing audio data is "o", the standard action may refer to a lip shape when the lip emits "o"; when the pronunciation part is a hand, the standard motion may be a motion when the hand expresses dubbing audio data.

Step S103, according to the standard action, the pronunciation action of the pronunciation part of the target object in the media data is adjusted, and the matching action of the pronunciation part of the target object in the media data is obtained.

In the application, the pronunciation action of the pronunciation part of the target object in the media data can be adjusted according to the standard action, so that the pronunciation action of the pronunciation part of the target object can be adjusted to be matched with the dubbing audio data. For ease of understanding, taking the pronunciation part as the lip as an example, the standard action may refer to a standard lip shape, and the lip shape of the lip of the target object in the media data may be adjusted according to the standard lip shape, so that the lip shape of the lip of the target object is matched with the dubbing audio data.

The implementation way of adjusting the pronunciation action of the pronunciation part of the target object in the media data according to the standard action to obtain the matching action of the pronunciation part of the target object in the media data can be as follows: an action model for characterizing a pronunciation action of a pronunciation part of a target object in media data can be obtained; subsequently, a key region containing the pronunciation part of the target object can be acquired in the action model; the pronunciation parts in the key areas can be dotted according to standard actions, so that dotting key points can be obtained; according to the dotting key points, adjusting the pronunciation actions of the pronunciation parts in the key areas to obtain the adjustment actions of the pronunciation parts in the key areas; the adjustment action of the pronunciation parts in the key area is determined as a matching action of the target object in the media data.

It is to be understood that the motion model herein may refer to a three-dimensional model, which includes a pronunciation part of the target object and a pronunciation motion of the pronunciation part; in the action model, an area containing the pronunciation part of the target object may be determined as a key area, then, the pronunciation part in the key area may be dotted according to a standard action, the dotted points may be referred to as dotted key points, and the action formed by the dotted key points is the standard action, then, the pronunciation part in the key area may be adjusted step by step according to the dotted key points, so as to obtain an adjustment action, and the adjustment action may be determined as a matching action of the target object in the media data.

Wherein, the specific way for dotting the pronunciation parts in the key area according to the standard action to obtain the dotting key points can be as follows: standard action feature points for composing standard actions and pronunciation part areas indicated by the standard actions can be acquired; then, the region ratio between the pronunciation part region and the key region can be obtained; the method comprises the steps that a first position of a standard action feature point in a pronunciation part area and a second position of a pronunciation part in a key area in the key area can be obtained; according to the first position, the second position and the region proportion, a third position where the standard action feature point is located in the key region can be determined; and finally, dotting can be performed on a third position in the key area, so that dotting key points can be obtained.

In order to facilitate understanding of a specific manner of dotting an action model of a target object based on standard actions, please refer to fig. 5, fig. 5 is a schematic diagram of dotting based on standard actions according to an embodiment of the present application. The embodiment shown in fig. 5 is described by taking the sound producing portion as a lip and the standard motion as a standard lip. As shown in fig. 5, for the standard lip 2001, a feature point for composing the standard lip is acquired, and an area indicated by the standard lip (an area 1 shown in fig. 5) is acquired, and as shown in fig. 5, the area 1 can completely contain the lips, that is, a sound emitting part area indicated by the standard action in the present application means an area that can completely contain the sound emitting part; subsequently, an action model representing the lips of the target object may be obtained as an action model 500a as shown in fig. 5, and in this action model 500a, lips 541 are included, and the lips of this lips 541 are in a closed shape. Further, the region including the lip 541 in the action model 500a may be obtained as an area 2, the position of the feature points in the area 2 may be determined based on the region ratio between the area 1 and the area 2, the position of the feature points in the standard lip shape in the area 1, and the position of the lip 541 in the area 2, then, the position may be spotted, and finally, the lip 541 may be adjusted to be located at the position of the spotted key point, thereby, the lip 541 may be adjusted to be a lip shape conforming to the standard lip shape, as shown in fig. 5, and after the adjustment, the lip shape of the lip 541 is in the shape of an opening smile.

The present application is merely illustrative of a specific manner of adjusting the pronunciation operation of the pronunciation part of the target object according to the standard operation, but the present application is not limited to this specific manner, and for example, when the pronunciation part is a lip, the specific manner of adjusting the pronunciation operation of the pronunciation part of the target object according to the standard operation may be: based on the shape parameters of the standard lip, fusing the shape parameters with the parameters on the face of the target object, and thus obtaining the target object with the standard lip. For ease of understanding, please refer to fig. 6, fig. 6 is a schematic diagram of a lip-adjusting structure according to an embodiment of the present application.

A three-dimensional reconstruction model, a region segmentation model, and a decoding optimizer may be included in the architecture as shown in fig. 6. The input image may refer to an image including a face of the target object, the three-dimensional reconstruction model may extract three-dimensional face parameters from the input image, where the three-dimensional face parameters may specifically include shape parameters and coarse texture parameters, where the shape parameters may refer to shape parameters of various parts of the face (such as eyes, eyebrows, nose, lips, etc.), and the coarse texture parameters may include illumination parameters, gesture parameters, texture parameters, and so on (only the texture parameters and illumination parameters are listed in fig. 6). The region segmentation model can be used for extracting a face region of a target object from an input image, extracting a facial feature vector, and optimizing the coarse texture parameters (such as texture parameters and illumination parameters) based on the facial feature vector by a decoding optimizer to obtain fine texture parameters; then, the shape parameters and the fine texture parameters are fused, so that a three-dimensional face model can be generated, and the textures in the three-dimensional face model are clearer. Rendering the three-dimensional face model to obtain a fusion image. The shape parameters of the lips can be changed into the shape parameters of the standard lips, so that the shapes of other parts in the generated three-dimensional face model are unchanged, and the shapes of the lips are changed into the standard lips.

It can be understood that the three-dimensional reconstruction model in the application can be a 3DMM model, and the region segmentation model can be a pretrained FaceNet model; the decoding optimizer may refer to a GCN decoder and a GCN optimizer.

Step S104, fusing the dubbing audio data, the matching action and the media data to obtain dubbing media data.

In the application, dubbing audio data, matching action and media data are fused, which can be understood as fusing the dubbing audio data and the matching action into the media data, thereby obtaining the dubbing media data of the target object with synchronous sound and picture. That is, when playing media data, the audio data of the soundtrack heard matches the sounding action of the sounding site of the observed target object.

It should be understood that sound and picture synchronization of media data can be realized by adjusting sound action of a sound producing part through dubbing, in addition, in order to further optimize the presentation effect of the media data, the application can also determine caption data based on dubbing audio data, and then the caption data is fused into the media data together, thereby realizing synchronization of three contents of sound, picture and word of the media data.

Specifically, after dubbing audio data, matching actions and media data are fused to obtain dubbing media data, subtitle data indicated by the dubbing audio data can be obtained; then, the audio duration of dubbing audio data can be obtained, and the caption display duration of caption data can be determined according to the audio duration; then, the subtitle data can be synchronously displayed according to the subtitle display time length when the dubbing media data is played.

The specific manner for acquiring the subtitle data indicated by the dubbing audio data may be: the matching action can be subjected to action recognition processing, so that action language data indicated by the matching action can be obtained; subsequently, the action language data indicated by the matching action may be determined as subtitle data indicated by the dubbing audio data. And when the pronunciation part is a lip, performing action recognition processing on the matching action, wherein the specific mode for obtaining the action language data indicated by the matching action can be as follows: through a visual mapping model, the matching action can be converted into a visual state sequence; and then, performing character conversion processing on the visual state sequence to obtain action language data indicated by the matching action.

It is understood that the action language data herein may refer to a language expressed by the matching action (or understood as an utterance or meaning expressed by the matching action), for example, when the pronunciation part is a lip, different lips may express different lips, and when the pronunciation part is a hand, different gestures may express different sign languages. By performing recognition processing on the matching action, the action language data indicated by the matching action can be determined.

Specifically, the pixel map model herein may be a mixture Gaussian model (Gaussian Mixture Model, GMM), which is a typical probability map model that may be used to transform matching actions into a sequence of pixel states when identifying the action language data of the matching actions. Where a visual element may refer to the smallest distinguishable unit of lip feature for recognition during pronunciation of a word. In the lip language recognition process, the GMM model can mainly construct an observation model for the visual element, namely, the input can be a frame of a lip moving image, and the output is the probability of the state in the corresponding visual element. That is, by the GMM model, the matching action (i.e., matching lips) can be translated into a sequence of visual states.

To facilitate an understanding of the GMM model, a brief description thereof will be provided below:

GMM (Gaussian Mixture Model), a gaussian mixture model, is a typical probability map model, and further a typical mixed directed graph model, which may also be referred to as a classical mixed bayesian network. The gaussian mixture model can be regarded as a model formed by combining K single gaussian models, and for convenience of understanding, the single gaussian model and the gaussian mixture model will be briefly described as follows:

1. single Gaussian model

When the sample data X is one-dimensional data (Univariate), the gaussian distribution follows a probability density function (Probability Density Function) as shown in formula (1), as shown in formula (1):

where μ is the data mean (expected) and σ is the data standard deviation as shown in equation (1).

When the sample data X is multidimensional data (multiscale), the gaussian distribution follows a probability density function as shown in formula (2):

where μ is the data mean (expected), Σ is the Covariance (Covariance) matrix and D is the data dimension as shown in equation (2).

2. Gaussian mixture model

The gaussian mixture model can be regarded as a model composed of K single gaussian models, the K sub-models being Hidden variables (Hidden variable) of the mixture model. In general, a mixture model can use any probability distribution, here a gaussian mixture model is used because of its very good mathematical properties and good computational performance.

For a gaussian mixture model the following information can be defined first:

1)x _j represents the j-th observation, j=1, 2,3, N;

2) K is the number of gaussian models in the mixed model, k=1, 2,..k;

3)α _k is the probability that the observed data belongs to the kth submodel, alpha _k ≥0,

4)φ(x|θ _k ) Is a gaussian distribution density function of the kth sub-model,the development form is the same as the single Gaussian model described above;

5)γ _jk representing the probability that the jth observation belongs to the kth submodel

The probability distribution of the gaussian mixture model can be as shown in formula (3):

wherein, for this model, the parameters areI.e. the expectation, variance (or covariance) of each sub-model, probability of occurrence in the hybrid model. The learning of the parameters is essentially an estimation using a maximum likelihood method. For the GMM model, too much description will not be made here. In the present application, the GMM model may be used to map matching actions (matching lips) to the visual states, ultimately resulting in a sequence of visual states. Then, the visual state sequence can be mapped into a character sequence, and the mapped character sequence can be understood as lip language (namely action language data) expressed by the matched lip shape in the application.

Finally, the identified action language data can be used as the subtitle data indicated by the dubbing audio data, the subtitle data can be synchronously displayed when the media data is played, and the display duration of the subtitle data is the same as the audio duration of the dubbing audio data, that is, the synchronous display of the subtitle data according to the subtitle display duration can be understood as follows: and synchronously displaying the caption data, wherein the display duration is the duration of the audio.

Further, referring to fig. 7, fig. 7 is a flowchart illustrating a process of converting dubbing audio data into standard actions according to an embodiment of the present application. The flow shown in fig. 7 may correspond to the implementation flow of the step of performing the motion conversion processing on the dubbing audio data to obtain the standard motion of the pronunciation part corresponding to the dubbing audio data in the embodiment corresponding to fig. 4. As shown in fig. 7, the flow may include at least the following steps S701 to S703:

step S701, the language category to which the dubbing audio data belongs is obtained, and a phoneme conversion rule of the dubbing audio data is determined according to the language category.

In particular, it is understood that the language category herein may refer to a language that is common to a region, for example, may refer to a chinese category, a french category, an english category, an arabic category, and so on, and will not be illustrated herein. The application can set different phoneme conversion rules for different language types, the phoneme conversion rule of one language type can be used as a configuration conversion rule, each configuration conversion rule can form a configuration conversion rule set, and after the language type to which the dubbing audio data belongs is obtained, the phoneme conversion rule of the dubbing audio data can be determined according to the language type.

Based on the above, the present application can determine the phoneme conversion rule of the dubbing audio data according to the language category to which the dubbing audio data belongs, and the specific manner thereof may be: if the language class is the first language class, determining a first conversion rule in the configuration conversion rule set as a phoneme conversion rule of the dubbing audio data; and if the language class is a second language class, a second conversion rule in the set of configuration conversion rules may be determined to be a phoneme conversion rule for the dubbed audio data. The first language category may refer to an english category, the first conversion rule may refer to a phoneme conversion rule configured for the english category, and the second language category may refer to any language category other than the english category.

Step S702, dubbing audio data is converted into a phoneme sequence according to a phoneme conversion rule.

Specifically, the dubbing audio data can be converted into a phoneme sequence according to the phoneme conversion rule, wherein the phonemes can be minimum speech units which are divided according to the natural attribute of the speech, and are analyzed according to pronunciation actions in syllables, and one action forms one phoneme. If "ma" includes two pronunciation actions of "m" and "a", it is two phonemes. The sounds made by the same pronunciation action are the same phonemes, and the sounds made by different pronunciation actions are different phonemes. In "ma-mi", the two "m" pronunciation actions are the same, and are the same phonemes, and the "a" pronunciation action is different from the "i" pronunciation action, and are different phonemes. The analysis of phonemes is generally described in terms of pronunciation actions. The pronunciation actions like "m" are: the upper lip and the lower lip are closed, the vocal cords vibrate, and the air flow flows out of the nasal cavity to sound.

When the language class to which the dubbing audio data belongs is a first language class, the phoneme conversion rule is a first conversion rule, and the specific way of converting the dubbing audio data into a phoneme sequence according to the phoneme conversion rule may be: preprocessing dubbing audio data according to a first conversion rule to obtain first preprocessed audio data; carrying out semantic analysis processing on the first preprocessed audio data to obtain text data of dubbing audio data; subsequently, a sound dictionary may be acquired; the sound dictionary comprises a phoneme mapping relation between text words and configuration phoneme sequences; the phoneme sequence indicated by the text data can be determined by the phoneme mapping relation between the text words and the configuration phoneme sequence in the sound dictionary, and then the phoneme sequence indicated by the text data can be determined as the phoneme sequence of the dubbing audio data.

It will be appreciated that the sound dictionary herein may be a lexicon sound dictionary by which text data (e.g., text data) may be converted into a sequence of possible phonemes. The preprocessing here may include a dc offset removal process, a resampling process, and a human voice detection process. The specific way of performing semantic analysis processing on the first preprocessed audio data to obtain text data of dubbing audio data may be: the method comprises the steps that M semantic analysis models can be called, semantic analysis processing can be conducted on first preprocessed audio data through each semantic analysis model, so that analysis texts corresponding to each voice analysis model can be obtained, and M analysis texts are obtained; subsequently, a text comparison process may be performed on the M pieces of analysis text, whereby text data of dubbing audio data can be determined.

The specific manner of determining the text data of the dubbing audio data by performing text comparison processing on the M analysis texts may be: the M analysis texts can be subjected to text comparison processing; if the M analysis texts are different from each other, a high-precision semantic analysis model can be obtained from the M semantic analysis models, and then the analysis text corresponding to the high-precision semantic analysis model can be determined as text data of dubbing audio data; if the same analysis text exists in the M analysis texts, the same analysis text in the M analysis texts may be determined as text data of dubbing audio data.

It can be understood that the semantic analysis model herein may refer to any model with semantic analysis capability, and it may be implemented to identify audio data as text data, and in these semantic analysis models, a known semantic analysis model with high accuracy may be selected as a high-accuracy semantic analysis model, and an analysis text may be identified by a semantic analysis model, where if the analysis texts of the semantic analysis models are different from each other, the analysis text identified by the high-accuracy semantic analysis model may be determined as text data of dubbing audio data; if the same analysis text exists in each analysis text, for example, the analysis texts in which two semantic analysis models exist are all "i eat today", the analysis texts in which three semantic analysis models exist are "i eat today", and the analysis texts in the rest of the semantic analysis models are different from each other, then the analysis text with the same higher frequency (frequency is 3) can be used as the text data of dubbing audio data in the two same analysis texts. That is, if there are the same analysis texts in the M analysis texts and there are 2 or more analysis texts in the same analysis text, the same frequency of each same analysis text (i.e. how many semantic analysis models exist to recognize the analysis text) may be counted, and the analysis text with the highest same frequency may be selected as the text data of the dubbing audio data; if there is only one identical analysis text among the M analysis texts, the identical analysis text is directly determined as text data of dubbing audio data.

It should be noted that, based on the above-mentioned knowledge, the subtitle data of dubbing audio data can be obtained, and when playing dubbing media data, the subtitle data can be synchronously displayed, so that not only can sound and picture synchronization be realized when playing media data, but also the subtitle data can be synchronously synchronized with sound and picture, in other words, synchronization of sound, picture and word can be realized. When the language class to which the dubbing audio data belongs is the first language class, when the dubbing audio data is converted into standard actions of a pronunciation part, text data indicated by the dubbing audio data can be determined first, then the text data are converted into a phoneme sequence, and then corresponding standard actions are identified based on the phoneme sequence.

When the language class to which the dubbing audio data belongs is a second language class and the phoneme conversion rule is a second conversion rule, the specific mode of converting the dubbing audio data into a phoneme sequence according to the phoneme conversion rule comprises the following steps:

preprocessing the dubbing audio data according to a second conversion rule, thereby obtaining second preprocessed audio data; subsequently, a phoneme recognition process may be performed on the second pre-processed audio data, whereby a phoneme sequence of the dubbing audio data conversion may be obtained. That is, after preprocessing dubbing audio data, it may be directly subjected to speech recognition processing, and converted into phonemes, resulting in a phoneme sequence. Also, the preprocessing here may include dc offset cancellation processing, resampling processing, and voice detection processing.

Step S703, call the action conversion model, and perform action conversion processing on the phoneme sequence of the dubbing audio data through the action conversion model, so as to obtain a standard action corresponding to the dubbing audio data.

Specifically, the motion conversion model may refer to any artificial intelligent model with a recognition function, in which a large number of candidate motion sets may be configured, and after receiving a phoneme sequence, the motion conversion model may select a candidate motion that is most matched with the phoneme sequence from the candidate motion sets as a standard motion of the dubbing audio data. In order to improve the accuracy of the output result of the motion conversion model, the method can adopt a machine learning mode to perform pre-training optimization on the motion conversion model, and then apply the motion conversion model after training optimization to a scene of the motion of the output pronunciation part. For example, the application can select a large amount of audio sample data with pronunciation action labels to train an optimized action conversion model, specifically, can firstly perform phoneme conversion on the audio sample data, convert each audio sample data into different phoneme sequences (which can be called as phoneme sequence samples), then can input the phoneme sequence samples into the action conversion model, the action conversion model can perform a large amount of calculation based on model parameters, finally can output an action, and can determine an error value based on the action output by the action conversion model and the pronunciation action labels of the audio sample data, and can optimize the model parameters of the action conversion model through the error value. It should be appreciated that by optimizing and updating the multiple rounds of parameters of the motion conversion model, the output result of the motion conversion model may be made to approach the pronunciation motion label more and more until the model convergence condition is satisfied.

In the embodiment of the application, dubbing audio data of a target object in media data can be converted into standard actions of a sounding part, then the sounding action of the sounding part of the target object in the media data can be adjusted according to the standard actions, and the adjusted sounding action can be used as a matching action of the sounding part of the target object in the media data; further, dubbing audio data and matching actions can be fused with the media data, and final dubbing audio data can be obtained. It should be understood that, the standard action is obtained after the action conversion processing is performed based on the dubbing audio data of the target object, and the standard action is matched and synchronous with the dubbing audio data, so that the obtained matching action is also matched with the dubbing audio data after the sound action of the sound producing part of the target object in the media data is adjusted based on the standard action, and the dubbing audio data and the matching action are fused into the media data to obtain dubbing media data, so that the dubbing audio data of the target object is matched with the sound producing action of the dubbing audio data, in other words, the sound and the picture are synchronous when the dubbing media data is output, and the media data presenting effect is better.

Further, referring to fig. 8, fig. 8 is a schematic diagram of an architecture for converting dubbing audio data into standard actions according to an embodiment of the present application. As shown in fig. 8, the architecture may include an audio receiving component, a preprocessing component, an audio recognition component, a phoneme conversion component, a noise detection component, an action conversion component, and a time optimization component. The architecture shown in fig. 8 may be an architecture in which the language class to which the dubbing audio data belongs is a first language class, and for convenience of understanding, each component will be briefly described below:

an audio receiving component: may be used to receive dubbing audio data.

Pretreatment component: the method can be used for preprocessing dubbing audio data, wherein the method can comprise direct current offset elimination processing, resampling processing and voice detection processing on the dubbing audio data.

An audio recognition component: the method can be used for carrying out recognition processing on the data obtained by voice detection to obtain text data corresponding to dubbing audio data.

A phoneme conversion component: can be used to convert text data into phonemes to obtain a sequence of phonemes. Wherein the process is applicable to techniques such as acoustic dictionaries, acoustic models, etc.

Noise detection component: can be used to detect and reject noise in the phoneme sequence.

A motion conversion component: can be used to convert the phoneme sequence into a standard action.

Time optimization component: can be used to match standard actions with the pictures of the target objects in the media data, so that the pronunciation parts of the corresponding target objects can be accurately adjusted based on the standard actions.

For specific implementation manners of the respective components, reference may be made to the descriptions in the foregoing respective embodiments, and a detailed description will not be given here. The beneficial effects thereof will not be described in detail herein.

Further, referring to fig. 9, fig. 9 is a schematic diagram of another architecture for converting dubbing audio data into standard actions according to an embodiment of the present application. As shown in fig. 9, the architecture may include an audio receiving component, a preprocessing component, a phoneme recognition component, a noise detection component, a motion conversion component, and a time optimization component. The architecture shown in fig. 9 may be an architecture in which the language class to which the dubbing audio data belongs is a second language class, and for convenience of understanding, each component will be briefly described below:

an audio receiving component: may be used to receive dubbing audio data.

A phoneme recognition component: the method can be used for converting the preprocessed audio data into phonemes to obtain a phoneme sequence. Wherein the process is applicable to speech recognition technology.

Further, referring to fig. 10, fig. 10 is a diagram of an overall system architecture according to an embodiment of the present application. As shown in fig. 10, the architecture may include a dubbing reception component, a motion conversion component, a fusion component, a stitching component, a material arrangement component, an image input component, a dubbing input component, and a time verification component.

For ease of understanding, the various components of the architecture will be briefly described as follows:

A dubbing receiving component: may be used to receive dubbed audio data for a target object.

A motion conversion component: can be used to convert dubbing audio data into standard actions for the pronunciation site. The method specifically can perform action conversion according to the language category to which the dubbing audio data belongs, when the language category to which the dubbing audio data belongs is a first language category, the dubbing audio data can be converted into text data, standard actions are obtained through the text data, and the text data can be used as subtitle data of the dubbing audio data.

Fusion component: the method can be used for fusing the standard action with the pronunciation part of the target object, namely adjusting the pronunciation action of the pronunciation part of the target object based on the standard action.

And (3) splicing the components: the method can be used for splicing the matching action, the dubbing audio data and the subtitle data of the target object to obtain the media data comprising the dubbing action, the pronunciation action and the subtitle synchronization.

And a material arrangement component: the method can be used for extracting pronunciation actions of pronunciation parts of different objects from sample media data, wherein the pronunciation actions can be used as candidate actions in an action model, namely the pronunciation actions can be used as materials, and the action model can determine standard actions of dubbing audio data based on the candidate actions when converting the standard actions.

An image input component: can be used to input an image of the pronunciation action including the pronunciation portion of the target object to the time verification component.

Dubbing input component: can be used to input dubbing audio data of the target object to the time verification component.

And a time checking component: the method can be used for verifying the matching and synchronism between the dubbing audio data and the pronunciation action of the target object, and the time verification component can be used for more accurately verifying the matching between the dubbing and the pronunciation action of the target object, so that the presentation effect of the media data is improved.

Further, referring to fig. 11, fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing apparatus may be a computer program (including program code) running in a computer device, for example the data processing apparatus is an application software; the data processing device may be used to perform the method shown in fig. 3. As shown in fig. 11, the data processing apparatus 1 may include: a data acquisition module 11, a data conversion module 12, an action adjustment module 13 and a fusion module 14.

A data acquisition module 11, configured to acquire dubbing audio data of a target object in media data;

The data conversion module 12 is configured to perform an action conversion process on the dubbing audio data, so as to obtain a standard action of a pronunciation part corresponding to the dubbing audio data;

the action adjusting module 13 is used for adjusting the pronunciation action of the pronunciation part of the target object in the media data according to the standard action to obtain the matching action of the pronunciation part of the target object in the media data;

and the fusion module 14 is used for fusing the dubbing audio data, the matching action and the media data to obtain dubbing media data.

The specific implementation manners of the data acquisition module 11, the data conversion module 12, the action adjustment module 13, and the fusion module 14 may be referred to the description of step S101 to step S104 in the embodiment corresponding to fig. 4, and will not be described herein.

In one embodiment, after the fusion module 14 fuses the dubbed audio data, the matching action, and the media data to obtain the dubbed media data, the data processing apparatus 1 further includes: a subtitle acquisition module 15, a duration determination module 16, and a synchronous display module 17.

A caption acquisition module 15, configured to acquire caption data indicated by dubbing audio data;

a duration determining module 16, configured to obtain an audio duration of dubbing audio data, and determine a caption display duration of the caption data according to the audio duration;

And the synchronous display module 17 is used for synchronously displaying the caption data according to the caption display time length when the dubbing media data is played.

The specific implementation manner of the subtitle acquiring module 15, the duration determining module 16, and the synchronous display module 17 may be referred to the description in step S104 in the embodiment corresponding to fig. 3, and will not be described herein.

In one embodiment, the caption acquiring module 15 acquires a specific implementation of caption data indicated by the dubbing audio data, including:

In one embodiment, the subtitle obtaining module 15 performs an action recognition process on the matching action to obtain a specific implementation manner of the action language data indicated by the matching action, including:

In one embodiment, the data conversion module 12 performs an action conversion process on the dubbing audio data to obtain a specific implementation manner of a standard action of a pronunciation part corresponding to the dubbing audio data, where the specific implementation manner includes:

In one embodiment, the data conversion module 12 determines a specific implementation of the phoneme conversion rule of the dubbed audio data according to the language category, including:

the specific implementation manner of the data conversion module 12 for converting dubbing audio data into a phoneme sequence according to a phoneme conversion rule includes:

In one embodiment, the data conversion module 12 performs semantic analysis processing on the first preprocessed audio data to obtain a specific implementation manner of text data of dubbed audio data, including:

Performing text comparison processing on the M analysis texts;

In one embodiment, the action adjustment module 13 adjusts the pronunciation action of the pronunciation part of the target object in the media data according to the standard action to obtain a specific implementation manner of the matching action of the pronunciation part of the target object in the media data, including:

In one embodiment, the action adjustment module 13 performs dotting on the pronunciation parts in the key area according to the standard action to obtain dotting key points, including:

Further, referring to fig. 12, fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 12, the above-described computer device 8000 may include: processor 8001, network interface 8004, and memory 8005, and further, the above-described computer device 8000 further includes: a user interface 8003, and at least one communication bus 8002. Wherein a communication bus 8002 is used to enable connected communications between these components. The user interface 8003 may include a Display screen (Display), a Keyboard (Keyboard), and the optional user interface 8003 may also include standard wired, wireless interfaces, among others. Network interface 8004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). Memory 8005 may be a high speed RAM memory or a non-volatile memory, such as at least one disk memory. Memory 8005 may optionally also be at least one memory device located remotely from the aforementioned processor 8001. As shown in fig. 12, an operating system, a network communication module, a user interface module, and a device control application program may be included in the memory 8005, which is one type of computer-readable storage medium.

In the computer device 8000 shown in fig. 12, the network interface 8004 may provide a network communication function; while user interface 8003 is primarily an interface for providing input to the user; and the processor 8001 may be used to invoke a device control application stored in the memory 8005 to implement:

acquiring dubbing audio data of a target object in media data;

It should be understood that the computer device 8000 according to the embodiment of the present application may perform the description of the data processing method according to the embodiment of fig. 4 to 9, and may also perform the description of the data processing apparatus 1 according to the embodiment of fig. 11, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

Furthermore, it should be noted here that: the embodiment of the present application further provides a computer readable storage medium, where a computer program executed by the computer device 8000 for data processing mentioned above is stored, and the computer program includes program instructions, when the processor executes the program instructions, the description of the data processing method in the embodiment corresponding to fig. 4 to 9 can be executed, and therefore, will not be repeated herein. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application.

The computer readable storage medium may be the data processing apparatus provided in any one of the foregoing embodiments or an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

The terms first, second and the like in the description and in the claims and drawings of embodiments of the application are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include other steps or modules not listed or inherent to such process, method, apparatus, article, or device.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and related apparatus provided in the embodiments of the present application are described with reference to the flowchart and/or schematic structural diagrams of the method provided in the embodiments of the present application, and each flow and/or block of the flowchart and/or schematic structural diagrams of the method may be implemented by computer program instructions, and combinations of flows and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A method of data processing, comprising:

acquiring dubbing audio data of a target object in media data;

performing action conversion processing on the dubbing audio data to obtain standard actions of a pronunciation part corresponding to the dubbing audio data;

according to the standard action, adjusting the pronunciation action of the pronunciation part of the target object in the media data to obtain the matching action of the pronunciation part of the target object in the media data;

2. The method of claim 1, wherein after fusing the dubbed audio data, the matching action, and the media data to obtain dubbed media data, the method further comprises:

acquiring subtitle data indicated by the dubbing audio data;

acquiring the audio duration of the dubbing audio data, and determining the caption display duration of the caption data according to the audio duration;

And synchronously displaying the caption data according to the caption display time length when the dubbing media data is played.

3. The method of claim 2, wherein the obtaining the subtitle data indicated by the dubbing audio data comprises:

performing action recognition processing on the matching action to obtain action language data indicated by the matching action;

and determining the action language data indicated by the matching action as subtitle data indicated by the dubbing audio data.

4. A method according to claim 3, wherein said performing an action recognition process on said matching action to obtain action language data indicated by said matching action comprises:

converting the matching action into a sequence of visual states through a visual mapping model;

5. The method of claim 1, wherein the performing the motion conversion process on the dubbing audio data to obtain the standard motion of the pronunciation part corresponding to the dubbing audio data comprises:

Acquiring the language category to which the dubbing audio data belong, and determining a phoneme conversion rule of the dubbing audio data according to the language category;

converting the dubbing audio data into a phoneme sequence according to the phoneme conversion rule;

6. The method of claim 5, wherein said determining phoneme conversion rules for the dubbed audio data according to the language category comprises:

if the language class is the first language class, determining a first conversion rule in a configuration conversion rule set as a phoneme conversion rule of the dubbing audio data;

and if the language class is a second language class, determining a second conversion rule in the configuration conversion rule set as a phoneme conversion rule of the dubbing audio data.

7. The method of claim 5, wherein the language class to which the dubbing audio data belongs is a first language class, and the phoneme conversion rule is a first conversion rule;

The converting the dubbing audio data into a phoneme sequence according to the phoneme conversion rule comprises the following steps:

preprocessing the dubbing audio data according to the first conversion rule to obtain first preprocessed audio data;

carrying out semantic analysis processing on the first preprocessed audio data to obtain text data of the dubbing audio data;

and determining a phoneme sequence indicated by the text data through a phoneme mapping relation between the text words and the configuration phoneme sequence in the sound dictionary, and determining the phoneme sequence indicated by the text data as the phoneme sequence of the dubbing audio data.

8. The method of claim 7, wherein the performing semantic analysis on the first preprocessed audio data to obtain text data of the dubbed audio data comprises:

And carrying out text comparison processing on the M analysis texts to determine text data of the dubbing audio data.

9. The method of claim 8, wherein the text comparison of the M parsed text to determine text data of the dubbed audio data comprises:

performing text comparison processing on the M analysis texts;

if the M analysis texts are different from each other, acquiring a high-precision semantic analysis model from the M semantic analysis models, and determining the analysis text corresponding to the high-precision semantic analysis model as text data of the dubbing audio data;

and if the same analysis text exists in the M analysis texts, determining the same analysis text in the M analysis texts as the text data of the dubbing audio data.

10. The method of claim 5, wherein the language class to which the dubbing audio data belongs is a second language class, and the phoneme conversion rule is a second conversion rule;

preprocessing the dubbing audio data according to the second conversion rule to obtain second preprocessed audio data;

And carrying out phoneme recognition processing on the second preprocessed audio data to obtain a phoneme sequence of the dubbing audio data conversion.

11. The method according to claim 1, wherein the adjusting the pronunciation of the pronunciation part of the target object in the media data according to the standard action results in a matching of the pronunciation part of the target object in the media data, including:

acquiring an action model for representing the pronunciation action of the pronunciation part of the target object in the media data;

acquiring a key area containing a pronunciation part of the target object in the action model;

dotting the pronunciation parts in the key area according to the standard action to obtain dotting key points;

and determining the adjustment action of the pronunciation parts in the key area as the matching action of the target object in the media data.

12. The method of claim 11, wherein the dotting the pronunciation parts in the key area according to the standard action to obtain dotting key points, comprising:

Acquiring standard action feature points for composing the standard action and a pronunciation part area indicated by the standard action;

acquiring a first position of the standard action feature point in the pronunciation part area and a second position of the pronunciation part in the key area;

and dotting is carried out on a third position in the key area, so that the dotting key point is obtained.

13. A data processing apparatus, comprising:

14. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is configured to provide a network communication function, the memory is configured to store a computer program, and the processor is configured to invoke the computer program to cause the computer device to perform the method of any of claims 1-12.

15. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded by a processor and to perform the method of any of claims 1-12.

16. A computer program product, characterized in that the computer program product comprises a computer program stored in a computer readable storage medium, the computer program being adapted to be read and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-12.