CN113362849A

CN113362849A - Voice data processing method and device

Info

Publication number: CN113362849A
Application number: CN202010135093.4A
Authority: CN
Inventors: 吴纲律; 王加芳; 王全占; 古鉴; 李名杨
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-03-02
Filing date: 2020-03-02
Publication date: 2021-09-07

Abstract

The application discloses an audio data processing method and device, wherein the method comprises the following steps: acquiring original audio data corresponding to the original video data; obtaining audio-related motion characteristic data in original video data, wherein the audio-related motion characteristic data refers to motion state data associated with a sound production event corresponding to the original video data; analyzing and obtaining target audio data from the original audio data according to the audio-related motion characteristic data; and processing the target audio data according to a preset audio processing mode. By using the method, the corresponding target audio data can be obtained from the original audio data in the current scene by utilizing the audio-related motion characteristic data in the original video data, and the target audio data is subjected to data enhancement or data suppression processing in combination with the specific scene.

Description

Voice data processing method and device

Technical Field

The application relates to the technical field of computers, in particular to a voice data processing method. The application also relates to a voice data processing device and an electronic device.

Background

In most speech interaction scenarios, the collected audio signal includes not only the speech of the target speaker, but also speech interference and noise interference of other speakers, and the objective of speech separation is to separate the speech of the target speaker from these interferences.

The speech separation can be divided into three types according to the difference of interference: when the interference is noise, the process of Speech separation may be referred to as "Speech Enhancement" (Speech Enhancement); when the disturbance is the voice of others, the process of voice Separation may be referred to as "Speaker Separation"; when the disturbance is a reflected wave of the target speaker's own voice, the process of speech separation may be referred to as "dereverberation".

The voice separation can be classified into a mono separation method (single microphone) and an array separation method (multiple microphones) according to the number of sensors or microphones. Typical of the mono separation methods are speech enhancement and Computational Auditory Scene Analysis (CASA). In the array separation method, an array of two or more microphones uses a spatial filter to enhance signals arriving from a particular direction by appropriate array structure, thereby reducing interference from other directions.

However, the existing voice separation method is only realized based on voice or noise audio characteristic data, and the voice separation process is not effectively fused with the sound production scene, so that the voice separation efficiency and accuracy in the sound production scene are affected.

Disclosure of Invention

The embodiment of the application provides an audio data processing method, an audio data processing device and electronic equipment, and aims to solve the problem that the existing voice separation process affects voice separation efficiency and accuracy because the existing voice separation process is not effectively fused with a sounding scene.

The embodiment of the application provides an audio data processing method, which comprises the following steps:

acquiring original audio data corresponding to the original video data;

obtaining audio-related motion characteristic data in the original video data, wherein the audio-related motion characteristic data refers to motion state data associated with a sound production event corresponding to the original video data;

analyzing the original audio data according to the audio-related motion characteristic data to obtain target audio data;

and processing the target audio data according to a preset audio processing mode.

Optionally, the analyzing the original audio data according to the audio-related motion feature data to obtain target audio data includes: obtaining target audio characteristic data matched with the audio-related motion characteristic data based on the original audio data;

and determining the audio data corresponding to the target audio characteristic data in the original audio data as target audio data.

Optionally, the obtaining, based on the original audio data, target audio feature data matched with the audio-related motion feature data includes:

and obtaining target audio characteristic data matched with the audio related motion characteristic data from a plurality of audio characteristic data corresponding to the original audio data.

Optionally, the obtaining audio-related motion feature data in the original video data includes: obtaining lip movement feature data in the original video data;

the obtaining of the target audio feature data matched with the audio-related motion feature data from the plurality of audio feature data corresponding to the original audio data includes: and obtaining target audio characteristic data matched with the lip movement characteristic data from the audio characteristic data corresponding to the original audio data.

Optionally, the lip movement characteristic data includes:

lip vibration frequency data;

the obtaining of the target audio feature data matched with the lip movement feature data from a plurality of audio feature data corresponding to the original audio data includes:

and matching the lip vibration frequency data with the speech rate characteristic data in the plurality of audio characteristic data to obtain target speech rate characteristic data matched with the lip vibration frequency data.

Optionally, the lip movement characteristic data includes:

lip shake phase data;

and matching the lip vibration phase data with voice phase characteristic data in the plurality of audio characteristic data to obtain target voice phase characteristic data matched with the lip vibration phase data.

Optionally, the lip movement characteristic data includes:

lip vibration amplitude data;

and matching the lip vibration amplitude data with voice amplitude characteristic data in the plurality of audio characteristic data to obtain target voice amplitude characteristic data matched with the lip vibration amplitude data.

Optionally, the obtaining audio-related motion feature data in the original video data includes: obtaining a plurality of audio-related motion characteristic data corresponding to a plurality of sound-producing subjects in original video data;

correspondingly, the analyzing the original audio data according to the audio-related motion characteristic data to obtain target audio data includes:

determining a target sound-emitting subject from the plurality of sound-emitting subjects;

obtaining target audio-related motion characteristic data corresponding to the target sounding subject;

and analyzing the original audio data according to the target audio related motion characteristic data to obtain target audio data.

Optionally, the obtaining audio-related motion feature data in the original video data includes: and obtaining audio-related motion characteristic data corresponding to the unique sound-emitting subject in the original video data.

Optionally, the obtaining of the original voice data corresponding to the original video data includes: original voice data corresponding to the original video data at an input time is obtained.

Optionally, the obtaining original voice data corresponding to the original video data in input time includes: original voice data corresponding to the original video data in input time and coming from a plurality of sound-producing subjects are obtained.

Optionally, the original video data includes part or all of the plurality of sound-generating subjects.

Optionally, the processing the target audio data according to a predetermined audio processing manner includes:

obtaining voice use level information corresponding to the target audio data;

and performing enhancement processing or suppression processing on the sound signal of the target voice data according to the voice use level information.

Optionally, the obtaining of the voice usage level information corresponding to the target audio data includes:

acquiring attribute information of a sounding body corresponding to the target audio data;

and acquiring the voice use level information corresponding to the target audio data according to the attribute information of the sound-producing main body corresponding to the target audio data.

Optionally, the obtaining attribute information of the sound-generating subject corresponding to the target audio data includes:

performing framing processing on the original video data to obtain a target image;

detecting the target image to obtain a face image corresponding to the audio-related motion characteristic data in the target image;

extracting the features of the facial image to obtain facial feature data;

and matching the facial feature data with a preset facial feature database to obtain the information of the main body corresponding to the facial feature data, and determining the information of the main body corresponding to the facial feature data as the attribute information of the sound-producing main body corresponding to the target audio data.

Optionally, the method further includes: and intercepting video data corresponding to the audio-related motion characteristic data in the original video data to obtain target video data corresponding to the target audio data.

Optionally, the method further includes: and storing the target audio data and the target video data in a mutual correlation mode.

Optionally, the method further includes: and combining a plurality of target video data corresponding to the same sounding main body in the original video data to obtain the video animation corresponding to the same sounding main body.

Optionally, the method further includes: and carrying out camera tracking on the sounding main body corresponding to the target audio data.

An embodiment of the present application further provides a voice data processing apparatus, including:

the original voice data obtaining unit is used for obtaining original voice data corresponding to the original video data;

the audio-related motion characteristic data acquisition unit is used for acquiring audio-related motion characteristic data in the original video data;

the target voice data obtaining unit is used for analyzing and obtaining target voice data from the original voice data according to the audio-related motion characteristic data;

and the target voice data processing unit is used for processing the target voice data according to a preset voice processing mode.

The embodiment of the present application further provides an electronic device, which is characterized by comprising a processor and a memory; wherein the memory is to store one or more computer instructions, wherein the one or more computer instructions are to be executed by the processor to:

obtaining audio-related motion characteristic data in original video data, wherein the audio-related motion characteristic data refers to motion state data associated with a sound production event corresponding to the original video data;

obtaining original audio data corresponding to the original video data;

Compared with the prior art, the embodiment of the application has the following advantages:

the audio data processing method provided by the embodiment of the application comprises the following steps: acquiring original audio data corresponding to the original video data; obtaining audio-related motion characteristic data in original video data, wherein the audio-related motion characteristic data refers to motion state data associated with a sound production event corresponding to the original video data; analyzing and obtaining target audio data from the original audio data according to the audio-related motion characteristic data; and processing the target audio data according to a preset audio processing mode. By using the method, the corresponding target audio data can be obtained from the original audio data in the current scene by utilizing the audio-related motion characteristic data in the original video data, and the target audio data is subjected to data enhancement or data suppression processing in combination with the specific scene, so that voices of different sounding subjects are separated.

Drawings

Fig. 1 is a flowchart of an audio data processing method according to a first embodiment of the present application;

FIG. 1-A is a schematic diagram of the acquisition of target audio data provided in the first embodiment of the application;

fig. 2 is a block diagram of units of an audio data processing apparatus according to a second embodiment of the present application;

fig. 3 is a schematic logical structure diagram of an electronic device according to a third embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

Aiming at a voice separation scene, in order to reduce the complexity of a voice separation process and improve the voice separation efficiency, the application provides an audio data processing method, an audio data processing device corresponding to the method and electronic equipment. The following provides embodiments to explain the method, apparatus, and electronic device in detail.

A first embodiment of the present application provides an audio data processing method, an application body of the method may be a computing device application for implementing voice separation, fig. 1 is a flowchart of the audio data processing method provided in the first embodiment of the present application, and the method provided in this embodiment is described in detail below with reference to fig. 1. The following description refers to embodiments for the purpose of illustrating the principles of the methods, and is not intended to be limiting in actual use.

As shown in fig. 1, the audio data processing method provided by the present embodiment includes the following steps:

s101, original audio data corresponding to the original video data are obtained.

This step is used to obtain the original audio data corresponding to the original video data, that is, to obtain the original audio data corresponding to the original video data in the recording time, where the sound-generating subject corresponding to the original audio data may be wholly or partially present in the original video data.

In this embodiment, the obtained original audio data is preferably original voice data corresponding to the original video data in the recording time and coming from multiple sound-emitting bodies, for example, a scene of ordering food by multiple users at the same time, a multi-person interaction scene of an intelligent service robot (the intelligent robot speaks to multiple persons at the same time), and the obtained original audio data is mixed voice coming from multiple sound-emitting bodies; as another example, in a symphony performance scene, the original audio data obtained is a mixed music piece emitted when a plurality of musicians operate a plurality of musical instruments.

S102, audio related motion characteristic data in the original video data are obtained.

After the original audio data corresponding to the original video data is obtained in the above step, this step is used to obtain audio-related motion feature data in the original video data, where the audio-related motion feature data refers to motion state data associated with a sound-emitting event corresponding to the original video data.

In this embodiment, the original video data is video data that is synchronously recorded when voice is recorded. For example, in an intelligent service scenario, the original video data may be video data including a plurality of users captured by an intelligent robot providing an intelligent service in a human-computer interaction process, and the audio-related motion characteristic data is motion state data associated with user vocalization, such as mouth shape change, lip vibration, and the like corresponding to the user speaking. For another example, in a symphony performance scene, the original video data may be an image sequence including performers captured by the audio/video recording device, and the audio-related motion characteristic data is motion state data associated with sounding of musical instruments, such as clicking and stretching of musical instruments by each musician in a music generation process.

It should be noted that, the process of obtaining the audio-related motion characteristic data is not limited to a human-computer interaction scene or a symphony performance scene in an intelligent service scene, and in an audio-video recording scene, when any sound-producing subject produces a sound, as long as a motion state associated with the produced sound can be captured, the process of obtaining the audio-related motion characteristic data is applicable, and is not limited herein.

In this embodiment, the sounding main body captured in the original video data may be part or all of the sounding main bodies in the multiple sounding main bodies in the current sounding scene, and the audio-related motion feature data in the original video data is obtained, which may be: obtaining a plurality of audio-related motion feature data corresponding to a plurality of sound-producing subjects in the original video data may also refer to: and obtaining audio-related motion characteristic data corresponding to the unique sound-emitting subject in the original video data. For example, in an intelligent service scenario, when a plurality of users are in the shooting range of the intelligent robot and speak towards the intelligent robot, the motion state data of the plurality of users, such as mouth shape change, lip vibration and the like, can be obtained; when a plurality of users all participate in the human-computer interaction process (the users speak to the intelligent robot in the same time period), only one user is in the intake range of the intelligent robot and speaks towards the intelligent robot, the motion state data of the user, such as mouth shape change, lip vibration and the like, are the audio-related motion characteristic data, or only one user speaks towards the intelligent robot and is in the intake range of the intelligent robot, the motion state data of the user, such as mouth shape change, lip vibration and the like, are the audio-related motion characteristic data.

In this embodiment, the obtaining of the audio-related motion feature data in the original video data may refer to: lip motion characteristic data in the raw video data is obtained, and the lip motion characteristic data can be one or more of lip vibration frequency data, lip vibration phase data and lip vibration amplitude data. The lip vibration frequency data is used for representing the speed of speech of the sounding main body, the lip vibration phase data is used for representing the time point when the lip of the sounding main body is opened and closed, and the lip vibration amplitude data is used for representing the amplitude of the opening and closing of the lip of the sounding main body. In this embodiment, the lip vibration frequency data, the lip vibration phase data, and the lip vibration amplitude data are feature vectors obtained by performing feature extraction on a lip image in original video data.

In this embodiment, the implementation sequence of the step S101 and the step S102 is not limited, that is, the original audio data corresponding to the original video data may be obtained after the audio-related motion feature data in the original video data is obtained.

And S103, analyzing and obtaining target audio data from the original audio data according to the audio-related motion characteristic data.

After the original audio data corresponding to the original video data is obtained in the above steps and the audio-related motion feature data in the original video data is obtained, the step is used for analyzing and obtaining the target audio data from the original audio data according to the audio-related motion feature data, that is, performing voice separation on the original audio data according to the obtained information to obtain the target audio data corresponding to the specific sound-emitting subject in the original video data.

In this embodiment, the expression features (sound spectrum) of the sound on the image can be represented by the movement of the lips, that is, for different sound-producing subjects, the speaking speed, the speaking start time and the speech intensity can be relatively reflected by the lip movements. When a plurality of sound-producing bodies speak, in a certain time interval, the relative strength, the relative speed and the relative phase of the voices from the plurality of sound-producing bodies can be distinguished through the movement difference of lips, and meanwhile, the relative relation between the speaking speed and the lip vibration frequency of each sound-producing body in the plurality of sound-producing bodies speaking at different periods has uniqueness, for example, in the plurality of sound-producing bodies, the speaking speed of the sound-producing body A relative to the sound-producing body B can be shown through the difference between the lip vibration frequency of the sound-producing body A and the lip vibration frequency of the sound-producing body B, the speaking starting time of the sound-producing body A relative to the sound-producing body B can be shown through the difference between the lip vibration phase of the sound-producing body A and the lip vibration phase of the sound-producing body B, and the strength of the voices of the sound-producing body A relative to the sound-producing body B can be shown through the lip opening and closing amplitude of the sound-producing The difference in amplitude is shown in that the above-mentioned relative relationship between the sound-generating body a and the sound-generating body B is kept constant for a certain time interval.

In this embodiment, when the audio-related motion feature data obtained in step S101 is a plurality of audio-related motion feature data corresponding to a plurality of sound-emitting subjects, a specific manner of analyzing and obtaining the target audio data from the original audio data according to the audio-related motion feature data in this step may be: determining a target sound-producing main body from the plurality of sound-producing main bodies, for example, determining a sound-producing user having a dominant role from the plurality of sound-producing users according to the positions, poses, facial expressions, estimated identity information and the like of the plurality of sound-producing users appearing in the original video data; obtaining target audio-related motion characteristic data corresponding to the target sound-producing subject, for example, obtaining one or more of lip vibration frequency data, lip vibration phase data, and lip vibration amplitude data of the dominant sound-producing user; and analyzing and obtaining target audio data from the original audio data according to the obtained target audio related motion characteristic data.

It should be noted that, when the obtained audio-related motion feature data is the audio-related motion feature data corresponding to the only sound-producing subject, the audio-related motion feature data is the target audio-related motion feature data.

In this embodiment, the analyzing the target audio data from the original audio data according to the target audio related motion feature data specifically includes: obtaining target audio characteristic data matched with the target audio related motion characteristic data based on the original audio data, wherein the process specifically may be: and obtaining target audio characteristic data matched with the target audio related motion characteristic data from a plurality of audio characteristic data corresponding to the original audio data, wherein the plurality of audio characteristic data can refer to audio characteristics such as voice tones, loudness, timbre, amplitude, phase, frequency, speech speed and the like from a plurality of sound production subjects contained in the original audio data. In this embodiment, the obtaining of the target audio feature data matched with the target audio related motion feature data from the multiple audio feature data corresponding to the original audio data may specifically refer to: obtaining target audio characteristic data matched with the lip movement characteristic data from the audio characteristic data corresponding to the original audio data; and determining the audio data corresponding to the target audio characteristic data in the original audio data as target audio data.

In this embodiment, when the lip movement feature data is lip vibration frequency data, the obtaining of the target audio feature data matched with the lip movement feature data from the plurality of audio feature data corresponding to the original audio data may be: and matching the lip vibration frequency data with the speech rate feature data in the plurality of audio feature data, for example, respectively matching with a plurality of speech rate feature data corresponding to a plurality of sound production subjects to obtain target speech rate feature data matched with the lip vibration frequency data, wherein the speech rate is the speed of presentation of the language symbol in unit time.

When the lip motion feature data is lip vibration phase data, the obtaining of the target audio feature data matched with the lip motion feature data from the multiple audio feature data corresponding to the original audio data may be: and matching the lip vibration phase data with the voice phase characteristic data in the plurality of audio characteristic data to obtain target voice phase characteristic data matched with the lip vibration phase data.

When the lip movement feature data is lip vibration amplitude data, the obtaining of the target audio feature data matched with the lip movement feature data from the plurality of audio feature data corresponding to the original audio data may be: and matching the lip vibration amplitude data with the voice amplitude characteristic data in the plurality of audio characteristic data to obtain target voice amplitude characteristic data matched with the lip vibration amplitude data.

In this embodiment, as shown in fig. 1-a, the process of analyzing and obtaining the target audio data from the original audio data according to the audio-related motion feature data may be implemented by a pre-trained deep neural network model, that is, the process of obtaining the target audio feature data matched with the lip motion feature data from the audio feature data corresponding to the original audio data and determining the audio data corresponding to the target audio feature data in the original audio data as the target audio data may be implemented by an internal algorithm of the pre-trained deep neural network model, where the deep neural network model performs model training by using the lip motion feature data and the audio feature data as training samples, for example, the lip motion feature data extracted from the lip image sequence of a large number of sound-producing subjects and the audio feature data corresponding to the original audio data from the large number of sound-producing subjects as training samples And the trained deep neural network model can output target audio data according to the input lip motion characteristic data and the input audio characteristic data.

And S104, processing the target audio data according to a preset audio processing mode.

After the target audio data are obtained by analyzing the original audio data according to the audio-related motion characteristic data, the step is used for processing the target audio data according to a preset audio processing mode. Namely, according to the voice use requirement of the current scene aiming at the target audio data, the target audio data is strengthened or weakened.

In this embodiment, the processing of the target audio data according to a predetermined audio processing manner specifically includes the following contents:

first, obtaining voice usage level information corresponding to target audio data, the voice usage level information being used to indicate the importance degree of the target audio data in the current scene, and determining the voice usage level corresponding to the target audio data in various ways, for example, determining the voice usage level corresponding to the target audio data according to the audio information such as intensity and tone corresponding to the target audio data (for example, the higher the intensity of the target audio data is, the higher the corresponding voice usage level is), or determining the voice usage level corresponding to the target audio data according to semantic information corresponding to the target audio data (for example, performing lip language recognition on the target audio data according to the mouth shape change information in the audio-related motion feature data to obtain semantic information corresponding to the target audio data, and determining the importance degree of the instruction information corresponding to the semantic information, thereby determining the voice usage level information corresponding to the target audio data), or determining the voice usage level corresponding to the target audio data according to the attribute information (identity information, background information, corresponding role information, etc.) of the utterance subject corresponding to the target audio data. In this embodiment, it is preferable that the voice usage level corresponding to the target audio data is determined by using attribute information of a sound-emitting subject corresponding to the target audio data, and the process specifically includes the following steps:

A. acquiring attribute information of a sounding body corresponding to target audio data, for example, performing framing processing on the original video data to acquire a target image containing a human body contour in a plurality of video images; performing face three-dimensional detection on the target image to obtain a face image corresponding to the audio-related motion characteristic data in the target image, for example, if the audio-related motion characteristic data is lip motion characteristic data corresponding to the user A speaking, the obtained face image is the face image of the user A; extracting the features of the facial image to obtain facial feature data; the extracted facial feature data is matched with a preset facial feature database, the facial feature database can contain the corresponding relation between the facial feature data and the user attribute information, through the matching process, the attribute information (identity information, background information, corresponding role information and the like) of the user corresponding to the facial feature data can be obtained, and the attribute information is determined as the attribute information of the sound-producing main body corresponding to the target audio data.

B. And obtaining the voice use level information corresponding to the target audio data according to the attribute information of the sound-producing subject corresponding to the target audio data, for example, if the identity information of the sound-producing subject corresponding to the target audio data is a father, determining that the voice use level is a high level.

Next, based on the voice usage level information corresponding to the target audio data, a voice enhancement process or a voice suppression process is performed on the sound signal of the target audio data, for example, when the voice usage level information corresponding to the target audio data is high (important degree is high), the voice enhancement process is performed on the target audio data, and specifically, the voice enhancement process may be performed on the target audio data in the original audio data by using spectral subtraction (spectral subtraction). The voice enhancement means that after a voice signal corresponding to a target utterance main body is interfered and submerged by noise or other utterance main bodies, the voice signal corresponding to the target utterance main body is extracted from a noise background, so as to achieve the purposes of suppressing and reducing noise interference, that is, a pure original voice signal is extracted from noisy voice, or a target voice signal with a specific purpose is extracted.

In this embodiment, the video data corresponding to the audio-related motion feature data in the original video data may be intercepted by using the obtained corresponding relationship between the target audio data and the audio-related motion feature data, and the intercepted video data is the target video data corresponding to the target audio data, so as to implement matching between the voice data and the video data. For example, if the audio-related motion feature data is lip motion feature data of the user a, it indicates that the sound-generating subject of the target audio data is the user a, and a video segment of the user a in the original video data is captured, where the captured video segment matches with the target audio data. In a conference recording scene in which multiple persons participate, the target audio data and the target video data which are matched with each other can be stored in a mutual association manner to realize automatic recording of conference contents, and further, after the attribute information (identity information, background information, corresponding role information and the like) of a sound-producing main body corresponding to the target audio data is obtained, semantic recognition can be performed on the target audio data to obtain corresponding semantic information, and the semantic information and the attribute information of the sound-producing main body are recorded in an association manner, so that automatic recording of conference content texts in the conference scene can be realized.

In this embodiment, camera tracking may also be performed on the sounding main body corresponding to the target audio data, for example, for an audio/video entry scene, camera tracking may be performed on the user a by using a correspondence between the target audio data and the audio-related motion characteristic data, and video data corresponding to the audio data of the user a is captured, so as to meet a scene requirement in the audio/video entry process.

After the video data corresponding to the audio-related motion feature data in the original video data is intercepted based on the corresponding relationship between the target audio data and the audio-related motion feature data, a plurality of target video data corresponding to the same sound-producing subject in the original video data can be combined to obtain the video animation corresponding to the same sound-producing subject, for example, the video clips of the user a in the original video data are combined to obtain the video animation corresponding to the sound-producing subject.

For example, in a sound production scene in which the facial image data of a sound production subject can be acquired, the existing voice separation method does not consider the correlation between the facial image data of the sound production subject and the sound produced under the same time dimension, so that the facial image data in the scene is not effectively utilized, and the voice separation process is not effectively fused with the sound production scene, so that the voice separation efficiency and accuracy in the sound production scene are affected.

In the audio data processing method provided by this embodiment, after the original audio data corresponding to the original video data is obtained, audio-related motion feature data in the original video data is obtained, where the audio-related motion feature data refers to motion state data associated with a sound-generating event corresponding to the original video data, and according to the audio-related motion feature data, target audio data is obtained by analyzing the original audio data, and finally, the target audio data is processed according to a predetermined audio processing manner. By using the method, in a scene in which audio data and video data are synchronously input, corresponding target audio data can be obtained from the original audio data in the current scene by using the audio-related motion characteristic data extracted from the original video data, and the target audio data is subjected to voice enhancement or voice suppression processing by combining the specific use scene of the audio data, so that voices of different sounding subjects are separated. The method combines and uses the video data and the audio data in the voice separation process, and has strong applicability to instruction analysis scenes in which a plurality of people send a plurality of different instructions at the same time.

The second embodiment of the present application also provides an audio data processing apparatus, since the apparatus embodiment is substantially similar to the method embodiment, so that the description is simple, and the details of the related technical features can be found in the corresponding description of the method embodiment provided above, and the following description of the apparatus embodiment is only illustrative.

Referring to fig. 2, to understand the embodiment, fig. 2 is a block diagram of a unit of the apparatus provided in the embodiment, and as shown in fig. 2, the apparatus provided in the embodiment includes:

an original audio data obtaining unit 201, configured to obtain original voice data corresponding to original video data;

an audio-related motion feature data obtaining unit 202, configured to obtain audio-related motion feature data in original video data;

a target audio data obtaining unit 203, configured to obtain target audio data from the original audio data by analyzing the audio-related motion feature data;

and the target audio data processing unit 204 is configured to process the target audio data according to a predetermined audio processing manner.

Analyzing and obtaining target audio data from the original audio data according to the audio-related motion characteristic data, wherein the target audio data comprises: obtaining target audio characteristic data matched with the audio-related motion characteristic data based on the original audio data;

and determining the audio data corresponding to the target audio characteristic data in the original audio data as the target audio data.

Obtaining target audio feature data matched with the audio-related motion feature data based on the original audio data, including:

and obtaining target audio characteristic data matched with the audio-related motion characteristic data from a plurality of audio characteristic data corresponding to the original audio data.

Obtaining audio-related motion feature data in raw video data, comprising: lip movement characteristic data in original video data are obtained;

obtaining target audio characteristic data matched with the audio-related motion characteristic data from a plurality of audio characteristic data corresponding to the original audio data, including: and obtaining target audio characteristic data matched with the lip movement characteristic data from the audio characteristic data corresponding to the original audio data.

Lip movement characteristic data includes:

lip vibration frequency data;

obtaining target audio characteristic data matched with the lip movement characteristic data from a plurality of audio characteristic data corresponding to the original audio data, wherein the target audio characteristic data comprises:

Lip movement characteristic data includes:

lip shake phase data;

Lip movement characteristic data includes:

lip vibration amplitude data;

and matching the lip vibration amplitude data with the voice amplitude characteristic data in the plurality of audio characteristic data to obtain target voice amplitude characteristic data matched with the lip vibration amplitude data.

Obtaining audio-related motion feature data in raw video data, comprising: obtaining a plurality of audio-related motion characteristic data corresponding to a plurality of sound-producing subjects in original video data;

correspondingly, according to the audio-related motion characteristic data, analyzing the original audio data to obtain target audio data, including:

determining a target sounding main body from a plurality of sounding main bodies;

obtaining target audio-related motion characteristic data corresponding to a target sounding subject;

and analyzing the original audio data according to the relevant motion characteristic data of the target audio to obtain target audio data.

Obtaining audio-related motion feature data in raw video data, comprising: and obtaining audio-related motion characteristic data corresponding to the unique sound-emitting subject in the original video data.

Obtaining original voice data corresponding to original video data, including: original voice data corresponding to the original video data at an input time is obtained.

Obtaining original speech data corresponding to original video data at an input time, comprising: original voice data corresponding to the original video data in input time and coming from a plurality of sound-producing subjects is obtained.

The original video data includes part or all of the plurality of sound emission subjects.

Processing the target audio data according to a preset audio processing mode, wherein the processing comprises the following steps:

acquiring voice use level information corresponding to target audio data;

Obtaining voice use level information corresponding to target audio data, including:

acquiring attribute information of a sounding body corresponding to target audio data;

Obtaining attribute information of a sound emission subject corresponding to target audio data, including:

performing framing processing on original video data to obtain a target image;

extracting the features of the facial image to obtain facial feature data;

The device also includes: and the video data intercepting unit is used for intercepting the video data corresponding to the audio-related motion characteristic data in the original video data to obtain target video data corresponding to the target audio data.

The device also includes: and the association storage unit is used for storing the target audio data and the target video data in an associated mode.

The device also includes: and the video animation obtaining unit is used for combining a plurality of target video data corresponding to the same sounding main body in the original video data to obtain the video animation corresponding to the same sounding main body.

The device also includes: and the camera tracking unit is used for carrying out camera tracking on the sounding main body corresponding to the target audio data.

By using the device, in a scene in which audio data and video data are synchronously input, corresponding target audio data can be obtained from the original audio data in the current scene by utilizing the audio-related motion characteristic data in the input original video data, and data enhancement or data suppression processing is carried out on the target audio data by combining the specific use scene of the audio data, so that voices of different sounding subjects are separated. The device combines and uses video and voice data in the voice separation process, and has strong applicability to instruction analysis scenes in which a plurality of people send out a plurality of different instructions at the same time.

In the foregoing embodiments, an audio data processing method and an audio data processing apparatus are provided, and in addition, a third embodiment of the present application also provides an electronic device, which is basically similar to the method embodiment and therefore is relatively simple to describe, and the details of the related technical features may be obtained by referring to the corresponding description of the method embodiment provided above, and the following description of the electronic device embodiment is only illustrative. The embodiment of the electronic equipment is as follows:

please refer to fig. 3 for understanding the present embodiment, fig. 3 is a schematic diagram of an electronic device provided in the present embodiment.

As shown in fig. 3, the electronic device includes: a processor 301; a memory 302;

the memory 302 is used for storing a program for audio data processing, and when the program is read and executed by the processor, the program performs the following operations:

acquiring original audio data corresponding to the original video data;

analyzing and obtaining target audio data from the original audio data according to the audio-related motion characteristic data;

The analyzing and obtaining target audio data from the original audio data according to the audio-related motion characteristic data includes: obtaining target audio characteristic data matched with the audio-related motion characteristic data based on the original audio data;

Lip movement characteristic data includes:

lip vibration frequency data;

Lip movement characteristic data includes:

lip shake phase data;

Lip movement characteristic data includes:

lip vibration amplitude data;

acquiring voice use level information corresponding to target audio data;

performing framing processing on original video data to obtain a target image;

extracting the features of the facial image to obtain facial feature data;

Further comprising: and intercepting video data corresponding to the audio-related motion characteristic data in the original video data to obtain target video data corresponding to the target audio data.

Further comprising: and storing the target audio data and the target video data in a mutual correlation mode.

Further comprising: and combining a plurality of target video data corresponding to the same sounding main body in the original video data to obtain the video animation corresponding to the same sounding main body.

Further comprising: and carrying out camera tracking on the sounding main body corresponding to the target audio data.

By using the electronic equipment, in a scene in which audio data and video data are synchronously input, corresponding target audio data can be obtained from the original audio data in the current scene by utilizing the audio-related motion characteristic data in the input original video data, and data enhancement or data suppression processing is carried out on the target audio data by combining the specific use scene of the audio data, so that voices of different sounding subjects are separated. In addition, the video data and the voice data are combined in the voice separation process, so that the method has strong applicability to instruction analysis scenes in which a plurality of people send a plurality of different instructions at the same time.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

Claims

1. A method of audio data processing, comprising:

acquiring original audio data corresponding to the original video data;

2. The method of claim 1, wherein analyzing the raw audio data to obtain target audio data according to the audio-related motion feature data comprises:

obtaining target audio characteristic data matched with the audio-related motion characteristic data based on the original audio data;

3. The method of claim 2, wherein obtaining target audio feature data that matches the audio-related motion feature data based on the raw audio data comprises:

4. The method of claim 3, wherein obtaining audio-related motion feature data in the raw video data comprises: obtaining lip movement feature data in the original video data;

the obtaining of the target audio feature data matched with the audio-related motion feature data from the plurality of audio feature data corresponding to the original audio data includes: and obtaining target audio characteristic data matched with the lip movement characteristic data from a plurality of audio characteristic data corresponding to the original audio data.

5. The method of claim 4, wherein the lip movement characteristics data comprises:

lip vibration frequency data;

6. The method of claim 4, wherein the lip movement characteristics data comprises:

lip shake phase data;

7. The method of claim 4, wherein the lip movement characteristics data comprises:

lip vibration amplitude data;

8. The method of claim 1, wherein obtaining audio-related motion feature data in the raw video data comprises: obtaining a plurality of audio-related motion characteristic data corresponding to a plurality of sound-producing subjects in the original video data;

9. The method of claim 1, wherein obtaining audio-related motion feature data in the raw video data comprises: and obtaining audio-related motion characteristic data corresponding to the unique sound-producing subject in the original video data.

10. The method of claim 1, wherein obtaining original audio data corresponding to original video data comprises: original audio data corresponding to the original video data at an input time is obtained.

11. The method of claim 10, wherein obtaining raw audio data corresponding in input time to the raw video data comprises: original audio data corresponding to the original video data in input time and coming from a plurality of sound-emitting subjects are obtained.

12. The method of claim 11, wherein some or all of the plurality of sound producing subjects are contained in the raw video data.

13. The method of claim 1, wherein said processing said target audio data according to a predetermined audio processing mode comprises:

obtaining voice use level information corresponding to the target audio data;

14. The method of claim 13, wherein obtaining speech usage level information corresponding to the target audio data comprises:

15. The method according to claim 14, wherein the obtaining attribute information of the sound-generating subject corresponding to the target audio data comprises:

extracting the features of the facial image to obtain facial feature data;

16. The method of claim 1, further comprising:

and intercepting video data corresponding to the audio-related motion characteristic data in the original video data to obtain target video data corresponding to the target audio data.

17. The method of claim 16, further comprising:

and storing the target audio data and the target video data in a mutual correlation mode.

18. The method of claim 16, further comprising: and combining a plurality of target video data corresponding to the same sounding main body in the original video data to obtain the video animation corresponding to the same sounding main body.

19. The method of claim 1, further comprising:

and carrying out camera tracking on the sounding main body corresponding to the target audio data.

20. An audio data processing apparatus, comprising:

the original audio data obtaining unit is used for obtaining original audio data corresponding to the original video data;

an audio-related motion feature data obtaining unit, configured to obtain audio-related motion feature data in the original video data;

the target audio data obtaining unit is used for analyzing and obtaining target audio data from the original audio data according to the audio-related motion characteristic data;

and the target audio data processing unit is used for processing the target audio data according to a preset audio processing mode.

21. An electronic device comprising a processor and a memory; wherein the content of the first and second substances,

the memory is to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of claims 1-19.