CN116437027A

CN116437027A - Audio data processing method and device and electronic equipment

Info

Publication number: CN116437027A
Application number: CN202310485323.3A
Authority: CN
Inventors: 李娜
Original assignee: Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Current assignee: Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-07-14

Abstract

The application discloses an audio data processing method, an audio data processing device and electronic equipment. The sound effect and the volume of the human voice in the target audio are adaptively adjusted by referring to the original audio, so that the human voice in the processed audio data has the sound effect matched with the scene, and the volume of the human voice in the processed audio data is larger than that of the international voice, so that the user can clearly hear the human voice in the dubbing in the process of watching the video, and the sound effect of the human voice is matched with the scene, thereby bringing better hearing feeling to the user.

Description

Audio data processing method and device and electronic equipment

Technical Field

The application belongs to the technical field of voice processing, and particularly relates to an audio data processing method, an audio data processing device and electronic equipment.

Background

Dubbing is an important link in the video production process. With the continued development of artificial intelligence (Artificial Intelligence, AI), artificial intelligence-based dubbing schemes are currently emerging, in particular: one dubbing person performs dubbing for a plurality of characters in the video (the same tone can be adopted), and then the dubbing of each character of the dubbing person is converted into the tone corresponding to the character. In a specific application, a dubbing person can dub a plurality of characters and even all characters in the video. The dubbing scheme based on artificial intelligence can greatly reduce the manpower cost.

However, the audio generated by the dubbing scheme based on artificial intelligence has a large gap and the hearing feeling of the user is poor compared with the audio generated by the conventional dubbing scheme.

Disclosure of Invention

In order to solve the above technical problems or at least partially solve the above technical problems, the application provides an audio data processing method, an audio data processing device and electronic equipment.

In a first aspect, the present application provides an audio data processing method, applied to an electronic device, the method including:

obtaining dubbing voice segments and mute segments in target audio data;

the method comprises the steps of obtaining an original piece of human voice fragment corresponding to the dubbing human voice fragment from original piece of audio data, wherein the target audio data and the original piece of audio data are audio data of the same object, and the original piece of audio data are audio data subjected to post-processing;

determining the sound effect type of the original piece of human voice fragment;

performing sound effect adjustment on the dubbing voice segments based on the sound effect types of the corresponding original voice segments;

obtaining international sound segments corresponding to the dubbing voice segments from the original audio data;

respectively determining average energy of dubbing voice segments and corresponding international voice segments subjected to sound effect adjustment;

The volume of the dubbing voice segment subjected to the sound effect adjustment is adjusted based on the average energy of the corresponding international voice segment, so that the average energy of the adjusted dubbing voice segment is higher than the average energy of the corresponding international voice segment;

and splicing the dubbing voice segment subjected to volume adjustment and the mute segment to obtain the processed audio data.

Optionally, the adjusting the sound effect of the dubbing human sound segment based on the sound effect type of the corresponding original human sound segment includes:

obtaining a parameter configuration template corresponding to the sound effect type of the original piece of human voice fragment, wherein the parameter configuration template comprises equalizer parameters and/or reverberator parameters;

and adjusting the dubbing voice segment according to the parameters in the parameter configuration template.

Optionally, the adjusting the volume of the dubbing voice segment with the adjusted sound effect based on the average energy of the corresponding international voice segment includes:

calculating the sum of the average energy of the international sound segment and a preset increment;

if the sum is greater than or equal to an energy threshold value, taking the sum as a target value, and if the sum is less than the energy threshold value, taking the energy threshold value as the target value;

And adjusting the volume of the dubbing voice segment subjected to the sound effect adjustment based on the target value.

Optionally, the adjusting the volume of the dubbing voice segment with the adjusted sound effect based on the average energy of the corresponding international voice segment further includes:

determining the average energy of international sound data in the original piece of audio data;

and determining the average energy of the international sound data in the original piece of audio data as the energy threshold value.

Optionally, the obtaining the dubbing vocal segments and the mute segments in the target audio data includes:

and detecting voice activity of the target audio data to obtain a dubbing voice segment and a mute segment.

Optionally, the obtaining the original piece of the human voice segment corresponding to the dubbing human voice segment from the original piece of audio data includes:

obtaining a human voice time sequence of a target role from the original audio data, wherein the target role is a role corresponding to the dubbing human voice segment;

voice activity detection is carried out on the voice time sequence of the target character so as to obtain an original voice fragment set of the target character;

and obtaining corresponding original piece voice fragments from the original piece voice fragment set of the target role based on the starting time and the ending time of the dubbing voice fragments.

Optionally, the obtaining the human voice time sequence of the target character from the original audio data includes:

obtaining voice data from the original audio data;

and carrying out segmentation clustering on the voice data to obtain the voice time sequence of the target role.

Optionally, the obtaining the voice data from the original audio data includes:

in the case that the original piece of audio data includes voice track data, obtaining the dialogue track data as voice data;

and carrying out blind source separation processing on the original piece of audio data to obtain voice data under the condition that the original piece of audio data does not contain voice track data.

Optionally, the obtaining the international sound segment corresponding to the dubbing voice segment from the original audio data includes:

obtaining international sound data from the original piece of audio data;

and obtaining the starting time and the ending time of the dubbing voice segment, and obtaining the international voice segment between the starting time and the ending time from the international voice data.

In a second aspect, the present application provides an audio data processing apparatus comprising:

the first audio data processing module is used for obtaining dubbing voice fragments and mute fragments in the target audio data;

The second audio data processing module is used for obtaining an original piece of human voice fragment corresponding to the dubbing human voice fragment from the original piece of audio data, wherein the target audio data and the original piece of audio data are audio data of the same object, and the original piece of audio data are audio data subjected to post-processing;

the sound effect type determining unit is used for determining the sound effect type of the original piece of human voice fragment;

the sound effect adjusting module is used for adjusting the sound effect of the dubbing human sound fragment based on the sound effect type of the corresponding original human sound fragment;

the international sound segment acquisition module is used for acquiring an international sound segment corresponding to the dubbing voice segment from the original audio data;

the average energy acquisition module is used for respectively determining average energy of the dubbing voice segment subjected to sound effect adjustment and the corresponding international voice segment;

the volume adjusting module is used for adjusting the volume of the dubbing voice segment with the adjusted sound effect based on the average energy of the corresponding international voice segment, so that the average energy of the adjusted dubbing voice segment is higher than the average energy of the corresponding international voice segment;

and the splicing module is used for splicing the dubbing voice segment subjected to volume adjustment and the mute segment to obtain the processed audio data.

In a third aspect, the present application provides an electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor is configured to execute the program to implement each step of any one of the methods described above.

Therefore, the beneficial effects of the application are as follows:

according to the audio data processing method, device and electronic equipment, a dubbing voice segment and a mute segment in target audio data are obtained, an original piece voice segment and an international voice segment corresponding to the dubbing voice segment are obtained from original piece audio data, then the sound effect type of the original piece voice segment is determined, sound effect adjustment is carried out on the dubbing voice segment based on the sound effect type of the corresponding original piece voice segment, average energy of the dubbing voice segment subjected to the sound effect adjustment and the corresponding international voice segment is respectively determined, volume adjustment is carried out on the dubbing voice segment subjected to the sound effect adjustment based on the average energy of the corresponding international voice segment, so that the average energy of the adjusted dubbing voice segment is higher than the average energy of the international voice segment, and then the dubbing voice segment subjected to the volume adjustment and the mute segment are spliced, so that processed audio data is obtained. According to the technical scheme, the sound effect and the volume of the voice in the target audio are adaptively adjusted by referring to the original audio (the voice has the sound effect matched with the scene), so that the voice in the processed audio data has the sound effect matched with the scene, the volume of the voice in the processed audio data is larger than that of the international voice, the voice in the dubbing can be clearly heard by a user in the process of watching the video, and the sound effect of the voice is matched with the scene, so that better hearing feeling is brought to the user.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an audio data processing method disclosed in the present application;

FIG. 2 is a flow chart of a method for obtaining an original piece of vocal segments corresponding to a dubbed vocal segment from original piece of audio data disclosed in the present application;

fig. 3 is a schematic structural diagram of an audio processing device disclosed in the present application;

fig. 4 is a block diagram of an electronic device disclosed in the present application.

Detailed Description

The applicant found that the main reasons for the poor hearing of audio generated based on artificial intelligence dubbing schemes are: first, the volume of the human voice does not match the volume of international sounds (sounds other than the human voice, including environmental background sounds, added music, and effect sounds), for example, the human voice is covered by the international sound, or the volume of the human voice is excessively large; second, the human voice lacks sound effects that match the scene.

Based on the above findings, the application discloses an audio data processing method, an audio data processing device and electronic equipment, which adaptively adjust the volume and the sound effect of target audio by referring to original audio, so that the adjusted audio provides better hearing feeling for users.

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

First, the original piece of audio data and the target audio data in the present application will be described.

The original audio data and the target audio data are audio data of the same object (video, such as film and television works), and the human voice (including the white, and possibly the white) in the original audio data and the human voice in the target audio data are in different languages. The original audio data is audio data subjected to post-processing, specifically, the human voice in the original audio data is usually deduced by professional actors according to the script, and is obtained through post-production (such as volume adjustment and added sound effects), and the human voice in the original audio data has natural volume fluctuation and sound effects matched with scenes. The target audio data is audio data generated based on an artificial intelligence dubbing scheme, and a large gap exists between the target audio data and the original audio data.

For example, in a scenario in which a domestically produced movie works are exported overseas, it is necessary to dub the movie works in the language of the destination, that is, the voice in the original piece of audio data of the movie works is chinese, and the voice in the target audio data of the movie works is the language of the destination. For example, in a scene where overseas video works are introduced into China, it is necessary to dub the video works using Chinese, that is, the voice in the original piece of audio data of the video works is a language of the exporting place, and the voice in the target audio data of the video works is Chinese.

Referring to fig. 1, fig. 1 is a flowchart of an audio data processing method disclosed in the present application. The method is performed by an electronic device and includes:

s101: and obtaining dubbing voice segments and mute segments in the target audio data.

The target audio data is obtained by deducting the speech text of a character by using a first language (a second language different from the voice of the person in the original audio data). The target audio data contains a human voice but does not contain an international sound.

Optionally, voice activity detection is performed on the target audio data to obtain dubbing segments and mute segments. Silence segments refer to segments of target audio data that do not contain speech information.

In practice, voice activity detection is performed on the target audio data by using an activity detection model which is trained in advance, so as to obtain a dubbing voice segment and a mute segment.

S102: and obtaining the original piece of human voice fragment corresponding to the dubbing human voice fragment from the original piece of audio data.

The original piece of human voice segment corresponding to the dubbing human voice segment means: for the same character, the starting time and the ending time are the same as or similar to the dubbing voice segment. The speech of the original audio data and the speech of the target audio data are in different languages, and the speech of the two speech of the same or similar speech of the different languages is adopted, so that the duration of the speech in the original audio data and the speech of the target audio data can be the same, and certain difference can exist. For example, when the original film is a mandarin chinese version of a movie, the issuer needs to make a foreign language version (e.g. english version, japanese version) for the movie, and then needs to make an audio file of the foreign language version, the duration of the mandarin chinese speech word in the original film of the audio data may have a certain difference, instead of being identical, with the duration of the foreign language speech word corresponding to the mandarin chinese speech word in the target audio data.

In the implementation, the starting time and the ending time of the dubbing voice segment are used as the basis, and the original voice segment corresponding to the dubbing voice segment is searched in the original voice segments of the same character. For example, the dubbing voice segment is compared with the starting time and the ending time of each original voice segment, and the original voice segment with the highest time overlap ratio is determined as the original voice segment corresponding to the dubbing voice segment.

S103: and determining the sound effect type of the original human voice segment.

Sound effects refer to the effects of sound. Sound effect types include, but are not limited to: electric voice effect, interphone sound effect, tunnel sound effect, mountain hole sound effect, gymnasium sound effect and airtight space sound effect. The sound effect of the human voice should be matched with the scene to ensure that the user produces a natural hearing sensation during the process of watching the video. For example, if a character in a video speaks into a cave, then the character's voice during the current time period needs to be treated as a cave sound effect.

In the implementation, the sound effect recognition model which is trained in advance can be utilized to detect the sound effect of the original piece of the human sound fragment so as to obtain the sound effect type of the original piece of the human sound fragment.

S104: and adjusting the sound effect of the dubbing voice segment based on the sound effect type of the corresponding original voice segment.

The dubbing human voice segment obtained from the target audio data has a correspondence with the original human voice segment obtained from the original audio data. In step S104, for each dubbing voice segment, an original piece of voice segment corresponding to the dubbing voice segment is obtained, and the dubbing voice segment is subjected to sound effect adjustment based on the sound effect type of the original piece of voice segment, so that the dubbing voice segment has the same sound effect as the corresponding original piece of voice segment.

Optionally, the audio effect of the dubbing audio segment is adjusted based on the audio effect type of the corresponding original audio segment, and the following scheme is adopted:

a1: obtaining a parameter configuration template corresponding to the sound effect type of the original human voice segment, wherein the parameter configuration template comprises equalizer parameters and/or reverberator parameters;

a2: and adjusting the dubbing voice segment according to the parameters in the parameter configuration template.

Wherein the equalizer parameters are used to control the frequency characteristics of the audio, for example: by adjusting equalizer parameters of the audio, the electric voice effect and the interphone sound effect can be realized. Reverberator parameters are used to add reverberation effects for audio, such as: by adjusting the reverberator parameters of the audio, the gymnasium sound effect, the tunnel sound effect and the cave sound effect can be realized.

In the implementation, the electronic equipment calls the equalizer and the reverberator, and processes the dubbing voice segment according to the parameters in the parameter configuration template, so that the processed dubbing voice segment has corresponding sound effects. The equalizer is a module which is realized by codes and is used for carrying out equalization processing on the audio, and the reverberator is a module which is realized by codes and is used for carrying out reverberation processing on the audio.

In the above scheme, corresponding parameter configuration templates are respectively constructed in advance for each sound effect type, and in the process of adjusting the sound effect of the dubbing voice segment, the sound effect of the dubbing voice segment can be adjusted based on the parameters in the parameter configuration templates only by determining the sound effect type of the original piece voice segment corresponding to the dubbing voice segment and obtaining the parameter configuration templates corresponding to the sound effect types.

S105: and obtaining the international sound segment corresponding to the dubbing voice segment from the original audio data.

In the implementation, the corresponding international sound segment is obtained from the original audio data based on the starting time and the ending time of the dubbing voice segment.

S106: and respectively determining the average energy of the dubbing voice segment subjected to sound effect adjustment and the corresponding international voice segment.

S107: and adjusting the volume of the dubbing voice segment subjected to the sound effect adjustment based on the average energy of the corresponding international voice segment, so that the average energy of the adjusted dubbing voice segment is higher than the average energy of the corresponding international voice segment.

In creating an audio file for video, audio data for each character needs to be fused with international sound data to form a final audio file. If the volume of the human voice in the audio data of the character is not matched with the volume of the international voice, the finally formed audio file has the following problems: the voice is covered by international sound or the volume of the voice is too loud. Therefore, in the present application, the volume adjustment is performed on the human voice in the target audio data with the international sound data as a reference. Note that, the target audio data does not include the international sound data, and therefore, it is necessary to obtain the international sound clip corresponding to the dubbing voice clip from the original audio data.

In the implementation, the average energy of the dubbing voice segment subjected to the sound effect adjustment is determined, and the following scheme is adopted: respectively obtaining the energy of a plurality of sampling points in the dubbing voice segment; an average of the energies of the plurality of sampling points is calculated. The average energy of the international sound fragment is determined by adopting the following scheme: respectively obtaining the energy of a plurality of sampling points in the international sound fragment; an average of the energies of the plurality of sampling points is calculated. The energy of any sampling point is as follows: the square of the amplitude of the sound wave at the sampling point.

Taking as an example the determination of the average energy of the sound effects-adjusted dubbing speech segments.

N sampling points are arranged on the dubbing voice segment, and the value of N is larger than 1, wherein the N sampling points can be uniformly arranged or unevenly arranged. The amplitude of the sound wave at the N sampling points is recorded as follows:

{s(n ₀ ),s(n ₁ ),s(n ₂ ),…,s(n _i ),…,s(n _N )}

the total energy of the dubbing human voice segment subjected to sound effect adjustment is as follows:

the average energy of the dubbing human voice segment subjected to sound effect adjustment is as follows:

s108: and splicing the dubbing voice segment subjected to volume adjustment and the mute segment to obtain the processed dubbing audio data.

After the sound effect adjustment and the volume adjustment are carried out on each dubbing voice segment, each dubbing voice segment and each mute segment are spliced, and the processed audio data, namely the optimized audio data, are obtained. The dubbing audio segments and the mute segments have a start time and an end time, and the dubbing audio segments and the mute segments are sequentially arranged based on the start time and the end time of the dubbing audio segments and the mute segments and then spliced.

It can be understood that the sound effect and the volume of the processed audio data are adjusted according to the original audio data, so that the processed audio data have sound effects matched with scenes, and the volume of the human voice is larger than that of international voice, so that the user can clearly hear the human voice, and a better hearing feeling is brought to the user.

In the audio data processing method shown in fig. 1, the step of obtaining the original piece of the human voice segment corresponding to the dubbing human voice segment from the original piece of audio data and the step of obtaining the international sound segment corresponding to the dubbing human voice segment from the original piece of audio data are performed in steps, and in the implementation, the two steps may be combined into one step, that is, the original piece of the human voice segment and the international sound segment corresponding to the dubbing human voice segment are obtained from the original piece of audio data.

It should be noted that the audio data processing method shown in fig. 1 is described with respect to a processing procedure of audio data of one character. In the implementation, the audio data of each character is sequentially used as target audio data, and the processing shown in fig. 1 is respectively executed so as to obtain the audio data processed by each character, and then the audio data of the plurality of characters and the international audio data are fused to obtain the audio file of the video.

According to the audio data processing method, a dubbing voice segment and a mute segment in target audio data are obtained, an original piece voice segment and an international voice segment corresponding to the dubbing voice segment are obtained from original piece audio data, then the sound effect type of the original piece voice segment is determined, sound effect adjustment is carried out on the dubbing voice segment based on the sound effect type of the corresponding original piece voice segment, average energy of the dubbing voice segment subjected to the sound effect adjustment and the corresponding international voice segment is respectively determined, volume adjustment is carried out on the dubbing voice segment subjected to the sound effect adjustment based on the average energy of the corresponding international voice segment, so that the average energy of the adjusted dubbing voice segment is higher than the average energy of the international voice segment, and then the dubbing voice segment subjected to the volume adjustment and the mute segment are spliced, so that processed audio data is obtained. According to the technical scheme, the sound effect and the volume of the voice in the target audio are adaptively adjusted by referring to the original audio (the voice has the sound effect matched with the scene), so that the voice in the processed audio data has the sound effect matched with the scene, the volume of the voice in the processed audio data is larger than that of the international voice, the voice in the dubbing can be clearly heard by a user in the process of watching the video, and the sound effect of the voice is matched with the scene, so that better hearing feeling is brought to the user.

In another embodiment of the present application, the volume of the dubbing voice segment subjected to the sound effect adjustment is adjusted based on the average energy of the corresponding international voice segment, and the following scheme is adopted:

b1: calculating the average energy E of the International sound fragment _b And presetting the sum value of the increment inc;

b2: if the sum is greater than or equal to the energy threshold eta _th The sum is taken as a target value E _b ' if the sum is less than the energy threshold, the energy threshold eta _th As the target value E _b ′；

B3: based on the target value E _b ' volume adjustment is performed on dubbing human voice segments subjected to sound effect adjustment.

According to the scheme for adjusting the volume of the dubbing voice segment subjected to the sound effect adjustment, the average energy of the corresponding international voice segment is used as a reference, if the sum of the average energy of the international voice segment and the preset increment is greater than or equal to the energy threshold value, the volume of the dubbing voice segment is adjusted by taking the sum as a target value, namely, when the volume of the international voice segment is large, the volume of the adjusted dubbing voice segment is ensured to be greater than the volume of the international voice segment, so that a user can clearly hear voice in the dubbing in the process of watching video; if the sum of the average energy and the preset increment of the international sound segment is smaller than the energy threshold, the energy threshold is taken as a target value to adjust the volume of the dubbing voice segment, namely, when the volume of the international sound segment is smaller, the dubbing voice segment is adjusted to a proper volume, so that a user can clearly hear the voice in the dubbing during the video watching process.

Optionally, based on the target value E _b The volume of the dubbing voice segment subjected to the sound effect adjustment is adjusted by adopting the following scheme:

according to

Calculating a gain; and adjusting the volume of the dubbing voice segment based on the gain. Wherein g is gain, E _d Mean energy of dubbing human voice segment, E _b ' is the target value.

Alternatively, the energy threshold η _th For empirical values determined over a number of experiments.

Optionally, the volume adjustment is performed on the dubbing voice segment subjected to the sound effect adjustment based on the average energy of the corresponding international voice segment, and the method further includes: determining the average energy of international sound data in the original piece of audio data; an energy threshold is determined based on the average energy of the international sound data in the original piece of audio data.

That is, the energy threshold is determined based on the average energy of the complete international sound data in the original piece of audio data. For example, the average energy of the international sound data in the original piece of audio data is determined as an energy threshold value. Of course, the preset adjustment amount may be added or subtracted as the energy threshold value based on the average energy of the international sound data in the original audio data. That is, the magnitude of the energy threshold is directly related to the volume of the international sound in the original audio data, and the energy threshold is a smaller value if the international sound in the original audio data is softer in its entirety, and a larger value if the international sound in the original audio data is stronger in its entirety.

In practice, the average energy of the international voice data is determined by adopting the following scheme: respectively obtaining the energy of a plurality of sampling points in the international sound data; calculating the sum of the energy of a plurality of sampling points; and calculating the average energy of the inter-national sound data by using the sum of the energy and the number of sampling points.

In another embodiment of the present application, an introduction is focused on a scheme for obtaining an original piece of human voice segments corresponding to a dubbed human voice segment from original piece of audio data. Referring to fig. 2, the method specifically includes:

s201: and obtaining the human voice time sequence of the target character from the original audio data.

The voice time sequence of each character is the audio data with the same duration as the original piece of audio data, and the voice time sequence of each character contains all voice data of the character. The target roles herein refer to: dubbing the role corresponding to the voice segment. It should be noted that, if the movie work contains a side, the side is also regarded as a character.

S202: and detecting voice activity on the voice time sequence of the target character to obtain an original voice fragment set of the target character.

For any character, voice activity detection (Voice Activity Detection, VAD) is performed on the voice time sequence of the character, and whether voice information exists in the frame signal in the audio can be identified, so that the original voice segment of the character in the original audio data is obtained.

In practice, voice activity detection is performed on the voice time sequence by using an activity detection model which is trained in advance so as to obtain an original voice fragment.

For example, in the original piece of audio data, the human voice section of character a includes: a human voice section 01 of 05 minutes 30 seconds to 06 minutes 00 seconds, a human voice section 02 of 08 minutes 00 seconds to 08 minutes 30 seconds, a human voice section 03 of 10 minutes 10 seconds to 10 minutes 20 seconds. Then the original set of pieces of vocal segments for character a includes the aforementioned vocal segment 01, vocal segment 02, and vocal segment 03.

S203: and obtaining corresponding original piece voice fragments from the original piece voice fragment set of the target role based on the starting time and the ending time of the dubbing voice fragments.

The dubbing vocal segments have a start time and an end time. In the implementation, the starting time and the ending time of the dubbing voice segment are used as the basis, and the original voice segment corresponding to the dubbing voice segment is searched in the original voice segment of the target character. For example, the dubbing voice segment is compared with the starting time and the ending time of each original voice segment, and the original voice segment with the highest time overlap ratio is determined as the original voice segment corresponding to the dubbing voice segment.

In the original audio data, a plurality of human voices of a character may be included in a certain period of time. Based on the situation, in the scheme, firstly, the human voice time sequence of the target character is obtained from the original audio data, then voice activity detection is carried out on the human voice time sequence of the target character so as to obtain the original human voice segment set of the target character, and then the original human voice segment corresponding to the dubbing human voice segment can be accurately obtained by searching in the original human voice segment set of the target character based on the starting time and the ending time of the dubbing human voice segment.

Optionally, the voice time sequence of the target character is obtained from the original audio data, and the following scheme is adopted:

c1: and obtaining the voice data from the original audio data.

In practice, if the original piece of audio data includes the pair voice track data, the pair voice track data is obtained as the human voice data; if the original piece of audio data does not contain the voice track data, blind source separation processing is performed on the original piece of audio data to obtain voice data.

The object in the application can be a video such as a movie, a movie play, a small video, and the like. Audio files of movies and movie plays typically comprise a plurality of audio tracks, including for example a voice track and an international audio track, and then the voice track data is obtained as voice data. And carrying out blind source separation processing on the audio data of the video aiming at the video of the indistinguishable audio tracks, such as a small video, so as to obtain the voice data in the original audio data.

C2: and carrying out segmentation clustering on the voice data to obtain the voice time sequence of the target role.

In practice, the classification model trained in advance can be utilized to perform segmentation clustering on the voice data so as to obtain the voice time sequence of each role.

Specific: voice activity detection is carried out on the voice data so as to divide the voice data into a plurality of voice fragments and silence fragments; inputting the voice segments into a classification model which is trained in advance aiming at each voice segment, and analyzing the voice segments by the classification model to obtain character labels of the voice segments; for each character tag, the human voice segment and the mute segment with the character tag are spliced into a human voice time sequence.

The voice segments have a start time and an end time, and the voice segments are sequentially arranged based on the start time or the end time of each voice segment for a plurality of voice segments having the same character tag, and a mute segment (the duration of the mute segment is the duration from the end time of the preceding voice segment to the start time of the following voice segment) is provided between adjacent voice segments, and the arranged voice segments and the mute segment are spliced into a voice time sequence of the character.

In another embodiment of the present application, a scheme for obtaining an international sound segment corresponding to a dubbing voice segment from original audio data is described, which specifically includes:

d1: international sound data is obtained from the original piece of audio data.

In implementation, if the original piece of audio data includes international audio track data, the international audio track data is obtained, and if the original piece of audio data does not include the international audio track data, blind source separation processing is performed on the original piece of audio data to obtain the international audio data.

D2: the starting time and the ending time of the dubbing voice segment are obtained, and the international voice segment between the starting time and the ending time is obtained from the international voice data.

In the implementation, the international sound segment corresponding to the dubbing voice segment is acquired from the international sound data based on the starting time and the ending time of the dubbing voice segment. For example, if a certain dubbing voice segment starts at 05 minutes 30 seconds and ends at 06 minutes 00 seconds, data from 05 minutes 30 seconds to 06 minutes 00 seconds is acquired from international voice data, and the data is the international voice segment corresponding to the dubbing voice segment.

The application discloses an audio data processing method, and correspondingly, the application discloses an audio data processing device, and the description of the application and the application can refer to each other.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an audio data processing device disclosed in the present application. The audio data processing apparatus includes:

a first audio data processing module 301, configured to obtain a dubbing vocal segment and a mute segment in the target audio data;

the second audio data processing module 302 is configured to obtain an original piece of human voice segment corresponding to the dubbing human voice segment from the original piece of audio data, where the target audio data and the original piece of audio data are audio data of the same object, and the original piece of audio data is audio data subjected to post-processing;

An audio type determining unit 303, configured to determine an audio type of an original piece of a human voice segment;

the sound effect adjustment module 304 is configured to perform sound effect adjustment on the dubbing voice segments based on the sound effect types of the corresponding original voice segments;

an international sound segment obtaining module 305, configured to obtain an international sound segment corresponding to a dubbing voice segment from original audio data;

the average energy obtaining module 306 is configured to determine average energies of the dubbing voice segment and the corresponding international voice segment after the sound effect adjustment;

the volume adjustment module 307 is configured to perform volume adjustment on the dubbing voice segment subjected to the sound effect adjustment based on the average energy of the corresponding international voice segment, so that the average energy of the adjusted dubbing voice segment is higher than the average energy of the corresponding international voice segment;

and the splicing module 308 is used for splicing the dubbing voice segment and the mute segment subjected to volume adjustment to obtain the processed audio data.

According to the audio data processing device disclosed by the application, the sound effect and the volume of the voice in the target audio are adaptively adjusted by referring to the original audio (the voice has the sound effect matched with the scene), so that the voice in the processed audio data has the sound effect matched with the scene, and the volume of the voice in the processed audio data is larger than that of the international voice, so that a user can clearly hear the voice in the dubbing in the process of watching the video, and the sound effect of the voice is matched with the scene, thereby bringing better hearing feeling to the user.

Optionally, the sound effect adjustment module 304 includes:

the template acquisition unit is used for acquiring a parameter configuration module corresponding to the sound effect type of the original piece of human voice fragment, wherein the parameter configuration module comprises equalizer parameters and/or reverberator parameters;

and the sound effect processing unit is used for adjusting the dubbing voice segment according to the parameters in the parameter configuration template.

Optionally, the volume adjustment module 307 includes:

the target value determining unit is used for calculating the sum value of the average energy and the preset increment of the international sound segment, taking the sum value as a target value if the sum value is larger than or equal to the energy threshold value, and taking the energy threshold value as the target value if the sum value is smaller than the energy threshold value;

and the volume adjusting unit is used for adjusting the volume of the dubbing voice segment subjected to the sound effect adjustment based on the target value.

Optionally, the volume adjustment module 307 further includes:

and the threshold value determining unit is used for determining the average energy of the international sound data in the original piece of audio data and determining the average energy of the international sound data in the original piece of audio data as an energy threshold value.

Optionally, the first audio data processing module 301 is specifically configured to: and detecting voice activity of the target audio data to obtain dubbing voice segments and mute segments.

Optionally, the second audio data processing module 302 includes:

the human voice time sequence acquisition unit is used for acquiring a human voice time sequence of a target role from the original audio data, wherein the target role is a role corresponding to the dubbing human voice segment;

the voice segment set acquisition unit is used for detecting voice activity of the voice time sequence of the target character so as to acquire an original voice segment set of the target character;

and the original piece voice segment acquisition unit is used for acquiring corresponding original piece voice segments from the original piece voice segment set of the target role based on the starting time and the ending time of the dubbing voice segments.

Optionally, the voice time sequence acquiring unit is specifically configured to: obtaining voice data from the original audio data; and carrying out segmentation clustering on the voice data to obtain the voice time sequence of the target role.

Optionally, the voice time sequence obtaining unit obtains voice data from the original audio data, specifically: in the case where the original piece of audio data includes the pair voice track data, obtaining dialogue track data as voice data; and performing blind source separation processing on the original piece of audio data to obtain voice data under the condition that the original piece of audio data does not contain the voice track data.

Optionally, the international sound fragment acquisition module 305 includes:

an international sound data acquisition unit for acquiring international sound data from the original piece of audio data;

and the international sound segment acquisition unit is used for acquiring the starting time and the ending time of the dubbing voice segment and acquiring the international sound segment between the starting time and the ending time from the international sound data.

The application also provides electronic equipment.

Referring to fig. 4, fig. 4 shows a hardware structure of an electronic device including: a processor 401, a memory 402, a communication interface 403, and a communication bus 404.

In the embodiment of the present application, the number of the processor 401, the memory 402, the communication interface 403, and the communication bus 404 is at least one, and the processor 401, the memory 402, and the communication interface 403 complete communication with each other through the communication bus 404. Communication bus 404 may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc.

It should be noted that the structure of the electronic device shown in fig. 4 is not limited to the electronic device, and the electronic device may include more or less components than those shown in fig. 4, or may combine some components, or may be arranged with different components, as will be understood by those skilled in the art.

The respective constituent elements of the electronic device are specifically described below with reference to fig. 4.

The processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device.

Processor 401 may be a central processing unit (Central Processing Unit, CPU), or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the Memory 402 may include a Memory such as a Random-Access Memory (RAM) and a Read-Only Memory (ROM), and may further include a mass storage device such as at least 1 disk Memory, etc.

Wherein the memory 402 stores a program, the processor 401 may call the program stored in the memory, the program being for:

obtaining dubbing voice segments and mute segments in target audio data;

the method comprises the steps of obtaining an original piece of human voice fragment corresponding to a dubbing human voice fragment from original piece of audio data, wherein target audio data and the original piece of audio data are audio data of the same object, and the original piece of audio data is audio data subjected to post-processing;

determining the sound effect type of the original human voice segment;

obtaining international sound segments corresponding to dubbing voice segments from the original audio data;

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. The audio data processing device and the electronic equipment disclosed in the embodiments correspond to the audio data processing method disclosed in the embodiments, so that the description is simpler, and the relevant parts only need to be referred to in the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An audio data processing method applied to an electronic device, the method comprising:

obtaining dubbing voice segments and mute segments in target audio data;

2. The method of claim 1, wherein the performing an audio adjustment on the dubbing human voice segment based on the audio type of the corresponding original human voice segment comprises:

3. The method of claim 1, wherein the volume adjusting of the sound-adjusted dubbing speech segments based on the average energy of the corresponding international sound segments comprises:

4. The method of claim 3, wherein the volume adjusting the sound-adjusted dubbing human sound segment based on the average energy of the corresponding international sound segment further comprises:

5. The method of claim 1, wherein the obtaining dubbing vocal segments and mute segments in the target audio data comprises:

6. The method of claim 1, wherein the obtaining the original piece of human voice segments corresponding to the dubbing human voice segments from the original piece of audio data comprises:

7. The method of claim 6, wherein obtaining the human voice time sequence of the target character from the raw audio data comprises:

obtaining voice data from the original audio data;

8. The method of claim 7, wherein obtaining the human voice data from the original piece of audio data comprises:

9. The method of claim 1, wherein obtaining an international sound segment corresponding to the dubbing human sound segment from the original audio data comprises:

obtaining international sound data from the original piece of audio data;

10. An audio data processing apparatus, comprising:

11. An electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor being adapted to execute the program for carrying out the steps of the method according to any one of claims 1 to 9.