CN114842858A - Audio processing method and device, electronic equipment and storage medium - Google Patents

Audio processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114842858A
CN114842858A CN202210457487.0A CN202210457487A CN114842858A CN 114842858 A CN114842858 A CN 114842858A CN 202210457487 A CN202210457487 A CN 202210457487A CN 114842858 A CN114842858 A CN 114842858A
Authority
CN
China
Prior art keywords
audio file
dubbing
audio
time period
target video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210457487.0A
Other languages
Chinese (zh)
Inventor
李海
文博龙
闫影
甘文东
陈海涛
郭凯旋
王松
李嘉文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Original Assignee
Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu iQIYI Intelligent Innovation Technology Co Ltd filed Critical Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Priority to CN202210457487.0A priority Critical patent/CN114842858A/en
Publication of CN114842858A publication Critical patent/CN114842858A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Management Or Editing Of Information On Record Carriers (AREA)

Abstract

The invention relates to an audio processing method, an audio processing device, electronic equipment and a storage medium, wherein the audio processing method comprises the following steps: acquiring a first audio file corresponding to a target video, and extracting dubbing characteristics corresponding to dubbing contents of a first dubbing object in the first audio file, wherein a first language in the first audio file is different from a second language in an original sound audio file of the target video; acquiring tone color characteristics corresponding to a second dubbing object, wherein the first dubbing object and the second dubbing object have different tone colors; combining the dubbing features and the tone color features to obtain an audio frequency spectrum; and carrying out audio reconstruction based on the audio frequency spectrum to obtain a second audio file corresponding to the target video. The tone of the first dubbing object can be automatically converted into the tone of the second dubbing object, and meanwhile the dubbing content and the feeling of the first dubbing object are reserved.

Description

Audio processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium.
Background
With the increasing pace of Chinese culture going out to the sea, a great number of domestic film and television plays are produced overseas every year, and a great number of foreign-language film and television plays are introduced into the country, so that the dubbing of the local film and television plays becomes an important restriction factor for landing national dramas overseas or landing overseas dramas domestically.
However, on the one hand, finding a sufficient number of characters overseas and matching-quality dubbing actors is far more difficult than finding chinese dubbing actors, and the time and expense cost of finding dubbing resources is much higher. Under the restriction of the current epidemic situation, the method is almost impossible in some areas; on the other hand, currently, the number of professional persons used for Chinese dubbing is very small, the number of foreign language film and television plays stored is huge, the cost of single dubbing is very high, manual dubbing is more difficult to bear when the manual dubbing is completely used, the timbre types of characters in the film and television plays to be dubbed generally have certain requirements, and the grade and the matching degree of the dubber are main factors which restrict the manual dubbing and are difficult to use in a large scale.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, the present application provides an audio processing method, an apparatus, an electronic device and a storage medium.
In a first aspect, the present application provides an audio processing method, including:
acquiring a first audio file corresponding to a target video, and extracting dubbing characteristics corresponding to dubbing contents of a first dubbing object in the first audio file, wherein a first language in the first audio file is different from a second language in an original sound audio file of the target video;
acquiring tone color characteristics corresponding to a second dubbing object, wherein the first dubbing object and the second dubbing object have different tone colors;
combining the dubbing features and the tone color features to obtain an audio frequency spectrum;
and carrying out audio reconstruction based on the audio frequency spectrum to obtain a second audio file corresponding to the target video.
Optionally, obtaining a first audio file corresponding to the target video includes:
acquiring an acoustic audio file, a dubbing audio file, a first speech text and a second speech text corresponding to a target video, wherein the first speech text is obtained by performing speech recognition on the dubbing audio file and does not contain role information, the dubbing audio file is obtained by dubbing in a first language different from a second language of the acoustic audio file of the target video, and the second speech text corresponds to the acoustic audio file and contains the role information;
determining a speaking time period of a face speaking belonging to the same role and speech content corresponding to the speaking time period according to the target video, the first speech text and the acoustic audio file;
audio track separation is carried out on the dubbing audio file according to the speaking time period, the speech content corresponding to the speaking time period and the second speech text, and a speaking time period of each role and an audio file corresponding to the speaking time period are obtained;
and determining an audio file corresponding to the speaking time period of any role as a first audio file corresponding to the target video.
Optionally, determining, according to the target video, the first speech text, and the acoustic audio file, a speech time period during which a face belonging to the same role speaks and speech content corresponding to the speech time period, includes:
extracting a face occurrence timestamp from the target video;
extracting a voiceprint fragment occurrence timestamp from an original voice frequency file;
extracting a first language speech segment occurrence timestamp from the first speech text;
matching the voiceprint fragment occurrence time stamp with the face occurrence time stamp to obtain a speaking time period of face speaking belonging to the same role;
matching the time period of speaking of the face belonging to the same role with the occurrence time stamp of the first language speech segment to obtain speech content corresponding to the speaking time period.
Optionally, performing audio track splitting on the dubbing audio file according to the speaking time period, the speech content corresponding to the speaking time period, and the second speech text to obtain a speaking time period of each role and an audio file corresponding to the speaking time period, including:
matching the speaking time period, the speech content corresponding to the speaking time period and the second speech text to obtain a time period for each role to speak;
and performing audio track division on the dubbing audio file according to the time period of speaking of each role to obtain an audio file corresponding to the time period.
Optionally, the dubbing features include: content features, extracting dubbing features corresponding to dubbing content of a first dubbing object in the first audio file, including:
inputting the first audio file into a preset voice recognition encoder to obtain recognition content;
and inputting the identification content into a preset content encoder to obtain the content characteristics.
Optionally, the dubbing features include: prosodic features, extracting dubbing features corresponding to dubbing content of a first dubbing object in the first audio file, comprising:
inputting the first audio file into a preset voice self-supervision learning pre-training model to obtain output data;
and inputting the output data into a preset prosody encoder to obtain the prosody characteristics.
Optionally, obtaining a second timbre feature corresponding to the second dubbing object includes:
acquiring an acoustic audio file of the target video;
extracting original sound print characteristics of an original sound dubbing object from the original sound audio file;
searching a voiceprint identifier corresponding to the original voiceprint characteristics in a preset voiceprint library;
and determining the tone color feature of the dubbing object corresponding to the voiceprint identifier as a second tone color feature of the second dubbing object.
Optionally, after performing audio reconstruction based on the audio spectrum to obtain a second audio file corresponding to the target video, the method further includes:
acquiring an acoustic audio file of the target video;
carrying out volume detection on the acoustic audio file to obtain first volume values corresponding to a plurality of first time stamps;
searching a second volume value corresponding to each first time stamp in the second audio file;
if the difference value between the first volume value and the second volume value is larger than a preset threshold value, the second volume value is adjusted to be the first volume value, and an adjusted second audio file is obtained.
Optionally, after performing audio reconstruction based on the audio spectrum to obtain a second audio file corresponding to the target video, the method further includes:
sound effect detection is carried out on the original sound audio file, and sound effect types corresponding to the second timestamps are obtained;
and adding sound effects in the second audio file according to the sound effect types corresponding to the second timestamps to obtain the adjusted second audio file.
In a second aspect, the present application provides an audio processing apparatus comprising:
the first obtaining module is used for obtaining a first audio file corresponding to a target video and extracting content characteristics and prosody characteristics of dubbing content of a first dubbing object in the first audio file, wherein a first language in the first audio file is different from a second language in an original sound audio file of the target video;
the second obtaining module is used for obtaining tone color characteristics corresponding to a second dubbing object, and the first dubbing object and the second dubbing object have different tone colors;
the merging module is used for merging the content characteristics, the rhythm characteristics and the tone characteristics to obtain an audio frequency spectrum;
and the reconstruction module is used for carrying out audio reconstruction based on the audio frequency spectrum to obtain a second audio file corresponding to the target video.
Optionally, the first obtaining module includes:
a first obtaining unit, configured to obtain an acoustic audio file, a dubbing audio file, a first speech text, and a second speech text corresponding to a target video, where the first speech text is obtained by performing speech recognition on the dubbing audio file and does not include role information, the dubbing audio file is obtained by dubbing in a first language different from a second language of the acoustic audio file of the target video, and the second speech text corresponds to the acoustic audio file and includes role information;
a first determining unit, configured to determine, according to the target video, the first speech text, and the acoustic audio file, a speech time period during which a face belonging to the same role speaks and speech content corresponding to the speech time period;
the track dividing unit is used for carrying out audio track division on the dubbing audio file according to the speaking time period, the speech content corresponding to the speaking time period and the second speech text to obtain the speaking time period of each role and the audio file corresponding to the speaking time period;
and the second determining unit is used for determining the audio file corresponding to the speaking time period of any character as the first audio file corresponding to the target video.
Optionally, the first determining unit includes:
the first extraction subunit is used for extracting a face occurrence timestamp from the target video;
the second extraction subunit is used for extracting the time stamp of the voiceprint occurrence segment in the original audio file;
a third extraction subunit, configured to extract a first-language speech segment occurrence timestamp from the first speech text;
the first matching subunit is used for matching the voiceprint fragment occurrence timestamp and the face occurrence timestamp to obtain a speaking time period of face speaking belonging to the same role;
and the second matching subunit is used for matching the time period of the face speaking belonging to the same role with the occurrence time stamp of the speech segment in the first language to obtain speech content corresponding to the speaking time period.
Optionally, the track splitting unit includes:
the third matching subunit is used for matching the speaking time period, the speech content corresponding to the speaking time period and the second speech text to obtain the time period for speaking of each role;
and the track dividing subunit is used for carrying out audio track division on the dubbing audio file according to the time period of speaking of each role to obtain an audio file corresponding to the time period.
Optionally, the dubbing features include: content features, the first obtaining module comprising:
the first input unit is used for inputting the first audio file into a preset voice recognition encoder to obtain recognition content;
and the second input unit is used for inputting the identification content into a preset content encoder to obtain the content characteristics.
Optionally, the dubbing features include: the first obtaining module comprises:
the third input unit is used for inputting the first audio file into a preset voice self-supervision learning pre-training model to obtain output data;
and the fourth input unit is used for inputting the output data into a preset prosody encoder to obtain the prosody characteristics.
Optionally, the second obtaining module includes:
the second acquisition unit is used for acquiring an acoustic audio file of the target video;
an extracting unit, configured to extract an acoustic voiceprint feature of an acoustic dubbing object in the acoustic audio file;
the first searching unit is used for searching the voiceprint identification corresponding to the original voiceprint characteristic in a preset voiceprint library;
a third determining unit, configured to determine a timbre feature of the dubbing object corresponding to the voiceprint identifier as a second timbre feature of the second dubbing object.
Optionally, after the reconstruction unit, the apparatus further comprises:
the third acquisition module is used for acquiring an acoustic audio file of the target video;
the volume detection module is used for carrying out volume detection on the acoustic audio file to obtain first volume values corresponding to the first time stamps;
a first searching module, configured to search the second audio file for a second volume value corresponding to each of the first timestamps;
and the volume adjusting module is used for adjusting the second volume value to the first volume value if the difference value between the first volume value and the second volume value is larger than a preset threshold value, so as to obtain an adjusted second audio file.
Optionally, after the reducing unit, the apparatus further comprises:
the sound effect detection module is used for carrying out sound effect detection on the original sound audio file to obtain sound effect types corresponding to the second timestamps;
and the sound effect adjusting module is used for adding sound effects into the second audio file according to the sound effect types corresponding to the second timestamps to obtain the adjusted second audio file.
In a third aspect, the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
a processor configured to implement the audio processing method according to any one of the first aspect when executing a program stored in the memory.
In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a program of an audio processing method, which when executed by a processor, implements the steps of the audio processing method of any one of the first aspects.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
the method combines the dubbing characteristics of the first dubbing object and the timbre characteristics of the second dubbing object by only reserving the dubbing characteristics of the first dubbing object in the first audio file without using the timbre characteristics of the first dubbing object, so that the second audio file reconstructed based on the audio frequency spectrum obtained by combination can have the timbre of the second dubbing object and reserve the dubbing characteristics, the timbre of the first dubbing object is automatically converted into the timbre of the second dubbing object, simultaneously the dubbing content and the emotion of the first dubbing object are reserved, further, the timbre of the first dubbing object in all the first audio files corresponding to the target video file can be conveniently converted into the corresponding timbre of the second dubbing object, the effect of dubbing the timbres of a plurality of second dubbing objects by one first dubbing object can be achieved without other dubbing actors, and simultaneously the emotion of the first dubbing object can be reserved, thereby meeting the requirements of the movie and television theatre on white.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a flowchart of an audio processing method according to an embodiment of the present application;
fig. 2 is a flowchart of step S101 in fig. 1 according to an embodiment of the present disclosure;
fig. 3 is a flowchart of step S102 in fig. 1 according to an embodiment of the present disclosure;
fig. 4 is another flowchart of an audio processing method according to an embodiment of the present application;
fig. 5 is another flowchart of an audio processing method according to an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating an audio processing method in practical application according to an embodiment of the present disclosure;
fig. 7 is a schematic diagram of a sound conversion model in practical application according to an embodiment of the present application;
FIG. 8 is a schematic diagram illustrating an audio processing method in another practical application according to an embodiment of the present application;
FIG. 9 is a schematic diagram of an acoustic conversion model in another practical application according to an embodiment of the present application;
fig. 10 is a block diagram of an audio processing apparatus according to an embodiment of the present application;
fig. 11 is a structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Since on the one hand, the present method is to find a sufficient number of characters overseas, and the difficulty level of the dubbing actors with a quality that is consistent with the present method is far higher than that of finding chinese dubbing actors, the time cost and the expense cost for finding dubbing resources are much higher. Under the restriction of the current epidemic situation, the method is almost impossible in some areas; on the other hand, currently, the number of professional persons used for Chinese dubbing is very small, the number of foreign language film and television plays stored is huge, the cost of single dubbing is very high, manual dubbing is more difficult to bear when the manual dubbing is completely used, the timbre types of characters in the film and television plays to be dubbed generally have certain requirements, and the grade and the matching degree of the dubber are main factors which restrict the manual dubbing and are difficult to use in a large scale.
Therefore, the embodiments of the present application provide an audio processing method, an apparatus, an electronic device, and a storage medium, so that when a video exported in a video, that is, a video originally using chinese dubbing is exported overseas or a video originally using non-chinese dubbing is imported domestically, dubbing features of a first dubbing object and timbre features of a second dubbing object are combined automatically by only retaining dubbing features of the first dubbing object in a first audio file without using timbre features of the first dubbing object, so that a second audio file reconstructed based on an audio spectrum obtained by combining can have timbre of the second dubbing object and retain dubbing features, thereby realizing automatic conversion of timbre of the first dubbing object to timbre of the second dubbing object while retaining content and emotion of dubbing of the first dubbing object, and further facilitating conversion of timbre of the first dubbing object in all first audio files corresponding to a target video file into corresponding second dubbing respectively The timbre of the dubbing object can achieve the effect of dubbing the timbres of a plurality of second dubbing objects by one first dubbing object without other dubbing actors, and simultaneously can reserve the emotion of rich speech of the first dubbing object, thereby meeting the requirement of the movie and television theatre scene on white.
As shown in fig. 1, an embodiment of the present application provides an audio processing method, which may include the following steps:
step S101, a first audio file corresponding to a target video is obtained, and dubbing characteristics corresponding to dubbing content of a first dubbing object in the first audio file are extracted.
In one embodiment of the present application, the target video refers to a video to be exported overseas, the target video may correspond to an original audio file and a dubbing audio file, the original audio file refers to an audio file dubbed in chinese, that is, the second language may exemplarily refer to chinese, in practical applications, in order to enable overseas people to export the target video to overseas and make the overseas people to hear without seeing the caption, the dubbing of the target video into the corresponding language is required, therefore, the dubbing audio file refers to an audio file dubbed in a language other than chinese for the target video, that is, the first language may exemplarily refer to a non-chinese language, in practical applications, the dubbing audio file is generally dubbed by one first dubbing object, and in order to complete dubbing quickly, the dubbing audio file may be completed by two or more first dubbing objects, the first dubbing object refers to a dubbing person who may dub the target video into a language other than chinese, other languages are exemplary, and could be the languages of the southeast Asian region, etc.
In another embodiment of the present application, the target video refers to a video to be imported, the target video may correspond to an original audio file and a dubbing audio file, the original audio file refers to an audio file using non-chinese dubbing, that is, the second language may exemplarily refer to a non-chinese language, in practical applications, in order to import the target video into the country, so that the domestic population can listen without seeing the caption, the target video needs to be dubbed in chinese from another language, the other language exemplarily may be a small language in southeast asian region, and so on, so the dubbing audio file refers to an audio file using chinese dubbing for the target video, that is, the first language exemplarily may refer to a chinese language, in practical applications, the dubbing audio file is generally dubbed by one first dubbing object, and in order to complete dubbing quickly, may also be completed by two or more first dubbing objects, the first dubbing object refers to a dubbing person who can dub the target video into chinese.
The first audio file corresponds to personas having speech in the target video, each persona may correspond to at least one audio file, and the first audio file may be a portion of a dubbed audio file, such that a first language in the first audio file is different from a second language in an acoustic audio file of the target video.
The dubbing characteristics are used for characterizing the speech content, emotional state, rhythm and the like of the first dubbing object, and may include, for example: content characteristics and prosodic characteristics, etc.
In this step, first audio files corresponding to the target video may be acquired one by one according to a preset sequence, and then dubbing features of dubbing content of the first dubbing object may be extracted from the first audio files.
Step S102, obtaining the tone color characteristic corresponding to the second dubbing object.
Because the target video generally has a plurality of characters with speech, the tone colors of the characters are generally different, each character generally corresponds to one tone color, and in order to dub the sounds of different characters in the target video with different tone colors, dubbing personnel having different tone colors from the first dubbing object, namely, the second dubbing object, needs to be selected.
In this embodiment of the present application, a tone color feature library may be pre-constructed, where tone color features of a plurality of second dubbing objects are stored in the tone color feature library, and in this step, a tone color feature corresponding to the second dubbing object may be obtained in the tone color feature library.
Step S103, combining the dubbing features and the tone features to obtain an audio frequency spectrum;
in this step, the dubbing feature and the tone feature may be decoded by a decoder to obtain an audio spectrum, which may be referred to as Mel (Mel) spectrum.
And step S104, performing audio reconstruction based on the audio frequency spectrum to obtain a second audio file corresponding to the target video.
The vocoder can be used to perform audio reconstruction on the audio spectrum to reconstruct a waveform file that can be played, resulting in a second audio file.
Based on the above steps, since only the dubbing features in the first audio file are retained and the dubbing features are combined with the timbre features of the second dubbing object, which is equivalent to replacing the timbre features of the first dubbing object in the first audio file with the timbre features of the second dubbing object to obtain the second audio file, in practical application, this process can be repeated for each first audio file in the dubbing audio files in this way, the timbre features of the first dubbing object in different first audio files are replaced with the timbre features of correspondingly different second dubbing objects, and the dubbing audio file obtained by dubbing of one first dubbing object is converted into the target audio file dubbed by a plurality of second dubbing objects after the plurality of second audio files are combined to obtain the complete target dubbing file.
In practical application, the second audio file generated by audio reconstruction may have a certain mechanical sound, a certain current sound and a certain degree of noise, so that the sound is denoised and repaired by using a DSP technology to ensure the playing effect of the second audio file.
The method combines the dubbing features of the first dubbing object and the timbre features of the second dubbing object by only reserving the dubbing features of the first dubbing object in the first audio file without using the timbre features of the first dubbing object, so that the second audio file reconstructed based on the audio spectrum obtained by combination can have the timbre of the second dubbing object and reserve the dubbing features, the content and the emotion of dubbing of the first dubbing object are reserved while the timbre of the first dubbing object in all the first audio files corresponding to the target video file is automatically converted into the timbre of the corresponding second dubbing object, the effect of dubbing objects dubbing by one first dubbing object can be achieved without other dubbing actors, and rich speech of the first dubbing object can be reserved, thereby meeting the requirements of the movie and television theatre on white.
In another embodiment of the present application, the acquiring of the first audio file corresponding to the target video in step S101, as shown in fig. 2, includes:
step S201, an acoustic audio file, a dubbing audio file, a first speech text, and a second speech text corresponding to the target video are obtained.
In this embodiment of the application, the first speech text is obtained by performing speech recognition on the dubbing audio file, where the dubbing audio file does not contain role information, the dubbing audio file is obtained by dubbing in a first language different from a second language of the original audio file of the target video, the second speech text corresponds to the original audio file and contains role information, and for example, the second speech text may be obtained by translating a second speech text into the first language, and the second speech text contains the role information, so the second speech text contains the role information;
step S202, determining a speaking time period of a face speaking belonging to the same role and speech content corresponding to the speaking time period according to the target video, the first speech text and the acoustic audio file;
in this step, a face occurrence timestamp may be first extracted from the target video, specifically, face recognition may be performed on each image frame of the target video, and when a face is recognized, the time of the current image frame is recorded to obtain a face occurrence timestamp;
then, extracting a voiceprint fragment occurrence timestamp from the original voice file, specifically, carrying out voiceprint recognition in the original voice file, and recording the time when the voiceprint is recognized to obtain the voiceprint fragment occurrence timestamp;
extracting a first-language speech segment occurrence timestamp from the first speech text, specifically, performing character recognition on the first speech text, and recording the current time when the speech segment is recognized to obtain the first-language speech segment occurrence timestamp;
then matching the voiceprint fragment occurrence time stamp with the face occurrence time stamp, and determining faces belonging to the same role to obtain speaking time periods of faces belonging to the same role, namely, time periods of speaking of each face belonging to the same role;
finally, the time period of speaking of the face belonging to the same role can be matched with the occurrence time stamp of the first language speech segment, so that the speech content corresponding to the speaking time period can be obtained, namely what speech is spoken by each face belonging to the same role in the speaking time period.
Step S203, performing audio track division on the dubbing audio file according to the speaking time period, the speech content corresponding to the speaking time period and the second speech text to obtain the speaking time period of each role and the audio file corresponding to the time period;
in the step, the speaking time period, the speech content corresponding to the speaking time period and the second speech text are matched to obtain the speaking time period of each role; and performing audio track division on the dubbing audio file according to the time period of speaking of each role to obtain an audio file corresponding to the time period.
For example: the corresponding content of the first language lines of a certain face in a period of time is as follows:
"face No. 3 01:10:01 Mama";
"face No. 3 01:10:02 me go";
"face No. 3 01:10:03 school";
"face number 2 01:10:06 good";
"face number 2, 01:10:10 way up";
"face No. 2 01:10:11 caution";
the corresponding second language speech content in the second speech text is:
"Xiaohong 01:10:01 Ma";
"Xiaohong 01:10:02 me go";
"Xiaohong 01:10:03 Shangzhong";
"mother 01:10:06 good";
"mother 01:10:10 on the way";
"mother 01:10:11 caution";
the two are matched, the character corresponding to the face No. 3 is reddish, the time period of the reddish saying that the mom goes to learn the first speech is 01:10:01 to 01:10:03, and the time period of the mom saying that the mom goes to learn the first speech is 01:10:06 to 01:10:11, so that dubbing audio files can be split into different tracks in the time periods of 01:10:01 to 01:10:03 and 01:10:06 to 01:10:11, an A audio file corresponding to the reddish in the time period is obtained, and a B audio file corresponding to the mom in the time periods of 01:10:06 to 01:10:11 are obtained.
In practical application, before the track separation, a speech recognition algorithm can be used for comparing the speech-line text with the recognized text, errors such as track leakage and the like are detected, and when the track leakage is detected, manual error correction is carried out manually, so that the accuracy of the audio file after the track separation is improved.
Step S204, determining an audio file corresponding to a time period when any character speaks as a first audio file corresponding to the target video.
In this step, the audio files may be determined as the first audio file corresponding to the target video one by one according to a certain sequence, for example: and determining the audio file A as a first audio file, performing audio reconstruction based on the first audio file to obtain a second audio file, determining the audio file B as the first audio file, and so on.
According to the method and the device for generating the second audio file, the complete dubbing audio file can be automatically divided into the plurality of first audio files, so that the tone of the first dubbing object in the first audio file corresponding to each role is replaced by the tone of the second dubbing object, and the second audio file is obtained.
In yet another embodiment of the present application, the dubbing features include: content features and prosodic features, the step S101 of extracting the content features and prosodic features corresponding to the dubbing content of the first dubbing object in the first audio file includes:
and inputting the first audio file into a preset voice recognition encoder to obtain recognition content, and inputting the recognition content into a preset content encoder to obtain the content characteristics.
For example, the first Audio file (Source Audio) may be input to an end-to-end automatic speech recognition Encoder (E2E ASR Encoder) to obtain the recognized content (BN), and the recognized content (BN) may be input to a content Encoder (content Encoder) to obtain the content feature (content vector).
And inputting the first audio file into a preset voice self-supervision learning pre-training model to obtain output data, and inputting the output data into a preset rhythm encoder to obtain the rhythm characteristics.
For example, the first Audio (Source Audio) may be input into a speech auto-supervised learning pre-training model (VQ-wav2vec pre-trained model) to obtain output data (VQW2V), and then the output data (VQW2V) may be input into a prosody encoder (prosody encoder) to obtain prosody features (prosody vector).
According to the embodiment of the application, the content characteristics and the prosodic characteristics can be respectively extracted by the automatic model, so that the content characteristics and the prosodic characteristics in the first audio file are only reserved, and then the content characteristics and the prosodic characteristics are combined with the tone characteristics of the second dubbing object, and the second audio file after audio reconstruction is obtained.
In another embodiment of the present application, the step S102 obtains a second timbre characteristic corresponding to the second dubbing object, as shown in fig. 3, including:
step S301, acquiring an acoustic audio file of the target video;
step S302, extracting the original sound print characteristics of the original sound dubbing object from the original sound audio file;
in the embodiment of the present application, the original dubbing object corresponds to the first audio file, that is, the original dubbing object refers to a dubbing actor who dubs a character corresponding to the first audio file.
In the original sound audio file, dubbing can be completed by different dubbing actors for different roles, so that the sound characteristics of the roles can be embodied conveniently and the roles are closer to the original sound audio file, and when the tone color is replaced, a dubbing object with the voiceprint closer to the original sound dubbing object can be selected, so that the original sound voiceprint characteristics of the original sound dubbing object can be extracted in the step.
Step S303, searching a voiceprint identifier corresponding to the original voiceprint characteristics in a preset voiceprint library;
in order to store different voiceprints and tone characteristics corresponding to the voiceprints, a VoicePrint library can be constructed in advance, a plurality of groups of VoicePrint identifiers, corresponding relations among the VoicePrint characteristics and the tone characteristics are stored in the VoicePrint library, in practical application, voice audios of a plurality of second dubbing objects can be collected, VoicePrint calculation (VoicePrint database calculation) is carried out on the voice audios, the calculated VoicePrint characteristics are stored in the VoicePrint library, the voice audios are input into a speaker coder (speaker coder), feature vectors of phrase sounds of specified speakers are extracted, and the tone characteristics of the second dubbing objects are obtained.
In this step, the original VoicePrint features extracted in step S302 and each VoicePrint feature in the VoicePrint library may be used to perform a query (VoicePrint query), so as to obtain a VoicePrint identifier corresponding to a successfully matched VoicePrint feature.
Step S304, determining the timbre characteristic of the dubbing object corresponding to the voiceprint identifier as a second timbre characteristic of the second dubbing object.
The tone color characteristics of the original sound dubbing object in the original sound audio file can be automatically acquired, and when the tone color characteristics are replaced, the replaced tone color characteristics are closer to the tone color characteristics of dubbing actors in the original sound audio file, and the plot is more attached.
In another embodiment of the present application, after performing audio reconstruction based on the audio spectrum in step S104 to obtain a second audio file corresponding to the target video, as shown in fig. 4, the method further includes:
step S401, an acoustic audio file of the target video is obtained;
step S402, carrying out volume detection on the acoustic audio file to obtain first volume values corresponding to a plurality of first time stamps;
the acoustic audio file in the embodiment of the present application may correspond to the same time period as the first audio file, for example: the corresponding time period of the first audio file in the whole dubbing audio file is 00:05:20-00:05:30, and then the audio clip of 00:05:20-00:05:30 is selected as the acoustic audio file.
In this step, a sliding window mechanism may be used to perform volume detection on the acoustic audio file, and the length of the time slice (TimeSlide) of the sliding window may be selected according to actual needs, for example: and 0.5 second, obtaining a plurality of sets of volume values with first time stamps, namely the first volume values corresponding to the first time stamps.
Step S403, searching a second volume value corresponding to each first timestamp in the second audio file;
in this step, the corresponding second volume values may be searched in the second audio file one by one according to the first time stamp, so as to obtain a plurality of second volume values corresponding to the first time stamp.
Step S404, if the difference between the first volume value and the second volume value is greater than a preset threshold, adjusting the second volume value to the first volume value to obtain an adjusted second audio file.
The difference between the first volume value and the second volume value can be calculated, the difference is compared with a preset threshold, if the difference is larger than the preset threshold, the difference between the first volume value and the second volume value is too large, and the second volume value needs to be adjusted so as to enable the second volume value to be closer to or equal to the first volume value.
The embodiment of the application can ensure that the volume value corresponding to each time point in the second audio file is the same as or similar to the volume value corresponding to the corresponding time point in the acoustic audio file, keep the volume stable, and avoid the volume of the second audio file from being large or small.
In another embodiment of the present application, after performing audio reconstruction based on the audio spectrum in step S104 to obtain a second audio file corresponding to the target video, as shown in fig. 5, the method further includes:
step S501, sound effect detection is carried out on the original sound audio file to obtain sound effect types corresponding to a plurality of second time stamps;
in the embodiment of the present application, the sound effect refers to the effect made by sound, and refers to the special effect added to the sound tape for enhancing the reality, atmosphere or drama message of a scene, such as: the effect of the voice in the phone, the effect of the voice in the cave, etc.
In practical application, an End-to-End model (End to End) can be adopted to perform sound effect detection sampling on an original sound audio file to obtain a sound effect type with a timestamp, and the End-to-End model in the embodiment of the application can support 9 sound effect types: a low pass sound effect type, a high pass sound effect type, a band pass (with gain) sound effect type, a full pass sound effect type, a peak sound effect type, a lowchelf sound effect type, a highchelf sound effect type, a notch sound effect type, and the like.
Step S502, adding sound effects in the second audio file according to the sound effect types corresponding to the second timestamps to obtain the adjusted second audio file.
Because in the original sound dubbing file for laminating scenario needs, different sound effect types have been added, for the convenience of the second audio file more restore the original sound dubbing file, the same sound effect type has, so the sound effect type that detects out according to the second time stamp department can add same sound effect in the position of second time stamp in the second audio file to realize laminating the original sound dubbing file more.
The embodiment of the application can ensure that the sound effect type corresponding to each time point in the second audio file is the same as the sound effect type corresponding to the corresponding time point in the original audio file, so that the situation that a user is difficult to understand due to the fact that the corresponding sound effect is not added to the second audio file is avoided, and the introduction feeling of watching videos by the user is improved.
For convenience of understanding, the present application further provides an embodiment of an audio processing method in a practical application scenario, as follows:
as shown in fig. 6, after the target video with the acoustic audio file in chinese is translated into non-chinese, the amateur non-chinese dubber a dubs the entire video to obtain a dubbed audio file.
The artificial dubbing material often generates errors such as track leakage and the like, so that the track leakage detection and error correction can be carried out on the dubbing audio file to obtain the dubbing audio file after error correction.
The subtitle file translated from chinese to non-chinese has no role information, which is a Voice Conversion model (VC) for selecting a role and information necessary for a second dubbing object, so that the subtitle role needs to be split to add role information to the subtitle file, specifically: extracting human face occurrence time stamps in a target video, extracting voiceprint occurrence segment time stamps in an acoustic audio file, extracting non-Chinese speech segment occurrence time stamps in a first speech text, combining the voiceprint occurrence segment time stamps, combining the human face occurrence time stamps with the non-Chinese speech segment occurrence time stamps, obtaining corresponding speech content of each human face in different time periods, matching corresponding speech content of each human face in different time periods with a second speech text, obtaining audio time periods corresponding to each role, performing intelligent track division based on the audio time periods corresponding to each role, obtaining audio files corresponding to each role in different time periods, namely: character track 1, character track 2 … … character track N, each character track corresponding to any one of the first audio files in the previous embodiments.
Each character track is respectively input into a voice conversion model, the voice conversion model is used for reserving emotion, rhythm, content and the like of an amateur non-Chinese dubber A, only the tone color characteristics of the voice conversion model are replaced by the tone color characteristics of a non-Chinese dubber B, a non-Chinese dubber C or a non-Chinese dubber D, specifically, as shown in fig. 7, any character track (namely, a first audio file) of the amateur non-Chinese dubber A is input into an encoder, the encoder extracts the content A and the rhythm A from the character track, and the tone color A in the character track can not be extracted in practical application. According to actual needs, voiceprint corner selection can be performed, that is, any dubbing timbre suitable for the dubbing scene of the movie and television drama is selected from the non-Chinese timbres B, C, D and the like, a suitable timbre ID (namely, the voiceprint identifier in the foregoing embodiment) is selected in the actual operation process, a timbre feature corresponding to the timbre ID is obtained, the content A, the rhythm A and the obtained timbre feature are input into a decoder, the decoder combines the three to obtain an audio frequency spectrum, and audio reconstruction is performed on the audio frequency spectrum to obtain a second audio file dubbed by the timbre B, C or D of the non-Chinese dubber.
The speech reconstructed from the audio spectrum generated based on the voice conversion model may have a certain mechanical sound, a certain current sound, and a certain degree of noise, so that the voice quality of the sound in the second audio file may be restored by using the DSP technique to remove the noise and restore the voice quality.
Because the reconstructed voice volume at different moments is possibly different, the volume of the original voice audio file and the corresponding timestamp can be detected, the volume of the voice in the second audio file is repaired accordingly, the sound effect of the original voice audio file and the corresponding timestamp can be detected, and the corresponding sound effect is added into the second audio file accordingly.
For convenience of understanding, the present application further provides an embodiment of an audio processing method in a practical application scenario, as follows:
as shown in fig. 8, after the target video whose original audio file is non-chinese is translated into chinese, an amateur chinese dubber a dubs the entire video to obtain a dubbed audio file.
The artificial dubbing material often generates errors such as track leakage and the like, so that the track leakage detection and error correction can be carried out on the dubbing audio file to obtain the dubbing audio file after error correction.
The subtitle file translated from non-chinese to chinese has no role information, and the role information is a Voice Conversion model (VC) for selecting a role and information necessary for a second dubbing object, so that the subtitle role needs to be split to add role information to the subtitle file, specifically: extracting human face occurrence time stamp in a target video, extracting voiceprint occurrence segment time stamp in an acoustic audio file, extracting Chinese speech segment occurrence time stamp in a first speech text, merging the voiceprint occurrence segment time stamp, human face occurrence time stamp and Chinese speech segment occurrence time stamp, obtaining corresponding speech content of each human face in different time periods, matching corresponding speech content of each human face in different time periods with a second speech text, obtaining audio time periods corresponding to each role, performing intelligent track division based on the audio time periods corresponding to each role, obtaining audio files corresponding to each role in different time periods, namely: character track 1, character track 2 … … character track N, each character track corresponding to any one of the first audio files in the previous embodiments.
Each character track is respectively input into a voice conversion model, the voice conversion model is used for reserving emotion, rhythm, content and the like of an amateur Chinese speaker A, and only the tone color characteristics of the amateur Chinese speaker A are replaced by the tone color characteristics of a Chinese speaker B, a Chinese speaker C or a Chinese speaker D, specifically, as shown in fig. 9, any character track (namely, a first audio file) of the amateur Chinese speaker A is input into an encoder, the encoder extracts the content A and the rhythm A from the character track, and the tone color A in the character track can not be extracted in practical application. According to actual needs, voiceprint corner selection can be performed, that is, any dubbing timbre suitable for the dubbing scene of the movie and television drama is selected from the Chinese timbres B, C, D and the like, a suitable timbre ID (namely, the voiceprint identifier in the foregoing embodiment) is selected in the actual operation process, a timbre feature corresponding to the timbre ID is obtained, the content A, the rhythm A and the obtained timbre feature are input into a decoder, the decoder combines the content A, the rhythm A and the obtained timbre feature to obtain an audio frequency spectrum, and audio reconstruction is performed on the audio frequency spectrum to obtain a second audio file dubbed by the timbre B, C or D of the Chinese dubber.
The speech reconstructed from the audio spectrum generated based on the voice conversion model may have a certain mechanical sound, a certain current sound, and a certain degree of noise, so that the voice quality of the sound in the second audio file may be restored by using the DSP technique to remove the noise and restore the voice quality.
Because the reconstructed voice volume at different moments is possibly different, the volume of the original voice audio file and the corresponding timestamp can be detected, the volume of the voice in the second audio file is repaired accordingly, the sound effect of the original voice audio file and the corresponding timestamp can be detected, and the corresponding sound effect is added into the second audio file accordingly.
In still another embodiment of the present application, there is also provided an audio processing apparatus, as shown in fig. 10, including:
a first obtaining module 11, configured to obtain a first audio file corresponding to a target video, and extract content features and prosodic features of dubbing content of a first dubbing object in the first audio file, where a language in the first audio file is a first language different from a second language of an original audio file of the target video;
a second obtaining module 12, configured to obtain a timbre feature corresponding to a second dubbing object, where the first dubbing object and the second dubbing object have different timbres;
a merging module 13, configured to merge the content feature, the prosody feature, and the timbre feature to obtain an audio frequency spectrum;
and a reconstruction module 14, configured to perform audio reconstruction based on the audio frequency spectrum to obtain a second audio file corresponding to the target video.
Optionally, the first obtaining module includes:
a first obtaining unit, configured to obtain an acoustic audio file, a dubbing audio file, a first speech text, and a second speech text corresponding to a target video, where the first speech text is obtained by performing speech recognition on the dubbing audio file and does not include role information, the dubbing audio file is obtained by dubbing in a first language different from a second language of the acoustic audio file of the target video, and the second speech text corresponds to the acoustic audio file and includes role information;
a first determining unit, configured to determine, according to the target video, the first speech text, and the acoustic audio file, a speech time period during which a face belonging to the same role speaks, a speech time period during which speech contents of a speech corresponding to the speech time period speak, and speech contents corresponding to the speech time period;
the track dividing unit is used for carrying out audio track division on the dubbing audio file according to the speaking time period, the speech content corresponding to the speaking time period and the second speech text to obtain the speaking time period of each role and the audio file corresponding to the speaking time period;
and the second determining unit is used for determining the audio file corresponding to the speaking time period of any character as the first audio file corresponding to the target video.
Optionally, the first determining unit includes:
the first extraction subunit is used for extracting a face occurrence timestamp from the target video;
the second extraction subunit is used for extracting the time stamp of the voiceprint occurrence segment in the original audio file;
a third extraction subunit, configured to extract a first-language speech segment occurrence timestamp from the first speech text;
the first matching subunit is used for matching the voiceprint fragment occurrence timestamp and the face occurrence timestamp to obtain a speaking time period of face speaking belonging to the same role;
and the second matching subunit is used for matching the time period of the face speaking belonging to the same role with the occurrence time stamp of the speech segment in the first language to obtain speech content corresponding to the speaking time period.
Optionally, the track splitting unit includes:
the third matching subunit is used for matching the speaking time period, the speech content corresponding to the speaking time period and the second speech text to obtain the time period for speaking of each role;
and the track dividing subunit is used for carrying out audio track division on the dubbing audio file according to the time period of speaking of each role to obtain an audio file corresponding to the time period.
Optionally, the dubbing features include: content features, the first obtaining module comprising:
the first input unit is used for inputting the first audio file into a preset voice recognition encoder to obtain recognition content;
and the second input unit is used for inputting the identification content into a preset content encoder to obtain the content characteristics.
Optionally, the dubbing features include: the first obtaining module comprises:
the third input unit is used for inputting the first audio file into a preset voice self-supervision learning pre-training model to obtain output data;
and the fourth input unit is used for inputting the output data into a preset prosody encoder to obtain the prosody characteristics.
Optionally, the second obtaining module includes:
the second acquisition unit is used for acquiring an acoustic audio file of the target video;
an extraction unit, configured to extract an acoustic voiceprint feature of an acoustic dubbing object in the acoustic audio file;
the first searching unit is used for searching the voiceprint identification corresponding to the original voiceprint characteristic in a preset voiceprint library;
a third determining unit, configured to determine a timbre feature of the dubbing object corresponding to the voiceprint identifier as a second timbre feature of the second dubbing object.
Optionally, after the reconstruction unit, the apparatus further comprises:
the third acquisition module is used for acquiring an acoustic audio file of the target video;
the volume detection module is used for carrying out volume detection on the acoustic audio file to obtain first volume values corresponding to the first time stamps;
a first searching module, configured to search the second audio file for a second volume value corresponding to each of the first timestamps;
and the volume adjusting module is used for adjusting the second volume value to the first volume value if the difference value between the first volume value and the second volume value is larger than a preset threshold value, so as to obtain an adjusted second audio file.
Optionally, after the reducing unit, the apparatus further comprises:
the sound effect detection module is used for carrying out sound effect detection on the original sound audio file to obtain sound effect types corresponding to the second timestamps;
and the sound effect adjusting module is used for adding sound effects into the second audio file according to the sound effect types corresponding to the second timestamps to obtain the adjusted second audio file.
In another embodiment of the present application, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the audio processing method of any one of the method embodiments when executing the program stored in the memory.
In the electronic device provided by the embodiment of the present invention, the processor implements, by executing the program stored in the memory, the effect of dubbing the timbres of a plurality of second dubbing objects by one first dubbing object by only reserving the dubbing characteristics of the first dubbing object in the first audio file and not using the timbre characteristics of the first dubbing object, so that the second audio file reconstructed based on the audio spectrum obtained by merging can have the timbre of the second dubbing object and reserve the dubbing characteristics, the content and emotion of dubbing of the first dubbing object can be reserved while automatically converting the timbre of the first dubbing object into the timbre of the second dubbing object, and further, the timbre of the first dubbing object in all the first audio files corresponding to the target video file can be conveniently converted into the timbre of the corresponding second dubbing object, without other dubbing actors, meanwhile, the emotion that the first dubbing object has rich speech can be reserved, so that the requirement of the movie and television theatre scene on the white speech is met.
The communication bus 1140 mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 11, but this is not intended to represent only one bus or type of bus.
The communication interface 1120 is used for communication between the electronic device and other devices.
The memory 1130 may include a Random Access Memory (RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The processor 1110 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
In a further embodiment of the present application, there is also provided a computer-readable storage medium having stored thereon a program of an audio processing method, which when executed by a processor, implements the steps of the audio processing method described in any of the method embodiments above.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is merely illustrative of particular embodiments of the invention that enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. An audio processing method, comprising:
acquiring a first audio file corresponding to a target video, and extracting dubbing characteristics corresponding to dubbing contents of a first dubbing object in the first audio file, wherein a first language in the first audio file is different from a second language in an original sound audio file of the target video;
acquiring tone color characteristics corresponding to a second dubbing object, wherein the first dubbing object and the second dubbing object have different tone colors;
combining the dubbing features and the tone color features to obtain an audio frequency spectrum;
and carrying out audio reconstruction based on the audio frequency spectrum to obtain a second audio file corresponding to the target video.
2. The audio processing method of claim 1, wherein obtaining the first audio file corresponding to the target video comprises:
acquiring an acoustic audio file, a dubbing audio file, a first speech text and a second speech text corresponding to a target video, wherein the first speech text is obtained by performing speech recognition on the dubbing audio file and does not contain role information, the dubbing audio file is obtained by dubbing in a first language different from a second language of the acoustic audio file of the target video, and the second speech text corresponds to the acoustic audio file and contains the role information;
determining a speaking time period of a face speaking belonging to the same role and speech content corresponding to the speaking time period according to the target video, the first speech text and the acoustic audio file;
audio track separation is carried out on the dubbing audio file according to the speaking time period, the speech content corresponding to the speaking time period and the second speech text, and a speaking time period of each role and an audio file corresponding to the speaking time period are obtained;
and determining an audio file corresponding to the speaking time period of any role as a first audio file corresponding to the target video.
3. The audio processing method according to claim 2, wherein determining a speech time period during which a human face belonging to the same character speaks and speech content corresponding to the speech time period from the target video, the first speech text, and the acoustic audio file comprises:
extracting a face occurrence timestamp from the target video;
extracting a voiceprint fragment occurrence timestamp from an original voice frequency file;
extracting a first language speech segment occurrence timestamp from the first speech text;
matching the voiceprint fragment occurrence time stamp with the face occurrence time stamp to obtain a speaking time period of face speaking belonging to the same role;
matching the time period of speaking of the face belonging to the same role with the occurrence time stamp of the first language speech segment to obtain speech content corresponding to the speaking time period.
4. The audio processing method of claim 2, wherein performing audio track splitting on the dubbing audio file according to the speaking time period, the speech content corresponding to the speaking time period, and the second speech text to obtain a speaking time period of each character and an audio file corresponding to the speaking time period comprises:
matching the speaking time period, the speech content corresponding to the speaking time period and the second speech text to obtain a time period for each role to speak;
and performing audio track division on the dubbing audio file according to the time period of speaking of each role to obtain an audio file corresponding to the time period.
5. The audio processing method according to claim 1, wherein obtaining a second timbre characteristic corresponding to a second dubbing object comprises:
acquiring an acoustic audio file of the target video;
extracting original sound print characteristics of an original sound dubbing object from the original sound audio file;
searching a voiceprint identifier corresponding to the original voiceprint characteristics in a preset voiceprint library;
and determining the tone color characteristic of the dubbing object corresponding to the voiceprint identifier as a second tone color characteristic of the second dubbing object.
6. The audio processing method according to claim 1, wherein after audio reconstruction based on the audio spectrum to obtain a second audio file corresponding to the target video, the method further comprises:
acquiring an acoustic audio file of the target video;
carrying out volume detection on the acoustic audio file to obtain first volume values corresponding to a plurality of first time stamps;
searching a second volume value corresponding to each first time stamp in the second audio file;
if the difference value between the first volume value and the second volume value is larger than a preset threshold value, the second volume value is adjusted to be the first volume value, and an adjusted second audio file is obtained.
7. The audio processing method according to claim 1, wherein after audio reconstruction based on the audio spectrum to obtain a second audio file corresponding to the target video, the method further comprises:
sound effect detection is carried out on the original sound audio file, and sound effect types corresponding to the second timestamps are obtained;
and adding sound effects in the second audio file according to the sound effect types corresponding to the second timestamps to obtain the adjusted second audio file.
8. An audio processing apparatus, comprising:
the first obtaining module is used for obtaining a first audio file corresponding to a target video and extracting content characteristics and prosody characteristics of dubbing content of a first dubbing object in the first audio file, wherein a first language in the first audio file is different from a second language in an original sound audio file of the target video;
a second obtaining module, configured to obtain a timbre feature corresponding to a second dubbing object, where the first dubbing object and the second dubbing object have different timbres;
the merging module is used for merging the content characteristics, the rhythm characteristics and the tone characteristics to obtain an audio frequency spectrum;
and the reconstruction module is used for carrying out audio reconstruction based on the audio frequency spectrum to obtain a second audio file corresponding to the target video.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the audio processing method according to any one of claims 1 to 7 when executing the program stored in the memory.
10. A computer-readable storage medium, characterized in that a program of an audio processing method is stored on the computer-readable storage medium, which when executed by a processor implements the steps of the audio processing method of any of claims 1-7.
CN202210457487.0A 2022-04-27 2022-04-27 Audio processing method and device, electronic equipment and storage medium Pending CN114842858A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210457487.0A CN114842858A (en) 2022-04-27 2022-04-27 Audio processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210457487.0A CN114842858A (en) 2022-04-27 2022-04-27 Audio processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114842858A true CN114842858A (en) 2022-08-02

Family

ID=82567323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210457487.0A Pending CN114842858A (en) 2022-04-27 2022-04-27 Audio processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114842858A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115312029A (en) * 2022-10-12 2022-11-08 之江实验室 Voice translation method and system based on voice depth characterization mapping

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115312029A (en) * 2022-10-12 2022-11-08 之江实验室 Voice translation method and system based on voice depth characterization mapping

Similar Documents

Publication Publication Date Title
US11887578B2 (en) Automatic dubbing method and apparatus
US11699456B2 (en) Automated transcript generation from multi-channel audio
US11942093B2 (en) System and method for simultaneous multilingual dubbing of video-audio programs
JP3621686B2 (en) Data editing method, data editing device, data editing program
US20160021334A1 (en) Method, Apparatus and System For Regenerating Voice Intonation In Automatically Dubbed Videos
US20120245936A1 (en) Device to Capture and Temporally Synchronize Aspects of a Conversation and Method and System Thereof
KR102044689B1 (en) System and method for creating broadcast subtitle
CN110675886A (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
JP2012181358A (en) Text display time determination device, text display system, method, and program
EP3839953A1 (en) Automatic caption synchronization and positioning
CN114842858A (en) Audio processing method and device, electronic equipment and storage medium
CN109376145B (en) Method and device for establishing movie and television dialogue database and storage medium
JP2009237285A (en) Personal name assignment apparatus and method
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
US20120154514A1 (en) Conference support apparatus and conference support method
KR101618777B1 (en) A server and method for extracting text after uploading a file to synchronize between video and audio
KR102160117B1 (en) a real-time broadcast content generating system for disabled
JP2006178334A (en) Language learning system
CN114155841A (en) Voice recognition method, device, equipment and storage medium
KR101920653B1 (en) Method and program for edcating language by making comparison sound
KR101783872B1 (en) Video Search System and Method thereof
CN115171645A (en) Dubbing method and device, electronic equipment and storage medium
Soens et al. On split dynamic time warping for robust automatic dialogue replacement
KR102076565B1 (en) Speech processing apparatus which enables identification of a speaking person through insertion of speaker identification noise and operating method thereof
CN109949828B (en) Character checking method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination