WO2019086044A1 - 音频文件处理方法、电子设备及存储介质 - Google Patents

音频文件处理方法、电子设备及存储介质 Download PDF

Info

Publication number
WO2019086044A1
WO2019086044A1 PCT/CN2018/114179 CN2018114179W WO2019086044A1 WO 2019086044 A1 WO2019086044 A1 WO 2019086044A1 CN 2018114179 W CN2018114179 W CN 2018114179W WO 2019086044 A1 WO2019086044 A1 WO 2019086044A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
time frame
frame information
replaced
audio segment
Prior art date
Application number
PCT/CN2018/114179
Other languages
English (en)
French (fr)
Inventor
赖春江
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019086044A1 publication Critical patent/WO2019086044A1/zh
Priority to US16/844,283 priority Critical patent/US11538456B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/01Correction of time axis
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/036Insert-editing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present application relates to the field of voice processing technologies, and in particular, to an audio file processing method, an electronic device, and a storage medium.
  • Dubbing is a time-consuming and laborious process in film and television production or recording.
  • the current dubbing of the leading character is mainly realized by manual dubbing, and the artificially recorded sound is synthesized into the film through post-processing of the movie.
  • the automation of the whole process is not high, time-consuming and laborious, resulting in higher labor costs and time costs, reducing the resource utilization of audio processing equipment.
  • the embodiments of the present application provide an audio file processing method, an electronic device, and a storage medium, and provide an automatic voice-over solution, which can improve the time efficiency of audio replacement, consume less memory resources, and improve an audio file processing device. Resource utilization.
  • the embodiment of the present application provides an audio file processing method, which is executed by an electronic device, and the method includes:
  • the embodiment of the present application further provides an electronic device, including a processor and a memory, where the memory stores instructions executable by the processor, when the instruction is executed, the processor is configured to:
  • the embodiment of the present application further provides a computer readable storage medium storing computer readable instructions, which can cause at least one processor to perform the method as described above.
  • FIG. 1 is a schematic diagram of an implementation environment involved in an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a method for processing an audio file according to an embodiment of the present application
  • 3a is a schematic diagram of an interface for a client to initiate a voice replacement request according to an embodiment of the present application
  • FIG. 3b is a schematic diagram of an interface for a client to initiate a voice replacement request according to another embodiment of the present application
  • FIG. 4 is a schematic flowchart of a method for processing an audio file according to another embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of candidate time frame information in an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of an audio file processing method according to still another embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an audio file processing apparatus according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an audio file processing apparatus according to another embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 1 is a schematic diagram of an implementation environment involved in an embodiment of the present application.
  • an audio file processing apparatus 110 includes a processor and a memory, and the method embodiments of the present application are executed by the processor executing instructions stored in the memory.
  • the audio file processing apparatus 110 includes a source file database 111, an audio sample database 112, a sound effect management database 113, and an audio replacement processing unit 114.
  • a client 130-1 is installed on the terminal device 130. The user 140 can view the video or listen to the audio after logging in to the client 130-1.
  • the source file database 111 stores a source video file or a source audio file to be replaced.
  • the audio sample database 112 stores various types of sound samples for replacement, such as pre-acquired standard male voices, standard female voices, star voices, etc.; the sound effect management database 113 sets and stores various audio styles, emotion types, and corresponding Processing module.
  • the user 140 After logging in to the client 130-1, the user 140 wishes to replace the voice of the target character (such as a starring or presenter) when watching the video or listening to the audio.
  • an operation is input on the terminal 130 to initiate a voice replacement request for the target character.
  • the client 130-1 transmits the voice replacement request to the audio file processing apparatus 110, and the audio replacement processing unit 114 acquires the source video file or the source audio file from the source file database 111 based on the voice replacement request, and acquires the audio from the audio sample database 112.
  • the sample generates audio data to be dubbed, performs audio replacement processing, and outputs the replaced video/audio file, that is, the re-dubbed file, and returns it to the client 130-1.
  • the audio data may be filtered by a corresponding sound processing module in the sound effect management database 113 before the audio replacement is performed.
  • the audio file processing apparatus 110 may be a server, or a server cluster composed of several servers, or a cloud computing service center.
  • the network 120 can connect the audio file processing device 110 and the terminal device 130 in a wireless or wired form.
  • the terminal device 130 can be a smart terminal, including a smart TV, a smart phone, a tablet computer, a laptop portable computer, and the like.
  • FIG. 2 is a schematic flowchart diagram of an audio file processing method according to an embodiment of the present application. This method can be applied to an electronic device such as an audio file processing device or a server. The method includes the following steps.
  • Step 201 Extract at least one audio segment from the first audio file.
  • the first audio file is a source file before audio replacement. According to different scenarios, there are two ways to obtain the first audio file:
  • First receiving a voice replacement request for the target character in the source video file from the client, and separating the first audio file from the source video file according to the voice replacement request.
  • the corresponding application scenario is that the user views video through the client, such as movies, TV dramas, entertainment programs, and the like.
  • the voice replacement request carries the identifier of the active video file and the identifier of the target role.
  • FIG. 3a is a schematic diagram of an interface for a client to initiate a voice replacement request according to an embodiment of the present application.
  • a play screen of "TV series: Episode 3” is displayed on the interface 310, and the user clicks the play button 311 to view the episode of the episode, and 312 is a progress control button.
  • the user is dissatisfied with the sound of a certain protagonist and wants to replace the sound of the protagonist, right click on the interface 310 to pop up a window 313, and select the protagonist who wants to replace the sound in the window, for example, select the replacement starring "Hu*" *"the sound of.
  • the client sends a voice replacement request to the audio file processing apparatus, and the request includes the TV drama logo of the third episode of "The List” and the logo of the target character "Hu**".
  • the audio file processing apparatus After receiving the voice replacement request, the audio file processing apparatus obtains the source video file according to the identifier of the television drama in the request, and then separates the first audio file from the source video file. For example, the entire video file is read, and the pure audio file therein is extracted as the first audio file by transcoding.
  • the corresponding application scenario is that the user listens to the audio through the client, such as listening to a book, listening to a lecture, listening to an online course, and the like.
  • the voice replacement request carries the identifier of the active audio file and the identifier of the target character.
  • the voice replacement request can carry only the identifier of the active audio file.
  • FIG. 3b is a schematic diagram of an interface for a client to initiate a voice replacement request according to another embodiment of the present application.
  • a play interface of "Listening to the book: The 8th of the Three Kingdoms” is displayed on the interface 320.
  • 322 is the progress control button.
  • the client transmits a voice replacement request carrying the "Listing: The 8th Back of the Three Kingdoms" logo to the audio file processing apparatus.
  • the audio file processing apparatus After receiving the voice replacement request, the audio file processing apparatus obtains the source audio file as the first audio file according to the identifier of the audio file in the request.
  • the data in the first audio file can be voice detected; a part of the continuous data of the detected voice is taken as an audio segment. For example, by detecting the fluctuation of the acoustic energy in the audio data to judge the appearance and disappearance of the speech, when the time when the speech is detected is detected as the starting time, when the time when the speech disappears is detected as the ending time, the starting time and The continuous audio data between the end times is taken as an audio segment.
  • Step 202 Identify at least one audio segment to be replaced that represents the target character from the at least one audio segment, and determine time frame information of each audio segment to be replaced in the first audio file.
  • At least one audio segment to be replaced that identifies the target character is identified from the at least one audio segment, and specifically includes the following two steps:
  • step 2021 the audio features of each audio segment are extracted.
  • each audio segment is input to a convolutional neural network for training to obtain audio features for each audio segment.
  • the audio features include any one or more of the timbre, frequency, gender, mood, and sound crest distance.
  • a convolutional neural network typically includes multiple layers of processing, such as a convolutional layer, a pooled layer, a fully connected layer, and an output layer.
  • the convolution layer is provided with a convolution matrix as a filter, which can filter out audio features in the audio segment, or is called an audio fingerprint.
  • multiple convolutional layers can be designed for deep learning, and multi-dimensional composite audio features can be extracted.
  • LSTM deep learning short-term memory
  • Step 2022 Identify at least one audio segment to be replaced that represents the target character from the at least one audio segment according to the audio feature.
  • a binary classification model is established based on the target role; each audio segment and the audio feature of the audio segment are input to the binary classification model, and the training is performed based on the logistic regression algorithm, and at least one audio segment to be replaced is determined according to the training result. segment.
  • Step 203 Acquire audio data to be dubbed for each audio segment to be replaced, and replace the data in the audio segment to be replaced with audio data according to the time frame information to obtain a second audio file.
  • the time frame information includes a duration
  • the to-be-replaced line corresponding to the to-be-replaced audio segment may be determined from the preset speech text information according to the duration;
  • the preset audio sample data generates audio data to be dubbed.
  • the data in the audio segment to be replaced and the audio data to be dubbed are consistent in duration, and the data is replaced, that is, the second audio file after re-dubbing is obtained.
  • the time frame is consistent, but the included audio data is replaced.
  • At least one audio segment to be replaced that identifies the target character is identified from the at least one audio segment by extracting at least one audio segment from the first audio file, and each audio segment to be replaced is determined.
  • the audio file provides an automatic dubbing solution to achieve the purpose of automatically replacing the target character sound.
  • the total The cost is X*Y*T, and the embodiment of the present application does not involve human factors.
  • the overall cost is only T, thereby greatly saving the labor cost and time cost of the dubbing, and satisfying the user's personality.
  • the demand for voice has improved the resource utilization of voice-over equipment.
  • FIG. 4 is a schematic flowchart diagram of an audio file processing method according to another embodiment of the present application, and the method may be performed by an audio file processing apparatus or an electronic device such as a server. As shown in Figure 4, the following steps are included:
  • Step 401 Extract at least one audio segment from the first audio file based on the principle of short sentence division, and determine first candidate time frame information of each audio segment.
  • short sentences refer to sentences with simple structure and few words.
  • Long sentences refer to sentences with complex structures and many words.
  • a short sentence is included in a complete long sentence, and the upper and lower sentences are connected by a short pause. Then, when the audio segment is extracted from the first audio file, it can be extracted based on the principle of short sentence division, and each audio segment corresponds to one short sentence.
  • Each audio file corresponds to a time axis, and the first candidate time frame information of each audio segment is determined while the audio segment is extracted.
  • the so-called time frame information is used to represent the time segmentation information of the audio segment on the time axis, including the start time of the audio segment on the time axis, or the start time and the end time, or the start time and duration. Wait.
  • the extracted plurality of audio segments and the corresponding first candidate time frame information are stored in an electronic device such as an audio file processing device or a server for use in subsequent replacement operations.
  • FIG. 5 is a schematic structural diagram of candidate time frame information in an embodiment of the present application.
  • the first candidate time frame information 510 includes the start time of the six audio segments, identified by a black triangle on the time axis, and the corresponding line information is given at block 500.
  • the six audio segments correspond to a short sentence, namely, “Ze Ma Ying Feng”, “Looking at the Ups and Downs of Life”, “Wu Ge Song Jing”, “Laughing the World and the Old”, “Driving for the Horse” and “Flying the Year”
  • the corresponding starting time is 0 minutes and 10 seconds, 0 minutes and 12 seconds, 0 minutes and 14.5 seconds, 0 minutes and 17 seconds, 0 minutes and 20.8 seconds, and 0 minutes and 22 seconds.
  • Step 402 Extract a second candidate time frame information based on the long sentence from the first audio file.
  • This step and the above step 401 can be in parallel.
  • the principle of extraction is based on long sentence division.
  • the so-called long sentence refers to a sentence composed of several short sentences.
  • the method of extraction may be the same as the method described in step 201. Since the goal of this step is to acquire the second candidate time frame information, the audio segment corresponding to the long sentence may not be stored.
  • the second candidate time frame information 520 includes three (starting time, duration) information corresponding to three long sentences, specifically (0 points 9.43 seconds, 3.50 seconds), (0 points) 14.20 seconds, 4.95 seconds) and (0 minutes 20.35 seconds, 3.95 seconds).
  • Step 403 the line text information including the third candidate time frame information is set in advance.
  • the line text information is preset in advance, and the line text information includes lines and corresponding third candidate time frame information.
  • the lines corresponding to the third candidate time frame information may be short sentences or long sentences.
  • the third candidate time frame information 530 includes three start times corresponding to three long sentences, which are 0 minutes 10 seconds, 0 minutes 15 seconds, and 0 minutes 20 seconds, respectively.
  • Step 404 Determine time frame information of each audio segment to be replaced according to one or more candidate time frame information.
  • the time frame information determines the time position of the replacement, thereby determining the accuracy of the replacement. According to the various possible candidate time frame information obtained in steps 401-403, the manner of determining is specifically divided into the following cases:
  • the first candidate time frame information is obtained when the audio segment is extracted. Then, in one embodiment, the first candidate time frame information can be directly used as time frame information.
  • the time deviation generated when the audio segment is extracted may be estimated in advance, and then compensated according to the time offset when determining the time frame information.
  • the starting time t0 i of the audio segments is t1 i + ⁇ offset.
  • ⁇ offset is a time deviation, and its value can be positive or negative.
  • the accuracy may be higher at the time of extraction, and the second candidate time frame information may be directly used as the time frame information.
  • the first candidate time frame information is corrected according to the second candidate time frame information, and the time frame information is determined. Considering that the starting time and the duration are related, the correction can be performed separately from the two angles, including the following three methods.
  • Manner 1 correcting a start time in the first candidate time frame information based on a start time in the second candidate time frame information.
  • Corresponding here means that the two starting moments correspond to the same position in the line.
  • the first start time t1 1 10 seconds in the first candidate time frame information
  • the first start time t2 1 9.43 seconds in the second candidate time frame information
  • the start time in the first candidate time frame information is corrected based on the duration in the second candidate time frame information.
  • the two starting moments are t1 i and t1 i+1 , then it is judged whether the condition is satisfied.
  • the duration in the first candidate time frame information is corrected based on the duration in the second candidate time frame information.
  • the duration of the second candidate time frame information is determined to be one or more durations corresponding to the first candidate time frame information; and the maximum value of the durations corresponding to the same line length is used as the duration in the time frame information.
  • the value of ⁇ t1 i or ⁇ t1 i+1 can be increased to satisfy the above conditions.
  • the manner of using the above correction it may be determined according to the specific information included in the first candidate time frame information and the second candidate time frame information, that is, according to the ratio of the starting time and the length of time. If the number of starting moments is large, the starting time is corrected; otherwise, the duration is corrected. Alternatively, in view of the fact that the duration can be defined by two adjacent starting moments, and the value of the starting moment is more important, one or two of the above modes are preferred.
  • the third candidate time frame information is derived from the line text information
  • the standard time data is characterized, and the third candidate time frame information may be directly used as the time frame information.
  • the first candidate time frame information may be corrected according to the third candidate time frame information to determine the time frame information.
  • the specific correction method refer to the description in the above step 4042, and replace the second candidate time frame information with the second candidate time frame information, and details are not described herein again.
  • the first candidate time frame information may be corrected according to the second candidate time frame information and the third candidate time frame information, and the time frame information is determined.
  • the correction may be performed by referring to the three methods in the above step 4042. What needs to be determined is the corresponding starting time and duration data in the three candidate time frame information.
  • the so-called correspondence refers to the position of the corresponding line.
  • three start times corresponding to the three candidate time frame information are determined; and the average, maximum or minimum value of the three start times is the start time in the time frame information.
  • two starting moments in which the values are closest to each of the three starting moments may be selected, and then the mean, maximum or minimum of the two starting moments is taken as the starting moment in the time frame information.
  • a plurality of durations corresponding to the three candidate time frame information are determined; and a maximum value of the durations corresponding to the same channel length is used as the duration in the time frame information.
  • Step 405 Identify at least one audio segment to be replaced that represents the target character from the at least one audio segment.
  • step 202 For the method of identification, reference may be made to the description of step 202 above, and details are not described herein again.
  • Step 406 Acquire audio data to be dubbed for each audio segment to be replaced, and replace the data in the audio segment to be replaced with audio data according to the time frame information to obtain a second audio file.
  • step 203 For the method of replacement, refer to the description of step 203 above, and details are not described herein again.
  • the time frame information for audio replacement when determining the time frame information for audio replacement, three kinds of candidate information are comprehensively considered, which are respectively derived from the short sentence extraction (ie, the first candidate time frame information) of the audio segment, and the entire audio.
  • This additional redundant time information facilitates determining the precise position of the audio segment on the time axis, thereby ensuring the accuracy of audio feature replacement.
  • FIG. 6 is a schematic flowchart diagram of an audio file processing method according to an embodiment of the present application.
  • the method can be executed by an audio file processing device or an electronic device such as a server, and is a sound replacement process for starring the source video file. Specifically, the following steps are included.
  • Step 601 Receive a voice replacement request for a target role in the source video file from the client, and separate the first audio file from the source video file according to the voice replacement request.
  • the voice replacement request carries the identifier of the active video file and the identifier of the target role.
  • the audio file processing apparatus After receiving the voice replacement request, the audio file processing apparatus obtains the source video file according to the identifier of the source video file in the request, and then separates the first audio file from the source video file.
  • Step 602 Extract at least one audio segment from the first audio file based on the principle of short sentence division, and determine first candidate time frame information of each audio segment.
  • Step 603 extracting second candidate time frame information from the first audio file based on the principle of long sentence division.
  • Step 604 the line text information including the third candidate time frame information is set in advance.
  • Step 605 Determine time frame information of each audio segment to be replaced according to one or more candidate time frame information.
  • step 606 the audio segment is sampled to determine training data and test data of the binary model.
  • sampling data can be used to select training data and test data for the training of the two-category model.
  • sampling is done at a 6:4 ratio, ie, 60% of the data is used for training and 40% of the data is used for testing.
  • the sampling process can use stratified sampling. For example, the entire TV series includes 40 episodes, and each set extracts a certain amount of training data, so that the training data can cover all the episodes.
  • Step 607 Identify, by using the two-category model, at least one audio segment to be replaced that represents the target character from the at least one audio segment, and perform score optimization.
  • step 202 considering the recognition 0-1 relationship between the protagonist and the supporting role in the audio file, that is, the protagonist is 1, the supporting role is 0, which conforms to the binary classification model. Therefore, a two-class model is established.
  • a machine learning algorithm based on logistic regression can be used to train the two-class model.
  • logistic regression is a mature machine learning algorithm that can be integrated into spark mllib. You can try spark for concurrent training.
  • the output of the two-class model training is the possibility that the audio segment is 0 or 1.
  • the AUC (Area Under roc Curve) model can be used to score and optimize the entire training process. That is, for each audio segment, a 0-1 score is performed, where 0 identifies the audio segment of the missed starring, and 1 identifies the audio segment that hits the starring.
  • the scheme in the embodiment of the present application considers the scoring condition that the label is 1, and the accuracy required to achieve the AUC requirement is 0.9 or more, that is, the accuracy of the discriminant 1 needs to reach 90% or more. When this requirement is not met, the training is resumed.
  • the AUC model is a standard used to measure the quality of the classification model. It is based on the ROC (Receiver Operating Characteristic) analysis.
  • the main analysis tool is a ROC curve drawn on a two-dimensional plane. The value of AUC is in ROC. The area under the curve.
  • Step 608 For each audio segment to be replaced, generate audio data to be dubbed according to the preset speech text information and the audio sample data.
  • the to-be-replaced line corresponding to the to-be-replaced audio segment may be determined from the preset speech text information according to the duration; the audio to be dubbed is generated according to the to-be-replaced line and the preset audio sample data.
  • the audio sample data is a standard male voice, and corresponding to each word in the to-be-replaced line, the sample data of the standard male voice is combined to obtain the audio data to be dubbed.
  • Step 609 Receive a processing request for an audio effect from the client, and adjust an audio effect of the audio data according to the processing request.
  • audio effects include audio styles and audio moods.
  • the user's processing request will reflect that the audio that is desired to be replaced has changed in the style of the sound, or has changed in the mood of the starring.
  • the so-called audio styles include sound masking, sound distortion (such as lifting, lowering, removing, transforming, etc.), female sounds replaced with male sounds, male sounds replaced with female sounds, star sounds, special sounds (ghost sounds, magic sounds, dolphin sounds) ) and so on, with a distinctive sound style.
  • the target audio style is selected from the preset at least one audio style based on the processing request; the audio data is filtered according to the target audio style.
  • the user's processing request is to replace the sound of the protagonist with the male voice by the female voice.
  • the standard female voice includes a plurality of female voices of a plurality of pitches, and then one of the plurality of standard female voices is selected as the target. Audio style, filtering audio data.
  • the so-called audio emotion refers to the personal emotions that the protagonist expresses when expressing the lines, such as anger, happiness, sadness, etc., corresponding to the fluctuation component of the sound that will appear on the audio.
  • the target audio mood is selected from the preset at least one audio mood based on the audio emotion request; the speech spectral distribution corresponding to the target audio emotion is determined; and the audio data is filtered according to the speech spectral distribution.
  • the user feels that the performance of the protagonist is not emotional enough, and the processing request is to increase the sadness of the protagonist, and includes various levels of sadness and corresponding speech spectrum distribution in the preset audio mood, then select from multiple sad emotions.
  • a target emotion is used to filter the audio data.
  • the filtering process of the audio genre or emotion may be added to the generated audio data based on the analysis of the audio sample data instead of the user's processing request.
  • Step 610 Replace data in the audio segment to be replaced with audio data according to time frame information to obtain a second audio file.
  • FIG. 7 is a schematic structural diagram of an audio file processing apparatus according to an embodiment of the present application. As shown in FIG. 7, the audio file processing apparatus 700 includes:
  • the extracting module 710 is configured to extract at least one audio segment from the first audio file
  • the identification module 720 is configured to identify, from the at least one audio segment extracted by the extraction module 710, at least one audio segment to be replaced that represents the target character;
  • the time frame determining module 730 is configured to determine time frame information of each audio segment to be replaced identified by the identifying module 720 in the first audio file;
  • the obtaining module 740 is configured to obtain audio data to be dubbed for each audio segment to be replaced that is identified by the identification module 720;
  • the replacement module 750 is configured to replace, for the audio segment to be replaced, the identification, the time frame information determined by the time frame determining module 730, the data in the audio segment to be replaced, and the data obtained by the obtaining module 740. Audio data, get the second audio file.
  • an automatic dubbing scheme which achieves the purpose of automatically replacing the target character sound; compared to the manual dubbing method, if there are X target characters, Y replacement audio effects, and the time cost is T, The total cost is X*Y*T, and the embodiment of the present application does not involve human factors.
  • the overall cost is only T, thereby greatly saving the labor cost and time cost of the dubbing, and satisfying the user.
  • the need for personalized sound has increased the resource utilization of voice-over equipment.
  • FIG. 8 is a schematic structural diagram of an audio file processing apparatus according to another embodiment of the present application. As shown in FIG. 8, on the basis of the audio file processing apparatus 700 shown in FIG. 7, the audio file processing apparatus 800 further includes:
  • the first receiving module 760 is configured to receive, from the client, a voice replacement request for the target role in the source video file.
  • the audio file determining module 770 is configured to separate the first audio file from the source video file according to the voice replacement request received by the first receiving module 760.
  • the audio file processing apparatus 800 further includes:
  • the first receiving module 760 is configured to receive, from the client, a voice replacement request for the target role in the source audio file;
  • the audio file determining module 770 is configured to determine the source audio file identified in the voice replacement request received by the first receiving module 760 as the first audio file.
  • the identification module 720 includes:
  • a feature extraction unit 721, configured to extract audio features of each audio segment
  • the identifying unit 722 is configured to identify, according to the audio feature extracted by the feature extracting unit 721, at least one audio segment to be replaced that represents the target character from the at least one audio segment.
  • the extraction module 710 is configured to extract at least one audio segment from the first audio file based on the principle of short sentence division, and determine first candidate time frame information of each audio segment;
  • the time frame determining module 730 is configured to determine time frame information according to the first candidate time frame information determined by the extracting module 710.
  • the audio file processing apparatus 800 further includes:
  • the setting module 780 is configured to preset the line text information including the third candidate time frame information
  • the extracting module 710 is further configured to: extract, according to the second sentence, second candidate time frame information from the first audio file;
  • the time frame determining module 730 is configured to correct the first candidate time frame information according to the second candidate time frame information extracted by the extracting module 710 and the third candidate time frame information set by the setting module 780, and determine the time frame information.
  • the time frame information includes a duration
  • the obtaining module 740 is configured to determine a to-be-replaced line corresponding to the to-be-replaced audio segment from the preset line text information according to the duration; according to the to-be-replaced line and the preset
  • the audio sample data generates audio data to be dubbed.
  • the audio file processing apparatus 800 further includes:
  • a second receiving module 790 configured to receive, from the client, a processing request for an audio effect
  • the sound effect processing module 810 is configured to adjust an audio effect of the audio data acquired by the obtaining module 740 according to the processing request received by the second receiving module 790.
  • the second candidate time frame information and the third candidate time frame information are optimized as separate time features, and the additional redundant time information is advantageous for determining the precise position of the audio segment on the time axis, thereby ensuring The accuracy of audio feature replacement.
  • the additional redundant time information is advantageous for determining the precise position of the audio segment on the time axis, thereby ensuring The accuracy of audio feature replacement.
  • FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device 900 includes a processor 910, a memory 920, a port 930, and a bus 940.
  • Processor 910 and memory 920 are interconnected by a bus 940.
  • Processor 910 can receive and transmit data through port 930. among them,
  • the processor 910 is configured to execute a machine readable instruction module stored by the memory 920.
  • Memory 920 stores machine readable instruction modules executable by processor 910.
  • the instruction module executable by the processor 910 includes an extraction module 921, an identification module 922, a time frame determination module 923, an acquisition module 924, and a replacement module 925. among them,
  • the extracting module 921 may be executed by the processor 910 to: extract at least one audio segment from the first audio file;
  • the identification module 922 may be executed by the processor 910 to: identify at least one audio segment to be replaced that represents the target character from the at least one audio segment extracted by the extraction module 921;
  • time frame determining module 923 When the time frame determining module 923 is executed by the processor 910, it may be: determining time frame information of each audio segment to be replaced identified by the identifying module 922 in the first audio file;
  • the audio data to be dubbed may be acquired for each audio segment to be replaced identified by the identifying module 922;
  • the replacement module 925 When the replacement module 925 is executed by the processor 910, it may be: for each audio segment to be replaced identified by the identification module 922, the time frame information determined by the time frame determining module 923 is replaced by the data in the audio segment to be replaced. To obtain the audio data obtained by the module 924, a second audio file is obtained.
  • the instruction module executable by the processor 910 further includes a first receiving module 926 and an audio file determining module 927, where
  • the first receiving module 926 when executed by the processor 910, may be: receiving, from the client, a voice replacement request for the target role in the source video file;
  • the first audio file may be separated from the source video file according to the voice replacement request received by the first receiving module 926.
  • the first receiving module 926 when executed by the processor 910, may be: receiving, from the client, a voice replacement request for the target role in the source audio file;
  • the source audio file identified in the voice replacement request received by the first receiving module 926 may be determined as the first audio file.
  • the extraction module 921 when executed by the processor 910, may be: extracting at least one audio segment from the first audio file based on the principle of short sentence division, and determining a first candidate time of each audio segment Frame information
  • the time frame information may be determined according to the first candidate time frame information determined by the extracting module 921.
  • the instruction module executable by the processor 910 further includes a setting module 928, wherein
  • the setting module 928 may be executed by the processor 910 to: preset the line text information including the third candidate time frame information;
  • the second candidate time frame information based on the long sentence may be extracted from the first audio file
  • the first candidate time frame information may be corrected according to the second candidate time frame information extracted by the extracting module 921 and the third candidate time frame information set by the setting module 928, and determined. Time frame information.
  • the instruction module executable by the processor 910 further includes a second receiving module 929 and a sound processing module 931, where
  • the second receiving module 929 may be executed by the processor 910 to: receive a processing request for an audio effect from the client;
  • the audio effect of the audio data acquired by the obtaining module 924 is adjusted according to the processing request received by the second receiving module 929, and the adjusted audio data is used by the replacement module 925. replace.
  • the extraction module when the instruction module stored in the memory 920 is executed by the processor 910, the extraction module, the identification module, the time frame determination module, the acquisition module, the replacement module, and the first receiving module in the foregoing various embodiments may be implemented.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each module may exist physically separately, or two or more modules may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • each of the embodiments of the present application can be implemented by a data processing program executed by a data processing device such as a computer.
  • the data processing program constitutes the present application.
  • a data processing program usually stored in a storage medium is executed by directly reading a program out of a storage medium or by installing or copying the program to a storage device (such as a hard disk and or a memory) of the data processing device. Therefore, such a storage medium also constitutes the present application.
  • the storage medium can use any type of recording method, such as paper storage medium (such as paper tape, etc.), magnetic storage medium (such as floppy disk, hard disk, flash memory, etc.), optical storage medium (such as CD-ROM, etc.), magneto-optical storage medium (such as MO, etc.).
  • paper storage medium such as paper tape, etc.
  • magnetic storage medium such as floppy disk, hard disk, flash memory, etc.
  • optical storage medium such as CD-ROM, etc.
  • magneto-optical storage medium Such as MO, etc.
  • the present application also discloses a storage medium in which is stored a data processing program for performing any of the above-described embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Management Or Editing Of Information On Record Carriers (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了音频文件处理方法、电子设备及存储介质。该方法由电子设备执行,包括:从第一音频文件中提取出至少一个音频分段;从所述至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段,并确定每个待替换音频分段在所述第一音频文件中的时间帧信息;及,针对每个待替换音频分段,获取待配音的音频数据,根据所述时间帧信息将该待替换音频分段内的数据替换为所述音频数据,得到第二音频文件。

Description

音频文件处理方法、电子设备及存储介质
本申请要求于2017年11月6日提交中国专利局、申请号为201711076391.5、申请名称为“一种音频文件处理方法及装置”的中国专利申请的优先权。
技术领域
本申请涉及语音处理技术领域,特别涉及音频文件处理方法、电子设备及存储介质。
发明背景
目前,用户在观看电影、电视剧等视频或者听书、听广播等音频文件时,文件内的声音是事先录好的,用户无法自由选择其中主演的声音或者主讲人的声音,因此,不能满足用户的个人喜好。
而在影视剧制作或者录音过程中,配音是一个耗时、费力的过程。以影片为例,目前针对主演角色的配音,主要通过人工配音实现,通过影片的后期处理将人工录入的声音合成影片中。但是,整个过程的自动化程度不高,耗时费力,导致人力成本和时间成本比较高,降低了音频处理设备的资源利用率。
发明内容
有鉴于此,本申请实施例提供了音频文件处理方法、电子设备及存储介质,提供了一种自动化配音的方案,能够提升音频替换的时间效率,消耗较少的内存资源,提高音频文件处理装置的资源利用率。
具体地,本申请实施例的技术方案是这样实现的:
本申请实施例提供了一种音频文件处理方法,由电子设备执行,所 述方法包括:
从第一音频文件中提取出至少一个音频分段;
从所述至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段,并确定每个待替换音频分段在所述第一音频文件中的时间帧信息;及,
针对每个待替换音频分段,获取待配音的音频数据,根据所述时间帧信息将该待替换音频分段内的数据替换为所述音频数据,得到第二音频文件。
本申请实施例还提供了一种电子设备,包括处理器和存储器,所述存储器中存储可被所述处理器执行的指令,当执行所述指令时,所述处理器用于:
从第一音频文件中提取出至少一个音频分段;
从所述至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段;
确定每个待替换音频分段在所述第一音频文件中的时间帧信息;
针对每个待替换音频分段,获取待配音的音频数据,根据所述时间帧信息将该待替换音频分段内的数据替换为所述音频数据,得到第二音频文件。
本申请实施例还提供了一种计算机可读存储介质,存储有计算机可读指令,可以使至少一个处理器执行如上所述的方法。
附图简要说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人 员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请一个实施例所涉及的实施环境示意图;
图2为本申请一个实施例中音频文件处理方法的流程示意图;
图3a为本申请一个实施例中客户端发起语音替换请求的界面示意图;
图3b为本申请另一个实施例中客户端发起语音替换请求的界面示意图;
图4为本申请另一个实施例中音频文件处理方法的流程示意图;
图5为本申请一个实施例中候选时间帧信息的结构示意图;
图6为本申请又一个实施例中音频文件处理方法的流程示意图;
图7为本申请一个实施例中音频文件处理装置的结构示意图;
图8为本申请另一个实施例中音频文件处理装置的结构示意图;
图9为本申请一个实施例中电子设备的结构示意图。
实施方式
为使得本申请的发明目的、特征、优点能够更加的明显和易懂,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而非全部实施例。基于本申请中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
图1为本申请一个实施例所涉及的实施环境示意图。如图1所示,在自动配音系统100中包括音频文件处理装置110、网络120、终端设备130以及用户140。其中,音频文件处理装置110包括处理器和存储 器,本申请中的方法实施例由处理器执行存储在存储器中的指令来执行。
具体地,音频文件处理装置110包括源文件数据库111、音频样本数据库112、音效管理数据库113和音频替换处理单元114。终端设备130上安装有客户端130-1。用户140登录客户端130-1后可以观看视频或者听取音频。
在本申请的实施例中,源文件数据库111存储有待替换的源视频文件或者源音频文件。音频样本数据库112中存储有用于替换的各类声音样本,例如预先采集的标准男音、标准女音、明星的声音等;音效管理数据库113中设置、存储各种音频风格、情绪类型以及相应的处理模块。
用户140登录到客户端130-1后,观看视频或者听取音频时,希望替换目标角色(如主演或者主讲人)的声音,此时在终端130上输入操作,发起针对目标角色的语音替换请求。客户端130-1向音频文件处理装置110发送该语音替换请求,音频替换处理单元114基于该语音替换请求从源文件数据库111中获取源视频文件或者源音频文件,从音频样本数据库112中获取音频样本,生成待配音的音频数据,执行音频替换处理,输出替换后的视频/音频文件,即重新配音后的文件,将其返回给客户端130-1。在执行音频替换之前,还可以调用音效管理数据库113中相应的音效处理模块对音频数据进行滤波处理。
其中,音频文件处理装置110可以是一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心。网络120可以为无线或有线的形式将音频文件处理装置110和终端设备130进行相连。终端设备130可以为智能终端,包括智能电视、智能手机、平板电脑、膝上型便携计算机等。
图2为本申请一个实施例中音频文件处理方法的流程示意图。该方 法可以应用于音频文件处理装置或者服务器等电子设备。该方法包括以下步骤。
步骤201,从第一音频文件中提取出至少一个音频分段。
其中,第一音频文件为进行音频替换之前的源文件。根据场景的不同,获取第一音频文件的方式有如下两种:
一,从客户端接收针对源视频文件中目标角色的语音替换请求,根据语音替换请求从源视频文件中分离出第一音频文件。
对应的应用场景是用户通过客户端观看视频,如电影、电视剧、娱乐节目等。语音替换请求中携带有源视频文件的标识以及目标角色的标识。
图3a为本申请一个实施例中客户端发起语音替换请求的界面示意图。如图3a所示,在界面310上显示“电视剧:琅琊榜第3集”的播放画面,用户点击播放按钮311后观看该集电视剧,312为进度控制按钮。当用户对某个主演的声音不满意,希望替换该主演的声音时,在界面310上点击鼠标右键,弹出窗口313,在该窗口中选择希望替换声音的主演,例如,选择替换主演“胡**”的声音。此时,客户端向音频文件处理装置发送语音替换请求,该请求中包括《琅琊榜》第3集的电视剧标识以及目标角色“胡**”的标识。
音频文件处理装置接收到语音替换请求后,根据该请求中电视剧的标识获得源视频文件,然后从源视频文件中分离出第一音频文件。例如,读取整个视频文件,通过转码的方式,抽取出其中的纯音频文件作为第一音频文件。
二,从客户端接收针对源音频文件中目标角色的语音替换请求,将源音频文件确定为第一音频文件。
此时对应的应用场景是用户通过客户端听取音频,如听书、听讲座、 听网络课程等。语音替换请求中携带有源音频文件的标识以及目标角色的标识。
具体应用时,考虑到该场景中目标角色为主讲人,通常主讲人的声音在音频文件中是唯一的人声,因此,语音替换请求可以仅携带有源音频文件的标识。
图3b为本申请另一个实施例中客户端发起语音替换请求的界面示意图。如图3b所示,在界面320上显示“听书:三国演义第8回”的播放界面,用户点击播放按钮321后收听该音频,322为进度控制按钮。当用户希望替换主讲人的声音时,在界面320上点击鼠标右键,弹出窗口323,在该窗口中选择选项:“替换主讲人的声音”。此时,客户端向音频文件处理装置发送携带有“听书:三国演义第8回”标识的语音替换请求。
音频文件处理装置接收到语音替换请求后,根据该请求中音频文件的标识获得源音频文件作为第一音频文件。
此外,从第一音频文件中提取音频分段时,可以对第一音频文件内的数据进行语音检测;将检测到语音的一部分连续数据作为一个音频分段。例如,通过检测音频数据中声波能量的波动来判断语音的出现和消失,当检测到语音出现时的时间点作为起始时刻,当检测到语音消失时的时间点作为终止时刻,起始时刻和终止时刻之间的连续音频数据作为一个音频分段。
步骤202,从至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段,并确定每个待替换音频分段在第一音频文件中的时间帧信息。
此步骤中,从至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段,具体包括如下两个步骤:
步骤2021,提取出每个音频分段的音频特征。
在具体应用时,可以采用机器学习的方式实现音频特征的提取。具体地,将每个音频分段输入到卷积神经网络进行训练,获得每个音频分段的音频特征。其中,音频特征包括音色、频率、性别、情绪、声音波峰距离中的任意一项或几项。
通常,卷积神经网络包括多层处理,例如有卷积层、池化层、全连接层和输出层。其中,卷积层中设置有卷积矩阵作为过滤器,能够过滤出音频分段中的音频特征,或者称之为音频指纹。在实际应用时,可以设计多个卷积层进行深度学习,可以提取出多维的复合音频特征。
或者,也可以使用基于深度学习的长短期记忆(LSTM)模型进行训练,通过记忆和关联,适合对长音频文件进行音频特征的提取。
步骤2022,根据音频特征从至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段。
在具体应用时,基于目标角色建立二分类模型;将每个音频分段以及该音频分段的音频特征输入到二分类模型,基于逻辑回归算法进行训练,根据训练结果确定至少一个待替换音频分段。
考虑到在音频文件中对于主演与配角间的识别或者主讲人和背景声音之间的识别,属于0-1关系,符合二分类模型。因此,建立二分类模型时,可以将目标角色设置为1,非目标角色设置为0。对目标角色的声音进行识别时,可以采用基于逻辑回归的机器学习算法进行二分类模型训练。
步骤203,针对每个待替换音频分段,获取待配音的音频数据,根据时间帧信息将该待替换音频分段内的数据替换为音频数据,得到第二音频文件。
此步骤中,时间帧信息包括时长,在获取待配音的音频数据时,可 以根据时长从预设的台词文本信息中确定出该待替换音频分段所对应的待替换台词;根据待替换台词和预设的音频样本数据生成待配音的音频数据。
这样,待替换音频分段内的数据与待配音的音频数据在时长上是一致的,将两者数据进行替换,即得到重新配音后的第二音频文件。这样,从第一音频文件到第二音频文件,在时间帧上是一致的,但是所包含的音频数据进行了替换。
本实施例中,通过从第一音频文件中提取出至少一个音频分段,从至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段,并确定每个待替换音频分段在第一音频文件中的时间帧信息,以及针对每个待替换音频分段,获取待配音的音频数据,根据时间帧信息将该待替换音频分段内的数据替换为音频数据,得到第二音频文件,提供了一种自动化配音的方案,达到了自动替换目标角色声音的目的,相比人工配音的方法,若有X个目标角色、Y个替换的音频效果、时间成本为T,总的成本为X*Y*T,而本申请实施例不涉及人力因素,通过机器的并行和处理,整体的成本仅为T,因此大大节省了配音的人力成本和时间成本,并且满足了用户对个性化声音的需求,提高了配音设备的资源利用率。
图4为本申请另一个实施例中音频文件处理方法的流程示意图,该方法可以由音频文件处理装置或者服务器等电子设备执行。如图4所示,包括如下步骤:
步骤401,基于短句划分的原则从第一音频文件中提取出至少一个音频分段,并确定每个音频分段的第一候选时间帧信息。
根据语言的表达习惯,短句是指结构简单、词语较少的句子。长句是指结构复杂、词语较多的句子。通常,在完整的一个长句中包括若干 个短句,通过短暂的停顿连接上下两个短句。那么,在从第一音频文件中提取音频分段时,可以基于短句划分的原则进行提取,每个音频分段对应一个短句。
每个音频文件对应一个时间轴,在提取音频分段的同时,确定出每个音频分段的第一候选时间帧信息。所谓的时间帧信息用于表征该音频分段在时间轴上的时间分段信息,包括一个音频分段在时间轴上的起始时刻、或者起始时刻和终止时刻、或者起始时刻和时长等。
提取出的多个音频分段及对应的第一候选时间帧信息将存储在音频文件处理装置或者服务器等电子设备中,以便用于后续的替换操作。
图5为本申请一个实施例中候选时间帧信息的结构示意图。如图5所示,第一候选时间帧信息510包括6个音频分段的起始时刻,由时间轴上的黑色三角标识,在方框500给出了对应的台词信息。这6个音频分段分别对应一个短句,即“策马迎风”、“看人生起伏”、“啸歌书景”、“笑天地荒老”、“以梦为马”及“驰骋流年”,对应的起始时刻为0分10秒、0分12秒、0分14.5秒、0分17秒、0分20.8秒、0分22秒。
步骤402,从第一音频文件中提取出基于长句的第二候选时间帧信息。
此步骤和上述步骤401可以是并行的。提取的原则是基于长句划分,所谓的长句是指由若干个短句组合而成的句子。提取的方法可以与步骤201中描述的方法相同。由于此步骤的目标是获取第二候选时间帧信息,因此,可以不存储与长句相对应的音频分段。
在图5所示的实施例中,第二候选时间帧信息520包括3个长句对应的3个(起始时刻、时长)信息,具体为(0分9.43秒、3.50秒)、(0分14.20秒、4.95秒)和(0分20.35秒、3.95秒)。
步骤403,预先设置包括第三候选时间帧信息的台词文本信息。
此步骤和上述步骤401、步骤402可以都是并行的。在制作音频文件过程中,会预先设置好台词文本信息,该台词文本信息中包括台词以及对应的第三候选时间帧信息。第三候选时间帧信息对应的台词可以为短句,也可以为长句。
在图5所示的实施例中,第三候选时间帧信息530包括3个长句对应的3个起始时刻,分别是0分10秒、0分15秒和0分20秒。
步骤404,根据一个或多个候选时间帧信息确定每个待替换音频分段的时间帧信息。
在进行音频替换时,时间帧信息决定了替换的时间位置,从而决定了替换的准确性。根据步骤401-403中获取到的各种可能的候选时间帧信息,确定的方式具体分为以下几种情况:
1)根据第一候选时间帧信息确定时间帧信息。
第一候选时间帧信息是在提取音频分段时获得的。那么,在一个实施例中,可以将第一候选时间帧信息直接作为时间帧信息。
在另一个实施例中,考虑到提取音频分段时在时间上可能具有一定的偏差,可以预先估计出提取音频分段时产生的时间偏差,然后在确定时间帧信息时根据该时间偏差进行补偿。例如,第一候选时间帧信息中包括N个音频分段的起始时刻,其中,第i个音频分段的起始时刻为t1 i,i=1,…N,那么时间帧信息中第i个音频分段的起始时刻t0 i=t1 i+Δoffset。其中,Δoffset为时间偏差,其数值可以为正值或者为负值。
2)根据第一候选时间帧信息和第二候选时间帧信息确定时间帧信息。
在一个实施例中,考虑到第二候选时间帧信息作为单独的时间数据,在抽取时精度可能更高,此时可以将第二候选时间帧信息直接作为时间帧信息。
在另一个实施例中,根据第二候选时间帧信息对第一候选时间帧信息进行校正,确定时间帧信息。考虑到起始时刻和时长是相关联的,校正时可以分别从这两个角度进行校正,具体包括如下三种方式。
方式一,基于第二候选时间帧信息中的起始时刻对第一候选时间帧信息中的起始时刻进行校正。
首先,确定出第二候选时间帧信息与第一候选时间帧信息中相对应的两个起始时刻;取两个起始时刻的均值、最大值或者最小值为时间帧信息中的起始时刻。此处的相对应是指两个起始时刻对应了台词中相同的位置。
若第二候选时间帧信息中包括M个音频分段的起始时刻,其中,第j个音频分段的起始时刻为t2 j,j=1,…M。以图5所示的实施例为例,第一候选时间帧信息中第一个起始时刻t1 1=10秒,第二候选时间帧信息中第一个起始时刻t2 1=9.43秒,那么,时间帧信息中第i个音频分段的起始时刻取为二者的均值,即t0 1=(t1 1+t2 1)/2=9.715秒。
方式二,基于第二候选时间帧信息中的时长对第一候选时间帧信息中的起始时刻进行校正。
首先,确定出第二候选时间帧信息的时长与第一候选时间帧信息中相对应的两个起始时刻;调整两个起始时刻中的任一个以保证两个起始时刻之间的差大于时长。
若第二候选时间帧信息中包括M个音频分段的时长,其中,第j个音频分段的时长为Δt2 j,j=1,…M,若确定出Δt2 j对应第一候选时间帧信息中两个起始时刻为t1 i和t1 i+1,则判断是否满足条件
t1 i+1-t1 i>Δt2 j;            (1)
若不满足,则减小t1 i或者增加t1 i+1直到满足上述条件。
方式三,基于第二候选时间帧信息中的时长对第一候选时间帧信息 中的时长进行校正。
首先,确定出第二候选时间帧信息的时长与第一候选时间帧信息中相对应的一个或多个时长;将对应相同台词长度的时长中的最大值为作为时间帧信息中的时长。
在具体应用时,考虑到配音时的时长是可变的,可以取两种候选时间帧信息中时长的最大值。若第一候选时间帧信息中包括N个音频分段的时长,其中,第i个音频分段的时长为Δt1 i,i=1,…N。若确定出Δt2 j对应第一候选时间帧信息中两个时长为Δt1 i和Δt1 i+1,判断是否满足条件
Δt1 i+Δt1 i+1<Δt2 j;            (2)
若满足,则可以增加Δt1 i或者Δt1 i+1的数值以满足上述条件。
在选择使用上述哪种校正的方式时,可以根据第一候选时间帧信息和第二候选时间帧信息中所包含的具体信息来决定,即根据起始时刻和时长的数量比例来选择。若起始时刻的数量多,则对起始时刻进行校正;反之,对时长进行校正。或者,考虑到时长可以由相邻的两个起始时刻限定,起始时刻的数值更加重要,则优选上述方式一或二。
3)根据第一候选时间帧信息和第三候选时间帧信息确定时间帧信息。
在一个实施例中,考虑到第三候选时间帧信息来源于台词文本信息,表征了标准的时间数据,此时可以将第三候选时间帧信息直接作为时间帧信息。
在另一个实施例中,当仅依赖台词文本信息不够可靠时,可以根据第三候选时间帧信息对第一候选时间帧信息进行校正,确定时间帧信息。具体的校正方法可以参照上述步骤4042中的描述,将第三候选时间帧信息代替第二候选时间帧信息,在此不再赘述。
4)根据第一候选时间帧信息、第二候选时间帧信息和第三候选时 间帧信息确定时间帧信息。
此步骤中,可以根据第二候选时间帧信息和第三候选时间帧信息对第一候选时间帧信息进行校正,确定时间帧信息。
由这三种候选时间帧信息确定时间帧信息时,可以参照上述步骤4042中的三种方式进行校正。需要确定的是三种候选时间帧信息中相对应的起始时刻和时长数据。所谓的相对应是指对应相同的台词位置。
例如,参照上述方式一,确定出三种候选时间帧信息相对应的三个起始时刻;取三个起始时刻的均值、最大值或者最小值为时间帧信息中的起始时刻。或者,可以选择三个起始时刻中数值最为接近的两个起始时刻,然后取这两个起始时刻的均值、最大值或者最小值为时间帧信息中的起始时刻。
例如,参照上述方式二,确定第二候选时间帧信息与第三候选时间帧信息中较大的时长,然后,调整第一候选时间帧信息中相对应的两个起始时刻,以保证两个起始时刻之间的差大于该较大的时长。
例如,参照上述方式三,确定三种候选时间帧信息中相对应的多个时长;将对应相同台词长度的时长中的最大值为作为时间帧信息中的时长。
步骤405,从至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段。
识别的方法可以参照上述步骤202的描述,在此不再赘述。
步骤406,针对每个待替换音频分段,获取待配音的音频数据,根据时间帧信息将该待替换音频分段内的数据替换为音频数据,得到第二音频文件。
替换的方法可以参照上述步骤203的描述,在此不再赘述。
在上述实施例中,在确定用于音频替换时的时间帧信息时,综合考 虑了三种候选的信息,分别来源于音频分段的短句提取(即第一候选时间帧信息)、整个音频文件的长句提取(即第二候选时间帧信息)和台词文本信息(即第三候选时间帧信息),其中,第二候选时间帧信息和第三候选时间帧信息作为单独的时间特征进行优化,这种附加的冗余时间信息有利于确定音频分段在时间轴上的精准位置,从而保证了音频特征替换的准确性。
图6为本申请一个实施例中音频文件处理方法的流程示意图。该方法可以由音频文件处理装置或者服务器等电子设备执行,是对源视频文件进行主演的声音替换处理。具体包括以下步骤。
步骤601,从客户端接收针对源视频文件中目标角色的语音替换请求,根据语音替换请求从源视频文件中分离出第一音频文件。
如步骤201中的方式一所述,语音替换请求中携带有源视频文件的标识以及目标角色的标识。音频文件处理装置接收到语音替换请求后,根据该请求中源视频文件的标识获得源视频文件,然后从源视频文件中分离出第一音频文件。
步骤602,基于短句划分的原则从第一音频文件中提取出至少一个音频分段,并确定每个音频分段的第一候选时间帧信息。
步骤603,基于长句划分的原则从第一音频文件中提取出第二候选时间帧信息。
步骤604,预先设置包括第三候选时间帧信息的台词文本信息。
步骤605,根据一个或多个候选时间帧信息确定每个待替换音频分段的时间帧信息。
步骤602-605的处理方法可以参照上述步骤401-404的描述,在此不再赘述。
步骤606,对音频分段进行抽样,确定二分类模型的训练数据和测 试数据。
在实际应用时,由于音频片段的数量可能比较多,不可能把所有电视剧的视频都进行训练。考虑覆盖度的要求,可以采用抽样的方法选择训练数据和测试数据用于二分类模型的训练。
例如,按照6:4的比例进行抽样,即所有的音频分段中,60%的数据用于训练,40%的数据用于测试。抽样过程可以采用分层抽样,例如,整个电视剧包括40集,每集中都抽取出一定量的训练数据,使得训练数据能够覆盖所有的剧集。
步骤607,使用二分类模型从至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段,并进行评分优化。
如步骤202所述,考虑到在音频文件中对于主演与配角间的识别0-1关系,即主演为1,配角为0,符合二分类模型。因此,建立二分类模型,在对主演的声音进行识别时,可以采用基于逻辑回归的机器学习算法进行二分类模型训练。具体应用时,逻辑回归作为成熟的机器学习算法,可以集成于spark mllib中,可以试用spark进行并发训练。
此外,二分类模型训练的输出结果为该音频分段为0或1的可能性。由于基于机器学习的训练结果可能出现误判的可能性,可以采用AUC(Area Under roc Curve)模型进行评分,优化整个训练过程。即对于每一个音频片段,都进行0-1打分,其中,0标识未命中主演的音频分段,1标识命中主演的音频分段。本申请实施例中的方案考虑标签为1的打分情况,要求AUC达到的准确性要求为0.9以上,即判别为1的准确率需要达到90%以上。当达不到该要求时,则重新进行训练。
其中,AUC模型是一种用来度量分类模型好坏的一个标准,其基于ROC(Receiver Operating Characteristic)分析,主要分析工具是一个画在二维平面上的ROC曲线,AUC的取值为处于ROC曲线下方的那部 分面积。
步骤608,针对每个待替换音频分段,根据预设的台词文本信息和音频样本数据生成待配音的音频数据。
当时间帧信息包括时长时,可以根据时长从预设的台词文本信息中确定出该待替换音频分段所对应的待替换台词;根据待替换台词和预设的音频样本数据生成待配音的音频数据。例如,音频样本数据为标准男音,对应待替换台词中的每个字,将标准男音的样本数据进行组合,获得待配音的音频数据。
步骤609,从客户端接收针对音频效果的处理请求,根据处理请求对音频数据的音频效果进行调整。
具体地,音频效果包括音频风格和音频情绪。用户的处理请求中将体现出希望替换的音频是在声音的风格上有所改变,或者是在主演的情绪上有所改变。
所谓的音频风格包括声音屏蔽、声音扭曲(如抬升、下降、去除、变换等)、女音替换成男音、男音替换成女音、明星声音、特殊声音(鬼畜声音、魔音、海豚音)等等有特色的声音风格。
当音频效果指音频的风格时,基于处理请求从预先设置的至少一个音频风格中选择出目标音频风格;根据目标音频风格对音频数据进行滤波。例如,用户的处理请求是将主演的声音由男音替换成女音,在预先设置的音频风格中标准女音包括多种音高的女音,那么从多个标准女音中选择一个作为目标音频风格,对音频数据进行滤波。
所谓的音频情绪是指主演在表达台词时表现出来的个人情感,例如愤怒、愉快、悲伤等,对应到音频上将会出现声音的波动成分。
当音频效果指音频的情绪时,基于音频情绪请求从预先设置的至少一种音频情绪中选择出目标音频情绪;确定目标音频情绪对应的语音频 谱分布;根据语音频谱分布对所述音频数据进行滤波。例如,用户感觉到主演的表演不够情绪化,处理请求是增加主演的悲伤情绪,在预先设置的音频情绪中包括多种程度的悲伤情绪以及对应的语音频谱分布,那么从多个悲伤情绪中选择一个作为目标情绪,对音频数据进行滤波。
在其他实施例中,也可以不基于用户的处理请求,而是根据对音频样本数据的分析,在生成的音频数据中增加音频风格或者情绪的滤波处理。
步骤610,根据时间帧信息将待替换音频分段内的数据替换为音频数据,得到第二音频文件。
通过上述实施例,考虑到用户对影视剧中声音的个性化需求,在配音的同时,加入对音频风格或者情绪的处理,为现有配音行业中枯燥、单一的配音引入了声音的各种可能性,能够在自动配音的同时提高丰富的音频效果,可以满足用户的个性化需求。
图7为本申请一个实施例中音频文件处理装置的结构示意图。如图7所示,音频文件处理装置700包括:
提取模块710,用于从第一音频文件中提取出至少一个音频分段;
识别模块720,用于从提取模块710提取出的至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段;
时间帧确定模块730,用于确定识别模块720识别出的每个待替换音频分段在第一音频文件中的时间帧信息;
获取模块740,用于针对识别模块识720别出的每个待替换音频分段,获取待配音的音频数据;及,
替换模块750,用于针对识别模块720识别出的每个待替换音频分段,根据时间帧确定模块730确定出的时间帧信息将该待替换音频分段内的数据替换为获取模块740得到的音频数据,得到第二音频文件。
根据上述实施例,提供了一种自动化配音的方案,达到了自动替换目标角色声音的目的;相比人工配音的方法,若有X个目标角色、Y个替换的音频效果、时间成本为T,总的成本为X*Y*T,而本申请实施例不涉及人力因素,通过机器的并行和处理,整体的成本仅为T,因此大大节省了配音的人力成本和时间成本,并且满足了用户对个性化声音的需求,提高了配音设备的资源利用率。
图8为本申请另一个实施例中音频文件处理装置的结构示意图。如图8所示,在图7所示的音频文件处理装置700基础之上,音频文件处理装置800进一步包括:
第一接收模块760,用于从客户端接收针对源视频文件中目标角色的语音替换请求;
音频文件确定模块770,用于根据第一接收模块760接收到的语音替换请求从源视频文件中分离出第一音频文件。
在一个实施例中,音频文件处理装置800进一步包括:
第一接收模块760,用于从客户端接收针对源音频文件中所述目标角色的语音替换请求;
音频文件确定模块770,用于将第一接收模块760接收到的语音替换请求中标识的源音频文件确定为第一音频文件。
在一个实施例中,识别模块720包括:
特征提取单元721,用于提取出每个音频分段的音频特征;
识别单元722,用于根据特征提取单元721提取出的音频特征从至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段。
在一个实施例中,提取模块710用于,基于短句划分的原则从第一音频文件中提取出至少一个音频分段,并确定每个音频分段的第一候选时间帧信息;
时间帧确定模块730用于,根据提取模块710确定的第一候选时间帧信息确定时间帧信息。
在一个实施例中,音频文件处理装置800进一步包括:
设置模块780,用于预先设置包括第三候选时间帧信息的台词文本信息;
提取模块710进一步用于,从第一音频文件中提取出基于长句的第二候选时间帧信息;
时间帧确定模块730用于,根据提取模块710提取出的第二候选时间帧信息和设置模块780设置的第三候选时间帧信息对第一候选时间帧信息进行校正,确定时间帧信息。
在一个实施例中,时间帧信息包括时长,获取模块740用于,根据时长从预设的台词文本信息中确定出该待替换音频分段所对应的待替换台词;根据待替换台词和预设的音频样本数据生成待配音的音频数据。
在一个实施例中,音频文件处理装置800进一步包括:
第二接收模块790,用于从客户端接收针对音频效果的处理请求;
音效处理模块810,用于根据第二接收模块790接收到的处理请求对获取模块740获取的音频数据的音频效果进行调整。
根据上述实施例,第二候选时间帧信息和第三候选时间帧信息作为单独的时间特征进行优化,这种附加的冗余时间信息有利于确定音频分段在时间轴上的精准位置,从而保证了音频特征替换的准确性。此外,考虑到用户对影视剧中声音的个性化需求,在配音的同时,加入对音频风格或者情绪的处理,为现有配音行业中枯燥、单一的配音引入了声音的各种可能性,能够在自动配音的同时提高丰富的音频效果,可以满足用户的个性化需求。
图9为本申请一个实施例中电子设备的结构示意图。如图9所示,电子设备900包括:处理器910、存储器920、端口930以及总线940。处理器910和存储器920通过总线940互联。处理器910可通过端口930接收和发送数据。其中,
处理器910用于执行存储器920存储的机器可读指令模块。
存储器920存储有处理器910可执行的机器可读指令模块。处理器910可执行的指令模块包括:提取模块921、识别模块922、时间帧确定模块923、获取模块924和替换模块925。其中,
提取模块921被处理器910执行时可以为:从第一音频文件中提取出至少一个音频分段;
识别模块922被处理器910执行时可以为:从提取模块921提取出的至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段;
时间帧确定模块923被处理器910执行时可以为:确定识别模块922识别出的每个待替换音频分段在第一音频文件中的时间帧信息;
获取模块924被处理器910执行时可以为:针对识别模块922识别出的每个待替换音频分段,获取待配音的音频数据;
替换模块925被处理器910执行时可以为:针对识别模块922识别出的每个待替换音频分段,根据时间帧确定模块923确定出的时间帧信息将该待替换音频分段内的数据替换为获取模块924得到的音频数据,得到第二音频文件。
在一个实施例中,处理器910可执行的指令模块进一步包括第一接收模块926和音频文件确定模块927,其中,
第一接收模块926被处理器910执行时可以为:从客户端接收针对源视频文件中目标角色的语音替换请求;
音频文件确定模块927被处理器910执行时可以为:根据第一接收模块926接收到的语音替换请求从源视频文件中分离出第一音频文件。
在另一个实施例中,第一接收模块926被处理器910执行时可以为:从客户端接收针对源音频文件中所述目标角色的语音替换请求;
音频文件确定模块927被处理器910执行时可以为:将第一接收模块926接收到的语音替换请求中标识的源音频文件确定为第一音频文件。
在一个实施例中,提取模块921被处理器910执行时可以为:基于短句划分的原则从第一音频文件中提取出至少一个音频分段,并确定每个音频分段的第一候选时间帧信息;
时间帧确定模块923被处理器910执行时可以为:根据提取模块921确定的第一候选时间帧信息确定时间帧信息。
在一个实施例中,处理器910可执行的指令模块进一步包括设置模块928,其中,
设置模块928被处理器910执行时可以为:预先设置包括第三候选时间帧信息的台词文本信息;
提取模块921被处理器910执行时可以为:从第一音频文件中提取出基于长句的第二候选时间帧信息;
时间帧确定模块923被处理器910执行时可以为:根据提取模块921提取出的第二候选时间帧信息和设置模块928设置的第三候选时间帧信息对第一候选时间帧信息进行校正,确定时间帧信息。
在一个实施例中,处理器910可执行的指令模块进一步包括第二接收模块929和音效处理模块931,其中,
第二接收模块929被处理器910执行时可以为:从客户端接收针对音频效果的处理请求;
音效处理模块931被处理器910执行时可以为:根据第二接收模块929接收到的处理请求对获取模块924获取的音频数据的音频效果进行调整,将调整后的音频数据用于替换模块925进行替换。
由此可以看出,当存储在存储器920中的指令模块被处理器910执行时,可实现前述各个实施例中提取模块、识别模块、时间帧确定模块、获取模块、替换模块、第一接收模块、音频文件确定模块、设置模块、第二接收模块和音效处理模块的各种功能。
上述电子设备实施例中,各个模块及单元实现自身功能的具体方法在方法实施例中均有描述,这里不再赘述。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
另外,本申请的每一个实施例可以通过由数据处理设备如计算机执行的数据处理程序来实现。显然,数据处理程序构成了本申请。此外,通常存储在一个存储介质中的数据处理程序通过直接将程序读取出存储介质或者通过将程序安装或复制到数据处理设备的存储设备(如硬盘和或内存)中执行。因此,这样的存储介质也构成了本申请。存储介质可以使用任何类别的记录方式,例如纸张存储介质(如纸带等)、磁存储介质(如软盘、硬盘、闪存等)、光存储介质(如CD-ROM等)、磁光存储介质(如MO等)等。
因此,本申请还公开了一种存储介质,其中存储有数据处理程序,该数据处理程序用于执行本申请上述方法的任何一种实施例。
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均 应包含在本申请的保护范围之内。

Claims (16)

  1. 一种音频文件处理方法,其特征在于,由电子设备执行,所述方法包括:
    从第一音频文件中提取出至少一个音频分段;
    从所述至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段,并确定每个待替换音频分段在所述第一音频文件中的时间帧信息;及,
    针对每个待替换音频分段,获取待配音的音频数据,根据所述时间帧信息将该待替换音频分段内的数据替换为所述音频数据,得到第二音频文件。
  2. 根据权利要求1所述的方法,进一步包括:
    从客户端接收针对源视频文件中所述目标角色的语音替换请求;
    根据所述语音替换请求从所述源视频文件中分离出所述第一音频文件。
  3. 根据权利要求1所述的方法,其中,所述从所述至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段包括:
    提取出每个音频分段的音频特征;
    根据所述音频特征从所述至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段。
  4. 根据权利要求3所述的方法,其中,所述根据所述音频特征从所述至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段包括:
    基于所述目标角色建立二分类模型;
    将每个音频分段以及该音频分段的音频特征输入到所述二分类模型,基于逻辑回归算法进行训练,根据训练结果确定所述至少一个待替 换音频分段。
  5. 根据权利要求1所述的方法,其中,所述从第一音频文件中提取出至少一个音频分段包括:
    基于短句划分的原则从所述第一音频文件中提取出所述至少一个音频分段,并确定每个音频分段的第一候选时间帧信息;
    所述确定每个待替换音频分段在所述第一音频文件中的时间帧信息包括:
    根据所述第一候选时间帧信息确定所述时间帧信息。
  6. 根据权利要求5所述的方法,进一步包括:
    从所述第一音频文件中提取出基于长句的第二候选时间帧信息;
    所述根据所述第一候选时间帧信息确定所述时间帧信息包括:
    根据所述第二候选时间帧信息对所述第一候选时间帧信息进行校正,确定所述时间帧信息。
  7. 根据权利要求5所述的方法,进一步包括:
    预先设置包括第三候选时间帧信息的台词文本信息;
    所述根据所述第一候选时间帧信息确定所述时间帧信息包括:
    根据所述第三候选时间帧信息对所述第一候选时间帧信息进行校正,确定所述时间帧信息。
  8. 根据权利要求1所述的方法,其中,所述时间帧信息包括时长,所述获取待配音的音频数据包括:
    根据所述时长从预设的台词文本信息中确定出该待替换音频分段所对应的待替换台词;
    根据所述待替换台词和预设的音频样本数据生成所述待配音的音频数据。
  9. 根据权利要求1至8中任一项所述的方法,进一步包括:
    从客户端接收针对音频效果的处理请求;
    根据所述处理请求对所述音频数据的音频效果进行调整。
  10. 根据权利要求9所述的方法,其中,所述根据所述时间帧信息将该待替换音频分段内的数据替换为所述音频数据包括:
    根据所述时间帧信息将该待替换音频分段内的数据替换为调整后的音频数据。
  11. 一种电子设备,其特征在于,包括处理器和存储器,所述存储器中存储可被所述处理器执行的指令,当执行所述指令时,所述处理器用于:
    从第一音频文件中提取出至少一个音频分段;
    从所述至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段;
    确定每个待替换音频分段在所述第一音频文件中的时间帧信息;
    针对每个待替换音频分段,获取待配音的音频数据,根据所述时间帧信息将该待替换音频分段内的数据替换为所述音频数据,得到第二音频文件。
  12. 根据权利要求11所述的电子设备,其中,当执行所述指令时,所述处理器进一步用于:
    提取出每个音频分段的音频特征;
    根据所述音频特征从所述至少一个音频分段中识别出表征目标角色的至少一个待替换音频分段。
  13. 根据权利要求11所述的电子设备,其中,当执行所述指令时,所述处理器进一步用于:
    基于短句划分的原则从所述第一音频文件中提取出所述至少一个音频分段,并确定每个音频分段的第一候选时间帧信息;
    根据所述第一候选时间帧信息确定所述时间帧信息。
  14. 根据权利要求13所述的电子设备,其中,当执行所述指令时,所述处理器进一步用于:
    预先设置包括第三候选时间帧信息的台词文本信息;
    从所述第一音频文件中提取出基于长句的第二候选时间帧信息;
    根据所述第二候选时间帧信息和所述第三候选时间帧信息对所述第一候选时间帧信息进行校正,确定所述时间帧信息。
  15. 根据权利要求11至14中任一项所述的电子设备,其中,当执行所述指令时,所述处理器进一步用于:
    从客户端接收针对音频效果的处理请求;
    根据所述处理请求对所述音频数据的音频效果进行调整。
  16. 一种计算机可读存储介质,其特征在于,存储有计算机可读指令,可以使至少一个处理器执行如权利要求1至10中任一项所述的方法。
PCT/CN2018/114179 2017-11-06 2018-11-06 音频文件处理方法、电子设备及存储介质 WO2019086044A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/844,283 US11538456B2 (en) 2017-11-06 2020-04-09 Audio file processing method, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711076391.5 2017-11-06
CN201711076391.5A CN108305636B (zh) 2017-11-06 2017-11-06 一种音频文件处理方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/844,283 Continuation US11538456B2 (en) 2017-11-06 2020-04-09 Audio file processing method, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
WO2019086044A1 true WO2019086044A1 (zh) 2019-05-09

Family

ID=62869672

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/114179 WO2019086044A1 (zh) 2017-11-06 2018-11-06 音频文件处理方法、电子设备及存储介质

Country Status (3)

Country Link
US (1) US11538456B2 (zh)
CN (1) CN108305636B (zh)
WO (1) WO2019086044A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113823300A (zh) * 2021-09-18 2021-12-21 京东方科技集团股份有限公司 语音处理方法及装置、存储介质、电子设备

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305636B (zh) * 2017-11-06 2019-11-15 腾讯科技(深圳)有限公司 一种音频文件处理方法及装置
CN109087633A (zh) * 2018-08-23 2018-12-25 北京猎户星空科技有限公司 语音测评方法、装置及电子设备
CN109618223B (zh) * 2019-01-28 2021-02-05 北京易捷胜科技有限公司 一种声音替换方法
CN109841225B (zh) * 2019-01-28 2021-04-30 北京易捷胜科技有限公司 声音替换方法、电子设备和存储介质
CN109903773B (zh) * 2019-03-13 2021-01-08 腾讯音乐娱乐科技(深圳)有限公司 音频处理方法、装置及存储介质
US11094311B2 (en) 2019-05-14 2021-08-17 Sony Corporation Speech synthesizing devices and methods for mimicking voices of public figures
US11141669B2 (en) 2019-06-05 2021-10-12 Sony Corporation Speech synthesizing dolls for mimicking voices of parents and guardians of children
CN110602424A (zh) * 2019-08-28 2019-12-20 维沃移动通信有限公司 视频处理方法及电子设备
CN110691261A (zh) * 2019-09-30 2020-01-14 咪咕视讯科技有限公司 多媒体数据交互方法、通信设备及计算机可读存储介质
CN111031386B (zh) * 2019-12-17 2021-07-30 腾讯科技(深圳)有限公司 基于语音合成的视频配音方法、装置、计算机设备及介质
CN111370011A (zh) * 2020-02-21 2020-07-03 联想(北京)有限公司 一种替换音频的方法、装置、系统和存储介质
CN111885313A (zh) * 2020-07-17 2020-11-03 北京来也网络科技有限公司 一种音视频的修正方法、装置、介质及计算设备
CN111885416B (zh) * 2020-07-17 2022-04-12 北京来也网络科技有限公司 一种音视频的修正方法、装置、介质及计算设备
CN112423081B (zh) * 2020-11-09 2021-11-05 腾讯科技(深圳)有限公司 一种视频数据处理方法、装置、设备及可读存储介质
CN112509538A (zh) * 2020-12-18 2021-03-16 咪咕文化科技有限公司 音频处理方法、装置、终端及存储介质
CN112820276B (zh) * 2020-12-21 2023-05-16 北京捷通华声科技股份有限公司 语音的处理方法、装置、计算机可读存储介质与处理器
CN114339392B (zh) * 2021-11-12 2023-09-12 腾讯科技(深圳)有限公司 视频剪辑方法、装置、计算机设备及存储介质
CN114783408A (zh) * 2022-03-31 2022-07-22 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置、计算机设备以及介质
CN114464151B (zh) * 2022-04-12 2022-08-23 北京荣耀终端有限公司 修音方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101189657A (zh) * 2005-05-31 2008-05-28 皇家飞利浦电子股份有限公司 一种用于对多媒体信号执行自动配音的方法和设备
CN101199146A (zh) * 2005-04-14 2008-06-11 汤姆森特许公司 自动替换来自音频信号的不良音频内容
CN103035247A (zh) * 2012-12-05 2013-04-10 北京三星通信技术研究有限公司 基于声纹信息对音频/视频文件进行操作的方法及装置
US20130151251A1 (en) * 2011-12-12 2013-06-13 Advanced Micro Devices, Inc. Automatic dialog replacement by real-time analytic processing
CN105244026A (zh) * 2015-08-24 2016-01-13 陈娟 一种语音处理方法及装置
CN105898556A (zh) * 2015-12-30 2016-08-24 乐视致新电子科技(天津)有限公司 一种外挂字幕的自动同步方法及装置
CN108305636A (zh) * 2017-11-06 2018-07-20 腾讯科技(深圳)有限公司 一种音频文件处理方法及装置

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6948131B1 (en) * 2000-03-08 2005-09-20 Vidiator Enterprises Inc. Communication system and method including rich media tools
US8050919B2 (en) * 2007-06-29 2011-11-01 Microsoft Corporation Speaker recognition via voice sample based on multiple nearest neighbor classifiers
CN101359473A (zh) * 2007-07-30 2009-02-04 国际商业机器公司 自动进行语音转换的方法和装置
US8731905B1 (en) * 2012-02-22 2014-05-20 Quillsoft Ltd. System and method for enhancing comprehension and readability of text
US9552807B2 (en) * 2013-03-11 2017-01-24 Video Dubber Ltd. Method, apparatus and system for regenerating voice intonation in automatically dubbed videos
US9324340B2 (en) * 2014-01-10 2016-04-26 Sony Corporation Methods and apparatuses for use in animating video content to correspond with audio content
CN103997657A (zh) * 2014-06-06 2014-08-20 福建天晴数码有限公司 一种视频中音频的变换方法及装置
CN105828220A (zh) * 2016-03-23 2016-08-03 乐视网信息技术(北京)股份有限公司 一种向视频文件中添加音频文件的方法和装置
CN105959773B (zh) * 2016-04-29 2019-06-18 魔方天空科技(北京)有限公司 多媒体文件的处理方法和装置
CN106293347B (zh) * 2016-08-16 2019-11-12 广东小天才科技有限公司 一种人机交互的学习方法及装置、用户终端
US20180330756A1 (en) * 2016-11-19 2018-11-15 James MacDonald Method and apparatus for creating and automating new video works
CN108780643B (zh) * 2016-11-21 2023-08-25 微软技术许可有限责任公司 自动配音方法和装置
CN107071512B (zh) * 2017-01-16 2019-06-25 腾讯科技(深圳)有限公司 一种配音方法、装置及系统
CN107293286B (zh) * 2017-05-27 2020-11-24 华南理工大学 一种基于网络配音游戏的语音样本收集方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101199146A (zh) * 2005-04-14 2008-06-11 汤姆森特许公司 自动替换来自音频信号的不良音频内容
CN101189657A (zh) * 2005-05-31 2008-05-28 皇家飞利浦电子股份有限公司 一种用于对多媒体信号执行自动配音的方法和设备
US20130151251A1 (en) * 2011-12-12 2013-06-13 Advanced Micro Devices, Inc. Automatic dialog replacement by real-time analytic processing
CN103035247A (zh) * 2012-12-05 2013-04-10 北京三星通信技术研究有限公司 基于声纹信息对音频/视频文件进行操作的方法及装置
CN105244026A (zh) * 2015-08-24 2016-01-13 陈娟 一种语音处理方法及装置
CN105898556A (zh) * 2015-12-30 2016-08-24 乐视致新电子科技(天津)有限公司 一种外挂字幕的自动同步方法及装置
CN108305636A (zh) * 2017-11-06 2018-07-20 腾讯科技(深圳)有限公司 一种音频文件处理方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113823300A (zh) * 2021-09-18 2021-12-21 京东方科技集团股份有限公司 语音处理方法及装置、存储介质、电子设备
CN113823300B (zh) * 2021-09-18 2024-03-22 京东方科技集团股份有限公司 语音处理方法及装置、存储介质、电子设备

Also Published As

Publication number Publication date
US11538456B2 (en) 2022-12-27
US20200234689A1 (en) 2020-07-23
CN108305636B (zh) 2019-11-15
CN108305636A (zh) 2018-07-20

Similar Documents

Publication Publication Date Title
WO2019086044A1 (zh) 音频文件处理方法、电子设备及存储介质
US11776547B2 (en) System and method of video capture and search optimization for creating an acoustic voiceprint
US20230199264A1 (en) Automated voice translation dubbing for prerecorded video
CN110517689B (zh) 一种语音数据处理方法、装置及存储介质
US20200134456A1 (en) Video data processing method and apparatus, and readable storage medium
US11917344B2 (en) Interactive information processing method, device and medium
US20200196028A1 (en) Video highlight recognition and extraction tool
CN110430476B (zh) 直播间搜索方法、系统、计算机设备和存储介质
US20090116695A1 (en) System and method for processing digital media
KR20180136265A (ko) 구간 영상 검색 및 제공 장치, 방법 및 컴퓨터-판독가능 매체
TW201717062A (zh) 基於多模態融合之智能高容錯視頻識別系統及其識別方法
US20170092277A1 (en) Search and Access System for Media Content Files
US20220027407A1 (en) Dynamic identification of unknown media
CN112153397B (zh) 视频处理方法、装置、服务器及存储介质
US20190318198A1 (en) Continual learning for multi modal systems using crowd sourcing
CN113392273A (zh) 视频播放方法、装置、计算机设备及存储介质
WO2022228235A1 (zh) 生成视频语料的方法、装置及相关设备
WO2018094952A1 (zh) 一种内容推荐方法与装置
CN114254158A (zh) 视频生成方法及其装置、神经网络的训练方法及其装置
Vaiani et al. Leveraging multimodal content for podcast summarization
JP2016102899A (ja) 音声認識装置、音声認識方法および音声認識プログラム
US9412395B1 (en) Narrator selection by comparison to preferred recording features
CN104281682A (zh) 文件分类系统及方法
KR102642358B1 (ko) 텍스트 감정분석 기반의 음악 추천 장치 및 방법
US20230169981A1 (en) Method and apparatus for performing speaker diarization on mixed-bandwidth speech signals

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18873050

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18873050

Country of ref document: EP

Kind code of ref document: A1