WO2024017110A1 - 语音降噪方法、模型训练方法、装置、设备、介质及产品 - Google Patents

语音降噪方法、模型训练方法、装置、设备、介质及产品 Download PDF

Info

Publication number
WO2024017110A1
WO2024017110A1 PCT/CN2023/106951 CN2023106951W WO2024017110A1 WO 2024017110 A1 WO2024017110 A1 WO 2024017110A1 CN 2023106951 W CN2023106951 W CN 2023106951W WO 2024017110 A1 WO2024017110 A1 WO 2024017110A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio frame
activity detection
detection result
noise reduction
sample
Prior art date
Application number
PCT/CN2023/106951
Other languages
English (en)
French (fr)
Inventor
魏善义
刘梁
Original Assignee
广州市百果园信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司 filed Critical 广州市百果园信息技术有限公司
Publication of WO2024017110A1 publication Critical patent/WO2024017110A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of audio processing technology, such as speech noise reduction methods, model training methods, devices, equipment, media and products.
  • the speech collected by the microphone of the terminal device usually contains a certain degree of noise.
  • the speech noise reduction algorithm can suppress the noise carried in the speech, thereby improving the intelligibility and voice quality of the speech.
  • speech noise reduction solutions can be roughly divided into two categories: traditional noise reduction solutions and artificial intelligence (Artificial Intelligence, AI) noise reduction solutions.
  • Traditional noise reduction solutions use signal processing to achieve speech noise reduction, which cannot eliminate unsteady noise, that is, the ability to reduce sudden noise is weak; AI noise reduction solutions can reduce both steady-state noise and unsteady-state noise. It has good noise reduction capabilities, but this solution is a data-driven solution and is very dependent on training samples. If there are scenarios that are not considered during the model training process (such as a situation where the signal-to-noise ratio is very low), then in actual applications Encountering this scenario may result in unpredictable signal output or even system crash.
  • the embodiments of this application provide speech noise reduction methods, model training methods, devices, equipment, media and products, which can effectively combine traditional noise reduction solutions and AI noise reduction solutions to improve the speech noise reduction effect.
  • a speech noise reduction method which method includes:
  • the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame are fused to obtain the target activity detection result corresponding to the current audio frame, where,
  • the model activity detection result is output by a preset speech noise reduction network model;
  • the initial noise reduction audio frame is input to the preset speech noise reduction network model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
  • a model training method including:
  • the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame are fused to obtain the target sample activity detection result corresponding to the current sample audio frame, wherein,
  • the sample model activity detection results are output by the speech noise reduction network model;
  • a first loss relationship is determined based on the target sample noise-reduced audio frame and the pure audio frame
  • a second loss relationship is determined based on the sample model activity detection result and the activity detection label, and based on the first loss relationship and The second loss relationship trains the speech noise reduction network model.
  • a voice noise reduction device which device includes:
  • the voice activity detection module is configured to use a preset voice activity detection algorithm to detect the current audio frame to be processed, and obtain the corresponding algorithm activity detection results;
  • the detection result fusion module is configured to fuse the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame to obtain the target activity detection result corresponding to the current audio frame, wherein,
  • the above model activity detection results are output by the preset speech noise reduction network model;
  • a noise reduction processing module configured to perform noise estimation and noise elimination on the current audio frame based on the target activity detection result to obtain an initial noise reduction audio frame
  • the model input module is configured to input the initial noise reduction audio frame to the preset speech noise reduction network model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
  • a model training device including:
  • the voice detection module is configured to use a preset voice activity detection algorithm to detect the current sample audio frame to be processed, and obtain the corresponding sample algorithm activity detection result, wherein the current sample audio frame is associated with an activity detection tag and a clean audio frame;
  • a fusion module configured to perform fusion processing on the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame, to obtain the target sample activity detection result corresponding to the current sample audio frame.
  • the sample model activity detection result is output by the speech noise reduction network model
  • a noise elimination module configured to perform noise estimation and noise elimination on the current sample audio frame based on the target active sample detection result to obtain an initial denoised sample audio frame
  • a network model input module configured to input the initial noise reduction sample audio frame to the speech noise reduction network model to output the target sample noise reduction audio frame and the sample model activity detection result corresponding to the current sample audio frame;
  • a network model training module configured to determine a first loss relationship based on the target sample noise-reduced audio frame and the clean audio frame, determine a second loss relationship based on the sample model activity detection result and the activity detection label, and based on The first loss relationship and the second loss relationship train the speech noise reduction network model.
  • an electronic device including:
  • the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application. Speech noise reduction methods and/or model training methods.
  • a computer-readable storage medium stores a computer program, and the computer program is used to implement any of the embodiments of the present application when executed by a processor. Speech noise reduction methods and/or model training methods.
  • a computer program product includes a computer program that, when executed by a processor, implements the speech noise reduction method and method described in any embodiment of the present application. /or model training method.
  • the speech noise reduction solution provided in the embodiment of this application uses a preset speech activity detection algorithm to detect the current audio frame to be processed, and obtains the corresponding algorithm activity detection result.
  • the model activity detection result corresponding to the previous audio frame and the current audio The algorithm activity detection results corresponding to the frame are fused to obtain the target activity detection results corresponding to the current audio frame.
  • the model activity detection results are determined by the preset speech noise reduction network. Network model output, perform noise estimation and noise elimination on the current audio frame based on the target activity detection results, and obtain the initial noise-reduced audio frame.
  • the initial noise-reduced audio frame is input to the preset speech noise reduction network model to output the target noise-reduced audio frame.
  • the model activity detection result corresponding to the current audio frame is input to the preset speech noise reduction network model to output the target noise-reduced audio frame.
  • the preset speech noise reduction network model can output the model activity detection results.
  • the model activity detection results of the previous audio frame and the traditional speech noise reduction can be compared.
  • the algorithm activity detection results obtained by the algorithm are combined, so that the traditional noise reduction algorithm can obtain more activity detection information and determine the voice activity detection results more reasonably and accurately.
  • noise estimation and noise elimination can better protect the voice.
  • noise elimination to obtain traditional noise reduction results with higher signal-to-noise ratio, and then use the traditional noise reduction results as the input of the preset speech noise reduction network model to obtain better noise reduction frequency frames, reducing the preset The possibility of the speech noise reduction network model to process harsh data.
  • Traditional noise reduction algorithms and AI noise reduction methods promote each other and have good noise reduction capabilities for various noises, which can improve the speech noise reduction effect and improve the overall speech noise reduction. The stability and robustness of the solution.
  • Figure 1 is a schematic flow chart of a speech noise reduction method provided by an embodiment of the present application.
  • Figure 2 is a schematic flow chart of yet another speech noise reduction method provided by an embodiment of the present application.
  • Figure 3 is a schematic diagram of the reasoning flow of a speech noise reduction method provided by an embodiment of the present application.
  • Figure 4 is a schematic flow chart of a model training method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of the training process of a model training method provided by an embodiment of the present application.
  • Figure 6 is a structural block diagram of a speech noise reduction device provided by an embodiment of the present application.
  • Figure 7 is a structural block diagram of a model training device provided by an embodiment of the present application.
  • FIG. 8 is a structural block diagram of an electronic device provided by an embodiment of the present application.
  • Figure 1 is a schematic flowchart of a speech noise reduction method provided by an embodiment of the present application.
  • This embodiment can be applied to the situation of speech noise reduction, for example, it can be applied to various situations such as voice calls, audio and video live broadcasts, and multi-person conferences. kind of scene.
  • the method can be executed by a voice noise reduction device, which can be implemented in the form of hardware and/or software.
  • the voice noise reduction device can be configured in electronic equipment such as voice noise reduction equipment.
  • the electronic device may be a mobile device such as a mobile phone, a smart watch, a tablet computer, or a personal digital assistant; it may also be a desktop computer or other other device.
  • the method includes:
  • Step 101 Use the preset speech activity detection algorithm to detect the current audio frame to be processed, and obtain the corresponding algorithm activity detection result.
  • the current audio frame to be processed can be understood as the audio frame that currently needs to be processed for voice noise reduction, and the current audio frame can be included in an audio file or audio stream.
  • the current audio frame may be an original audio frame in an audio file or audio stream, or an audio frame obtained by preprocessing the original audio frame.
  • the entire speech noise reduction solution can be understood as a speech noise reduction system, and the current audio frame can be understood as an input signal of the speech noise reduction system.
  • the speech noise reduction solution can include traditional speech noise reduction algorithms and AI speech noise reduction models.
  • the type of traditional speech noise reduction algorithm can be, for example, the Adaptive Noise Suppression (ANS) algorithm in Web Real-Time Communication (webRTC), linear filtering method, spectral subtraction method, statistical model algorithm or Subspace algorithm, etc.
  • Traditional speech noise reduction algorithms mainly include three parts: Voice Activity Detection (VAD) estimation, noise estimation and noise elimination.
  • VAD Voice Activity Detection
  • Voice activity detection also known as voice endpoint detection or voice boundary detection, can identify long periods of silence from the sound signal stream.
  • the preset voice activity detection algorithm in the embodiment of the present application can be a voice activity detection algorithm in any traditional voice noise reduction algorithm.
  • the preset speech noise reduction network model in this application can be an AI speech noise reduction model, which can include real-time noise suppression (Dual-Signal Transformation LSTM Network) such as RNNoise model or dual-channel signal transformation long short-term memory artificial neural network. for Real-Time Noise Suppression, DTLN) noise reduction model, etc.
  • the default speech noise reduction network model includes two branches, one branch is used for input The denoised speech is output (can be referred to as the noise reduction branch), and the other branch is used to output the speech activity detection result (can be referred to as the detection branch).
  • the original model structure can be maintained; for AI speech denoising models that do not include detection branches, detection branches can be added based on the backbone network, and the network of the detection branches
  • the structure may include, for example, convolutional layers and/or fully connected layers.
  • RNNoise is a noise reduction solution that combines audio feature extraction + deep neural network.
  • the obtained detection results can be recorded as algorithm activity detection results, and the preset voice activity can be reduced to
  • the activity detection results output by the noise network model are recorded as model activity detection results.
  • Step 102 Fusion process the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame to obtain the target activity detection result corresponding to the current audio frame, wherein the model activity detection result
  • the results are output by the preset speech noise reduction network model.
  • the previous audio frame can be understood as the latest audio frame before the current audio frame, that is, the previous audio frame is located before the current audio frame and the two frame numbers are adjacent.
  • the preset speech noise reduction network model can output the noise reduction audio frame and model activity detection results corresponding to the previous audio frame, and the model activity detection results can be cached for use. For noise reduction processing of the current audio frame.
  • the model activity detection results corresponding to the previous audio frame and the algorithm activity detection results corresponding to the current audio frame can be combined to determine the parameters used in the traditional speech noise reduction algorithm.
  • Activity detection results target activity detection results
  • the traditional noise reduction algorithms can obtain more VAD information, thereby obtaining more accurate noise estimates, which can better protect speech and eliminate it more accurately.
  • Noise can improve the output signal-to-noise ratio (Signal to Noise Ratio, SNR) of traditional noise reduction algorithms.
  • Step 103 Perform noise estimation and noise elimination on the current audio frame based on the target activity detection result to obtain an initial noise-reduced audio frame.
  • the noise estimation algorithm and noise elimination algorithm in the traditional speech noise reduction algorithm can be used to process the current audio frame accordingly, and the processed audio frame is recorded as the initial noise reduction audio frame.
  • Step 104 Input the initial noise reduction audio frame to the preset speech noise reduction network model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
  • the initial noise-reduced audio frame can be directly used as the preset Assuming the input of the speech noise reduction network model, the initial noise reduction audio frame can also be converted according to the characteristics of the preset speech noise reduction network model, for example, into a signal with a preset dimension.
  • the preset dimension can be, for example, the frequency domain or the time domain. or other dimension fields.
  • the speech noise reduction method uses a preset speech activity detection algorithm to detect the current audio frame to be processed, and obtains the corresponding algorithm activity detection result.
  • the model activity detection result corresponding to the previous audio frame and the current audio The algorithm activity detection results corresponding to the frame are fused to obtain the target activity detection results corresponding to the current audio frame.
  • the model activity detection results are output by the preset speech noise reduction network model. Based on the target activity detection results, the noise estimation and noise are performed on the current audio frame. Eliminate to obtain an initial noise-reduction audio frame, and input the initial noise-reduction audio frame to the preset speech noise reduction network model to output the target noise-reduction audio frame and the model activity detection result corresponding to the current audio frame.
  • the preset speech noise reduction network model can output the model activity detection results.
  • the model activity detection results of the previous audio frame and the traditional speech noise reduction can be compared.
  • the algorithm activity detection results obtained by the algorithm are combined, so that the traditional noise reduction algorithm can obtain more activity detection information and determine the voice activity detection results more reasonably and accurately.
  • noise estimation and noise elimination can better protect the voice. , eliminate more noise, and obtain traditional noise reduction results with a higher signal-to-noise ratio, and then use the traditional noise reduction results as the input of the preset speech noise reduction network model to obtain better noise reduction frequency frames, reducing the preset
  • the speech noise reduction network model has the possibility to process harsh data.
  • Traditional noise reduction algorithms and AI noise reduction methods promote each other, have better noise reduction capabilities for various noises, and improve the overall stability and robustness of the solution.
  • voice activity detection can be at the frame level or at the frequency point level, and the detection results can be represented by one or more probability values.
  • the algorithm activity detection result includes a first probability value corresponding to the presence of speech in the audio frame
  • the model activity detection result includes a second probability value corresponding to the existence of speech in the audio frame.
  • the fusion processing of the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame to obtain the target activity detection result corresponding to the current audio frame includes: using a preset calculation In this way, the first probability value in the model activity detection result corresponding to the previous audio frame and the second probability value in the algorithm activity detection result corresponding to the current audio frame are calculated to obtain the third probability value. According to the The third probability value determines the target activity detection result corresponding to the current audio frame. With this setting, for frame-level speech activity detection, the target activity detection results can be accurately determined.
  • the first probability value is used to represent the probability that the corresponding audio frame contains speech after detecting the corresponding audio frame using the preset voice activity detection algorithm.
  • the corresponding audio frame here can be any audio frame, and can be the current audio frame. , or it can be the previous audio frame.
  • the first probability value corresponding to different audio frames can be different;
  • the second probability value is used to represent the corresponding audio output by the preset speech noise reduction network model.
  • the probability that the frame contains speech, the corresponding audio frame here can also be any audio frame, and the second probability values corresponding to different audio frames can be different.
  • the first probability value in the algorithm activity detection result corresponding to the current audio frame can be used to represent the current audio frame obtained after using the preset voice activity detection algorithm to detect the current audio frame (assumed to be marked as A).
  • the probability that contains speech can be recorded as Pa.
  • the second probability value in the model activity detection result corresponding to the previous audio frame can be used to represent the upper value predicted by the preset speech noise reduction network model when performing speech noise reduction processing on the previous audio frame (assumed to be B).
  • the probability that an audio frame contains speech can be recorded as Pb.
  • the third probability value can be used as the target activity detection result corresponding to the current audio frame.
  • the preset calculation method is one of taking the maximum value, taking the minimum value, calculating the average, calculating the sum, calculating the weighted sum, and calculating the weighted average.
  • Pc max(Pa, Pb).
  • the algorithm activity detection result includes a fourth probability value for the presence of speech in each of the preset number of frequency points in the corresponding audio frame; and the model activity detection result includes the corresponding audio frame.
  • Each frequency point in the preset number of frequency points has a fifth probability value of speech; wherein, the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame are fused.
  • obtaining the target activity detection result corresponding to the current audio frame including: for each frequency point in the preset number of frequency points, using a preset calculation method to calculate the model activity detection result corresponding to the previous audio frame.
  • the fifth probability value of a single frequency point is calculated with the corresponding fourth probability value of the single frequency point in the algorithm activity detection result corresponding to the current audio frame to obtain a sixth probability value; according to the preset The sixth probability value of the number determines the target activity detection result corresponding to the current audio frame.
  • the preset number (denoted as n) can be set according to actual needs, for example, it can be determined according to the number of points used in the fast Fourier transform in the preprocessing stage, for example, n is 256.
  • the fourth probability value corresponding to the current audio frame can be used to represent each of the preset number of frequency points in the current audio frame obtained after using the preset voice activity detection algorithm to detect the current audio frame (assumed to be marked as A).
  • the probability that a frequency point contains speech can be recorded as PA[n].
  • PA[n] can be understood as a vector containing n elements (n bits). The value of each element is between 0 and 1. The value of an element is The value is used to represent the probability that the corresponding frequency point contains speech.
  • the fifth probability value corresponding to the previous audio frame can be used to indicate that when performing speech noise reduction processing on the previous audio frame (assumed to be marked as B), the preset speech noise reduction network model predicts the predetermined value in the previous audio frame. Assume the probability that each frequency point contains speech among a number of frequency points, which can be recorded as PB[n]. Calculate PA[n] and PB[n] using a preset calculation method to obtain a preset number of sixth probability values, which can be recorded as PC[n], for example. For example, a vector containing the sixth probability value may be used as the target activity detection result corresponding to the current audio frame.
  • the preset calculation method is one of taking the maximum value, taking the minimum value, calculating the average, calculating the sum, calculating the weighted sum, and calculating the weighted average.
  • PC[n] max(PA[n], PB[n]).
  • the maximum value of the corresponding fourth probability value and fifth probability value becomes the sixth probability value corresponding to the first frequency point in the current audio frame, and subsequent frequencies Click and so on.
  • inputting the initial noise-reduction audio frame to the preset speech noise reduction network model includes: performing feature extraction of a preset feature dimension on the initial noise-reduction audio frame to obtain the target input signal; input the target input signal to the preset speech noise reduction network model, or input the target input signal and the initial noise reduction audio frame to the preset speech noise reduction network model.
  • feature extraction can be carried out in a targeted manner and the prediction accuracy and precision of the preset speech noise reduction network model can be improved.
  • the preset feature dimensions include explicit feature dimensions, which can be fundamental frequency features, such as pitch frequency (Pitch), per-channel energy normalization (PCEN) features, or Mel Frequency Cepstrum Coefficient (Mel Frequency Cepstrum Coefficient, MFCC) characteristics and so on.
  • the preset feature dimensions can be determined based on the network structure or characteristics of the preset speech noise reduction network model.
  • Figure 2 is a schematic flow chart of another voice noise reduction method provided by an embodiment of the present application. This method is optimized based on the above optional embodiments.
  • Figure 3 is a schematic diagram of a voice noise reduction method provided by an embodiment of the present application. The schematic diagram of the reasoning flow can be understood by combining Figure 2 and Figure 3 to understand the solution of the embodiment of the present application. Among them, as shown in Figure 2, the method may include:
  • Step 201 Obtain the original audio frame, preprocess the original audio frame, and obtain the current audio frame to be processed.
  • the original audio frame is included in an audio file or audio stream, for example, it may be an audio stream in a voice call scenario.
  • the call audio needs to be noise reduced.
  • Preprocessing can include processing such as framing, windowing, and Fourier transform.
  • the preprocessed noisy speech frame is the current audio frame to be processed, which is used as the input signal of the preset traditional noise reduction algorithm (recorded as S0).
  • Step 202 Use the preset speech activity detection algorithm in the preset traditional noise reduction algorithm to detect the current audio frame to be processed, and obtain the corresponding algorithm activity detection result.
  • the preset traditional noise reduction algorithm may be the ANS algorithm.
  • S0 is detected. Assuming that it is a frequency-level detection, the voice presence probability Pf of 256 frequency points can be obtained [256], that is, the corresponding S0 The algorithm activity detection results.
  • Step 203 Determine whether the current audio frame has a previous audio frame. If so, perform step 204; otherwise, perform step 206.
  • Step 206 is executed based on the algorithm activity detection result corresponding to the current audio frame. Perform noise estimation and noise removal.
  • Step 204 Obtain the model activity detection result corresponding to the previous audio frame, and fuse the obtained model activity detection result and the algorithm activity detection result corresponding to the current audio frame to obtain the target activity detection result corresponding to the current audio frame.
  • the model activity detection result corresponding to the previous audio frame is output by a preset speech noise reduction network model based on artificial intelligence, which can be the speech presence probability PF [256] of 256 frequency points in the previous audio frame, which can be used
  • Step 205 Based on the target activity detection result, use the preset traditional noise reduction algorithm to perform noise estimation and noise elimination on the current audio frame to obtain an initial noise reduction audio frame, and execute step 207.
  • the preset traditional noise reduction algorithm implements noise estimation and noise elimination according to P [256], and obtains the speech signal S1 that has undergone traditional noise reduction processing, that is, the initial noise reduction audio frame.
  • Step 206 Based on the algorithm activity detection result corresponding to the current audio frame, use the preset traditional noise reduction algorithm to perform noise estimation and noise elimination on the current audio frame to obtain an initial noise reduction audio frame.
  • the preset traditional noise reduction algorithm implements noise estimation and noise elimination according to Pf [256], and obtains the speech signal S1 that has undergone traditional noise reduction processing, that is, the initial noise reduction audio frame.
  • Step 207 Extract features of preset feature dimensions on the initial noise-reduced speech to obtain the target input signal.
  • S1 serves as the input signal of the preset speech noise reduction network model, which can be a signal in the frequency domain, time domain or other dimensional domain.
  • the preset speech noise reduction network model there may be an explicit one-step Feature extraction calculation, such as pitch frequency features, records the extracted feature information as the target input signal S2.
  • Step 208 Input the target input signal and/or the initial noise reduction audio frame to the preset speech noise reduction network model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
  • S1 or S2 can be used as the model input, or both S1 and S2 can be used as the model input, and input into the preset speech noise reduction network model for inference calculation to obtain the output signal.
  • the output signal contains two parts. The first part is the final denoised speech output S3 of the speech denoising method, and the second part is the VAD output PF [256] of the model, which is used by the traditional speech denoising algorithm when processing the next audio frame.
  • Step 209 Determine whether there is an original audio frame to be processed. If so, return to step 201; otherwise, end the process.
  • step 201 can be returned to continue the denoising process.
  • the speech noise reduction method provided by the embodiments of this application uses a preset speech noise reduction network model based on artificial intelligence to provide information feedback to the traditional noise reduction algorithm, so that the traditional noise reduction algorithm can obtain more VAD information.
  • Traditional noise reduction Both VAD estimation and AI noise reduction use frequency point level, which can obtain more accurate noise estimation, so that traditional noise reduction algorithms can better protect speech, eliminate more noise, improve the output signal-to-noise ratio of traditional noise reduction, and achieve high
  • the input of the preset speech denoising network model can be enriched, which reduces the possibility of the preset speech denoising network model processing bad data and at the same time improves the speech denoising performance of the model. effect, improving voice noise reduction performance.
  • Figure 4 is a schematic flowchart of a model training method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of the training process of a model training method provided by an embodiment of the present application.
  • the embodiment of the present application can be understood in conjunction with Figures 4 and 4 .
  • This embodiment can be applied to training a speech noise reduction network model based on artificial intelligence.
  • the model can be applied to various scenarios such as voice calls, audio and video live broadcasts, and multi-person conferences.
  • the method can be executed by a model training device, which can be implemented in the form of hardware and/or software, and which can be configured in electronic equipment such as model training equipment.
  • the electronic device may be a mobile device such as a mobile phone, a smart watch, a tablet computer, or a personal digital assistant; it may also be a desktop computer or other other device.
  • the speech noise reduction network model trained using the embodiments of this application can be applied to the speech noise reduction method provided by any embodiment of this application.
  • the method includes:
  • Step 401 Use the preset voice activity detection algorithm to detect the current sample audio frame to obtain the corresponding sample algorithm activity detection result, wherein the current sample audio frame is associated with an activity detection tag and a pure audio frame.
  • a pure (clean) speech data set and a noise data set can be mixed into noisy speech data according to a preset mixing rule.
  • the preset mixing rule can be based on, for example, signal-to-noise ratio or room acoustic impulse response (Room Impulse Response). RIR) to set.
  • RIR Room Impulse Response
  • the mixed noisy speech data set and the pure speech data set are used as a training set for the model.
  • the current sample audio frame can be an audio frame in the training set.
  • the current sample audio frame can carry an activity detection label, which can be added through manual annotation.
  • the label can be 1, if it does not contain speech, the label can be 0; taking the frequency level as an example, the label can be a vector containing a preset number of elements, and the value of each element It is 1 or 0. If the corresponding frequency point contains speech, the value is 1. If the corresponding frequency point does not contain speech, the value is 0.
  • Step 402 Fusion process the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame to obtain the current sample audio frame The corresponding target sample activity detection result, wherein the sample model activity detection result is output by the speech noise reduction network model.
  • the activity detection result fusion process in this step can be similar to the fusion process in the speech noise reduction method provided by the embodiment of the present application.
  • it can be frequency point level fusion or frame level fusion, etc., and similar pre-processing can also be used.
  • the calculation method is designed to fuse the corresponding frequency values. For specific details, please refer to the relevant content of this article and will not be repeated here.
  • Step 403 Perform noise estimation and noise elimination on the current sample audio frame based on the target active sample detection result to obtain an initial denoised sample audio frame.
  • Step 404 Input the initial noise reduction sample audio frame to the speech noise reduction network model to output the target sample noise reduction audio frame and the sample model activity detection result corresponding to the current sample audio frame.
  • Step 405 Determine a first loss relationship based on the target sample noise-reduced audio frame and the pure audio frame, determine a second loss relationship based on the sample model activity detection result and the activity detection label, and determine the second loss relationship based on the first The loss relationship and the second loss relationship train the speech noise reduction network model.
  • the loss relationship can be used to characterize the difference between two types of data, which can be represented by a loss value. For example, it can be calculated using a loss function.
  • the first loss relationship is used to characterize the difference between the target sample noise-reduced audio frame and the pure audio frame
  • the second loss relationship is used to characterize the difference between the sample model activity detection result and the activity detection label, where, is used to calculate the first
  • the first loss function of the loss relationship and the function type of the second loss function used to calculate the second loss relationship can be set according to actual needs.
  • the target loss relationship may be calculated based on the first loss relationship and the second loss relationship, and the calculation method may be, for example, weighted summation.
  • the speech noise reduction network model is trained according to the target loss relationship.
  • the weight parameters in the speech noise reduction network model can be continuously optimized using training methods such as backpropagation with the goal of minimizing the target loss relationship. value until the preset training cutoff condition is met.
  • the training cutoff condition can be set according to actual needs, for example, it can be set based on the number of iterations, the degree of convergence of the loss value, or the accuracy of the model.
  • the model training method provided by the embodiment of the present application uses the traditional noise reduction algorithm and the speech noise reduction network model as a whole during the training process, which can avoid the data generated by the traditional noise reduction algorithm concatenating the separately trained speech noise reduction network model. Mismatch risk, the model obtained after training can be used for speech noise reduction, and has better noise reduction capabilities for various noises, improving the noise reduction effect.
  • the sample algorithm activity detection result includes a first sample probability value corresponding to the presence of speech in the sample audio frame
  • the sample model activity detection result includes a second sample probability value corresponding to the presence of speech in the sample audio frame
  • the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame are fused to obtain the target sample activity detection result corresponding to the current sample audio frame, It includes: using a preset calculation method to calculate the second sample probability value in the sample model activity detection result corresponding to the previous sample audio frame, and the first sample probability value in the sample algorithm activity detection result corresponding to the current sample audio frame. The value is calculated to obtain a third sample probability value, and the target sample activity detection result corresponding to the current sample audio frame is determined according to the third sample probability value.
  • the sample algorithm activity detection result includes the fourth sample probability value of the existence of speech at each frequency point in the preset number of frequency points in the corresponding audio frame;
  • the model activity detection result includes the corresponding audio frame.
  • the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame are fused to obtain the target sample activity detection result corresponding to the current sample audio frame, It includes: for each frequency point in the preset number of frequency points, using a preset calculation method to calculate the fifth sample probability value of a single frequency point in the sample model activity detection result corresponding to the previous sample audio frame, and The fourth sample probability value corresponding to the single frequency point in the sample algorithm activity detection result corresponding to the current sample audio frame is calculated to obtain a sixth sample probability value; according to the preset number of sixth sample probability values , determine the target sample activity detection result corresponding to the current sample audio frame.
  • inputting the initial noise reduction sample audio frame to the speech noise reduction network model includes: performing feature extraction of preset feature dimensions on the initial noise reduction sample audio frame to obtain a target input signal;
  • the target input signal is input to the speech noise reduction network model, or the target input signal and the initial noise reduction sample audio frame are input to the speech noise reduction network model.
  • Figure 6 is a structural block diagram of a voice noise reduction device provided by an embodiment of the present application.
  • the device can be implemented by software and/or hardware, and can generally be integrated in electronic equipment such as voice noise reduction equipment. It can be performed by executing a voice noise reduction method. Perform voice noise reduction.
  • the device includes:
  • the voice activity detection module 601 is configured to use a preset voice activity detection algorithm to detect the current audio frame to be processed, and obtain the corresponding algorithm activity detection result;
  • the detection result fusion module 602 is configured to fuse the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame to obtain the target activity detection result corresponding to the current audio frame, wherein,
  • the model activity detection result is output by a preset speech noise reduction network model;
  • the noise reduction processing module 603 is configured to perform noise estimation and noise elimination on the current audio frame based on the target activity detection result to obtain an initial noise reduction audio frame;
  • the model input module 604 is configured to input the initial noise reduction audio frame to the preset speech noise reduction network model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
  • the voice noise reduction device uses a preset voice activity detection algorithm to detect the current audio frame to be processed, and obtains the corresponding algorithm activity detection result, and compares the model activity detection result corresponding to the previous audio frame and the current audio frame.
  • the corresponding algorithm activity detection results are fused to obtain the target activity detection results corresponding to the current audio frame.
  • the model activity detection results are output by the preset speech noise reduction network model. Based on the target activity detection results, noise estimation and noise elimination are performed on the current audio frame. , obtain the initial noise-reduction audio frame, and input the initial noise-reduction audio frame to the preset speech noise reduction network model to output the target noise-reduction audio frame and the model activity detection result corresponding to the current audio frame.
  • the preset speech noise reduction network model can output the model activity detection results.
  • the model activity detection results of the previous audio frame and the traditional speech noise reduction can be compared.
  • the algorithm activity detection results obtained by the algorithm are combined, so that the traditional noise reduction algorithm can obtain more activity detection information and determine the voice activity detection results more reasonably and accurately.
  • noise estimation and noise elimination can better protect the voice. , eliminate more noise, and obtain traditional noise reduction results with higher signal-to-noise ratio, and then use the traditional noise reduction results as the input of the preset speech noise reduction network model to obtain better noise reduction frequency frames, reducing the preset
  • the speech noise reduction network model has the possibility to process harsh data.
  • Traditional noise reduction algorithms and AI noise reduction methods promote each other, have better noise reduction capabilities for various noises, and improve the overall stability and robustness of the solution.
  • the algorithm activity detection result includes a first probability value corresponding to the presence of speech in the audio frame
  • the model activity detection result includes a second probability value corresponding to the existence of speech in the audio frame
  • the detection result fusion module 602 is configured to fuse the model activity detection result and the algorithm activity detection result in the following manner to obtain the target activity detection result corresponding to the current audio frame:
  • the algorithm activity detection result includes the fourth probability value of the existence of speech in each of the preset number of frequency points in the corresponding audio frame;
  • the model activity detection result includes the preset number of frequency points in the corresponding audio frame. Let the fifth probability value of speech exist for each frequency point among the number of frequency points;
  • the detection result fusion module 602 is also configured to fuse the model activity detection result and the algorithm activity detection result in the following manner to obtain the target activity detection result corresponding to the current audio frame:
  • a preset calculation method is used to calculate the fifth probability value of a single frequency point in the model activity detection result corresponding to the previous audio frame, and the current audio frame Calculate the corresponding fourth probability value of the single frequency point in the corresponding algorithm activity detection result to obtain a sixth probability value; determine the target corresponding to the current audio frame based on the preset number of sixth probability values Activity test results.
  • the preset calculation method is one of taking the maximum value, taking the minimum value, calculating the average, calculating the sum, calculating the weighted sum, and calculating the weighted average.
  • model input module includes:
  • a feature extraction unit configured to extract features of a preset feature dimension from the initial noise-reduced speech to obtain a target input signal
  • a signal input unit configured to input the target input signal to the preset speech noise reduction network model, or to input the target input signal and the initial noise reduction audio frame to the preset speech noise reduction network. model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
  • Figure 7 is a structural block diagram of a model training device provided by an embodiment of the present application.
  • the device can be implemented by software and/or hardware, and can generally be integrated in electronic equipment such as model training equipment. Model training can be performed by executing a model training method. .
  • the device includes:
  • the voice detection module 701 is configured to use a preset voice activity detection algorithm to detect the current sample audio frame to be processed, and obtain the corresponding sample algorithm activity detection result, wherein the current sample audio frame is associated with an activity detection tag and a clean audio frame. ;
  • the fusion module 702 is configured to perform fusion processing on the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame, to obtain the target sample activity detection corresponding to the current sample audio frame.
  • the result, wherein the sample model activity detection result is output by the speech noise reduction network model;
  • the noise elimination module 703 is configured to perform noise estimation and noise elimination on the current sample audio frame based on the target active sample detection result to obtain an initial denoised sample audio frame;
  • the network model input module 704 is configured to input the initial noise reduction sample audio frame to the speech noise reduction network model to output the target sample noise reduction audio frame and the sample model activity detection result corresponding to the current sample audio frame;
  • the network model training module 705 is configured to determine a first loss relationship based on the target sample noise-reduced audio frame and the clean audio frame, and determine a second loss relationship based on the sample model activity detection result and the activity detection label, and The speech noise reduction network model is trained based on the first loss relationship and the second loss relationship.
  • the model training device uses the traditional noise reduction algorithm and the speech noise reduction network model as a whole during the training process, which can avoid the data generated by the traditional noise reduction algorithm concatenating the separately trained speech noise reduction network model. Mismatch risk, the model obtained after training can be used for speech noise reduction, and has better noise reduction capabilities for various noises, improving the noise reduction effect.
  • FIG. 8 is a structural block diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device 800 includes a processor 801, and a memory 802 communicatively connected to the processor 801.
  • the memory 802 stores a computer program that can be executed by the processor 801, and the computer program is executed by the processor 801, so that the processor 801
  • the speech noise reduction method and/or model training method described in any embodiment of the present application can be executed.
  • the number of processors may be one or more. In FIG. 8 , one processor is taken as an example.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program is used to enable the processor to implement the speech reduction described in any embodiment of the present application when executed. noise methods and/or model training methods.
  • Embodiments of the present application also provide a computer program product.
  • the computer program product includes a computer program. When executed by a processor, the computer program implements the speech noise reduction method and/or model training method as provided in the embodiments of the present application.
  • the speech noise reduction device, model training device, electronic equipment, storage media and products provided in the above embodiments can execute the speech noise reduction method or model training method provided by the corresponding embodiments of the present application, and have corresponding functional modules and functions to execute the method. beneficial effects.
  • the speech noise reduction method or model training method provided by any embodiment of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephone Function (AREA)

Abstract

一种语音降噪方法、模型训练方法、装置、设备、介质及产品。其中,该语音降噪方法包括:采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果[101];对上一音频帧对应的模型活性检测结果和当前音频帧对应的算法活性检测结果进行融合处理,得到当前音频帧对应的目标活性检测结果,模型活性检测结果由预设语音降噪网络模型输出[102];基于目标活性检测结果对当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧[103];将初始降噪音频帧输入至预设语音降噪网络模型,以输出目标降噪音频帧以及当前音频帧对应的模型活性检测结果[104]。通过采用上述方案,可以提升语音降噪效果,并提高语音降噪方案的稳定性和鲁棒性。

Description

语音降噪方法、模型训练方法、装置、设备、介质及产品
本公开要求在2022年7月21日提交中国专利局、申请号为202210864010.4的中国专利的优先权,以上申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频处理技术领域,例如涉及语音降噪方法、模型训练方法、装置、设备、介质及产品。
背景技术
随着多媒体技术的飞速发展,各种会议、社交以及娱乐类应用程序层出不穷,其中涉及语音通话、音视频直播以及多人会议等诸多场景,而语音质量是衡量应用性能的重要指标。
终端设备的麦克风所采集的语音通常带有一定程度的噪声,通过语音降噪算法可抑制语音中所携带的噪声,从而提高语音的可懂度和话音质量。
目前,语音降噪方案大致可以分为两大类:传统降噪方案和人工智能(Artificial Intelligence,AI)降噪方案。传统降噪方案是以信号处理的方式实现语音降噪,无法消除非稳态的噪声,也即对突发噪声的降噪能力较弱;AI降噪方案对稳态噪声和非稳态噪声都具有较好的降噪能力,但该方案为数据驱动方案,非常依赖于训练样本,在模型训练过程中如果存在未考虑到的场景(例如信噪比很低的情况),那么在实际应用中遇到此场景可能导致不可估计的信号输出,甚至系统崩溃。
发明内容
本申请实施例提供了语音降噪方法、模型训练方法、装置、设备、介质及产品,可以将传统降噪方案和AI降噪方案进行有效结合,提升语音降噪效果。
根据本申请的一方面,提供了一种语音降噪方法,该方法包括:
采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果;
对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,其中, 所述模型活性检测结果由预设语音降噪网络模型输出;
基于所述目标活性检测结果对所述当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧;
将所述初始降噪音频帧输入至所述预设语音降噪网络模型,以输出目标降噪音频帧以及所述当前音频帧对应的模型活性检测结果。
根据本申请的另一方面,提供了一种模型训练方法,包括:
采用预设语音活性检测算法对当前样本音频帧进行检测,得到对应的样本算法活性检测结果,其中,所述当前样本音频帧关联有活性检测标签和纯净音频帧;
对上一样本音频帧对应的样本模型活性检测结果和所述当前样本音频帧对应的样本算法活性检测结果进行融合处理,得到所述当前样本音频帧对应的目标样本活性检测结果,其中,所述样本模型活性检测结果由语音降噪网络模型输出;
基于所述目标活性样本检测结果对所述当前样本音频帧进行噪声估计和噪声消除,得到初始降噪样本音频帧;
将所述初始降噪样本音频帧输入至所述语音降噪网络模型,以输出目标样本降噪音频帧以及所述当前样本音频帧对应的样本模型活性检测结果;
根据所述目标样本降噪音频帧和所述纯净音频帧确定第一损失关系,根据所述样本模型活性检测结果和所述活性检测标签确定第二损失关系,并基于所述第一损失关系和所述第二损失关系对所述语音降噪网络模型进行训练。
根据本申请的另一方面,提供了一种语音降噪装置,该装置包括:
语音活性检测模块,设置为采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果;
检测结果融合模块,设置为对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,其中,所述模型活性检测结果由预设语音降噪网络模型输出;
降噪处理模块,设置为基于所述目标活性检测结果对所述当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧;
模型输入模块,设置为将所述初始降噪音频帧输入至所述预设语音降噪网络模型,以输出目标降噪音频帧以及所述当前音频帧对应的模型活性检测结果。
根据本申请的另一方面,提供了一种模型训练装置,包括:
语音检测模块,设置为采用预设语音活性检测算法对待处理的当前样本音频帧进行检测,得到对应的样本算法活性检测结果,其中,所述当前样本音频帧关联有活性检测标签和干净音频帧;
融合模块,设置为对上一样本音频帧对应的样本模型活性检测结果和所述当前样本音频帧对应的样本算法活性检测结果进行融合处理,得到所述当前样本音频帧对应的目标样本活性检测结果,其中,所述样本模型活性检测结果由语音降噪网络模型输出;
噪声消除模块,设置为基于所述目标活性样本检测结果对所述当前样本音频帧进行噪声估计和噪声消除,得到初始降噪样本音频帧;
网络模型输入模块,设置为将所述初始降噪样本音频帧输入至所述语音降噪网络模型,以输出目标样本降噪音频帧以及所述当前样本音频帧对应的样本模型活性检测结果;
网络模型训练模块,设置为根据所述目标样本降噪音频帧和所述干净音频帧确定第一损失关系,根据所述样本模型活性检测结果和所述活性检测标签确定第二损失关系,并基于所述第一损失关系和所述第二损失关系对所述语音降噪网络模型进行训练。
根据本申请的另一方面,提供了一种电子设备,所述电子设备包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行本申请任一实施例所述的语音降噪方法和/或模型训练方法。
根据本申请的另一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序用于使处理器执行时实现本申请任一实施例所述的语音降噪方法和/或模型训练方法。
根据本申请的另一方面,提供了一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序在被处理器执行时实现本申请任一实施例所述的语音降噪方法和/或模型训练方法。
本申请实施例中提供的语音降噪方案,采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果,对上一音频帧对应的模型活性检测结果和当前音频帧对应的算法活性检测结果进行融合处理,得到当前音频帧对应的目标活性检测结果,模型活性检测结果由预设语音降噪网 络模型输出,基于目标活性检测结果对当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧,将初始降噪音频帧输入至预设语音降噪网络模型,以输出目标降噪音频帧以及当前音频帧对应的模型活性检测结果。通过采用上述方案,预设语音降噪网络模型能够输出模型活性检测结果,在采用传统语音降噪算法对当前音频帧进行处理时,可以对上一音频帧的模型活性检测结果和传统语音降噪算法得到的算法活性检测结果进行结合,使传统降噪算法可以获得更多的活性检测信息,更加合理准确地确定语音活性检测结果,基于该结果进行噪声估计和噪声消除,可以更好的保护语音以及更多的消除噪声,得到信噪比更高的传统降噪结果,再将传统降噪结果作为预设语音降噪网络模型的输入,得到效果更好的降噪音频帧,降低了预设语音降噪网络模型处理恶劣数据的可能性,传统降噪算法和AI降噪方法相互促进,对各种噪声具有较好的降噪能力,可以提升语音降噪效果,并提高整体的语音降噪方案的稳定性和鲁棒性。
附图说明
下面将对实施例描述中所需要使用的附图作介绍,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种语音降噪方法的流程示意图;
图2为本申请实施例提供的又一种语音降噪方法的流程示意图;
图3为本申请实施例提供的一种语音降噪方法的推理流程示意图;
图4为本申请实施例提供的一种模型训练方法的流程示意图;
图5为本申请实施例提供的一种模型训练方法的训练过程示意图;
图6为本申请实施例提供的一种语音降噪装置的结构框图;
图7为本申请实施例提供的一种模型训练装置的结构框图;
图8为本申请实施例提供的一种电子设备的结构框图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例进行描述,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
图1为本申请实施例提供的一种语音降噪方法的流程示意图,本实施例可适用于对语音进行降噪的情况,例如可以适用于如语音通话、音视频直播以及多人会议等各种场景。该方法可以由语音降噪装置执行,该语音降噪装置可以采用硬件和/或软件的形式实现,该语音降噪装置可配置于语音降噪设备等电子设备中。所述电子设备可以为手机、智能手表、平板电脑以及个人数字助理等移动设备;也可为台式计算机等其他设备。如图1所示,该方法包括:
步骤101、采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果。
示例性的,待处理的当前音频帧可以理解为当前需要进行语音降噪处理的音频帧,当前音频帧可以包含于音频文件或音频流中。可选的,当前音频帧可以是音频文件或音频流中的原始音频帧,也可以是对原始音频帧进行预处理后得到的音频帧。
本申请实施例中,语音降噪方案整体可以理解为一个语音降噪系统,当前音频帧可以理解为语音降噪系统的输入信号。语音降噪方案中可包含传统语音降噪算法和AI语音降噪模型。
其中,传统语音降噪算法的类型例如可以是网络即时通信(Web Real-Time Communication,webRTC)中的自适应噪音抑制(Adaptive Noise Suppression,ANS)算法、线性滤波法、谱减法、统计模型算法或子空间算法等。传统语音降噪算法中主要包括语音活性检测(Voice Activity Detection,VAD)估计、噪声估计和噪声消除三大部分。语音活性检测,又称语音端点检测或语音边界检测,可以从声音信号流里识别长时间的静音期。本申请实施例中的预设语音活性检测算法,可以是任意传统语音降噪算法中的语音活性检测算法。
其中,本申请中的预设语音降噪网络模型,可以是AI语音降噪模型,可包括如RNNoise模型、或双路信号变换长短期记忆人工神经网络的实时噪声抑制(Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression,DTLN)降噪模型等。预设语音降噪网络模型中包括两个分支,一个分支用于输 出降噪语音(可简称为降噪分支),另一个分支用于输出语音活性检测结果(可简称为检测分支)。对于已包含检测分支的AI语音降噪模型来说,可以保持原有模型结构;对于未包含检测分支的AI语音降噪模型来说,可以在主干网络基础上,添加检测分支,检测分支的网络结构例如可以包括卷积层和/或全连接层等。
其中,RNNoise是一种采用音频特征抽取+深度神经网络结合的降噪方案。
示例性的,为了便于区分不同来源的语音活性检测结果,采用预设语音活性检测算法对待处理的当前音频帧进行检测后,可将得到的检测结果记为算法活性检测结果,将预设语音降噪网络模型输出的活性检测结果记为模型活性检测结果。
步骤102、对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,其中,所述模型活性检测结果由预设语音降噪网络模型输出。
示例性的,上一音频帧可以理解为当前音频帧之前的最近一个音频帧,也即,上一音频帧位于当前音频帧之前且两者帧序号相邻。在对上一音频帧进行语音降噪处理时,预设语音降噪网络模型可以输出上一音频帧对应的降噪音频帧和模型活性检测结果,可对该模型活性检测结果进行缓存,以用于对当前音频帧的降噪处理。
本申请实施例中,在对当前音频帧进行处理时,可以综合上一音频帧对应的模型活性检测结果和当前音频帧对应的算法活性检测结果,来确定用于供传统语音降噪算法中的噪声估计和噪声消除所使用的活性检测结果(目标活性检测结果)。相比于单纯采用传统语音降噪算法来进行语音活性检测来说,使传统降噪算法可以获得更多的VAD信息,从而得到更准确的噪声估计,可以更好的保护语音并更准确地消除噪声,可提升传统降噪算法的输出信噪比(Signal to Noise Ratio,SNR)。
步骤103、基于所述目标活性检测结果对所述当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧。
示例性的,在得到目标活性检测结果后,可以利用传统语音降噪算法中的噪声估计算法和噪声消除算法,对当前音频帧进行相应处理,将处理后得到的音频帧记为初始降噪音频帧。
步骤104、将所述初始降噪音频帧输入至所述预设语音降噪网络模型,以输出目标降噪音频帧以及所述当前音频帧对应的模型活性检测结果。
示例性的,在得到初始降噪音频帧后,可以直接将初始降噪音频帧作为预 设语音降噪网络模型的输入,也可以根据预设语音降噪网络模型的特点对初始降噪音频帧进行转换,例如转换为预设维度的信号,预设维度例如可以是频域、时域或其他维度域。
本申请实施例中提供的语音降噪方法,采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果,对上一音频帧对应的模型活性检测结果和当前音频帧对应的算法活性检测结果进行融合处理,得到当前音频帧对应的目标活性检测结果,模型活性检测结果由预设语音降噪网络模型输出,基于目标活性检测结果对当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧,将初始降噪音频帧输入至预设语音降噪网络模型,以输出目标降噪音频帧以及当前音频帧对应的模型活性检测结果。通过采用上述方案,预设语音降噪网络模型能够输出模型活性检测结果,在采用传统语音降噪算法对当前音频帧进行处理时,可以对上一音频帧的模型活性检测结果和传统语音降噪算法得到的算法活性检测结果进行结合,使传统降噪算法可以获得更多的活性检测信息,更加合理准确地确定语音活性检测结果,基于该结果进行噪声估计和噪声消除,可以更好的保护语音、更多的消除噪声,得到信噪比更高的传统降噪结果,再将传统降噪结果作为预设语音降噪网络模型的输入,得到效果更好的降噪音频帧,降低了预设语音降噪网络模型处理恶劣数据的可能性,传统降噪算法和AI降噪方法相互促进,对各种噪声具有较好的降噪能力,提高方案整体的稳定性和鲁棒性。
本申请实施例中,语音活性检测可以是帧级别的,也可以是频点级别的,检测结果可以用一个或多个概率值来表示。
在一些实施例中,所述算法活性检测结果包括对应音频帧中存在语音的第一概率值,所述模型活性检测结果包括对应音频帧中存在语音的第二概率值。其中,所述对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,包括:采用预设计算方式,对上一音频帧对应的模型活性检测结果中的第一概率值,和所述当前音频帧对应的算法活性检测结果中的第二概率值进行计算,得到第三概率值,根据所述第三概率值确定所述当前音频帧对应的目标活性检测结果。这样设置,对于帧级别的语音活性检测,可以准确地确定目标活性检测结果。
其中,第一概率值用于表示采用预设语音活性检测算法对对应音频帧进行检测后,得到对应音频帧中包含语音的概率,这里的对应音频帧可以是任意音频帧,可以是当前音频帧,也可以是上一音频帧,不同音频帧对应的第一概率值可以不同;第二概率值用于表示由预设语音降噪网络模型输出的,对应音频 帧中包含语音的概率,这里的对应音频帧也可以是任意音频帧,不同音频帧对应的第二概率值可以不同。
示例性的,当前音频帧对应的算法活性检测结果中的第一概率值,可以用于表示采用预设语音活性检测算法对当前音频帧(假设记为A)进行检测后,得到的当前音频帧中包含语音的概率,可记为Pa。上一音频帧对应的模型活性检测结果中的第二概率值可以用于表示在对上一音频帧(假设记为B)进行语音降噪处理时,预设语音降噪网络模型所预测的上一音频帧中包含语音的概率,可记为Pb。采用预设计算方式对Pa和Pb进行计算,得到第三概率值,可记为Pc。示例性的,可将第三概率值作为当前音频帧对应的目标活性检测结果。
示例性的,所述预设计算方式为取最大值、取最小值、计算平均值、求和、计算加权和、以及计算加权平均值中的一种。以取最大值为例,Pc=max(Pa,Pb)。
在一些实施例中,所述算法活性检测结果包括对应音频帧中,预设数量的频点中每个频点存在语音的第四概率值;所述模型活性检测结果包括对应音频帧中,所述预设数量的频点中每个频点存在语音的第五概率值;其中,所述对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,包括:针对所述预设数量的频点中的每个频点,采用预设计算方式,对上一音频帧对应的模型活性检测结果中的单个频点的第五概率值,和所述当前音频帧对应的算法活性检测结果中的对应的所述单个频点的第四概率值进行计算,得到第六概率值;根据所述预设数量的第六概率值,确定所述当前音频帧对应的目标活性检测结果。这样设置,采用频点级别的语音活性检测,可以更加精准地确定目标活性检测结果。
示例性的,预设数量(记为n)可以根据实际需求设定,例如可以根据预处理阶段时的快速傅里叶变换所采用的点数确定,例如n为256。当前音频帧对应的第四概率值可以用于表示采用预设语音活性检测算法对当前音频帧(假设记为A)进行检测后,得到的当前音频帧中的预设数量的频点中每个频点包含语音的概率,可记为PA[n],PA[n]可理解为包含n个元素(n位)的向量,每个元素的取值在0至1之间,一个元素的取值用于表示对应的频点中包含语音的概率。上一音频帧对应的第五概率值可以用于表示在对上一音频帧(假设记为B)进行语音降噪处理时,预设语音降噪网络模型所预测的上一音频帧中的预设数量的频点中每个频点包含语音的概率,可记为PB[n]。采用预设计算方式对PA[n]和PB[n]进行计算,得到预设数量的第六概率值,例如可记为PC[n]。示例性的,可将包含第六概率值的向量作为当前音频帧对应的目标活性检测结果。
示例性的,所述预设计算方式为取最大值、取最小值、计算平均值、求和、计算加权和、以及计算加权平均值中的一种。以取最大值为例,PC[n]=max(PA[n],PB[n])。例如,对于当前音频帧中的第一个频点,对应的第四概率值和第五概率值中的最大值,成为当前音频帧中的第一个频点对应的第六概率值,后续频点以此类推。
在一些实施例中,所述将所述初始降噪音频帧输入至所述预设语音降噪网络模型,包括:对所述初始降噪音频帧进行预设特征维度的特征提取,得到目标输入信号;将所述目标输入信号输入至所述预设语音降噪网络模型,或者,将所述目标输入信号和所述初始降噪音频帧输入至所述预设语音降噪网络模型。这样设置,可以有针对性地进行特征提取,提高预设语音降噪网络模型的预测准确度和精度。
可选的,预设特征维度包括显性特征维度,可以是基频特征,如基音频率(Pitch),还可以是每通道能量归一化(Per-channel energy normalization,PCEN)特征、或梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)特征等等。预设特征维度可根据预设语音降噪网络模型的网络结构或特点等来确定。
图2为本申请实施例提供的又一种语音降噪方法的流程示意图,该方法在上述各可选实施例基础上进行优化,图3为本申请实施例提供的一种语音降噪方法的推理流程示意图,可结合图2和图3对本申请实施例方案进行理解。其中,如图2所示,该方法可包括:
步骤201、获取原始音频帧,对原始音频帧进行预处理,得到待处理的当前音频帧。
示例性的,原始音频帧包含于音频文件或音频流中,例如,可以是语音通话场景中的音频流。为了保证通话质量,需要对通话音频进行降噪。预处理可以包括如分帧、加窗以及傅里叶变换等处理。经过预处理后的带噪语音帧即为待处理的当前音频帧,作为预设传统降噪算法的输入信号(记为S0)。
步骤202、采用预设传统降噪算法中的预设语音活性检测算法,对待处理的当前音频帧进行检测,得到对应的算法活性检测结果。
示例性的,预设传统降噪算法可以是ANS算法。利用ANS算法中的VAD估计功能模块对应的预设语音活性检测算法,对S0进行检测,假设为频点级别的检测,可以得到256个频点的语音存在概率Pf[256],也即S0对应的算法活性检测结果。
步骤203、判断当前音频帧是否存在上一个音频帧,若是,则执行步骤204;否则,执行步骤206。
示例性的,对于第一个音频帧来说,不存在上一个音频帧,因此,可以不需要获取上一音频帧的模型活性检测结果,执行步骤206,基于当前音频帧对应的算法活性检测结果进行噪声估计和噪声消除。
步骤204、获取上一音频帧对应的模型活性检测结果,对所获取的模型活性检测结果和当前音频帧对应的算法活性检测结果进行融合处理,得到当前音频帧对应的目标活性检测结果。
示例性的,上一音频帧对应的模型活性检测结果由基于人工智能的预设语音降噪网络模型输出,可以是上一音频帧中256个频点的语音存在概率PF[256],可以采用取最大值的方式得到融合的VAD估计结果(目标活性检测结果):P[256]=max(Pf[256],PF[256])。
步骤205、基于目标活性检测结果,利用所述预设传统降噪算法对当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧,执行步骤207。
示例性的,预设传统降噪算法根据P[256],实现噪声估计和噪声消除,得到经过传统降噪处理的语音信号S1,也即初始降噪音频帧。
步骤206、基于当前音频帧对应的算法活性检测结果,利用所述预设传统降噪算法对当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧。
示例性的,预设传统降噪算法根据Pf[256],实现噪声估计和噪声消除,得到经过传统降噪处理的语音信号S1,也即初始降噪音频帧。
步骤207、对初始降噪语音进行预设特征维度的特征提取,得到目标输入信号。
示例性的,S1作为预设语音降噪网络模型的输入信号,可以是频域、时域或其他维度域的信号,根据预设语音降噪网络模型的模型设计不同,可能存在一步显性的特征提取计算,如基音频率特征,将提取到的特征信息记为目标输入信号S2。
步骤208、将目标输入信号和/或初始降噪音频帧输入至预设语音降噪网络模型,以输出目标降噪音频帧以及当前音频帧对应的模型活性检测结果。
可选的,可以将S1或S2作为模型输入,还可以将S1和S2均作为模型输入,输入至预设语音降噪网络模型中进行推理计算,得出输出信号。输出信号包含两部分,第一部分是语音降噪方法的最终降噪语音的输出S3,第二部分是模型的VAD输出PF[256],供传统语音降噪算法在处理下一音频帧时使用。
步骤209、判断是否存在待处理的原始音频帧,若是,则返回执行步骤201;否则,结束流程。
示例性的,若语音通话结束,所有原始音频帧已得到降噪处理,此时可以结束流程,若仍存在未降噪的原始音频帧,则可返回执行步骤201,继续进行降噪处理。
本申请实施例提供的语音降噪方法,通过基于人工智能的预设语音降噪网络模型向传统降噪算法进行信息反馈的方式,使传统降噪算法可以获得更多的VAD信息,传统降噪和AI降噪的VAD估计均采用频点级别,可以得到更精准的噪声估计,使得传统降噪算法可以更好的保护语音、更多的消除噪声,提升传统降噪的输出信噪比,高信噪比的初始降噪语音信号经过特征提取后,可以丰富预设语音降噪网络模型的输入,在降低预设语音降噪网络模型处理恶劣数据的可能性的同时,提升模型的语音降噪效果,提升语音降噪性能。
图4为本申请实施例提供的一种模型训练方法的流程示意图,图5为本申请实施例提供的一种模型训练方法的训练过程示意图,可结合图4和图4对本申请实施例进行理解。本实施例可适用于对基于人工智能的语音降噪网络模型进行训练的情况,该模型可以适用于如语音通话、音视频直播以及多人会议等各种场景。该方法可以由模型训练装置执行,该装置可以采用硬件和/或软件的形式实现,该装置可配置于模型训练设备等电子设备中。所述电子设备可以为手机、智能手表、平板电脑以及个人数字助理等移动设备;也可为台式计算机等其他设备。采用本申请实施例训练得到的语音降噪网络模型可以应用于本申请中任意实施例提供的语音降噪方法。
如图4所示,该方法包括:
步骤401、采用预设语音活性检测算法对当前样本音频帧进行检测,得到对应的样本算法活性检测结果,其中,所述当前样本音频帧关联有活性检测标签和纯净音频帧。
示例性的,可将纯净(干净)语音数据集合和噪声数据集按照预设混合规则混合成为带噪语音数据,预设混合规则例如可以基于信噪比或房间声学冲激响应(Room Impulse Response,RIR)来设定。可选的,将混合得到的带噪语音数据集和纯净语音数据集一起作为模型的训练集。当前样本音频帧可以是训练集中的音频帧。当前样本音频帧可以携带活性检测标签,该标签可以通过人工标注的方式添加。以帧级别为例,若包含语音,标签可以为1,若不包含语音,标签可以为0;以频点级别为例,标签可以是包含预设数量的元素的向量,每个元素的取值为1或0,对应频点若包含语音,取值为1,对应频点若不包含语音,取值为0。
步骤402、对上一样本音频帧对应的样本模型活性检测结果和所述当前样本音频帧对应的样本算法活性检测结果进行融合处理,得到所述当前样本音频帧 对应的目标样本活性检测结果,其中,所述样本模型活性检测结果由语音降噪网络模型输出。
示例性的,本步骤中的活性检测结果融合过程可以与本申请实施例提供的语音降噪方法中的融合过程类似,如可以是频点级融合或帧级别融合等,还可采用类似的预设计算方式对相应的频率值进行融合,具体细节可参考本文相关内容,此处不再赘述。
步骤403、基于所述目标活性样本检测结果对所述当前样本音频帧进行噪声估计和噪声消除,得到初始降噪样本音频帧。
步骤404、将所述初始降噪样本音频帧输入至所述语音降噪网络模型,以输出目标样本降噪音频帧以及所述当前样本音频帧对应的样本模型活性检测结果。
步骤405、根据所述目标样本降噪音频帧和所述纯净音频帧确定第一损失关系,根据所述样本模型活性检测结果和所述活性检测标签确定第二损失关系,并基于所述第一损失关系和所述第二损失关系对所述语音降噪网络模型进行训练。
示例性的,损失关系可以用于表征两种数据之间的差异,可以用损失值表示,例如可以采用损失函数来计算。第一损失关系用于表征目标样本降噪音频帧和纯净音频帧之间的差异,第二损失关系用于表征样本模型活性检测结果和活性检测标签之间的差异,其中,用于计算第一损失关系的第一损失函数,以及用于计算第二损失关系的第二损失函数的函数类型可根据实际需求进行设置。
示例性的,可基于所述第一损失关系和所述第二损失关系计算目标损失关系,计算方式例如可以是加权求和等。
示例性的,根据目标损失关系对语音降噪网络模型进行训练,在训练过程中,可以以最小化目标损失关系为目标,利用反向传播等训练手段不断优化语音降噪网络模型中的权重参数值,直到满足预设训练截止条件。训练截止条件可根据实际需求进行设置,例如可以基于迭代次数、损失值收敛程度、或模型准确率等设定。
本申请实施例提供的模型训练方法,在训练过程中,将传统降噪算法和语音降噪网络模型作为一个整体,可避免传统降噪算法串联单独训练的语音降噪网络模型所带来的数据失配风险,训练后得到的模型,可以用于语音降噪,并对各种噪声具有较好的降噪能力,提升降噪效果。
可选的,所述样本算法活性检测结果包括对应样本音频帧中存在语音的第一样本概率值,所述样本模型活性检测结果包括对应样本音频帧中存在语音的第二样本概率值;
其中,所述对上一样本音频帧对应的样本模型活性检测结果和所述当前样本音频帧对应的样本算法活性检测结果进行融合处理,得到所述当前样本音频帧对应的目标样本活性检测结果,包括:采用预设计算方式,对上一样本音频帧对应的样本模型活性检测结果中的第二样本概率值,和所述当前样本音频帧对应的样本算法活性检测结果中的第一样本概率值进行计算,得到第三样本概率值,根据所述第三样本概率值确定所述当前样本音频帧对应的目标样本活性检测结果。
可选的,所述样本算法活性检测结果包括对应音频帧中,预设数量的频点中每个频点存在语音的第四样本概率值;所述模型活性检测结果包括对应音频帧中,所述预设数量的频点中每个频点存在语音的第五样本概率值;
其中,所述对上一样本音频帧对应的样本模型活性检测结果和所述当前样本音频帧对应的样本算法活性检测结果进行融合处理,得到所述当前样本音频帧对应的目标样本活性检测结果,包括:针对所述预设数量的频点中的每个频点,采用预设计算方式,对上一样本音频帧对应的样本模型活性检测结果中的单个频点的第五样本概率值,和所述当前样本音频帧对应的样本算法活性检测结果中的对应的所述单个频点的第四样本概率值进行计算,得到第六样本概率值;根据所述预设数量的第六样本概率值,确定所述当前样本音频帧对应的目标样本活性检测结果。
可选的,所述将所述初始降噪样本音频帧输入至所述语音降噪网络模型,包括:对所述初始降噪样本音频帧进行预设特征维度的特征提取,得到目标输入信号;将所述目标输入信号输入至所述语音降噪网络模型,或者,将所述目标输入信号和所述初始降噪样本音频帧输入至所述语音降噪网络模型。
图6为本申请实施例提供的一种语音降噪装置的结构框图,该装置可由软件和/或硬件实现,一般可集成在语音降噪设备等电子设备中,可通过执行语音降噪方法来进行语音降噪。如图6所示,该装置包括:
语音活性检测模块601,设置为采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果;
检测结果融合模块602,设置为对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,其中,所述模型活性检测结果由预设语音降噪网络模型输出;
降噪处理模块603,设置为基于所述目标活性检测结果对所述当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧;
模型输入模块604,设置为将所述初始降噪音频帧输入至所述预设语音降噪网络模型,以输出目标降噪音频帧以及所述当前音频帧对应的模型活性检测结果。
本申请实施例提供的语音降噪装置,采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果,对上一音频帧对应的模型活性检测结果和当前音频帧对应的算法活性检测结果进行融合处理,得到当前音频帧对应的目标活性检测结果,模型活性检测结果由预设语音降噪网络模型输出,基于目标活性检测结果对当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧,将初始降噪音频帧输入至预设语音降噪网络模型,以输出目标降噪音频帧以及当前音频帧对应的模型活性检测结果。通过采用上述方案,预设语音降噪网络模型能够输出模型活性检测结果,在采用传统语音降噪算法对当前音频帧进行处理时,可以对上一音频帧的模型活性检测结果和传统语音降噪算法得到的算法活性检测结果进行结合,使传统降噪算法可以获得更多的活性检测信息,更加合理准确地确定语音活性检测结果,基于该结果进行噪声估计和噪声消除,可以更好的保护语音、更多的消除噪声,得到信噪比更高的传统降噪结果,再将传统降噪结果作为预设语音降噪网络模型的输入,得到效果更好的降噪音频帧,降低了预设语音降噪网络模型处理恶劣数据的可能性,传统降噪算法和AI降噪方法相互促进,对各种噪声具有较好的降噪能力,提高方案整体的稳定性和鲁棒性。
可选的,所述算法活性检测结果包括对应音频帧中存在语音的第一概率值,所述模型活性检测结果包括对应音频帧中存在语音的第二概率值;
其中,所述检测结果融合模块602设置为通过以下方式对所述模型活性检测结果和所述算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果:
采用预设计算方式,对上一音频帧对应的模型活性检测结果中的第二概率值,和所述当前音频帧对应的算法活性检测结果中的第一概率值进行计算,得到第三概率值,根据所述第三概率值确定所述当前音频帧对应的目标活性检测结果。
可选的,所述算法活性检测结果包括对应音频帧中,预设数量的频点中每个频点存在语音的第四概率值;所述模型活性检测结果包括对应音频帧中,所述预设数量的频点中每个频点存在语音的第五概率值;
其中,所述检测结果融合模块602还设置为通过以下方式对所述模型活性检测结果和所述算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果:
针对所述预设数量的频点中的每个频点,采用预设计算方式,对上一音频帧对应的模型活性检测结果中的单个频点的第五概率值,和所述当前音频帧对应的算法活性检测结果中的对应的所述单个频点的第四概率值进行计算,得到第六概率值;根据所述预设数量的第六概率值,确定所述当前音频帧对应的目标活性检测结果。
可选的,所述预设计算方式为取最大值、取最小值、计算平均值、求和、计算加权和、以及计算加权平均值中的一种。
可选的,所述模型输入模块,包括:
特征提取单元,设置为对所述初始降噪语音进行预设特征维度的特征提取,得到目标输入信号;
信号输入单元,设置为将所述目标输入信号输入至所述预设语音降噪网络模型,或者,将所述目标输入信号和所述初始降噪音频帧输入至所述预设语音降噪网络模型,以输出目标降噪音频帧以及所述当前音频帧对应的模型活性检测结果。
图7为本申请实施例提供的一种模型训练装置的结构框图,该装置可由软件和/或硬件实现,一般可集成在模型训练设备等电子设备中,可通过执行模型训练方法来进行模型训练。如图7所示,该装置包括:
语音检测模块701,设置为采用预设语音活性检测算法对待处理的当前样本音频帧进行检测,得到对应的样本算法活性检测结果,其中,所述当前样本音频帧关联有活性检测标签和干净音频帧;
融合模块702,设置为对上一样本音频帧对应的样本模型活性检测结果和所述当前样本音频帧对应的样本算法活性检测结果进行融合处理,得到所述当前样本音频帧对应的目标样本活性检测结果,其中,所述样本模型活性检测结果由语音降噪网络模型输出;
噪声消除模块703,设置为基于所述目标活性样本检测结果对所述当前样本音频帧进行噪声估计和噪声消除,得到初始降噪样本音频帧;
网络模型输入模块704,设置为将所述初始降噪样本音频帧输入至所述语音降噪网络模型,以输出目标样本降噪音频帧以及所述当前样本音频帧对应的样本模型活性检测结果;
网络模型训练模块705,设置为根据所述目标样本降噪音频帧和所述干净音频帧确定第一损失关系,根据所述样本模型活性检测结果和所述活性检测标签确定第二损失关系,并基于所述第一损失关系和所述第二损失关系对所述语音降噪网络模型进行训练。
本申请实施例提供的模型训练装置,在训练过程中,将传统降噪算法和语音降噪网络模型作为一个整体,可避免传统降噪算法串联单独训练的语音降噪网络模型所带来的数据失配风险,训练后得到的模型,可以用于语音降噪,并对各种噪声具有较好的降噪能力,提升降噪效果。
本申请实施例提供了一种电子设备,该电子设备中可集成本申请实施例提供的语音降噪装置和/或模型训练装置。图8为本申请实施例提供的一种电子设备的结构框图。电子设备800包括处理器801,以及与处理器801通信连接的存储器802,其中,存储器802存储有可被处理器801执行的计算机程序,计算机程序被处理器801执行,以使所述处理器801能够执行本申请任一实施例所述的语音降噪方法和/或模型训练方法。其中,处理器的数量可以是一个或多个,图8中以一个处理器为例。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序用于使处理器执行时实现本申请任一实施例所述的语音降噪方法和/或模型训练方法。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序在被处理器执行时实现如本申请实施例提供的语音降噪方法和/或模型训练方法。
上述实施例中提供的语音降噪装置、模型训练装置、电子设备、存储介质及产品可执行本申请相应实施例所提供的语音降噪方法或模型训练方法,具备执行该方法相应的功能模块和有益效果。未在上述实施例中详尽描述的技术细节,可参见本申请任意实施例所提供的语音降噪方法或模型训练方法。

Claims (11)

  1. 一种语音降噪方法,包括:
    采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果;
    对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,其中,所述模型活性检测结果由预设语音降噪网络模型输出;
    基于所述目标活性检测结果对所述当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧;
    将所述初始降噪音频帧输入至所述预设语音降噪网络模型,以输出目标降噪音频帧以及所述当前音频帧对应的模型活性检测结果。
  2. 根据权利要求1所述的方法,其中,所述算法活性检测结果包括对应音频帧中存在语音的第一概率值,所述模型活性检测结果包括对应音频帧中存在语音的第二概率值;
    所述对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,包括:
    采用预设计算方式,对上一音频帧对应的模型活性检测结果中的第二概率值,和所述当前音频帧对应的算法活性检测结果中的第一概率值进行计算,得到第三概率值,根据所述第三概率值确定所述当前音频帧对应的目标活性检测结果。
  3. 根据权利要求1所述的方法,其中,所述算法活性检测结果包括对应音频帧中,预设数量的频点中每个频点存在语音的第四概率值;所述模型活性检测结果包括对应音频帧中,所述预设数量的频点中每个频点存在语音的第五概率值;
    所述对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,包括:
    针对所述预设数量的频点中的每个频点,采用预设计算方式,对上一音频帧对应的模型活性检测结果中的单个频点的第五概率值,和所述当前音频帧对应的算法活性检测结果中的对应的所述单个频点的第四概率值进行计算,得到第六概率值;
    根据所述预设数量的第六概率值,确定所述当前音频帧对应的目标活性检 测结果。
  4. 根据权利要求2或3所述的方法,其中,所述预设计算方式为取最大值、取最小值、计算平均值、求和、计算加权和、以及计算加权平均值中的一种。
  5. 根据权利要求1所述的方法,其中,所述将所述初始降噪音频帧输入至所述预设语音降噪网络模型,包括:
    对所述初始降噪音频帧进行预设特征维度的特征提取,得到目标输入信号;
    将所述目标输入信号输入至所述预设语音降噪网络模型,或者,将所述目标输入信号和所述初始降噪音频帧输入至所述预设语音降噪网络模型。
  6. 一种模型训练方法,包括:
    采用预设语音活性检测算法对当前样本音频帧进行检测,得到对应的样本算法活性检测结果,其中,所述当前样本音频帧关联有活性检测标签和纯净音频帧;
    对上一样本音频帧对应的样本模型活性检测结果和所述当前样本音频帧对应的样本算法活性检测结果进行融合处理,得到所述当前样本音频帧对应的目标样本活性检测结果,其中,所述样本模型活性检测结果由语音降噪网络模型输出;
    基于所述目标活性样本检测结果对所述当前样本音频帧进行噪声估计和噪声消除,得到初始降噪样本音频帧;
    将所述初始降噪样本音频帧输入至所述语音降噪网络模型,以输出目标样本降噪音频帧以及所述当前样本音频帧对应的样本模型活性检测结果;
    根据所述目标样本降噪音频帧和所述纯净音频帧确定第一损失关系,根据所述样本模型活性检测结果和所述活性检测标签确定第二损失关系,并基于所述第一损失关系和所述第二损失关系对所述语音降噪网络模型进行训练。
  7. 一种语音降噪装置,包括:
    语音活性检测模块,设置为采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果;
    检测结果融合模块,设置为对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,其中,所述模型活性检测结果由预设语音降噪网络模型输出;
    降噪处理模块,设置为基于所述目标活性检测结果对所述当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧;
    模型输入模块,设置为将所述初始降噪音频帧输入至所述预设语音降噪网络模型,以输出目标降噪音频帧以及所述当前音频帧对应的模型活性检测结果。
  8. 一种模型训练装置,包括:
    语音检测模块,设置为采用预设语音活性检测算法对待处理的当前样本音频帧进行检测,得到对应的样本算法活性检测结果,其中,所述当前样本音频帧关联有活性检测标签和干净音频帧;
    融合模块,设置为对上一样本音频帧对应的样本模型活性检测结果和所述当前样本音频帧对应的样本算法活性检测结果进行融合处理,得到所述当前样本音频帧对应的目标样本活性检测结果,其中,所述样本模型活性检测结果由语音降噪网络模型输出;
    噪声消除模块,设置为基于所述目标活性样本检测结果对所述当前样本音频帧进行噪声估计和噪声消除,得到初始降噪样本音频帧;
    网络模型输入模块,设置为将所述初始降噪样本音频帧输入至所述语音降噪网络模型,以输出目标样本降噪音频帧以及所述当前样本音频帧对应的样本模型活性检测结果;
    网络模型训练模块,设置为根据所述目标样本降噪音频帧和所述干净音频帧确定第一损失关系,根据所述样本模型活性检测结果和所述活性检测标签确定第二损失关系,并基于所述第一损失关系和所述第二损失关系对所述语音降噪网络模型进行训练。
  9. 一种电子设备,所述电子设备包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-5任一项所述的语音降噪方法和/或权利要求6所述的模型训练方法。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序用于使处理器执行时实现权利要求1-5任一项所述的语音降噪方法和/或权利要求6所述的模型训练方法。
  11. 一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序在被处理器执行时实现权利要求1-5任一项所述的语音降噪方法和/或权利要求6所述的模型训练方法。
PCT/CN2023/106951 2022-07-21 2023-07-12 语音降噪方法、模型训练方法、装置、设备、介质及产品 WO2024017110A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210864010.4 2022-07-21
CN202210864010.4A CN115273880A (zh) 2022-07-21 2022-07-21 语音降噪方法、模型训练方法、装置、设备、介质及产品

Publications (1)

Publication Number Publication Date
WO2024017110A1 true WO2024017110A1 (zh) 2024-01-25

Family

ID=83767239

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/106951 WO2024017110A1 (zh) 2022-07-21 2023-07-12 语音降噪方法、模型训练方法、装置、设备、介质及产品

Country Status (2)

Country Link
CN (1) CN115273880A (zh)
WO (1) WO2024017110A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273880A (zh) * 2022-07-21 2022-11-01 百果园技术(新加坡)有限公司 语音降噪方法、模型训练方法、装置、设备、介质及产品

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017218386A1 (en) * 2016-06-13 2017-12-21 Med-El Elektromedizinische Geraete Gmbh Recursive noise power estimation with noise model adaptation
CN108428456A (zh) * 2018-03-29 2018-08-21 浙江凯池电子科技有限公司 语音降噪算法
US20200286501A1 (en) * 2017-10-12 2020-09-10 Huawei Technologies Co., Ltd. Apparatus and a method for signal enhancement
CN114255778A (zh) * 2021-12-21 2022-03-29 广州欢城文化传媒有限公司 一种音频流降噪方法、装置、设备及存储介质
CN114495969A (zh) * 2022-01-20 2022-05-13 南京烽火天地通信科技有限公司 一种融合语音增强的语音识别方法
CN114596870A (zh) * 2022-03-07 2022-06-07 广州博冠信息科技有限公司 实时音频处理方法和装置、计算机存储介质、电子设备
CN115273880A (zh) * 2022-07-21 2022-11-01 百果园技术(新加坡)有限公司 语音降噪方法、模型训练方法、装置、设备、介质及产品

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017218386A1 (en) * 2016-06-13 2017-12-21 Med-El Elektromedizinische Geraete Gmbh Recursive noise power estimation with noise model adaptation
US20200286501A1 (en) * 2017-10-12 2020-09-10 Huawei Technologies Co., Ltd. Apparatus and a method for signal enhancement
CN108428456A (zh) * 2018-03-29 2018-08-21 浙江凯池电子科技有限公司 语音降噪算法
CN114255778A (zh) * 2021-12-21 2022-03-29 广州欢城文化传媒有限公司 一种音频流降噪方法、装置、设备及存储介质
CN114495969A (zh) * 2022-01-20 2022-05-13 南京烽火天地通信科技有限公司 一种融合语音增强的语音识别方法
CN114596870A (zh) * 2022-03-07 2022-06-07 广州博冠信息科技有限公司 实时音频处理方法和装置、计算机存储介质、电子设备
CN115273880A (zh) * 2022-07-21 2022-11-01 百果园技术(新加坡)有限公司 语音降噪方法、模型训练方法、装置、设备、介质及产品

Also Published As

Publication number Publication date
CN115273880A (zh) 2022-11-01

Similar Documents

Publication Publication Date Title
Li et al. Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement
CN107393550B (zh) 语音处理方法及装置
Lin et al. Speech enhancement using multi-stage self-attentive temporal convolutional networks
CN111785288B (zh) 语音增强方法、装置、设备及存储介质
CN112004177B (zh) 一种啸叫检测方法、麦克风音量调节方法及存储介质
US20190132452A1 (en) Acoustic echo cancellation based sub band domain active speaker detection for audio and video conferencing applications
WO2024017110A1 (zh) 语音降噪方法、模型训练方法、装置、设备、介质及产品
WO2022141868A1 (zh) 一种提取语音特征的方法、装置、终端及存储介质
CN112053702B (zh) 一种语音处理的方法、装置及电子设备
CN112949708A (zh) 情绪识别方法、装置、计算机设备和存储介质
CN112602150A (zh) 噪声估计方法、噪声估计装置、语音处理芯片以及电子设备
CN112309417A (zh) 风噪抑制的音频信号处理方法、装置、系统和可读介质
Hidayat et al. A Modified MFCC for Improved Wavelet-Based Denoising on Robust Speech Recognition.
Bonet et al. Speech enhancement for wake-up-word detection in voice assistants
CN112289337A (zh) 一种滤除机器学习语音增强后的残留噪声的方法及装置
Zhou et al. Speech Enhancement via Residual Dense Generative Adversarial Network.
CN116312561A (zh) 一种电力调度系统人员声纹识别鉴权降噪和语音增强方法、系统及装置
WO2020107455A1 (zh) 语音处理方法、装置、存储介质及电子设备
CN114333912B (zh) 语音激活检测方法、装置、电子设备和存储介质
CN115440240A (zh) 语音降噪的训练方法、语音降噪系统及语音降噪方法
CN115083440A (zh) 音频信号降噪方法、电子设备和存储介质
CN111048096B (zh) 一种语音信号处理方法、装置及终端
CN114743571A (zh) 一种音频处理方法、装置、存储介质及电子设备
JP2013235050A (ja) 情報処理装置及び方法、並びにプログラム
Ram et al. Enhancement of speech using deep neural network with discrete cosine transform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23842175

Country of ref document: EP

Kind code of ref document: A1