WO2016192410A1 - 一种音频信号增强方法和装置 - Google Patents

一种音频信号增强方法和装置 Download PDF

Info

Publication number
WO2016192410A1
WO2016192410A1 PCT/CN2016/073792 CN2016073792W WO2016192410A1 WO 2016192410 A1 WO2016192410 A1 WO 2016192410A1 CN 2016073792 W CN2016073792 W CN 2016073792W WO 2016192410 A1 WO2016192410 A1 WO 2016192410A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
frame
frames
noise
spectral envelope
Prior art date
Application number
PCT/CN2016/073792
Other languages
English (en)
French (fr)
Inventor
夏丙寅
周璇
苗磊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2016192410A1 publication Critical patent/WO2016192410A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present invention relates to the field of communications, and in particular, to an audio signal enhancement method and apparatus.
  • audio signals are often subject to noise, resulting in degradation of audio signal quality.
  • audio enhancement technology is mainly used to extract as clean signals as possible from noise-contaminated audio signals to improve audio signal quality.
  • the network device is often used to enhance the audio signal.
  • the operation of completely decoding, enhancing, and re-encoding the audio signal is included. Since the audio signal needs to be completely decoded, and then the decoded data is processed, the computational complexity and the additional delay in the enhancement process of the current audio signal are relatively high.
  • Embodiments of the present invention provide an audio signal enhancement method and apparatus, which can reduce computational complexity and additional delay in an enhancement process of an audio signal.
  • an embodiment of the present invention provides an audio signal enhancement method, including:
  • Quantifying the pure estimate to obtain a quantization index of a pure estimate of the spectral envelope parameter of the to-be-enhanced frame, and replacing the quantized index with a ratio corresponding to a spectral envelope parameter of the to-be-enhanced frame special.
  • the method further includes:
  • N Counting, in the N frames including the audio signal frame, the number of frames of each noise type included in the N frames, and selecting a noise type with the largest number of frames as the audio signal A type of noise included, wherein the N is an integer greater than or equal to one.
  • the performing by using the spectral envelope parameter, performing noise classification on the audio signal frame to obtain the audio
  • the noise type of the signal frame including:
  • a posterior probability of each noise model in the model, a noise model having the largest posterior probability among the M noise models is selected as the noise type of the audio signal frame, where M is an integer greater than or equal to 1.
  • the method further includes:
  • the envelope parameter is enhanced to obtain a pure estimate of the spectral envelope parameter of the to-be-enhanced frame, including:
  • the N statistical units including the audio signal frame in the audio signal
  • the number of frames of each type of noise included in the N frames, and the type of noise in which the number of frames is selected as the type of noise included in the audio signal including:
  • the method further includes:
  • each of the consecutive multi-frames is counted in the consecutive multi-frames The number of frames of the noise type, and the noise type with the largest number of frames is selected as the audio signal.
  • Pre-noise type
  • the spectral envelope parameter of the to-be-enhanced frame of the audio signal is enhanced using a neural network pre-set for the current noise type of the audio signal to obtain a pure estimate of the spectral envelope parameter of the to-be-enhanced frame.
  • the neural network includes:
  • the present invention provides an audio signal enhancement apparatus, including: a decoding unit, an enhancement unit, and a replacement unit, where:
  • the decoding unit is configured to decode a bit stream of the input audio signal, and acquire a spectral envelope parameter of the to-be-enhanced frame of the audio signal;
  • the enhancement unit is configured to perform enhancement processing on a spectral envelope parameter of the to-be-enhanced frame of the audio signal by using a neural network preset for a noise type included in the audio signal, to acquire a spectrum of the to-be-enhanced frame a pure estimate of the envelope parameter;
  • the replacing unit is configured to quantize the pure estimation value, obtain a quantization index of a pure estimation value of the spectral envelope parameter of the to-be-enhanced frame, and replace the quantization index with the spectrum of the to-be-enhanced frame The bit corresponding to the envelope parameter.
  • the decoding unit is further configured to: decode a bit stream of the input audio signal, and acquire a spectral envelope parameter of the audio signal frame of the audio signal;
  • the device also includes:
  • a classifying unit configured to perform noise classification on the audio signal frame by using the spectral envelope parameter to obtain a noise type of the audio signal frame
  • a statistical unit configured to count, in the N frames of the audio signal, the frame of the audio signal, the number of frames of each noise type included in the N frames, and select a noise type with the largest number of frames as the a type of noise included in the audio signal, wherein the N is an integer greater than or equal to 1;
  • the classifying unit is configured to obtain, from the bit stream of the input audio signal, the frame corresponding to the audio signal a codebook gain parameter, using the codebook gain parameter and the spectral envelope parameter to calculate a posterior probability of the audio signal frame for each of the preset M noise models, and selecting the M noises
  • the noise model with the largest posterior probability in the model is used as the noise type of the audio signal frame.
  • the apparatus further includes:
  • an adjusting unit configured to jointly adjust the adaptive codebook gain and the algebraic book gain of the to-be-enhanced frame, respectively quantize the jointly adjusted adaptive codebook gain and the algebraic book gain, to obtain the to-be-enhanced a quantized index of the jointly adjusted adaptive codebook gain of the frame and a quantized index of the algebraic code gain, wherein the adaptive codebook gain and the algebraic book gain of the to-be-enhanced frame are to decode the to-be-enhanced frame Obtained by operation
  • the replacing unit is further configured to replace a quantization index of the jointly adjusted adaptive codebook gain of the to-be-enhanced frame with a bit corresponding to the adaptive codebook gain of the to-be-enhanced frame, where the to-be-enhanced frame is The quantized index of the jointly adjusted algebraic book gain replaces the bit corresponding to the algebraic book gain of the to-be-enhanced frame.
  • the enhancing unit includes:
  • a first calculating unit configured to calculate an average of a spectral envelope parameter of the to-be-enhanced frame and the plurality of frames of the audio signal, where the plurality of frames are a plurality of frames in the audio signal before the to-be-enhanced frame;
  • a second calculation unit configured to calculate a spectral envelope parameter of the de-average of the to-be-enhanced frame, where the spectral envelope parameter of the de-equalization is a difference between a spectral envelope parameter of the to-be-enhanced frame and the mean value;
  • a third calculating unit configured to perform enhancement processing on the spectral envelope parameter of the de-average using a neural network preset for a noise type of the audio signal to obtain a pure estimate of the spectral envelope parameter of the de-average ;
  • a fourth calculating unit configured to add a pure estimated value of the spectral mean envelope parameter of the de-average value to a mean value of a pre-acquired pure audio spectral envelope parameter to obtain a purity of a spectral envelope parameter of the to-be-enhanced frame estimated value.
  • the statistic unit is configured to include a start of the audio signal frame in the audio signal Counting the number of frames of each noise type included in the N frames in the N frames of the segment, and selecting the noise type with the largest number of frames as the type of noise included in the audio signal; or
  • the statistic unit is configured to count, in the N frames of the audio signal that include the audio signal frame and no voice signal, the number of frames of each noise type included in the N frames, and select a frame.
  • the most common type of noise is the type of noise contained in the audio signal.
  • the statistics unit is further configured to: when detecting a noise type of consecutive multiple frames in the audio signal When the type of noise included in the previously determined audio signal is different, the number of frames of each noise type included in the consecutive multi-frames is counted in the consecutive multiple frames, and the noise type having the largest number of frames is selected as the The current noise type of the audio signal;
  • the enhancement unit is configured to perform enhancement processing on a spectral envelope parameter of the to-be-enhanced frame of the audio signal by using a neural network preset for a current noise type of the audio signal, to obtain a spectral envelope of the to-be-enhanced frame A pure estimate of the parameter.
  • the neural network includes:
  • decoding a bit stream of the input audio signal acquiring a spectral envelope parameter of the to-be-enhanced frame of the audio signal; using the neural network set in advance for the type of noise included in the audio signal to the audio signal
  • the spectral envelope parameter of the to-be-enhanced frame is enhanced to obtain a pure estimate of the spectral envelope parameter of the to-be-enhanced frame; and the purely estimated value is quantized to obtain a spectral envelope parameter of the to-be-enhanced frame A quantization index of the pure estimate, and replacing the quantization index with a bit corresponding to the spectral envelope parameter of the to-be-enhanced frame.
  • it is only necessary to decode the bit corresponding to the spectral envelope parameter of the audio signal frame that is, to perform partial decoding, thereby reducing computational complexity and additional delay in the enhancement process of the audio signal.
  • FIG. 1 is a schematic flowchart of an audio signal enhancement method according to an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of another audio signal enhancement method according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of an RDNN model according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of another RDNN model provided by an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a GMM model according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of another audio signal enhancement method according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of an audio signal enhancement apparatus according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of another audio signal enhancement apparatus according to an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of another audio signal enhancement apparatus according to an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of another audio signal enhancement apparatus according to an embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of another audio signal enhancement apparatus according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of an audio signal enhancement method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps:
  • the to-be-enhanced frame can be understood as the current frame of the audio signal, that is, the audio signal frame currently input in the audio signal. Additionally, the above inputs may be understood as inputs to the method, or inputs to the apparatus performing the method.
  • step 101 can also be understood as only the bit corresponding to the spectral envelope parameter in the above-mentioned frame to be enhanced.
  • Decoding is performed, wherein the bit corresponding to the above-mentioned mid-spectral envelope parameter may be a bit of the bit stream included in the audio signal frame that is a spectral envelope parameter.
  • the spectral envelope parameters may include: Line Spectral Frequencies (LSF), Immunity Spectral Frequencies (ISF), or Linear Prediction Coefficients (LPC).
  • the audio signal may be any audio signal including a spectral envelope parameter in a bit stream such as a voice signal or a music signal.
  • a plurality of neural networks may be preset, and each neural network corresponds to a type of noise, so that when the noise type of the audio signal is determined, the neural network corresponding to the noise type may be selected for enhancement. deal with.
  • the type of noise included in the audio signal may be obtained before the decoding of the to-be-enhanced frame, for example, by using the noise type statistics of several frames of the initial segment of the audio signal.
  • the type of noise included in the above audio signal obtained by the noise type statistics of several frames adjacent to the above-mentioned frame to be enhanced.
  • the type of noise included in the audio signal may be confirmed according to the source of the audio signal.
  • the voice signal of the call may be confirmed according to the geographic location of the two parties, the duration of the call, or the type of noise of the historical voice signal.
  • the type of noise of the voice signal such as when the party is at a certain site by the geographical location of the two parties, then it can be determined that the noise type of the current voice signal is the noise type corresponding to the worksite, or the user outputs when the user calls.
  • the noise type of the nine times of the ten times of the speech signal is the noise type A
  • it is determined according to the history that the noise type included in the voice signal output by the user at the next call is the noise type A.
  • the spectrum envelope parameter of the to-be-enhanced frame may be obtained only when the frame to be enhanced is decoded, and the other parameters in the to-be-enhanced frame may not be decoded, so that step 103 sets the spectral envelope parameter of the to-be-enhanced frame.
  • the bit stream of the enhanced frame to be enhanced can be obtained.
  • the foregoing method can be applied to any smart device having a decoding and computing function, such as a server, a network side device, a personal computer (PC), a notebook computer, a mobile phone, a tablet computer, and the like.
  • a server a network side device
  • PC personal computer
  • notebook computer a mobile phone
  • tablet computer a tablet computer
  • the bit stream of the input audio signal is decoded, and the spectral envelope parameter of the to-be-enhanced frame of the audio signal is acquired; the audio signal is used by using a neural network set in advance for the type of noise included in the audio signal.
  • the spectral envelope parameter of the to-be-enhanced frame is enhanced to obtain a pure estimate of the spectral envelope parameter of the to-be-enhanced frame; and the purely estimated value is quantized to obtain a spectral envelope parameter of the to-be-enhanced frame A quantization index of the pure estimate, and replacing the quantization index with a bit corresponding to the spectral envelope parameter of the to-be-enhanced frame.
  • FIG. 2 is a schematic flowchart of another audio signal enhancement method according to an embodiment of the present invention. As shown in FIG. 2, the method includes the following steps:
  • step 202 may include:
  • the spectral envelope parameter of the value is enhanced to obtain a pure estimate of the spectral envelope parameter of the de-average;
  • the neural network may be a recursive deep neural network or other neural network.
  • a Recurrent Deep Neural Network RDNN
  • the spectrum can be effectively improved due to the existence of a time domain recursive connection in the RDNN.
  • the smoothness of the envelope adjustment result improves the audio signal quality.
  • the method of spectral envelope parameter adjustment based on RDNN can also avoid the instability of the LPC filter adjusted by the existing method, thereby improving the robustness of the algorithm.
  • the spectral envelope estimation method based on RDNN has a low computational complexity, which can effectively improve the operation speed.
  • the above RDNN can be as shown in FIG. 3, wherein the relevant symbols of the RDNN model shown in FIG. 3 are explained as follows:
  • X noisy represents the spectral envelope parameter of the above de-average (for example, the de-average ISF feature of the noisy speech)
  • a pure estimate of the spectral envelope parameter of the above de-average eg, an estimate of the pure speech de-average ISF feature
  • h 1 , h 2 , h 3 are hidden layer states
  • W 1 , W 2 , W 3 , W 4 is a weight matrix between layers
  • b 1 , b 2 , b 3 , and b 4 are offset vectors of each layer
  • U is a recursive connection matrix
  • m is a frame label.
  • the mapping relationship between the layers of the RDNN model shown in Figure 3 is described as follows:
  • mapping relationship between the display layer and the hidden layer 1 is:
  • mapping relationship between hidden layer 1 and hidden layer 2 is:
  • mapping relationship between hidden layer 2 and hidden layer 3 is:
  • h 3 (m) ⁇ (W 3 (h 2 (m)+Uh 2 (m-1))+b 3 )
  • mapping relationship between hidden layer 3 and output layer is:
  • is the Sigmoid activation function
  • RDNN may also be as shown in FIG. 4, wherein the relevant symbols of the RDNN model shown in FIG. 4 are explained as follows:
  • X noisy represents the spectral envelope parameter of the above de-average (for example, the de-average ISF feature of the noisy speech)
  • a pure estimate of the spectral envelope parameter of the above de-average eg, an estimate of the pure speech de-average ISF feature
  • h 1 , h 2 , h 3 are hidden layer states
  • W 1 , W 2 , W 3 , W 4 is a weight matrix between layers
  • b 1 , b 2 , b 3 , and b 4 are offset vectors of each layer
  • U is a recursive connection matrix
  • m is a frame label.
  • mapping relationship between the layers of the RDNN model shown in FIG. 4 is described as follows:
  • mapping relationship between the display layer and the hidden layer 1 is:
  • mapping relationship between hidden layer 1 and hidden layer 2 is:
  • h 2 (m) ⁇ (W 2 (h 1 (m)+U 1 h 1 (m-1))+b 2 )
  • mapping relationship between hidden layer 2 and hidden layer 3 is:
  • h 3 (m) ⁇ (W 3 (h 2 (m)+U 2 h 2 (m-1))+b 3 )
  • mapping relationship between hidden layer 3 and output layer is:
  • the model structure adds recursive connections in the hidden layer 1 and the hidden layer 3. More recursive connections facilitate the RDNN model to model the temporal correlation of the spectral envelope of the speech signal.
  • RDNN models may be pre-acquired, for example, receiving user input in advance or receiving other devices in advance.
  • the above RDNN model can also be obtained by pre-training.
  • the following is an example of ISF and voice signals.
  • the training of RDNN model can input the features of noisy speech as model input, and the feature of pure speech as the target output of the model.
  • the characteristics of pure speech and noisy speech need to be paired. After extracting a feature from a pure speech, it is necessary to add noise to it, and then extract the noisy speech feature as a pair of training features.
  • the input feature of the RDNN model is the de-averaged ISF feature of the noisy speech signal.
  • the feature acquisition method is as follows:
  • ISF noisy (m) is the ISF feature of the mth frame
  • ISF mean_noisy is the mean of the noisy speech ISF parameters, which is calculated from all the noisy speech ISF parameters under a certain noise condition in the training database.
  • the target output of the RDNN model is the de-averaged ISF parameter of the pure speech signal, and the feature acquisition method. as follows:
  • ISF clean (m) is the pure voice ISF parameter
  • ISF mean_clean is the mean value of the pure voice ISF parameter, which is obtained by counting the ISF parameters of all pure voice signals in the training database.
  • this embodiment adopts an objective function in the form of weighted mean square error, which is expressed as follows:
  • the above F w is a weight function.
  • the weighted objective function L w takes into account the different effects of the reconstruction error of each dimension in the ISF feature on the speech quality, and each dimension of the ISF feature. Reconstruction errors are assigned different weights.
  • one RDNN model can be trained for each pre-selected noise type by the above training method.
  • RDNN model used in this embodiment is not limited to three hidden layers, and the number of hidden layers may be increased or decreased as needed.
  • the foregoing method may further include the following steps:
  • step 201 may include:
  • a bit stream of the input audio signal is decoded, and a spectral envelope parameter, an adaptive codebook gain, and an algebraic book gain of the to-be-enhanced frame of the audio signal are obtained.
  • step 201 decodes the spectral envelope parameter, the adaptive codebook gain, and the algebraic code gain corresponding bit of the enhanced frame.
  • the adaptive codebook gain and the algebraic book gain of the to-be-enhanced frame are performed.
  • the joint adjustment can be adjusted by using the energy conservation criterion.
  • the adaptive codebook gain and the algebraic book gain of the above-mentioned frame to be enhanced can be respectively defined as the first adaptive codebook gain and the first generation digital book gain, and will be combined.
  • the adaptive codebook gain and the algebraic book gain of the adjusted frame to be enhanced are respectively defined as the second adaptive codebook gain and the second generation digital book gain, and the specific adjustment process can be as follows:
  • a second adaptive codebook gain is determined based on the first adaptive codebook gain and the second generation digital book gain.
  • step of adjusting the gain of the first generation digital book to obtain the gain of the second generation digital book may include:
  • Steps can include:
  • the to-be-enhanced frame is the first type of subframe, acquiring a second-generation digital book vector of the to-be-enhanced frame and a second adaptive codebook vector;
  • the execution order of the steps 204 and 205 in the embodiment is not limited.
  • the steps 205 and 203 may be performed together, or separately, or the step 204 may be performed before the step 203.
  • N Counting, in the N frames including the audio signal frame, the number of frames of each noise type included in the N frames, and selecting a noise type with the largest number of frames as the audio signal A type of noise included, wherein the N is an integer greater than or equal to one.
  • the above audio signal frame may be understood as any frame in the above audio signal, or as a current frame, or may be understood as performing a partial decoding operation for each of the above audio signals.
  • the above may be to perform noise classification on the spectral envelope parameter, and then use the noise type of the spectral envelope parameter as the noise type included in the audio signal frame.
  • the above steps can perform frame number statistics for each type of noise, thereby selecting the noise type with the largest number of frames as the noise type of the audio signal.
  • the foregoing N frames may be the foregoing audio signals.
  • the N frames are the beginning segment of the audio signal, or the frame between the Tth and the N+T in the audio signal, wherein the T frame can be set by the user.
  • the decoding of the audio signal frame may be performed for each frame, and the noise classification of the audio signal frame may be performed for each frame, or the noise classification may be performed only for the partial frame.
  • the step of selecting the noise type of the audio signal may be performed only once, or periodically according to time, and the like.
  • the noise type of the audio signal is always considered to be the selected noise type during the processing of the audio signal; or when the noise type of the audio signal is selected, It is possible to use the selected noise type as the noise type of the specific period in the processing of the above audio signal; or, after selecting the noise type of the above audio signal, continue to identify the noise type of each frame, when the noise of several consecutive frames is recognized When the type is different from the previously selected noise type, the audio signal can be classified again.
  • the step of performing noise classification on the audio signal frame by using the spectral envelope parameter to obtain a noise type of the audio signal frame may include:
  • a posterior probability of each noise model in the model, a noise model having the largest posterior probability among the M noise models is selected as the noise type of the audio signal frame, where M is an integer greater than or equal to 1.
  • the noise model may be a Gaussian Mixture Model (GMM).
  • GMM Gaussian Mixture Model
  • the RDNN model corresponding to the current noise environment can be selected when the spectral envelope parameter is adjusted, which helps to improve the adaptability of the algorithm to the complex noise environment.
  • the codebook gain parameter described above may include a long-term average of the adaptive codebook gain and a variance of the algebraic book gain.
  • the long-term average value of the adaptive codebook gain may be calculated according to the following formula according to the current code frame and the adaptive codebook gain of the L-1 frame before the current frame.
  • gp (mi) represents the adaptive codebook gain of the mith frame
  • L is an integer greater than one.
  • the variance of the algebraic book gain can be calculated according to the following formula according to the algebraic book gain of the L-1 frame before the current frame and the current frame.
  • g c (mi) represents the algebraic book gain of the mith frame
  • the GMMs of various noise types in the noise pool may be acquired in advance, for example, may be received by the user in advance or received by other devices, or may be trained in advance for each type of noise.
  • the feature vector used in GMM training is composed of ISF parameters, adaptive codebook gain long-term average, and algebraic code gain variance.
  • the feature dimension can be 18 dimensions, as shown in Figure 5. Show.
  • the Maximum Expectation Algorithm EM can be used to train a separate GMM model for each noise type in the noise database (the number of noise types is M).
  • the number of frames of each noise type included in the N frames is counted in the N frames including the audio signal frame in the audio signal, and the noise type with the largest number of frames is selected.
  • the method may include:
  • This embodiment can implement the determination of the noise type of the audio signal using the frame of the initial segment of the audio signal, so that subsequent frames can be directly enhanced using the neural network corresponding to the noise type.
  • the number of frames of each noise type included in the N frames is counted in the N frames including the audio signal frame in the audio signal, and the noise type with the largest number of frames is selected.
  • the method may include:
  • This embodiment can implement noise determination of an audio signal using N frames in which no speech signal is present Type, since the audio signal frame without the speech signal is easier to reflect the noise type than the audio signal frame with the noisy signal, it is easier to analyze the noise type of the audio signal by using the N frames without the speech signal to determine the noise type of the audio signal. .
  • this embodiment can use Voice Activity Detection (VAD) to determine whether there is speech in the current frame, so that it can be performed in a frame in which the VAD determines that there is no voice. It may also be that when the encoder turns on the discontinuous transmission (DTX) mode, the VAD information in the code stream can be used to determine whether the voice exists; if the encoder does not enable the DTX mode, the ISF parameter and the codebook gain parameter can be utilized. As a feature, it is judged whether or not the voice exists.
  • VAD Voice Activity Detection
  • each of the consecutive multi-frames is counted in the consecutive multi-frames The number of frames of the noise type, and the noise type with the largest number of frames is selected as the current noise type of the audio signal;
  • the spectral envelope parameter of the to-be-enhanced frame of the audio signal is enhanced using a neural network pre-set for the current noise type of the audio signal to obtain a pure estimate of the spectral envelope parameter of the to-be-enhanced frame.
  • This embodiment can realize the timely adjustment of the noise type of the audio signal, because an audio signal tends to include multiple audio signal frames, and these audio signal frames may also have audio signal frames of different noise types, thereby achieving the above steps. Timely use the neural network corresponding to the current correct noise type for enhancement to improve the quality of the audio signal.
  • FIG. 6 is a schematic diagram of another audio signal enhancement method according to an embodiment of the present invention.
  • an ISF parameter is used as an example. As shown in FIG. 6, the following steps are included:
  • the background noise is classified by a Gaussian mixture model (GMM).
  • GBM Gaussian mixture model
  • the codebook gain related parameter may include an average value of the adaptive codebook gain and a variance of the algebraic code gain.
  • RDNN recursive depth neural network
  • the RDNN model is introduced to adjust the spectral envelope parameters (such as ISF parameters) of the noisy speech, and the time domain smoothness of the spectral envelope parameter adjustment result can be effectively improved due to the existence of the time domain recursive connection in the model. Improve voice quality.
  • the spectral envelope parameter adjustment method based on RDNN can avoid the problem of unstable LPC filter in the existing method and improve the robustness of the algorithm.
  • the RDNN model corresponding to the current noise environment can be selected during spectral envelope adjustment, which helps to improve the adaptability of the algorithm to complex noise environments.
  • the spectral envelope estimation method based on RDNN has low computational complexity and can effectively improve the running speed.
  • the device embodiment of the present invention is used to perform the method for implementing the first to second embodiments of the present invention.
  • the device embodiment of the present invention is used to perform the method for implementing the first to second embodiments of the present invention.
  • Only parts related to the embodiment of the present invention are shown, and the specific technical details are not disclosed. Please refer to Embodiment 1 and Embodiment 2 of the present invention.
  • FIG. 7 is a schematic structural diagram of an audio signal enhancement apparatus according to an embodiment of the present invention. As shown in FIG. 7, the method includes: a decoding unit 71, an enhancement unit 72, and a replacement unit 73, where:
  • a decoding unit 71 configured to decode a bit stream of the input audio signal, to obtain the audio signal The spectral envelope parameter of the frame to be enhanced.
  • the to-be-enhanced frame can be understood as the current frame of the audio signal, that is, the audio signal frame currently input in the audio signal. Additionally, the above inputs may be understood as inputs to the method, or inputs to the apparatus performing the method.
  • the decoding unit 71 may also be understood to decode only the bit corresponding to the spectral envelope parameter in the frame to be enhanced, wherein the bit corresponding to the mid-spectral envelope parameter may be a spectrum in the bit stream included in the audio signal frame.
  • the spectral envelope parameters may include: Line Spectral Frequencies (LSF), Immunity Spectral Frequencies (ISF), or Linear Prediction Coefficients (LPC).
  • the audio signal may be any audio signal including a spectral envelope parameter in a bit stream such as a voice signal or a music signal.
  • the enhancement unit 72 is configured to perform enhancement processing on a spectral envelope parameter of the to-be-enhanced frame of the audio signal by using a neural network set in advance for a noise type included in the audio signal, to obtain a spectrum packet of the to-be-enhanced frame A pure estimate of the network parameters.
  • a plurality of neural networks may be preset, and each neural network corresponds to a type of noise, so that when the noise type of the audio signal is determined, the neural network corresponding to the noise type may be selected for enhancement. deal with.
  • the type of noise included in the audio signal may be obtained before the decoding of the to-be-enhanced frame, for example, by using the noise type statistics of several frames of the initial segment of the audio signal.
  • the type of noise included in the above audio signal obtained by the noise type statistics of several frames adjacent to the above-mentioned frame to be enhanced.
  • the type of noise included in the audio signal may be confirmed according to the source of the audio signal.
  • the voice signal of the call may be confirmed according to the geographic location of the two parties, the duration of the call, or the type of noise of the historical voice signal.
  • the type of noise of the voice signal such as when the party is at a certain site by the geographical location of the telephone, then it can be determined that the noise type of the current voice signal is the type of noise corresponding to the site, or when the user calls, the user outputs When the noise type of nine times of ten voice signals is noise type A, Then, based on the history, it is determined that the type of noise included in the voice signal output by the user at the next call is the noise type A.
  • the replacing unit 73 is configured to quantize the pure estimation value, obtain a quantization index of a pure estimation value of the spectral envelope parameter of the to-be-enhanced frame, and replace the quantization index with the spectrum packet of the to-be-enhanced frame The bit corresponding to the network parameter.
  • the spectrum envelope parameter of the to-be-enhanced frame may be obtained only when the frame to be enhanced is decoded, and the other parameters in the to-be-enhanced frame may not be decoded, so that step 103 sets the spectral envelope parameter of the to-be-enhanced frame.
  • the bit stream of the enhanced frame to be enhanced can be obtained.
  • the foregoing apparatus can be applied to any smart device having a decoding and computing function, such as a server, a network side device, a personal computer (PC), a notebook computer, a mobile phone, a tablet computer, and the like.
  • a server a network side device
  • PC personal computer
  • notebook computer a mobile phone
  • tablet computer a tablet computer
  • the bit stream of the input audio signal is decoded, and the spectral envelope parameter of the to-be-enhanced frame of the audio signal is acquired; the audio signal is used by using a neural network set in advance for the type of noise included in the audio signal.
  • the spectral envelope parameter of the to-be-enhanced frame is enhanced to obtain a pure estimate of the spectral envelope parameter of the to-be-enhanced frame; and the purely estimated value is quantized to obtain a spectral envelope parameter of the to-be-enhanced frame A quantization index of the pure estimate, and replacing the quantization index with a bit corresponding to the spectral envelope parameter of the to-be-enhanced frame.
  • FIG. 8 is a schematic structural diagram of another audio signal enhancement apparatus according to an embodiment of the present invention. As shown in FIG. 8, the method includes: a decoding unit 81, an enhancement unit 82, and a replacement unit 83, where:
  • the decoding unit 81 is configured to decode a bit stream of the input audio signal, and acquire a spectral envelope parameter of the to-be-enhanced frame of the audio signal.
  • the enhancement unit 82 is configured to perform enhancement processing on a spectral envelope parameter of the to-be-enhanced frame of the audio signal by using a neural network set in advance for a noise type included in the audio signal, to acquire a spectrum packet of the to-be-enhanced frame A pure estimate of the network parameters.
  • the enhancement unit 82 may include:
  • a first calculating unit 821 configured to calculate an average of a spectral envelope parameter of the to-be-enhanced frame and the plurality of frames of the audio signal, where the plurality of frames are a plurality of frames in the audio signal before the to-be-enhanced frame ;
  • a second calculation unit 822 configured to calculate a spectral envelope parameter of the de-average of the to-be-enhanced frame, where the spectral envelope parameter of the de-averaging is a spectral envelope parameter of the to-be-enhanced frame and the mean Difference
  • a third calculating unit 823 configured to perform enhancement processing on the spectral envelope parameter of the de-average using a neural network set in advance for a noise type included in the audio signal, to obtain a spectral envelope parameter of the de-average Pure estimate
  • a fourth calculating unit 824 configured to add a pure estimated value of the spectral mean parameter of the de-average value to a mean value of a pre-acquired pure audio spectral envelope parameter to obtain a spectral envelope parameter of the to-be-enhanced frame Pure estimate.
  • the neural network may be a recursive deep neural network or other neural network.
  • a Recurrent Deep Neural Network RDNN
  • the spectrum can be effectively improved due to the existence of a time domain recursive connection in the RDNN.
  • the smoothness of the envelope adjustment result improves the audio signal quality.
  • the method of spectral envelope parameter adjustment based on RDNN can also avoid the instability of the LPC filter adjusted by the existing method, thereby improving the robustness of the algorithm.
  • the spectral envelope estimation method based on RDNN has a low computational complexity, which can effectively improve the operation speed.
  • the replacing unit 83 is configured to quantize the pure estimation value, obtain a quantization index of the spectral envelope parameter pure estimation value of the to-be-enhanced frame, and replace the quantization index with the spectral envelope of the to-be-enhanced frame The bit corresponding to the parameter.
  • the foregoing apparatus may further include:
  • the adjusting unit 84 is configured to perform joint adjustment on the adaptive codebook gain and the algebraic book gain of the to-be-enhanced frame, and respectively quantize the jointly adjusted adaptive codebook gain and the algebraic book gain to obtain the a quantization index of the jointly adjusted adaptive codebook gain of the enhanced frame and a quantization index of the algebraic code gain, wherein the adaptive codebook gain and the algebraic book gain of the to-be-enhanced frame are performed on the to-be-enhanced frame Obtained by the decoding operation;
  • the replacing unit 83 may be further configured to replace the quantization index of the adaptive codebook gain that is jointly adjusted by the to-be-enhanced frame with the bit corresponding to the adaptive codebook gain of the to-be-enhanced frame, where the to-be-enhanced
  • the quantized index of the jointly adjusted algebraic book gain of the frame replaces the bit corresponding to the algebraic book gain of the to-be-enhanced frame.
  • the adaptive codebook gain and the algebraic code gain of the to-be-enhanced frame may be obtained by performing a decoding operation on the to-be-enhanced frame.
  • the decoding unit 81 may be configured to decode a bitstream of the input audio signal, and obtain the location.
  • the spectral envelope parameter of the to-be-enhanced frame of the audio signal, the adaptive codebook gain, and the algebraic book gain may be obtained by performing a decoding operation on the to-be-enhanced frame.
  • the decoding unit 81 decodes the spectral envelope parameter, the adaptive codebook gain, and the algebraic code gain corresponding bit of the enhanced frame.
  • the joint adjustment of the adaptive codebook gain and the algebraic book gain of the to-be-enhanced frame may be adjusted by using an energy conservation criterion, for example, an adaptive codebook gain and generation of the to-be-enhanced frame may be used.
  • the digital book gain is defined as the first adaptive codebook gain and the first generation digital book gain, respectively, and the adaptive codebook gain and the algebraic book gain of the jointly adjusted frame to be enhanced are respectively defined as the second adaptive codebook.
  • Gain and second-generation digital book gain the specific adjustment process can be as follows:
  • a second adaptive codebook gain is determined based on the first adaptive codebook gain and the second generation digital book gain.
  • This embodiment can enhance the spectral envelope parameters, adaptive codebook gain, and algebraic book gain of the frame to be enhanced.
  • the decoding unit 81 may be further configured to decode a bit stream of the input audio signal, and acquire a spectral envelope parameter of the audio signal frame of the audio signal;
  • the apparatus may further include:
  • a classifying unit 85 configured to perform noise classification on the audio signal frame by using the spectral envelope parameter to obtain a noise type of the audio signal frame;
  • the statistic unit 86 is configured to count, in the N frames of the audio signal, the frame of the audio signal, the number of frames of each noise type included in the N frames, and select the noise type with the largest number of frames. And a noise type included in the audio signal, wherein the N is an integer greater than or equal to 1;
  • the above audio signal frame may be understood as any frame in the above audio signal, or as a current frame, or may be understood as performing partial decoding operations for each of the above audio signals. Work.
  • the above may be to perform noise classification on the spectral envelope parameter, and then use the noise type of the spectral envelope parameter as the noise type included in the audio signal frame.
  • the above steps can perform frame number statistics for each type of noise, thereby selecting the noise type with the largest number of frames as the noise type of the audio signal.
  • the foregoing N frames may be partial frames in the audio signal, for example, the N frames are the initial segment of the audio signal, or the frame between the Tth and the N+T in the audio signal. , wherein the Tth frame can be set by the user.
  • the decoding of the audio signal frame may be performed for each frame, and the noise classification of the audio signal frame may be performed for each frame, or the noise classification may be performed only for the partial frame.
  • the step of selecting the noise type of the audio signal may be performed only once, or periodically according to time, and the like.
  • the noise type of the audio signal is always considered to be the selected noise type during the processing of the audio signal; or when the noise type of the audio signal is selected, It is possible to use the selected noise type as the noise type of the specific period in the processing of the above audio signal; or, after selecting the noise type of the above audio signal, continue to identify the noise type of each frame, when the noise of several consecutive frames is recognized When the type is different from the previously selected noise type, the audio signal can be classified again.
  • the classification unit 85 may be configured to obtain a codebook gain parameter corresponding to the audio signal frame from a bit stream of the input audio signal, and calculate the location by using the codebook gain parameter and the spectral envelope parameter. The posterior probability of the audio signal frame for each of the preset M noise models is selected, and the noise model with the largest posterior probability among the M noise models is selected as the noise type of the audio signal frame.
  • the noise model may be a Gaussian Mixture Model (GMM).
  • GMM Gaussian Mixture Model
  • the RDNN model corresponding to the current noise environment can be selected when the spectral envelope parameter is adjusted, which helps to improve the adaptability of the algorithm to the complex noise environment.
  • the codebook gain parameter described above may include a long-term average of the adaptive codebook gain and a variance of the algebraic book gain.
  • the long-term average value of the adaptive codebook gain may be calculated according to the following formula according to the current code frame and the adaptive codebook gain of the L-1 frame before the current frame.
  • gp (mi) represents the adaptive codebook gain of the mith frame
  • L is an integer greater than one.
  • the variance of the algebraic book gain can be calculated according to the following formula according to the algebraic book gain of the L-1 frame before the current frame and the current frame.
  • g c (mi) represents the algebraic book gain of the mith frame
  • the GMMs of various noise types in the noise pool may be acquired in advance, for example, may be received by the user in advance or received by other devices, or may be trained in advance for each type of noise.
  • the feature vector used in GMM training is composed of ISF parameters, adaptive codebook gain long-term average, and generational digital gain variance, and the feature dimension is 18 dimensions, as shown in FIG. .
  • the Maximum Expectation Algorithm can be used to train a separate GMM model for each noise type in the noise database (the number of noise types is M).
  • the statistic unit 86 may be configured to count, in the N frames of the initial segment including the audio signal frame in the audio signal, frames of each noise type included in the N frames. The number, the noise type with the largest number of frames is selected as the type of noise contained in the audio signal.
  • This embodiment can implement the determination of the noise type of the audio signal using the frame of the initial segment of the audio signal, so that subsequent frames can be directly enhanced using the neural network corresponding to the noise type.
  • the statistic unit 86 may be configured to count each type of noise included in the N frames in N frames of the audio signal including the audio signal frame and no voice signal The number of frames, the noise type with the largest number of frames is selected as the type of noise contained in the audio signal.
  • This embodiment can implement the noise type of the audio signal using N frames in which no speech signal is present, since the audio signal frame in which the speech signal is not present is easier than the audio signal frame containing the noise signal. Reflecting the type of noise, it is easier to analyze the noise type of the audio signal by determining the noise type of the audio signal using N frames in which no speech signal is present.
  • this embodiment can use Voice Activity Detection (VAD) to determine whether there is speech in the current frame, so that it can be performed in a frame in which the VAD determines that there is no voice. It may also be that when the encoder turns on the discontinuous transmission (DTX) mode, the VAD information in the code stream can be used to determine whether the voice exists; if the encoder does not enable the DTX mode, the ISF parameter and the codebook gain parameter can be utilized. As a feature, it is judged whether or not the voice exists.
  • VAD Voice Activity Detection
  • the statistic unit 86 is further configured to: when detecting that the noise type of the continuous multiple frames in the audio signal is different from the noise type included in the audio signal before, in the consecutive multiple frames Counting the number of frames of each type of noise included in the continuous multi-frame, and selecting the noise type with the largest number of frames as the current noise type of the audio signal;
  • the enhancement unit 83 may be configured to perform enhancement processing on a spectral envelope parameter of the to-be-enhanced frame of the audio signal by using a neural network set in advance for the current noise type of the audio signal to acquire a spectral envelope of the to-be-enhanced frame A pure estimate of the parameter.
  • This embodiment can realize the timely adjustment of the noise type of the audio signal, because an audio signal tends to include multiple audio signal frames, and these audio signal frames may also have audio signal frames of different noise types, thereby achieving the above steps. Enhance in time using the neural network corresponding to the current correct noise type to provide the quality of the audio signal.
  • FIG. 11 is a schematic structural diagram of another audio signal enhancement apparatus according to an embodiment of the present invention.
  • the processor 111 includes a processor 111, a network interface 11, a memory 113, and a communication bus 114.
  • the communication bus 114 is configured to implement connection communication between the processor 111, the network interface 112, and the memory 113, and the processor 111 executes a program stored in the memory for implementing the following methods:
  • the spectral envelope parameter of the to-be-enhanced frame of the signal is enhanced to obtain a pure estimate of the spectral envelope parameter of the to-be-enhanced frame;
  • Quantifying the pure estimate to obtain a quantization index of a pure estimate of the spectral envelope parameter of the to-be-enhanced frame, and replacing the quantized index with a bit corresponding to the spectral envelope parameter of the to-be-enhanced frame.
  • the step performed by the processor 111 may further include:
  • N Counting, in the N frames including the audio signal frame, the number of frames of each noise type included in the N frames, and selecting a noise type with the largest number of frames as the audio signal A type of noise included, wherein the N is an integer greater than or equal to one.
  • the step of performing noise classification on the audio signal frame by using the spectral envelope parameter to obtain the noise type of the audio signal frame may include:
  • a posterior probability of each noise model in the model, a noise model having the largest posterior probability among the M noise models is selected as the noise type of the audio signal frame, where M is an integer greater than or equal to 1.
  • the step performed by the processor 111 may further include:
  • the steps of the pure estimate of the spectral envelope parameters of the frame may include:
  • the number of frames of each noise type included in the N frames is counted in N frames of the audio signal including the audio signal frame, and the number of frames is selected.
  • the step of using the most noise type as the type of noise included in the audio signal may include:
  • the step performed by the processor 111 may further include:
  • the continuous multi-frame is included in the continuous multi-frame to include each The number of frames of the noise type, and the noise type with the largest number of frames is selected as the current noise type of the audio signal;
  • the neural network that is configured by the processor 111 to set the noise type of the audio signal is used to enhance the spectral envelope parameter of the to-be-enhanced frame of the audio signal to obtain the to-be-enhanced frame.
  • the steps of the pure estimate of the spectral envelope parameters may include:
  • the spectral envelope parameter of the to-be-enhanced frame of the audio signal is enhanced using a neural network pre-set for the current noise type of the audio signal to obtain a pure estimate of the spectral envelope parameter of the to-be-enhanced frame.
  • the foregoing neural network may include:
  • the bit stream of the input audio signal is decoded, and the spectral envelope parameter of the to-be-enhanced frame of the audio signal is acquired; the audio signal is used by using a neural network set in advance for the type of noise included in the audio signal.
  • the spectral envelope parameter of the to-be-enhanced frame is enhanced to obtain a pure estimate of the spectral envelope parameter of the to-be-enhanced frame; and the purely estimated value is quantized to obtain a spectral envelope parameter of the to-be-enhanced frame A quantization index of the pure estimate, and replacing the quantization index with a bit corresponding to the spectral envelope parameter of the to-be-enhanced frame.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Abstract

本发明实施例公开了一种音频信号增强方法和装置,该方法可包括:解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数;使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值;对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。本发明实施例可以降低音频信号的增强过程中计算复杂度和附加时延。

Description

一种音频信号增强方法和装置
本申请要求于2015年6月2日提交中国专利局、申请号为201510295355.2、发明名称为“一种音频信号增强方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及通信领域,尤其涉及一种音频信号增强方法和装置。
背景技术
在通信系统中,音频信号往往都会受到噪声的干扰,导致音频信号质量下降。目前,通信领域中主要是通过音频增强技术实现从被噪声污染的音频信号中提取尽可能的干净信号,以提高音频信号质量。由于实践中需要考虑终端设备在计算能力、存储空间和成本等方面的限制,往往使用网络设备实现对音频信号的增强。其中,在网络设备对音频信号进行语音增强过程中包括对音频信号进行完全解码、增强处理和重新编码的操作。由于需要对音频信号进行完全解码,再对解码后的数据进行处理,从而目前音频信号的增强过程中计算复杂度和附加时延都会比较高。
发明内容
本发明实施例提供了一种音频信号增强方法和装置,可以降低音频信号的增强过程中计算复杂度和附加时延。
第一方面,本发明实施例提供一种音频信号增强方法,包括:
解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数;
使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值;
对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比 特。
在第一方面的第一种可能的实现方式中,所述方法还包括:
解码输入的音频信号的比特流,获取所述音频信号的音频信号帧的谱包络参数;
使用所述谱包络参数对所述音频信号帧进行噪声分类,以获取所述音频信号帧的噪声类型;
在所述音频信号中包括所述音频信号帧在内的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型,其中,所述N为大于或者等于1的整数。
结合第一方面第一种可能的实现方式,在第一方面的第二种可能的实现方式中,所述使用所述谱包络参数对所述音频信号帧进行噪声分类,以获取所述音频信号帧的噪声类型,包括:
从输入的音频信号的比特流中获得对应于所述音频信号帧的码书增益参数,利用所述码书增益参数和所述谱包络参数计算所述音频信号帧对预设的M个噪声模型中每个噪声模型的后验概率,选择所述M个噪声模型中后验概率最大的噪声模型作为所述音频信号帧的噪声类型,其中,M为大于或者等于1的整数。
结合第一方面或者第一方面的第一种上可能的实现方式或者第一方面第二种可能的实现方式,在第一方面的第三种可能的实现方式中,所述方法还包括:
对所述待增强帧的自适应码书增益和代数码书增益进行联合调整,分别对联合调整后的自适应码书增益和代数码书增益进行量化,得到所述待增强帧的联合调整后的自适应码书增益的量化索引和代数码书增益的量化索引,其中,所述待增强帧的自适应码书增益和代数码书增益是对所述待增强帧进行解码操作获取的;
将所述待增强帧的联合调整后的自适应码书增益的量化索引替换掉所述待增强帧的自适应码书增益对应的比特,将所述待增强帧的联合调整后的代数码书增益的量化索引替换掉所述待增强帧的代数码书增益对应的比特。
结合第一方面或者第一方面的第一种上可能的实现方式或者第一方面第 二种可能的实现方式,在第一方面的第四种可能的实现方式中,所述使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值,包括:
计算所述音频信号的待增强帧与若干帧的谱包络参数的均值,其中,所述若干帧为所述音频信号中在所述待增强帧之前的若干帧;
计算所述待增强帧的去均值的谱包络参数,其中,所述去均值的谱包络参数为所述待增强帧的谱包络参数与所述均值的差值;
使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述去均值的谱包络参数进行增强处理,以得到所述去均值的谱包络参数的纯净估计值;
将所述去均值的谱包络参数的纯净估计值与预先获取的纯净音频谱包络参数的均值相加,以得到所述待增强帧的谱包络参数的纯净估计值。
结合第一方面第一种可能的实现方式,在第一方面的第五种可能的实现方式中,所述在所述音频信号中包括所述音频信号帧在内的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型,包括:
在所述音频信号中包括所述音频信号帧在内的起始段的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型;或者
在所述音频信号中包括所述音频信号帧在内的且不存在语音信号的N个帧中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型。
结合第一方面或者第一方面的第一种上可能的实现方式或者第一方面第二种可能的实现方式,在第一方面的第六种可能的实现方式中,所述方法还包括:
当检测到所述音频信号中连续的多帧的噪声类型与之前判断的所述音频信号中包含的噪声类型不同时,在所述连续的多帧内统计所述连续的多帧包含的每种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号的当 前噪声类型;
所述使用预先为所述音频信号的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值,包括:
使用预先为所述音频信号的当前噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值。
结合第一方面上述任一种可能的实现方式,在第一方面的第七种可能的实现方式中,所述神经网络包括:
递归深度神经网络。
第二方面,本发明提供一种音频信号增强装置,包括:解码单元、增强单元和替换单元,其中:
所述解码单元,用于解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数;
所述增强单元,用于使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值;
所述替换单元,用于对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。
在第二方面的第一种可能的实现方式中,所述解码单元还用于解码输入的音频信号的比特流,获取所述音频信号的音频信号帧的谱包络参数;
所述装置还包括:
分类单元,用于使用所述谱包络参数对所述音频信号帧进行噪声分类,以获取所述音频信号帧的噪声类型;
统计单元,用于在所述音频信号中包括所述音频信号帧在内的N个帧内统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型,其中,所述N为大于或者等于1的整数;
结合第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述分类单元用于从输入的音频信号的比特流中获得对应于所述音频信号帧的码书增益参数,利用所述码书增益参数和所述谱包络参数计算所述音频信号帧对预设的M个噪声模型中每个噪声模型的后验概率,选择所述M个噪声模型中后验概率最大的噪声模型作为所述音频信号帧的噪声类型。
结合第二方面或者第二方面的第一种可能的实现方式或者第二方面的第二种可能的实现方式,在第二方面的第三种可能的实现方式中,所述装置还包括:
调整单元,用于对所述待增强帧的自适应码书增益和代数码书增益进行联合调整,分别对联合调整后的自适应码书增益和代数码书增益进行量化,得到所述待增强帧的联合调整后的自适应码书增益的量化索引和代数码书增益的量化索引,其中,所述待增强帧的自适应码书增益和代数码书增益是对所述待增强帧进行解码操作获取的;
所述替换单元还用于将所述待增强帧的联合调整后的自适应码书增益的量化索引替换掉所述待增强帧的自适应码书增益对应的比特,将所述待增强帧的联合调整后的代数码书增益的量化索引替换掉所述待增强帧的代数码书增益对应的比特。
结合第二方面或者第二方面的第一种可能的实现方式或者第二方面的第二种可能的实现方式,在第二方面的第四种可能的实现方式中,所述增强单元包括:
第一计算单元,用于计算所述音频信号的待增强帧与若干帧的谱包络参数的均值,其中,所述若干帧为所述音频信号中在所述待增强帧之前的若干帧;
第二计算单元,用于计算所述待增强帧的去均值的谱包络参数,其中,所述去均值的谱包络参数为所述待增强帧的谱包络参数与所述均值的差值;
第三计算单元,用于使用预先为所述音频信号的噪声类型设置的神经网络对所述去均值的谱包络参数进行增强处理,以得到所述去均值的谱包络参数的纯净估计值;
第四计算单元,用于将所述去均值的谱包络参数的纯净估计值与预先获取的纯净音频谱包络参数的均值相加,以得到所述待增强帧的谱包络参数的纯净 估计值。
结合第二方面的第四种可能的实现方式,在第二方面的第五种可能的实现方式中,所述统计单元用于在所述音频信号中包括所述音频信号帧在内的起始段的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型;或者
所述统计单元用于在所述音频信号中包括所述音频信号帧在内的且不存在语音信号的N个帧中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型。
结合第二方面的第一种可能的实现方式,在第二方面的第六种可能的实现方式中,所述统计单元还用于当检测到所述音频信号中连续的多帧的噪声类型与之前判断的所述音频信号中包含的噪声类型不同时,在所述连续的多帧内统计所述连续的多帧包含的每种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号的当前噪声类型;
所述增强单元用于使用预先为所述音频信号的当前噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值。
结合第二方面上述任一种可能的实现方式,在第二方面的第八种可能的实现方式中,所述神经网络包括:
递归深度神经网络。
上述技术方案中,解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数;使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值;对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。这样可以实现只需要对音频信号帧的谱包络参数对应的比特进行解码,即进行部分解码,从而可以降低音频信号的增强过程中计算复杂度和附加时延。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的一种音频信号增强方法的流程示意图;
图2是本发明实施例提供的另一种音频信号增强方法的流程示意图;
图3是本发明实施例提供的一种RDNN模型示意图;
图4是本发明实施例提供的另一种RDNN模型示意图;
图5是本发明实施例提供的一种GMM模型的结构示意图;
图6是本发明实施例提供的另一种音频信号增强方法的示意图;
图7是本发明实施例提供的一种音频信号增强装置的结构示意图;
图8是本发明实施例提供的另一种音频信号增强装置的结构示意图;
图9是本发明实施例提供的另一种音频信号增强装置的结构示意图;
图10是本发明实施例提供的另一种音频信号增强装置的结构示意图;
图11是本发明实施例提供的另一种音频信号增强装置的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参阅图1,图1是本发明实施例提供的一种音频信号增强方法的流程示意图,如图1所示,包括以下步骤:
101、解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数。
本实施例中,上述待增强帧可以理解为上述音频信号的当前帧,即上述音频信号中当前输入的音频信号帧。另外,上述输入可以理解为本方法的输入,或者执行本方法的装置的输入。
另外,步骤101还可以理解为仅对上述待增强帧中谱包络参数对应的比特 进行解码,其中,上述中谱包络参数对应的比特可以是该音频信号帧包括的比特流中为谱包络参数的比特。其中,上述谱包络参数可以包括:线谱频率(Line Spectral Frequencies,LSF)、导抗谱频率(Immittance Spectral Frequencies,ISF)或者线性预测系数(Linear Prediction Coefficients,LPC)等其他等价参数。
本实施例中,上述音频信号可以是语音信号或者音乐信号等比特流中包含谱包络参数的任意音频信号。
102、使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值。
本实施例中,可以是预先设定多个神经网络,且每个神经网络与一种噪声类型对应,这样当上述音频信号的噪声类型确定后,就可以选择该噪声类型对应的神经网络进行增强处理。
另外,本实施例中,上述音频信号中包含的噪声类型可以是在对上述待增强帧进行解码之前获取的,例如:通过对上述音频信号的起始段的若干个帧的噪声类型统计获得的上述音频信号中包含的噪声类型;或者通过对上述音频信号的若干个不存在语音信号的帧的噪声类型统计获得的上述音频信号中包含的噪声类型等等。或者通过与上述待增强帧相邻的若干个帧的噪声类型统计获得的上述音频信号中包含的噪声类型。另外,上述音频信号中包含的噪声类型还可以是根据该音频信号的来源进行确认的,例如:打电话的语音信号可以根据电话双方的地理位置、通话时间或者历史语音信号的噪声类型等信息确认该语音信号的噪声类型,如通过电话双方的地理位置判断一方在某一工地时,那么就可以确定当前语音信号的噪声类型为工地对应的噪声类型,或者某一用户打电话时,该用户输出的语音信号中十次有九次的噪声类型都为噪声类型A时,那么,就可以根据该历史记录确定该用户在下一次打电话时输出的语音信号中包含的噪声类型为噪声类型A。
103、对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。
由于在对上述待增强帧进行解码时,只获取上述待增强帧的谱包络参数,而上述待增强帧中的其他参数可以不进行解码,从而步骤103将上述待增强帧的谱包络参数的纯净估计值的量化索引替换掉所述待增强帧的谱包络参数对应的比特后,就可以得到经过增强的待增强帧的比特流。
另外,本实施例中,上述方法可以应用于任意具备解码和计算功能的智能设备,例如:服务器、网络侧设备、个人计算机(Personal Computer,PC)、笔记本电脑、手机、平板电脑等智能设备。
本实施例中,解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数;使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值;对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。这样可以实现只需要对音频信号帧的谱包络参数对应的比特进行解码,即进行部分解码,从而可以降低音频信号的增强过程中计算复杂度和附加时延。
请参阅图2,图2是本发明实施例提供的另一种音频信号增强方法的流程示意图,如图2所示,包括以下步骤:
201、解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数。
202、使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值。
本实施例中,步骤202可以包括:
计算所述音频信号的待增强帧与若干帧的谱包络参数的均值,其中,所述若干帧为所述音频信号中在所述待增强帧之前的若干帧;
计算所述待增强帧的去均值的谱包络参数,其中,所述去均值的谱包络参数为所述待增强帧的谱包络参数与所述均值的差值;
使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述去均 值的谱包络参数进行增强处理,以得到所述去均值的谱包络参数的纯净估计值;
将所述去均值的谱包络参数的纯净估计值与预先获取的纯净音频谱包络参数的均值相加,以得到所述待增强帧的谱包络参数的纯净估计值。
本实施例中,上述神经网络可以是递归深度神经网络或者其他神经网络,其中,使用递归深度神经网络(Recurrent Deep Neural Network,RDNN)时,由于RDNN中时域递归连接的存在,可有效提升谱包络调整结果的平滑性,从而改善音频信号质量,另外,基于RDNN的谱包络参数调整的方法还可以避免现有方法调整后的LPC滤波器不稳定的问题,从而可以提高算法鲁棒性,另外,基于RDNN的谱包络估计方法计算复杂度比较低,从而可有效提高运算速度。
下面对本实施例使用的RDNN进行详细介绍:
上述RDNN可以如图3所示,其中,图3中所示RDNN模型的相关符号解释如下:Xnoisy表示上述去均值的谱包络参数(例如:含噪语音的去均值ISF特征),
Figure PCTCN2016073792-appb-000001
表示上述去均值的谱包络参数的纯净估计值(例如:纯净语音去均值ISF特征的估计值),h1、h2、h3为隐层状态,W1、W2、W3、W4为各层之间的权重矩阵,b1、b2、b3、b4为各层的偏移量矢量,U为递归连接矩阵,m为帧标号。另外,图3所示的RDNN模型各层之间的映射关系描述如下:
显层到隐层1的映射关系为:
h1(m)=σ(W1Xnoisy(m)+b1)
隐层1到隐层2的映射关系为:
h2(m)=σ(W2h1(m)+b2)
隐层2到隐层3的映射关系为:
h3(m)=σ(W3(h2(m)+Uh2(m-1))+b3)
隐层3到输出层的映射关系为:
Figure PCTCN2016073792-appb-000002
式中σ为Sigmoid激活函数。
另外,上述RDNN还可以如图4所示,其中,图4中所示RDNN模型的相关符号解释如下:Xnoisy表示上述去均值的谱包络参数(例如:含噪语音的去 均值ISF特征),
Figure PCTCN2016073792-appb-000003
表示上述去均值的谱包络参数的纯净估计值(例如:纯净语音去均值ISF特征的估计值),h1、h2、h3为隐层状态,W1、W2、W3、W4为各层之间的权重矩阵,b1、b2、b3、b4为各层的偏移量矢量,U为递归连接矩阵,m为帧标号。另外,图4所示的RDNN模型各层之间的映射关系描述如下:
显层到隐层1的映射关系为:
h1(m)=σ(W1Xnoisy(m)+b1)
隐层1到隐层2的映射关系为:
h2(m)=σ(W2(h1(m)+U1h1(m-1))+b2)
隐层2到隐层3的映射关系为:
h3(m)=σ(W3(h2(m)+U2h2(m-1))+b3)
隐层3到输出层的映射关系为:
Figure PCTCN2016073792-appb-000004
本模型结构与图3所示的RDNN模型结构相比,在隐层1和隐层3增加了递归连接。较多的递归连接有利于RDNN模型对语音信号谱包络的时域相关性进行建模。
另外,上述RDNN模型都可以是预先获取的,例如:预先接收用户输入的或者预先接收其他设备发送的。
当然,上述RDNN模型还可以是预先训练获取的,下面以ISF和语音信号为例进行举例说明。其中,RDNN模型的训练可以将含噪语音的特征作为模型输入,纯净语音的特征作为模型的目标输出。纯净语音和含噪语音的特征需要配对,即对某段纯净语音提取特征后,需要对其加入噪声,再提取含噪语音特征,作为一对训练特征。
RDNN模型的输入特征是含噪语音信号的去均值ISF特征,特征获取方式如下:
Xnoisy(m)=ISFnoisy(m)-ISFmean_noisy
ISFnoisy(m)为第m帧的ISF特征,ISFmean_noisy为含噪语音ISF参数的均值,由训练数据库中某一类噪声条件下的所有含噪语音ISF参数计算得到。
RDNN模型的目标输出是纯净语音信号的去均值ISF参数,特征获取方式 如下:
Xclean(m)=ISFclean(m)-ISFmean_clean
ISFclean(m)为纯净语音ISF参数,ISFmean_clean为纯净语音ISF参数的均值,由训练数据库中所有纯净语音信号的ISF参数统计得到。
与传统DNN不同,本实施例采用一种加权均方误差形式的目标函数,表示如下:
Figure PCTCN2016073792-appb-000005
上述Fw为权重函数,该试与均方误差形式的目标函数相比,加权目标函数Lw考虑了ISF特征中各维的重建误差对语音质量影响不同的特点,对ISF特征每一维的重建误差分配了不同的权重。
另外,本实施例中可以通过上述训练方法为每个预先选定的噪声类型训练一个RDNN模型。
需要说明的是,本实施例中采用的RDNN模型不限于三个隐层,隐层的个数可以根据需要增减。
203、对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。
本实施例中,上述方法还可以包括如下步骤:
204、对所述待增强帧的自适应码书增益和代数码书增益进行联合调整,分别对联合调整后的自适应码书增益和代数码书增益进行量化,得到所述待增强帧的联合调整后的自适应码书增益的量化索引和代数码书增益的量化索引。
其中,上述待增强帧的自适应码书增益和代数码书增益可以是对所述待增强帧进行解码操作获取的,例如,步骤201可以包括:
解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数、自适应码书增益和代数码书增益。
即步骤201对待增强帧的谱包络参数、自适应码书增益和代数码书增益对应比特进行解码。
本实施例中,上述对所述待增强帧的自适应码书增益和代数码书增益进行 联合调整可以采用能量守恒准则进行调整,例如:可以将上述待增强帧的自适应码书增益和代数码书增益分别定义为第一自适应码书增益和第一代数码书增益,而将联合调整后的待增强帧的自适应码书增益和代数码书增益分别定义为第二自适应码书增益和第二代数码书增益,具体调整过程可以如下:
调整第一代数码书增益,得到第二代数码书增益;
根据第一自适应码书增益和第二代数码书增益,确定第二自适应码书增益。
而上述调整所述第一代数码书增益,得到第二代数码书增益的步骤可以包括:
根据所述第一代数码书增益确定噪声的代数码书增益;
根据所述噪声的代数码书增益和所述第一代数码书矢量确定噪声激励能量估计值;
根据所述第一代数码书增益和所述第一代数码书矢量确定第一代数码书激励能量;
根据所述噪声激励能量估计值和所述第一代数码书激励能量,确定所述当前待处理语音子帧的第一后验信噪比估计值;
根据所述当前待处理语音子帧的能量和所述当前待处理语音子帧的能量的最小值,确定所述当前待处理语音子帧的第二后验信噪比估计值;
根据所述第一后验信噪比估计值和所述第二后验信噪比估计值确定所述当前待处理语音子帧的先验信噪比估计值;
采用所述先验信噪比估计值确定所述当前待处理语音子帧的第一调整因子;
根据所述第一调整因子调整所述第一代数码书增益,确定所述第二代数码书增益。
另外,当对步骤201解码的参数还包括第一自适应码书矢量时,上述根据所述第一自适应码书增益和所述第二代数码书增益,确定第二自适应码书增益的步骤,可以包括:
若确定待增强帧为所述第一类子帧,则获取所述待增强帧的第二代数码书矢量以及第二自适应码书矢量;
根据所述第一自适应码书增益、所述第一自适应码书矢量、所述第一代数码书增益以及所述第一代数码书矢量,确定第一总激励能量;
根据所述第一总激励能量和能量调整因子,确定第二总激励能量;
根据所述第二总激励能量、所述第二代数码书增益、所述第二代数码书矢量以及所述第二自适应码书矢量,确定所述第二自适应码书增益。
205、将所述待增强帧的联合调整后的自适应码书增益的量化索引替换掉所述待增强帧的自适应码书增益对应的比特,将所述待增强帧的联合调整后的代数码书增益的量化索引替换掉所述待增强帧的代数码书增益对应的比特。
这样可以实现对待增强帧的谱包络参数、自适应码书增益和代数码书增益进行增强。
需要说明的是,本实施例中对步骤204和205的执行顺序不作限定,例如:步骤205和步骤203可以是一起执行的,或者分开执行的,或者步骤204可以是在步骤203之前执行的。
本实施例中,还可以包括如下步骤:
解码输入的音频信号的比特流,获取所述音频信号的音频信号帧的谱包络参数;
使用所述谱包络参数对所述音频信号帧进行噪声分类,以获取所述音频信号帧的噪声类型;
在所述音频信号中包括所述音频信号帧在内的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型,其中,所述N为大于或者等于1的整数。
其中,上述音频信号帧可以是理解为上述音频信号中的任意帧,或者理解为当前帧,或者可以理解为针对上述音频信号中的每一帧都执行部分解码操作。
上述可以是对上述谱包络参数进行噪声分类,再将该谱包络参数的噪声类型作为上述音频信号帧中包含的噪声类型。
另外,由于在上述N个帧可能存在不同噪声类型的帧,这样上述步骤就可以对每一种噪声类型进行帧数量统计,从而选择帧数量最多的噪声类型作为所述音频信号的噪声类型。需要说明的是,上述N个帧可以是上述音频信号 中的部分帧,例如:上述N个帧为上述音频信号的起始段,或者上述音频信号中第T到第N+T之间的帧,其中,第T帧可以由用户设置的。
另外,本实施方式,对音频信号帧进行解码可以是对每个帧都执行,而对音频信号帧的噪声分类可以是对每个帧都执行,或者可以是只对部分帧进行噪声分类。而选择音频信号的噪声类型的步骤可以是只执行一次,或者按照时间周期性执行等。例如:当选择出上述音频信号的噪声类型后,就可以在上述音频信号的处理过程中一直认为上述音频信号的噪声类型为上述选择的噪声类型;或者当选择出上述音频信号的噪声类型后,就可以将选择的噪声类型作为上述音频信号的处理过程中特定时段的噪声类型;或者当选择出上述音频信号的噪声类型后,继续识别每个帧的噪声类型,当识别到连续若干帧的噪声类型与之前选择的噪声类型不同时,可以再次对音频信号进行噪声分类。
上述使用所述谱包络参数对所述音频信号帧进行噪声分类,以获取所述音频信号帧的噪声类型的步骤,可以包括:
从输入的音频信号的比特流中获得对应于所述音频信号帧的码书增益参数,利用所述码书增益参数和所述谱包络参数计算所述音频信号帧对预设的M个噪声模型中每个噪声模型的后验概率,选择所述M个噪声模型中后验概率最大的噪声模型作为所述音频信号帧的噪声类型,其中,M为大于或者等于1的整数。
其中,上述噪声模型可以是高斯混合模型(Gaussian Mixture Model,GMM)。本实施例中,引入基于GMM的噪声分类后,谱包络参数调整时可以选择对应当前噪声环境的RDNN模型,有助于提高算法对复杂噪声环境的适应性。
另外,上述码书增益参数可以包括自适应码书增益的长时平均值和代数码书增益的方差。其中,自适应码书增益的长时平均值可以根据当前帧和该当前帧之前的L-1帧的自适应码书增益采用如下公式计算
Figure PCTCN2016073792-appb-000006
其中,
Figure PCTCN2016073792-appb-000007
为第m帧或者当前帧的自适应码书增益的平均值,gp(m-i)表示第m-i帧的自适应码书增益,L为大于1的整数。
代数码书增益的方差可以根据当前帧和当前帧之前的L-1帧的代数码书增益采用如下公式计算
Figure PCTCN2016073792-appb-000008
其中,
Figure PCTCN2016073792-appb-000009
为第m帧或者当前帧的代数码书增益的方差,gc(m-i)表示第m-i帧的代数码书增益,
Figure PCTCN2016073792-appb-000010
为L个帧中代数码书增益的平均值。
另外,本实施例中,可以预先获取噪声库中各种噪声类型的GMM,例如:可以预先接收用户输入的或者接收其他设备发送的,或者还可以是预先为每种噪声类型训练一个GMM。
例如:以ISF参数为例,在GMM训练中使用的特征矢量由ISF参数、自适应码书增益长时平均值,以及代数码书增益方差构成,特征维数可以为18维,如图5所示。在训练中可以使用最大期望算法(Expectation Maximization Algorithm,EM)对噪声数据库中的每一种噪声类型(设噪声类型数目为M)训练一个单独的GMM模型。
本实施例中,上述在所述音频信号中包括所述音频信号帧在内的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型的步骤,可以包括:
在所述音频信号中包括所述音频信号帧在内的起始段的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型。
该实施方式可以实现使用音频信号的起始段的帧确定音频信号的噪声类型,这样在后续的帧就可以直接使用该噪声类型对应的神经网络进行增强。
本实施例中,上述在所述音频信号中包括所述音频信号帧在内的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型的步骤,可以包括:
在所述音频信号中包括所述音频信号帧在内的且不存在语音信号的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型。
该实施方式可以实现使用不存在语音信号的N个帧确定音频信号的噪声 类型,由于不存在语音信号的音频信号帧比含噪声信号的音频信号帧更加容易反映噪声类型,从而使用不存在语音信号的N个帧确定音频信号的噪声类型更加容易分析出音频信号的噪声类型。
另外,该实施方式可以使用话音激活检测(Voice Activity Detection,VAD)判断当前帧是否存在语音,这样就可以在VAD判定为不存在语音的帧中进行。还可能是当编码器开启非连续传输(Discontinuous Transmission,DTX)模式时,可以利用码流中的VAD信息判断语音是否存在;若编码器没有开启DTX模式,则可以利用ISF参数和码书增益参数等作为特征,判断语音是否存在。
本实施例中,还可以包括如下步骤:
当检测到所述音频信号中连续的多帧的噪声类型与之前判断的所述音频信号中包含的噪声类型不同时,在所述连续的多帧内统计所述连续的多帧包含的每种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号的当前噪声类型;
所述使用预先为所述音频信号的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值,包括:
使用预先为所述音频信号的当前噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值。
该实施方式可以实现及时调整音频信号的噪声类型,因为一个音频信号往往会包括多个音频信号帧,而这些音频信号帧也可能会存在不同噪声类型的音频信号帧,从而通过上述步骤就可以实现及时使用当前正确的噪声类型对应的神经网络进行增强,以提高音频信号的质量。
本实施例中,在图1所示的实施例的基础上增加了多种可选的实施方式,且都可以实现降低音频信号的增强过程中计算复杂度和附加时延。
请参阅图6,图6是本发明实施例提供的另一种音频信号增强方法的示意图,该实施例中以ISF参数进行举例,如图6所示包括以下步骤:
601、利用部分解码器从输入比特流中提取含噪语音的编码参数,其中, 编码参数包括ISF参数、自适应码书增益gp(m)、代数码书增益gc(m)、自适应码书矢量dm(n)和代数码书矢量cm(n)等;
602、利用部分解码器得到的自适应码书增益、代数码书增益、自适应码书矢量和代数码书矢量参数,对自适应码书增益和代数码书增益进行联合调整,得到调整后的自适应码书增益和代数码书增益。
603、以ISF和码书增益相关参数作为特征,利用高斯混合模型(GMM)对背景噪声进行分类。
其中,上述码书增益相关参数可以包括自适应码书增益的平均值和代数码书增益的方差。
604、根据噪声分类的结果,选择对应的递归深度神经网络(RDNN)模型对部分解码器得到的含噪语音的ISF参数进行处理,得到纯净语音ISF参数的估计值。
605、对调整后的自适应码书增益和代数码书增益参数,以及调整后的ISF参数进行重新量化,并替换码流中的对应位置。
本实施例中,引入RDNN模型对含噪语音的谱包络参数(如ISF参数)进行调整,由于模型中时域递归连接的存在,可有效提升谱包络参数调整结果的时域平滑性,改善语音质量。另外,基于RDNN的谱包络参数调整方法可以避免现有方法中调整后的LPC滤波器不稳定的问题,提高算法鲁棒性。以及引入基于GMM的噪声分类后,谱包络调整时可以选择对应当前噪声环境的RDNN模型,有助于提高算法对复杂噪声环境的适应性。且与现有技术方案相比,基于RDNN的谱包络估计方法计算复杂度较低,可有效提高运行速度。
下面为本发明装置实施例,本发明装置实施例用于执行本发明方法实施例一至二实现的方法,为了便于说明,仅示出了与本发明实施例相关的部分,具体技术细节未揭示的,请参照本发明实施例一和实施例二。
请参阅图7,图7是本发明实施例提供的一种音频信号增强装置的结构示意图,如图7所示,包括:解码单元71、增强单元72和替换单元73,其中:
解码单元71,用于解码输入的音频信号的比特流,获取所述音频信号的 待增强帧的谱包络参数。
本实施例中,上述待增强帧可以理解为上述音频信号的当前帧,即上述音频信号中当前输入的音频信号帧。另外,上述输入可以理解为本方法的输入,或者执行本方法的装置的输入。
另外,解码单元71还可以理解为仅对上述待增强帧中谱包络参数对应的比特进行解码,其中,上述中谱包络参数对应的比特可以是该音频信号帧包括的比特流中为谱包络参数的比特。其中,上述谱包络参数可以包括:线谱频率(Line Spectral Frequencies,LSF)、导抗谱频率(Immittance Spectral Frequencies,ISF)或者线性预测系数(Linear Prediction Coefficients,LPC)等其他等价参数。
本实施例中,上述音频信号可以是语音信号或者音乐信号等比特流中包含谱包络参数的任意音频信号。
增强单元72,用于使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值。
本实施例中,可以是预先设定多个神经网络,且每个神经网络与一种噪声类型对应,这样当上述音频信号的噪声类型确定后,就可以选择该噪声类型对应的神经网络进行增强处理。
另外,本实施例中,上述音频信号中包含的噪声类型可以是在对上述待增强帧进行解码之前获取的,例如:通过对上述音频信号的起始段的若干个帧的噪声类型统计获得的上述音频信号中包含的噪声类型;或者通过对上述音频信号的若干个不存在语音信号的帧的噪声类型统计获得的上述音频信号中包含的噪声类型等等。或者通过与上述待增强帧相邻的若干个帧的噪声类型统计获得的上述音频信号中包含的噪声类型。另外,上述音频信号中包含的噪声类型还可以是根据该音频信号的来源进行确认的,例如:打电话的语音信号可以根据电话双方的地理位置、通话时间或者历史语音信号的噪声类型等信息确认该语音信号的噪声类型,如通过电话双的地理位置判断一方在某一工地时,那么就可以确定当前语音信号的噪声类型为工地对应的噪声类型,或者某一用户打电话时,该用户输出的语音信号中十次有九次的噪声类型都为噪声类型A时, 那么,就可以根据该历史记录确定该用户在下一次打电话时输出的语音信号中包含的噪声类型为噪声类型A。
替换单元73,用于对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。
由于在对上述待增强帧进行解码时,只获取上述待增强帧的谱包络参数,而上述待增强帧中的其他参数可以不进行解码,从而步骤103将上述待增强帧的谱包络参数的纯净估计值的量化索引替换掉所述待增强帧的谱包络参数对应的比特后,就可以得到经过增强的待增强帧的比特流。
另外,本实施例中,上述装置可以应用于任意具备解码和计算功能的智能设备,例如:服务器、网络侧设备、个人计算机(Personal Computer,PC)、笔记本电脑、手机、平板电脑等智能设备。
本实施例中,解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数;使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值;对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。这样可以实现只需要对音频信号帧的谱包络参数对应的比特进行解码,即进行部分解码,从而可以降低音频信号的增强过程中计算复杂度和附加时延。
请参阅图8,图8是本发明实施例提供的另一种音频信号增强装置的结构示意图,如图8所示,包括:解码单元81、增强单元82和替换单元83,其中:
解码单元81,用于解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数。
增强单元82,用于使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值。
本实施例中,增强单元82可以包括:
第一计算单元821,用于计算所述音频信号的待增强帧与若干帧的谱包络参数的均值,其中,所述若干帧为所述音频信号中在所述待增强帧之前的若干帧;
第二计算单元822,用于计算所述待增强帧的去均值的谱包络参数,其中,所述去均值的谱包络参数为所述待增强帧的谱包络参数与所述均值的差值;
第三计算单元823,用于使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述去均值的谱包络参数进行增强处理,以得到所述去均值的谱包络参数的纯净估计值;
第四计算单元824,用于将所述去均值的谱包络参数的纯净估计值与预先获取的纯净音频谱包络参数的均值相加,以得到所述待增强帧的谱包络参数的纯净估计值。
本实施例中,上述神经网络可以是递归深度神经网络或者其他神经网络,其中,使用递归深度神经网络(Recurrent Deep Neural Network,RDNN)时,由于RDNN中时域递归连接的存在,可有效提升谱包络调整结果的平滑性,从而改善音频信号质量,另外,基于RDNN的谱包络参数调整的方法还可以避免现有方法调整后的LPC滤波器不稳定的问题,从而可以提高算法鲁棒性,另外,基于RDNN的谱包络估计方法计算复杂度比较低,从而可有效提高运算速度。
替换单元83,用于对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。
本实施例中,如图9所示,上述装置还可以包括:
调整单元84,用于对所述待增强帧的自适应码书增益和代数码书增益进行联合调整,分别对联合调整后的自适应码书增益和代数码书增益进行量化,得到所述待增强帧的联合调整后的自适应码书增益的量化索引和代数码书增益的量化索引,其中,所述待增强帧的自适应码书增益和代数码书增益是对所述待增强帧进行解码操作获取的;
替换单元83还可以用于将所述待增强帧联合调整后的的自适应码书增益的量化索引替换掉所述待增强帧的自适应码书增益对应的比特,将所述待增强 帧的联合调整后的代数码书增益的量化索引替换掉所述待增强帧的代数码书增益对应的比特。
其中,上述待增强帧的自适应码书增益和代数码书增益可以是对所述待增强帧进行解码操作获取的,例如,解码单元81可以用于解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数、自适应码书增益和代数码书增益。
即解码单元81对待增强帧的谱包络参数、自适应码书增益和代数码书增益对应比特进行解码。
本实施例中,上述对所述待增强帧的自适应码书增益和代数码书增益进行联合调整可以采用能量守恒准则进行调整,例如:可以将上述待增强帧的自适应码书增益和代数码书增益分别定义为第一自适应码书增益和第一代数码书增益,而将联合调整后的待增强帧的自适应码书增益和代数码书增益分别定义为第二自适应码书增益和第二代数码书增益,具体调整过程可以如下:
调整第一代数码书增益,得到第二代数码书增益;
根据第一自适应码书增益和第二代数码书增益,确定第二自适应码书增益。
该实施方式可以实现对待增强帧的谱包络参数、自适应码书增益和代数码书增益进行增强。
本实施例中,解码单元81还可以用于解码输入的音频信号的比特流,获取所述音频信号的音频信号帧的谱包络参数;
如图10所示,所述装置还可以包括:
分类单元85,用于使用所述谱包络参数对所述音频信号帧进行噪声分类,以获取所述音频信号帧的噪声类型;
统计单元86,用于在所述音频信号中包括所述音频信号帧在内的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型,其中,所述N为大于或者等于1的整数;
其中,上述音频信号帧可以是理解为上述音频信号中的任意帧,或者理解为当前帧,或者可以理解为针对上述音频信号中的每一帧都执行部分解码操 作。
上述可以是对上述谱包络参数进行噪声分类,再将该谱包络参数的噪声类型作为上述音频信号帧中包含的噪声类型。
另外,由于在上述N个帧可能存在不同噪声类型的帧,这样上述步骤就可以对每一种噪声类型进行帧数量统计,从而选择帧数量最多的噪声类型作为所述音频信号的噪声类型。需要说明的是,上述N个帧可以是上述音频信号中的部分帧,例如:上述N个帧为上述音频信号的起始段,或者上述音频信号中第T到第N+T之间的帧,其中,第T帧可以由用户设置的。
另外,本实施方式,对音频信号帧进行解码可以是对每个帧都执行,而对音频信号帧的噪声分类可以是对每个帧都执行,或者可以是只对部分帧进行噪声分类。而选择音频信号的噪声类型的步骤可以是只执行一次,或者按照时间周期性执行等。例如:当选择出上述音频信号的噪声类型后,就可以在上述音频信号的处理过程中一直认为上述音频信号的噪声类型为上述选择的噪声类型;或者当选择出上述音频信号的噪声类型后,就可以将选择的噪声类型作为上述音频信号的处理过程中特定时段的噪声类型;或者当选择出上述音频信号的噪声类型后,继续识别每个帧的噪声类型,当识别到连续若干帧的噪声类型与之前选择的噪声类型不同时,可以再次对音频信号进行噪声分类。
该实施方式中,分类单元85可以用于从输入的音频信号的比特流中获得对应于所述音频信号帧的码书增益参数,利用所述码书增益参数和所述谱包络参数计算所述音频信号帧对预设的M个噪声模型中每个噪声模型的后验概率,选择所述M个噪声模型中后验概率最大的噪声模型作为所述音频信号帧的噪声类型。
其中,上述噪声模型可以是高斯混合模型(Gaussian Mixture Model,GMM)。本实施例中,引入基于GMM的噪声分类后,谱包络参数调整时可以选择对应当前噪声环境的RDNN模型,有助于提高算法对复杂噪声环境的适应性。
另外,上述码书增益参数可以包括自适应码书增益的长时平均值和代数码书增益的方差。其中,自适应码书增益的长时平均值可以根据当前帧和该当前帧之前的L-1帧的自适应码书增益采用如下公式计算
Figure PCTCN2016073792-appb-000011
其中,
Figure PCTCN2016073792-appb-000012
为第m帧或者当前帧的自适应码书增益的平均值,gp(m-i)表示第m-i帧的自适应码书增益,L为大于1的整数。
代数码书增益的方差可以根据当前帧和当前帧之前的L-1帧的代数码书增益采用如下公式计算
Figure PCTCN2016073792-appb-000013
其中,
Figure PCTCN2016073792-appb-000014
为第m帧或者当前帧的代数码书增益的方差,gc(m-i)表示第m-i帧的代数码书增益,
Figure PCTCN2016073792-appb-000015
为L个帧中代数码书增益的平均值。
另外,本实施例中,可以预先获取噪声库中各种噪声类型的GMM,例如:可以预先接收用户输入的或者接收其他设备发送的,或者还可以是预先为每种噪声类型训练一个GMM。
例如:以ISF参数为例,在GMM训练中使用的特征矢量由ISF参数、自适应码书增益长时平均值,以及代数码书增益方差构成,特征维数为18维,如图5所示。在训练中可以使用最大期望算法(Expectation Maximization Algorithm,EM)对噪声数据库中的每一种噪声类型(设噪声类型数目为M)训练一个单独的GMM模型。
本实施例中,统计单元86可以用于在所述音频信号中包括所述音频信号帧在内的起始段的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型。
该实施方式可以实现使用音频信号的起始段的帧确定音频信号的噪声类型,这样在后续的帧就可以直接使用该噪声类型对应的神经网络进行增强。
本实施例中,统计单元86可以用于在所述音频信号中包括所述音频信号帧在内的且不存在语音信号的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型。
该实施方式可以实现使用不存在语音信号的N个帧确定音频信号的噪声类型,由于不存在语音信号的音频信号帧比含噪声信号的音频信号帧更加容易 反映噪声类型,从而使用不存在语音信号的N个帧确定音频信号的噪声类型更加容易分析出音频信号的噪声类型。
另外,该实施方式可以使用话音激活检测(Voice Activity Detection,VAD)判断当前帧是否存在语音,这样就可以在VAD判定为不存在语音的帧中进行。还可能是当编码器开启非连续传输(Discontinuous Transmission,DTX)模式时,可以利用码流中的VAD信息判断语音是否存在;若编码器没有开启DTX模式,则可以利用ISF参数和码书增益参数等作为特征,判断语音是否存在。
本实施例中,统计单元86还可以用于当检测到所述音频信号中连续的多帧的噪声类型与之前判断所述音频信号中包含的噪声类型不同时,在所述连续的多帧内统计所述连续的多帧包含的每个种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号的当前噪声类型;
增强单元83可以用于使用预先为所述音频信号的当前噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值。
该实施方式可以实现及时调整音频信号的噪声类型,因为一个音频信号往往会包括多个音频信号帧,而这些音频信号帧也可能会存在不同噪声类型的音频信号帧,从而通过上述步骤就可以实现及时使用当前正确的噪声类型对应的神经网络进行增强,以提供音频信号的质量。
本实施例中,在图7所示的实施例的基础上增加了多种可选的实施方式,且都可以实现降低音频信号的增强过程中计算复杂度和附加时延。
请参阅图11,图11是本发明实施例提供的另一种音频信号增强装置的结构示意图,如图11所示,包括:处理器111、网络接口11、存储器113和通信总线114,其中,通信总线114用于实现所述处理器111、网络接口112和存储器113之间连接通信,处理器111执行所述存储器中存储的程序用于实现以下方法:
解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数;
使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频 信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值;
对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。
本实施例中,处理器111执行的步骤还可以包括:
解码输入的音频信号的比特流,获取所述音频信号的音频信号帧的谱包络参数;
使用所述谱包络参数对所述音频信号帧进行噪声分类,以获取所述音频信号帧的噪声类型;
在所述音频信号中包括所述音频信号帧在内的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型,其中,所述N为大于或者等于1的整数。
本实施例中,处理器111执行的使用所述谱包络参数对所述音频信号帧进行噪声分类,以获取所述音频信号帧的噪声类型的步骤,可以包括:
从输入的音频信号的比特流中获得对应于所述音频信号帧的码书增益参数,利用所述码书增益参数和所述谱包络参数计算所述音频信号帧对预设的M个噪声模型中每个噪声模型的后验概率,选择所述M个噪声模型中后验概率最大的噪声模型作为所述音频信号帧的噪声类型,其中,M为大于或者等于1的整数。
本实施例中,处理器111执行的步骤还可以包括:
对所述待增强帧的自适应码书增益和代数码书增益进行联合调整,分别对联合调整后的自适应码书增益和代数码书增益进行量化,得到所述待增强帧的联合调整后的自适应码书增益的量化索引和代数码书增益的量化索引,其中,所述待增强帧的自适应码书增益和代数码书增益是对所述待增强帧进行解码操作获取的;
将所述待增强帧联合调整后的的自适应码书增益的量化索引替换掉所述待增强帧的自适应码书增益对应的比特,将所述待增强帧联合调整后的的代数码书增益的量化索引替换掉所述待增强帧的代数码书增益对应的比特。
本实施例中,处理器111执行的使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值的步骤,可以包括:
计算所述音频信号的待增强帧与若干帧的谱包络参数的均值,其中,所述若干帧为所述音频信号中在所述待增强帧之前的若干帧;
计算所述待增强帧的去均值的谱包络参数,其中,所述去均值的谱包络参数为所述待增强帧的谱包络参数与所述均值的差值;
使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述去均值的谱包络参数进行增强处理,以得到所述去均值的谱包络参数的纯净估计值;
将所述去均值的谱包络参数的纯净估计值与预先获取的纯净音频谱包络参数的均值相加,以得到所述待增强帧的谱包络参数的纯净估计值。
本实施例中,处理器111执行的在所述音频信号中包括所述音频信号帧在内的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型的步骤,可以包括:
在所述音频信号中包括所述音频信号帧在内的起始段的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型;或者
在所述音频信号中包括所述音频信号帧在内的且不存在语音信号的N个帧中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型。
本实施例中,处理器111执行的步骤还可以包括:
当检测到所述音频信号中连续的多帧的噪声类型与之前判断的所述音频信号中包含的噪声类型不同时,在所述连续的多帧内统计所述述连续的多帧包含的每种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号的当前噪声类型;
本实施例中,处理器111执行的使用预先为所述音频信号的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值的步骤,可以包括:
使用预先为所述音频信号的当前噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值。
本实施例中,上述神经网络可以包括:
递归深度神经网络。
本实施例中,解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数;使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值;对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。这样可以实现只需要对音频信号帧的谱包络参数对应的比特进行解码,即进行部分解码,从而可以降低音频信号的增强过程中计算复杂度和附加时延。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存取存储器(Random Access Memory,简称RAM)等。
以上所揭露的仅为本发明较佳实施例而已,当然不能以此来限定本发明之权利范围,因此依本发明权利要求所作的等同变化,仍属本发明所涵盖的范围。

Claims (16)

  1. 一种音频信号增强方法,其特征在于,包括:
    解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数;
    使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值;
    对所述纯净估计值进行量化,得到所述待增强帧的谱包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    解码输入的音频信号的比特流,获取所述音频信号的音频信号帧的谱包络参数;
    使用所述谱包络参数对所述音频信号帧进行噪声分类,以获取所述音频信号帧的噪声类型;
    在所述音频信号中包括所述音频信号帧在内的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型,其中,所述N为大于或者等于1的整数。
  3. 如权利要求2所述的方法,其特征在于,所述使用所述谱包络参数对所述音频信号帧进行噪声分类,以获取所述音频信号帧的噪声类型,包括:
    从输入的音频信号的比特流中获得对应于所述音频信号帧的码书增益参数,利用所述码书增益参数和所述谱包络参数计算所述音频信号帧对预设的M个噪声模型中每个噪声模型的后验概率,选择所述M个噪声模型中后验概率最大的噪声模型作为所述音频信号帧的噪声类型,其中,M为大于或者等于1的整数。
  4. 如权利要求1-3中任一项所述的方法,其特征在于,所述方法还包括:
    对所述待增强帧的自适应码书增益和代数码书增益进行联合调整,分别对联合调整后的自适应码书增益和代数码书增益进行量化,得到所述待增强帧的联合调整后的自适应码书增益的量化索引和代数码书增益的量化索引,其中,所述待增强帧的自适应码书增益和代数码书增益是对所述待增强帧进行解码操作获取的;
    将所述待增强帧的联合调整后的自适应码书增益的量化索引替换掉所述待增强帧的自适应码书增益对应的比特,将所述待增强帧的联合调整后的代数码书增益的量化索引替换掉所述待增强帧的代数码书增益对应的比特。
  5. 如权利要求1-3中任一项所述的方法,其特征在于,所述使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值,包括:
    计算所述音频信号的待增强帧与若干帧的谱包络参数的均值,其中,所述若干帧为所述音频信号中在所述待增强帧之前的若干帧;
    计算所述待增强帧的去均值的谱包络参数,其中,所述去均值的谱包络参数为所述待增强帧的谱包络参数与所述均值的差值;
    使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述去均值的谱包络参数进行增强处理,以得到所述去均值的谱包络参数的纯净估计值;
    将所述去均值的谱包络参数的纯净估计值与预先获取的纯净音频谱包络参数的均值相加,以得到所述待增强帧的谱包络参数的纯净估计值。
  6. 如权利要求2所述的方法,其特征在于,所述在所述音频信号中包括所述音频信号帧在内的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型,包括:
    在所述音频信号中包括所述音频信号帧在内的起始段的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作 为所述音频信号中包含的噪声类型;或者
    在所述音频信号中包括所述音频信号帧在内的且不存在语音信号的N个帧中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型。
  7. 如权利要求1-3中任一项所述的方法,其特征在于,所述方法还包括:
    当检测到所述音频信号中连续的多帧的噪声类型与之前判断的所述音频信号中包含的噪声类型不同时,在所述连续的多帧内统计所述连续的多帧包含的每种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号的当前噪声类型;
    所述使用预先为所述音频信号的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值,包括:
    使用预先为所述音频信号的当前噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值。
  8. 如权利要求1-6中任一项所述的方法,其特征在于,所述神经网络包括:
    递归深度神经网络。
  9. 一种音频信号增强装置,其特征在于,包括:解码单元、增强单元和替换单元,其中:
    所述解码单元,用于解码输入的音频信号的比特流,获取所述音频信号的待增强帧的谱包络参数;
    所述增强单元,用于使用预先为所述音频信号中包含的噪声类型设置的神经网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值;
    所述替换单元,用于对所述纯净估计值进行量化,得到所述待增强帧的谱 包络参数的纯净估计值的量化索引,并将所述量化索引替换掉所述待增强帧的谱包络参数对应的比特。
  10. 如权利要求9所述的装置,其特征在于,所述解码单元还用于解码输入的音频信号的比特流,获取所述音频信号的音频信号帧的谱包络参数;
    所述装置还包括:
    分类单元,用于使用所述谱包络参数对所述音频信号帧进行噪声分类,以获取所述音频信号帧的噪声类型;
    统计单元,用于在所述音频信号中包括所述音频信号帧在内的N个帧内统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型,其中,所述N为大于或者等于1的整数。
  11. 如权利要求10所述的装置,其特征在于,所述分类单元用于从输入的音频信号的比特流中获得对应于所述音频信号帧的码书增益参数,利用所述码书增益参数和所述谱包络参数计算所述音频信号帧对预设的M个噪声模型中每个噪声模型的后验概率,选择所述M个噪声模型中后验概率最大的噪声模型作为所述音频信号帧的噪声类型。
  12. 如权利要求9-11中任一项所述的装置,其特征在于,所述装置还包括:
    调整单元,用于对所述待增强帧的自适应码书增益和代数码书增益进行联合调整,分别对联合调整后的自适应码书增益和代数码书增益进行量化,得到所述待增强帧的联合调整后的自适应码书增益的量化索引和代数码书增益的量化索引,其中,所述待增强帧的自适应码书增益和代数码书增益是对所述待增强帧进行解码操作获取的;
    所述替换单元还用于将所述待增强帧的联合调整后的自适应码书增益的量化索引替换掉所述待增强帧的自适应码书增益对应的比特,将所述待增强帧的联合调整后的代数码书增益的量化索引替换掉所述待增强帧的代数码书增 益对应的比特。
  13. 如权利要求9-11中任一项所述的装置,其特征在于,所述增强单元包括:
    第一计算单元,用于计算所述音频信号的待增强帧与若干帧的谱包络参数的均值,其中,所述若干帧为所述音频信号中在所述待增强帧之前的若干帧;
    第二计算单元,用于计算所述待增强帧的去均值的谱包络参数,其中,所述去均值的谱包络参数为所述待增强帧的谱包络参数与所述均值的差值;
    第三计算单元,用于使用预先为所述音频信号的噪声类型设置的神经网络对所述去均值的谱包络参数进行增强处理,以得到所述去均值的谱包络参数的纯净估计值;
    第四计算单元,用于将所述去均值的谱包络参数的纯净估计值与预先获取的纯净音频谱包络参数的均值相加,以得到所述待增强帧的谱包络参数的纯净估计值。
  14. 如权利要求10所述的装置,其特征在于,所述统计单元用于在所述音频信号中包括所述音频信号帧在内的起始段的N个帧内中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型;或者
    所述统计单元用于在所述音频信号中包括所述音频信号帧在内的且不存在语音信号的N个帧中统计所述N个帧包含的每一种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号中包含的噪声类型。
  15. 如权利要求10所述的装置,其特征在于,所述统计单元还用于当检测到所述音频信号中连续的多帧的噪声类型与之前判断的所述音频信号中包含的噪声类型不同时,在所述连续的多帧内统计所述连续的多帧包含的每种噪声类型的帧数量,选择帧数量最多的噪声类型作为所述音频信号的当前噪声类型;
    所述增强单元用于使用预先为所述音频信号的当前噪声类型设置的神经 网络对所述音频信号的待增强帧的谱包络参数进行增强处理,以获取所述待增强帧的谱包络参数的纯净估计值。
  16. 如权利要求9-15中任一项所述的装置,其特征在于,所述神经网络包括:
    递归深度神经网络。
PCT/CN2016/073792 2015-06-02 2016-02-15 一种音频信号增强方法和装置 WO2016192410A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510295355.2A CN104966517B (zh) 2015-06-02 2015-06-02 一种音频信号增强方法和装置
CN201510295355.2 2015-06-02

Publications (1)

Publication Number Publication Date
WO2016192410A1 true WO2016192410A1 (zh) 2016-12-08

Family

ID=54220545

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/073792 WO2016192410A1 (zh) 2015-06-02 2016-02-15 一种音频信号增强方法和装置

Country Status (2)

Country Link
CN (1) CN104966517B (zh)
WO (1) WO2016192410A1 (zh)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104966517B (zh) * 2015-06-02 2019-02-01 华为技术有限公司 一种音频信号增强方法和装置
CN105657535B (zh) * 2015-12-29 2018-10-30 北京搜狗科技发展有限公司 一种音频识别方法和装置
CN106328150B (zh) * 2016-08-18 2019-08-02 北京易迈医疗科技有限公司 嘈杂环境下的肠鸣音检测方法、装置及系统
CN109427340A (zh) * 2017-08-22 2019-03-05 杭州海康威视数字技术股份有限公司 一种语音增强方法、装置及电子设备
CN107564538A (zh) * 2017-09-18 2018-01-09 武汉大学 一种实时语音通信的清晰度增强方法及系统
CN110085216A (zh) * 2018-01-23 2019-08-02 中国科学院声学研究所 一种婴儿哭声检测方法及装置
CN108335702A (zh) * 2018-02-01 2018-07-27 福州大学 一种基于深度神经网络的音频降噪方法
US10991379B2 (en) * 2018-06-22 2021-04-27 Babblelabs Llc Data driven audio enhancement
CN109087659A (zh) * 2018-08-03 2018-12-25 三星电子(中国)研发中心 音频优化方法及设备
CN108806711A (zh) * 2018-08-07 2018-11-13 吴思 一种提取方法及装置
CN110147788B (zh) * 2019-05-27 2021-09-21 东北大学 一种基于特征增强crnn的金属板带产品标签文字识别方法
CN112133299B (zh) * 2019-06-25 2021-08-27 大众问问(北京)信息科技有限公司 一种声音信号的处理方法、装置及设备
CN110491406B (zh) * 2019-09-25 2020-07-31 电子科技大学 一种多模块抑制不同种类噪声的双噪声语音增强方法
CN110942779A (zh) * 2019-11-13 2020-03-31 苏宁云计算有限公司 一种噪声处理方法、装置、系统
CN110970050B (zh) * 2019-12-20 2022-07-15 北京声智科技有限公司 语音降噪方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6347297B1 (en) * 1998-10-05 2002-02-12 Legerity, Inc. Matrix quantization with vector quantization error compensation and neural network postprocessing for robust speech recognition
CN102124518A (zh) * 2008-08-05 2011-07-13 弗朗霍夫应用科学研究促进协会 采用特征提取处理音频信号用于语音增强的方法和装置
CN104021796A (zh) * 2013-02-28 2014-09-03 华为技术有限公司 语音增强处理方法和装置
CN104318927A (zh) * 2014-11-04 2015-01-28 东莞市北斗时空通信科技有限公司 一种抗噪声的低速率语音编码方法及解码方法
CN104966517A (zh) * 2015-06-02 2015-10-07 华为技术有限公司 一种音频信号增强方法和装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5737485A (en) * 1995-03-07 1998-04-07 Rutgers The State University Of New Jersey Method and apparatus including microphone arrays and neural networks for speech/speaker recognition systems
US5742694A (en) * 1996-07-12 1998-04-21 Eatwell; Graham P. Noise reduction filter
CN1169117C (zh) * 1996-11-07 2004-09-29 松下电器产业株式会社 声源矢量生成装置以及声音编码装置和声音解码装置
CN101796579B (zh) * 2007-07-06 2014-12-10 法国电信公司 数字音频信号的分级编码
KR101173980B1 (ko) * 2010-10-18 2012-08-16 (주)트란소노 음성통신 기반 잡음 제거 시스템 및 그 방법
RU2464649C1 (ru) * 2011-06-01 2012-10-20 Корпорация "САМСУНГ ЭЛЕКТРОНИКС Ко., Лтд." Способ обработки звукового сигнала
CN104157293B (zh) * 2014-08-28 2017-04-05 福建师范大学福清分校 一种增强声环境中目标语音信号拾取的信号处理方法
CN104575509A (zh) * 2014-12-29 2015-04-29 乐视致新电子科技(天津)有限公司 语音增强处理方法及装置
CN104637489B (zh) * 2015-01-21 2018-08-21 华为技术有限公司 声音信号处理的方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6347297B1 (en) * 1998-10-05 2002-02-12 Legerity, Inc. Matrix quantization with vector quantization error compensation and neural network postprocessing for robust speech recognition
CN102124518A (zh) * 2008-08-05 2011-07-13 弗朗霍夫应用科学研究促进协会 采用特征提取处理音频信号用于语音增强的方法和装置
CN104021796A (zh) * 2013-02-28 2014-09-03 华为技术有限公司 语音增强处理方法和装置
CN104318927A (zh) * 2014-11-04 2015-01-28 东莞市北斗时空通信科技有限公司 一种抗噪声的低速率语音编码方法及解码方法
CN104966517A (zh) * 2015-06-02 2015-10-07 华为技术有限公司 一种音频信号增强方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEI, XIAODONG ET AL.: "Real-time Speech Enhancement Approach Using Neural Networks", JOURNAL OF SHANGHAI JIAOTONG UNIVERSITY, vol. 32, no. 4, 30 April 1998 (1998-04-30) *

Also Published As

Publication number Publication date
CN104966517B (zh) 2019-02-01
CN104966517A (zh) 2015-10-07

Similar Documents

Publication Publication Date Title
WO2016192410A1 (zh) 一种音频信号增强方法和装置
US10950249B2 (en) Audio watermark encoding/decoding
JP5089772B2 (ja) 音声活動を検出するための装置および方法
RU2417456C2 (ru) Системы, способы и устройства для обнаружения изменения сигналов
US8554550B2 (en) Systems, methods, and apparatus for context processing using multi resolution analysis
US20090168673A1 (en) Method and apparatus for detecting and suppressing echo in packet networks
WO2021103778A1 (zh) 语音处理方法、装置、计算机可读存储介质和计算机设备
CN112489665B (zh) 语音处理方法、装置以及电子设备
JP6174266B2 (ja) ブラインド帯域幅拡張のシステムおよび方法
US9373342B2 (en) System and method for speech enhancement on compressed speech
US20140278418A1 (en) Speaker-identification-assisted downlink speech processing systems and methods
US20200098380A1 (en) Audio watermark encoding/decoding
CN112334980A (zh) 自适应舒适噪声参数确定
JP2019204097A (ja) 音声符号化方法および関連装置
CN101069231A (zh) 语音通信的舒适噪声生成方法
Mohamed et al. On deep speech packet loss concealment: A mini-survey
CN112751820B (zh) 使用深度学习实现数字语音丢包隐藏
Zhou et al. Speech Enhancement via Residual Dense Generative Adversarial Network.
CN115101088A (zh) 音频信号恢复方法、装置、电子设备及介质
Liu et al. PLCNet: Real-time Packet Loss Concealment with Semi-supervised Generative Adversarial Network.
Zhao et al. Enhancement of G. 711-Coded Speech Providing Quality Higher Than Uncoded
US20240127848A1 (en) Quality estimation model for packet loss concealment
Xiang et al. An improved packet loss concealment method for mobile audio coding
CN116705040A (zh) 音频信号恢复方法、装置、电子设备及可读存储介质
JP2023517973A (ja) 音声符号化方法、装置、コンピュータ機器及びコンピュータプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16802333

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16802333

Country of ref document: EP

Kind code of ref document: A1