CN112242147A

CN112242147A - Voice gain control method and computer storage medium

Info

Publication number: CN112242147A
Application number: CN202011098089.1A
Authority: CN
Inventors: 陈东敏; 陈荣观; 薛建清; 陈玉龙
Original assignee: Fujian Xingwang Intelligent Technology Co Ltd
Current assignee: Fujian Xingwang Intelligent Technology Co Ltd
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2021-01-19
Anticipated expiration: 2040-10-14
Also published as: CN112242147B

Abstract

The invention relates to the technical field of voice communication, and discloses a voice gain control method and a computer storage medium, which comprise the following steps: performing framing and Fourier transformation on the voice signal to obtain an original amplitude spectrum and an original phase spectrum of the frequency domain signal; preprocessing the original amplitude spectrum through a neural network model, and inhibiting instantaneous noise amplitude spectrum components in the original amplitude spectrum to obtain a voice enhanced amplitude spectrum; performing AGC processing and modification processing on the time domain signal, wherein the AGC processing and modification processing comprise framing and envelope solving of the time domain signal, and performing AGC processing on the envelope to obtain a gain coefficient; and finally applying a gain coefficient to the time domain signal to complete gain control of the speech signal. After the preprocessing of the deep learning neural network model, the instantaneous noise amplitude is greatly reduced and is far lower than a gain amplification threshold value in AGC, the risk of the instantaneous noise being amplified by an AGC algorithm by mistake can be reduced through gain correction processing, and the voice call quality is improved.

Description

Voice gain control method and computer storage medium

Technical Field

The present invention relates to the field of voice call technology, and in particular, to a method for controlling voice gain during voice call and a storage medium.

Background

The occurrence of transient noise during a conference call can distract the audience, and how to properly handle transient noise is an urgent technical problem to be solved during a conference call. In the well-known open source algorithm webrtc, an automatic gain control Algorithm (AGC) is used in a conference call. For example, in the patent application No. CN201010181204.1 entitled speech noise reduction device, it is disclosed that the AGC is used to perform noise control, and when the target speech is not present, the AGC is used to further suppress the noise intensity by reducing the gain, and the decision of the existence probability of the target speech is given by the speech detection unit. Also, as in the invention with application number CN201910860097.6 entitled microphone automatic gain control method, apparatus and storage medium, the invention also discloses that the volume of the microphone is controlled by AGC to realize noise suppression.

The main principle of the existing AGC algorithm is to take a time domain amplitude envelope of a signal according to the amplitude of an input signal, determine the difference between the amplitude and a given target amplitude, and calculate a corresponding gain, so that a signal higher than the target amplitude is attenuated and an amplitude lower than the target signal is boosted. However, the existing algorithm has a drawback that when transient noise occurs in speech, it is easy to interfere with amplitude judgment, so that a speech signal which does not reach a target amplitude is judged to be a signal higher than the target amplitude due to the occurrence of the transient noise, thereby obtaining attenuation. And the smoothing processing in the algorithm can inhibit a small-end voice signal after the transient noise appears, so that the voice volume is discontinuous in hearing sense, and the communication quality is reduced.

Disclosure of Invention

Therefore, it is necessary to provide a speech gain control method for solving the above-mentioned technical problem of poor speech noise processing effect in the prior art.

In order to achieve the above object, the present invention provides a speech gain control method, comprising the steps of:

performing framing and Fourier transformation on the voice signal, and then converting the voice signal into a polar coordinate form to obtain an original amplitude spectrum and an original phase spectrum of the frequency domain signal;

preprocessing the original amplitude spectrum through a neural network model, wherein the preprocessing comprises suppressing instantaneous noise amplitude spectrum components in the original amplitude spectrum to obtain a voice enhanced amplitude spectrum; restoring the preprocessed voice enhanced amplitude spectrum and the original phase spectrum into a time domain signal;

calculating the energy ratio of the voice signal before preprocessing and the time domain signal after preprocessing;

performing AGC processing and correction processing on the time domain signal, including framing the time domain signal to obtain an envelope, performing AGC processing on the envelope to obtain a gain coefficient, if the energy ratio is greater than a first preset threshold, not correcting the gain coefficient, if the energy ratio is smaller than the first preset threshold and greater than a second preset threshold, correcting the gain coefficient to be not more than half of the gain coefficient, and if the energy ratio is smaller than the second preset threshold, correcting the gain coefficient to be a product result of the gain coefficient and the energy ratio;

and finally, applying the gain coefficient to the time-domain signal to complete the gain control of the speech signal.

Further, the neural network model structure includes: the input layer, the output layer, the first full connection layer, the second full connection layer, the first LSTM layer and the second LSTM layer are formed by open source and self-research data set training.

Further, the input layer has 128 neurons, the 128 neurons corresponding to 128 amplitude spectrum values; the output layer has 128 neurons, the 128 neurons corresponding to 128 speech enhancement amplitude spectral values, a first fully connected layer having 64 neurons, a second fully connected layer having 128 neurons, a first LSTM layer having 64 neurons, a second LSTM layer having 128 neurons.

Further, the energy ratio in the correction process is obtained by taking a logarithm of the energy ratio before and after the speech signal is preprocessed and then comparing the logarithm with a preset threshold, after the preprocessing by the neural network model, the energy ratio is smaller than the second preset threshold in the correction process, and if the energy ratio is smaller than the second preset threshold, the energy ratio needs to be converted into linear scales first and then multiplied by the gain coefficient.

Further, the step of preprocessing the frequency domain signal through a neural network model to suppress a noise signal in the frequency domain signal includes the steps of:

and dividing the frequency domain signal into a voice section containing the voice and a voiceless section not containing the voice, wherein the suppression strength of the noise signal in the voice section is greater than or equal to 12dB, and the suppression strength of the noise signal in the voiceless section is greater than or equal to 24 dB.

Further, after obtaining the speech enhancement magnitude spectrum, the method further comprises the steps of: the speech enhancement magnitude spectrum between frames needs to be smoothed with a smoothing coefficient lower than 0.1

Further, the smoothing coefficient is 0.

Further, the neural network model is an LSTM neural network model that is deeply trained, and in different training stages, different data sets are used, and the deep training of the LSTM neural network model includes:

using the characteristics of the speech without the instantaneous noise and the speech without the instantaneous noise as the input and the output of the LSTM neural network model for training, and obtaining a parameter A after the LSTM neural network model converges;

and using the voice features with noise as the input of the LSTM neural network model, using the voice features without instantaneous noise as the output of the LSTM neural network model, continuously training the neural network model based on the parameter A, obtaining a parameter B after the LSTM neural network model converges, and determining the parameter B as the parameter of the LSTM neural network model.

In order to solve the technical problem, the invention also provides a technical scheme that:

a computer storage medium, the storage medium having a program stored therein, the program, when executed by a processor, performing the speech gain control method of any of the preceding claims.

The technical scheme is different from the prior art and provides a speech gain control method based on a neural network model, and in the scheme, a speech signal is subjected to framing and Fourier transform and then is converted into a polar coordinate form to obtain an original amplitude spectrum and an original phase spectrum of a corresponding frequency domain signal; and before AGC processing is carried out, instantaneous noise is detected through a neural network model, and instantaneous noise components in an original amplitude spectrum are inhibited when the instantaneous noise is detected, so that the amplitude of the instantaneous noise is reduced, and the normal AGC processing of a voice signal is not influenced. Furthermore, the amplitude of the transient noise processed by the deep learning neural network model is greatly reduced and is far lower than a gain amplification threshold in the AGC (namely the second preset threshold in the AGC processing), the risk that the transient noise is amplified by the AGC algorithm by mistake can be reduced by the gain correction processing, and the voice call quality is improved.

Drawings

FIG. 1 is a flow chart of a speech gain control method according to an embodiment;

FIG. 2 is a diagram illustrating a neural network model according to an embodiment;

FIG. 3 is a schematic diagram of a computer storage medium according to an embodiment.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

The existing AGC algorithm mainly calculates the gain of a current frame by the amplitude characteristic of a voice signal and combining the voice signal existence probability (VAD) of the current frame; if transient noise exists in a voice signal segment which does not contain human voice, attenuation is carried out so as to prevent the noise from being amplified by mistake. In a speech signal segment including a human voice, transient noise is often difficult to detect by VAD, and even if detected, suppression by AGC is difficult. The reason is that the duration of the transient noise is usually 5-50 ms, the period is short, the short-time Fourier transform segment is also 10 ms-20 ms, the superposition ratio of the transient noise spectrum and the speech spectrum is high, and the transient noise is difficult to strip from the spectrum.

In addition, the common processing of AGC is to adjust the response according to the amplitude variation of the input signal, when instantaneous noise exists in the speech, the gain coefficient is reduced, so that the speech volume level reaches the target level, at this time, the speech signal segment where the instantaneous noise exists is suppressed, and by adding a smoothing strategy, the suppression effect lasts for a short period of time, and in terms of hearing, the speech becomes small after the instantaneous noise appears, which affects the actual conversation experience.

For the interference of the instantaneous noise to the AGC, the present embodiment provides a speech gain control method, which performs a preprocessing on the AGC algorithm. Referring to fig. 1 and fig. 2, the voice gain control method can be used in voice communication such as conference call, and is used for performing noise suppression and automatic gain control on a voice signal to improve the voice call quality. As shown in fig. 1, the voice gain control method includes the steps of:

s101, framing and Fourier transformation are carried out on the voice signals, and then the voice signals are converted into a polar coordinate form, so that an original amplitude spectrum S102 and an original phase spectrum S103 of the frequency domain signals are obtained;

s104, preprocessing the original amplitude spectrum through a neural network model, and suppressing instantaneous noise amplitude spectrum components in the original amplitude spectrum to obtain a voice enhanced amplitude spectrum;

s105, restoring the preprocessed voice enhanced amplitude spectrum and the original phase spectrum into a time domain signal, and calculating an energy ratio of the voice signal before and after preprocessing;

s106, carrying out AGC processing and correction processing on the time domain signal, wherein the AGC processing and the correction processing are carried out on the time domain signal, the envelope is subjected to AGC processing to obtain a gain coefficient, if the energy ratio is greater than a first preset threshold value, the gain coefficient is not corrected, if the energy ratio is smaller than the first preset threshold value and is greater than a second preset threshold value, the gain coefficient is corrected to be not more than half of the gain coefficient, and if the energy ratio is smaller than the second preset threshold value, the gain coefficient is corrected to be a product result of the gain coefficient and the energy ratio;

and finally applying a gain coefficient to the time domain signal to complete gain control of the speech signal.

In the embodiment, before AGC algorithm processing, whether a current speech signal frame comprises transient noise is judged through a deep-learning neural network model, if so, suppression is carried out, otherwise, no processing is carried out, so that interference of the transient noise on a call is avoided, and call experience is improved.

In step S101, an input speech signal is framed and fourier transformed to obtain a corresponding frequency domain signal, which is converted into a polar coordinate form to obtain an original amplitude spectrum and original phase spectrum information, and only the original amplitude spectrum is modified while the original phase spectrum remains unchanged. Step S104 is performed after step S101.

In step S104, the original amplitude spectrum is preprocessed by the neural network model, that is, AGC preprocessing is performed to suppress an instantaneous noise amplitude spectrum component in the original amplitude spectrum, so as to obtain a speech enhancement amplitude spectrum.

Preferably, as shown in fig. 2, the neural network model is an LSTM neural network model through deep learning (but not limited to the LSTM neural network model), and the LSTM neural network model is also called a long-short term memory network model. The input gate controls how much the new state currently calculated is updated into the memory unit; the forgetting door controls how much the information in the memory unit of the previous step is forgotten; the output gate controls how much the current output depends on the current memory cell.

In this embodiment, the LSTM neural network model includes: an input layer, an output layer, a first fully-connected layer, a second fully-connected layer, a first LSTM layer, and a second LSTM layer. The input layer has 128 neurons, the 128 neurons corresponding to 128 amplitude spectral values; the output layer has 128 neurons, the 128 neurons of the output layer corresponding to 128 speech enhancement amplitude spectral values, the first fully connected layer having 64 neurons, the second fully connected layer having 128 neurons, the first LSTM layer having 64 neurons, the second LSTM layer having 128 neurons.

The LSTM neural network model for deep learning is divided into two parts of training and using. When the LSTM neural network model is used, a data set needs to be used for training in advance, and the LSTM neural network model can be used for preprocessing the frequency domain signals through deep learning training and suppressing noise signals in the frequency domain signals. The learning training of the LSTM neural network model comprises:

1) collecting clean voices, extracting characteristics in the voices, wherein the characteristics are signal frequency spectrums, using the voices and the characteristics as input and output of a neural network model, and obtaining a parameter A after convergence;

2) on the basis of the parameter A, a data set is replaced, a voice signal with noise is collected, characteristics in the voice are extracted, the characteristics are signal frequency spectrums, the voice signal with noise and the characteristics are used as input of a network model, meanwhile, clean voice and corresponding characteristics are used as output of the network model, the neural network model continues to be trained on the basis of the parameter A, and the parameter B is obtained after the model converges.

The parameter B is the final parameter, and the training mode has the advantages of accelerating the convergence speed and reducing the time spent on training.

The training of the LSTM neural network model comprises a forward transmission part and a backward feedback part, and weight parameters are extracted when the LSTM neural network model loss function is converged; when in use, the weights are directly used for forward transmission to obtain the expected characteristics, and reverse feedback is not needed.

In the above embodiment, the voice signal may be a full-band voice signal. In the existing audio algorithm based on deep learning, Mel frequency spectrum coefficient (MFCC) is used as one of characteristic values, which divides the whole voice frequency band into a plurality of sub-frequency bands, and the logarithmic energy of the sub-frequency bands is used to characterize the sub-band characteristics, so that the method is an analysis mode with coarse resolution, and has better effects in semantic analysis and voice recognition. But transient noise cannot be suppressed finely because the spectral overlap of the transient noise and the speech signal is relatively high, which is expressed as: in the signal containing voice, still more transient noise remains, and the noise is more obvious from the existence to the nonexistence and from the nonexistence to the human ear perception. Therefore, the above embodiment adopts the complete frequency spectrum as the voice feature, which can effectively enhance the suppression effect of the transient noise and weaken the perception of human ears on the transient noise.

Also in the above embodiment, the amplitude spectrum of the full band can be adopted as the signal feature. If a full-band complex spectrum is used, assuming that the number of fourier transform points is 256, the number of effective spectra is 128 complex spectra, which includes 256 features in total, namely, a real part and an imaginary part. According to the fact that human ears are not obvious in noise phase sensing and different from complex frequency spectrum characteristics, in the embodiment, the amplitude spectrum is used as signal characteristics, namely 128 amplitude values, the number of input characteristics is reduced to half of the original number, and the original phase information is still used as phase information.

In the learning and training process, specifically, the spectrum of the noisy speech signal is used as an input feature, and the spectrum of the clean speech signal is used as an expected feature. The characteristic is a signal frequency spectrum, specifically, a Fourier transform result is converted into polar coordinate representation, and since the human ear is insensitive to phase information, an amplitude value is used as an input characteristic, and the phase information is discarded.

Step S105 is performed after step S104, and AGC processing and correction processing are performed on the time domain signal in step S105. The method comprises the steps of converting a preprocessed voice enhancement amplitude spectrum and an original phase spectrum into time-domain signals, framing the time-domain signals to obtain envelopes, carrying out AGC (automatic gain control) processing on the envelopes to obtain gain coefficients, not modifying the gain coefficients if the energy ratio is larger than a first preset threshold, modifying the gain coefficients to be not more than half of the gain coefficients if the energy ratio is smaller than the first preset threshold and larger than a second preset threshold, and modifying the gain coefficients to be the product of the gain coefficients and the energy ratio if the energy ratio is smaller than the second preset threshold. The volume of the voice signal is more consistent and coherent through AGC processing (namely an automatic gain control algorithm), and the conversation quality is improved. In the embodiment, because the preprocessing is performed before the AGC processing, when transient noise occurs in the voice signal, the effective voice signal is effectively prevented from being suppressed, and meanwhile, the energy value of the transient noise is reduced, and the correction processing prevents the transient noise from being amplified by mistake during the AGC processing.

Preferably, in the above embodiment, in order to effectively suppress the interference of the transient noise, in step S104, after the neural network model preprocessing, the energy ratio is smaller than the second preset threshold in the correction processing.

In one embodiment, the step of pre-processing the frequency domain signal by a neural network model to suppress a noise signal in the frequency domain signal includes the steps of:

In the present embodiment, before performing AGC processing, instantaneous noise suppression is performed, and the suppression is performed over 24dB in a voiceless segment not including human voice, thereby preventing the suppression from being erroneously amplified by AGC, and the suppression is performed over 12dB in a speech segment including human voice, thereby preventing noise from being regarded as a speech signal and affecting the determination of the speech amplitude in AGC, and suppressing a speech signal having instantaneous noise, thereby causing discontinuity in hearing.

Further, in the above embodiment, in the step of performing AGC processing on the frequency domain signal after preprocessing, smoothing processing is not performed on the gain suppression coefficient and the gain amplification coefficient. Aiming at the suppression of steady-state noise, the gain smoothing operation is adopted, so that the processed voice is stable and natural, and noise caused by overlarge gain difference between frames is avoided. However, the occurrence frequency of transient noise is far lower than that of steady-state noise, and the suppression effect of transient noise is reduced by adopting the smoothing operation, and particularly, in a scene of continuous occurrence of transient noise (such as knocking a desk), the suppression effect of the first few times of knocking sound is weaker, because the suppression of transient noise by the smoothing operation has a delay effect.

As shown in fig. 3, in another embodiment, a computer storage medium 300 is provided, in which a program is stored, which when executed by a processor performs the speech gain control method described in any of the above embodiments.

It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims

1. A method for speech gain control, comprising the steps of:

performing AGC processing and modification processing on the time domain signal, including: framing the time domain signal to obtain an envelope, and carrying out AGC processing on the envelope to obtain a gain coefficient; if the energy ratio is greater than a first preset threshold, the gain coefficient is not corrected, if the energy ratio is less than the first preset threshold and greater than a second preset threshold, the gain coefficient is corrected to be not more than half of the gain coefficient, and if the energy ratio is less than the second preset threshold, the gain coefficient is corrected to be a product result of the gain coefficient and the energy ratio;

2. The speech gain control method of claim 1, wherein the neural network model structure comprises: the input layer, the output layer, the first full connection layer, the second full connection layer, the first LSTM layer and the second LSTM layer are formed by open source and self-research data set training.

3. The speech gain control method of claim 2, wherein the input layer has 128 neurons, the 128 neurons corresponding to 128 amplitude spectral values; the output layer has 128 neurons, the 128 neurons corresponding to 128 speech enhancement amplitude spectral values; the first fully connected layer has 64 neurons, the second fully connected layer has 128 neurons, the first LSTM layer has 64 neurons, and the second LSTM layer has 128 neurons.

4. The speech gain control method according to claim 1, wherein the energy ratio in the correction process is obtained by taking a logarithm of the energy ratio before and after the speech signal is preprocessed and comparing the logarithm with a preset threshold, after the preprocessing by the neural network model, the energy ratio is smaller than the second preset threshold in the correction process, and if the energy ratio is smaller than the second preset threshold, the energy ratio needs to be converted into a linear scale and then multiplied by the gain factor.

5. The speech gain control method according to claim 1, wherein the step of suppressing the noise signal in the frequency domain signal by preprocessing the frequency domain signal through a neural network model comprises the steps of:

6. The speech gain control method according to claim 1, further comprising, after obtaining the speech enhancement magnitude spectrum, the steps of: the speech enhancement magnitude spectrum between frames needs to be smoothed with a smoothing coefficient lower than 0.1.

7. The speech gain control method of claim 6, wherein the smoothing factor is 0.

8. The speech gain control method of claim 1, wherein the neural network model is an LSTM neural network model that is deep trained using different data sets in different training phases, comprising:

and using the characteristic of the voice with noise as the input of the LSTM neural network model, using the voice characteristic without instantaneous noise as the output of the LSTM neural network model, continuously training the neural network model based on the parameter A, obtaining a parameter B after the LSTM neural network model converges, and determining the parameter B as the parameter of the LSTM neural network model.

9. A computer storage medium, in which a program is stored, which, when executed by a processor, performs the speech gain control method according to any one of claims 1 to 8.