CN112242147B

CN112242147B - Voice gain control method and computer storage medium

Info

Publication number: CN112242147B
Application number: CN202011098089.1A
Authority: CN
Inventors: 陈东敏; 陈荣观; 薛建清; 陈玉龙
Original assignee: Fujian Xingwang Intelligent Technology Co ltd
Current assignee: Fujian Xingwang Intelligent Technology Co ltd
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2023-12-19
Anticipated expiration: 2040-10-14
Also published as: CN112242147A

Abstract

The invention relates to the technical field of voice communication, and discloses a voice gain control method and a computer storage medium, wherein the method comprises the following steps: carrying out framing and Fourier transformation on the voice signal to obtain an original amplitude spectrum and an original phase spectrum of the frequency domain signal; preprocessing the original amplitude spectrum through a neural network model, and inhibiting transient noise amplitude spectrum components in the original amplitude spectrum to obtain a voice enhancement amplitude spectrum; performing AGC processing and correction processing on the time domain signal, wherein the AGC processing comprises the steps of framing the time domain signal to obtain an envelope, and performing AGC processing on the envelope to obtain a gain coefficient; and finally, applying a gain coefficient to the time domain signal to complete gain control of the voice signal. After the deep learning neural network model is preprocessed, the amplitude of the instantaneous noise is greatly reduced and is far lower than the gain amplification threshold value in the AGC, and the gain correction processing can reduce the risk that the instantaneous noise is mistakenly amplified by the AGC algorithm, so that the voice call quality is improved.

Description

Voice gain control method and computer storage medium

Technical Field

The present invention relates to the field of voice communication technologies, and in particular, to a method for controlling a voice gain in a voice communication and a storage medium.

Background

In conference calls, the occurrence of transient noise can be distracted from the audience, and how to properly handle the transient noise is a technical problem that needs to be solved in conference calls. In the well-known open source algorithm webtc, an automatic gain control Algorithm (AGC) is used in conference calls. For example, in patent application CN201010181204.1, entitled speech noise reduction device, noise control is disclosed using AGC, which further suppresses noise intensity by reducing gain when target speech is not present, and the decision of the probability of presence of target speech is given by a speech detection unit. In addition, in the invention with the application number of CN201910860097.6 and the name of a microphone automatic gain control method, device and storage medium, the invention also discloses a method for controlling the volume of a microphone through AGC to realize noise suppression.

The existing AGC algorithm mainly comprises the steps of taking a signal time domain amplitude envelope according to the amplitude of an input signal, judging the difference between the amplitude and a given target amplitude, and calculating corresponding gain, so that the signal higher than the target amplitude is attenuated, and the amplitude lower than the target signal is improved. However, the conventional algorithm has a disadvantage in that when transient noise occurs in the voice, the amplitude judgment is easily disturbed, so that the voice signal which does not reach the target amplitude is judged to be a signal higher than the target amplitude due to the occurrence of the transient noise, thereby obtaining attenuation. And the smoothing processing in the algorithm can inhibit a small-end voice signal after the instantaneous noise appears, and the voice volume is inconsistent on the hearing sense, so that the conversation quality is reduced.

Disclosure of Invention

Therefore, it is necessary to provide a voice gain control method for solving the above-mentioned technical problem that the existing voice noise processing effect is poor.

In order to achieve the above object, the present invention provides a method for controlling a speech gain, comprising the steps of:

carrying out framing and Fourier transformation on the voice signal, and then converting the voice signal into a polar coordinate form to obtain an original amplitude spectrum and an original phase spectrum of the frequency domain signal;

preprocessing the original amplitude spectrum through a neural network model, wherein the preprocessing comprises the step of suppressing transient noise amplitude spectrum components in the original amplitude spectrum to obtain a voice enhancement amplitude spectrum; restoring the preprocessed voice enhancement amplitude spectrum into a time domain signal by combining the original phase spectrum;

calculating the energy ratio of the voice signal before preprocessing and the time domain signal after preprocessing;

performing AGC processing and correction processing on the time domain signal, wherein the AGC processing is performed on the envelope in frames of the time domain signal to obtain a gain coefficient, if the energy ratio is larger than a first preset threshold value, the gain coefficient is not corrected, if the energy ratio is smaller than the first preset threshold value and larger than a second preset threshold value, the gain coefficient is corrected to be not more than half of the gain coefficient, and if the energy ratio is smaller than the second preset threshold value, the gain coefficient is corrected to be the product result of the gain coefficient and the energy ratio;

and finally, applying the gain coefficient to the time domain signal to complete gain control of the voice signal.

Further, the neural network model structure includes: the input layer, the output layer, the first full-connection layer and the second full-connection layer, the first LSTM layer and the second LSTM layer are formed by training open source and self-grinding data sets.

Further, the input layer has 128 neurons, the 128 neurons corresponding to 128 amplitude spectrum values; the output layer has 128 neurons, the 128 neurons corresponding to 128 speech enhancement magnitude spectrum values, the first fully connected layer has 64 neurons, the second fully connected layer has 128 neurons, the first LSTM layer has 64 neurons, and the second LSTM layer has 128 neurons.

Further, the energy ratio in the correction processing is that the energy ratio before and after the pretreatment of the voice signal is obtained, the logarithm is compared with a preset threshold value, after the pretreatment of the neural network model, the energy ratio is smaller than the second preset threshold value in the correction processing, and if the energy ratio is smaller than the second preset threshold value, the energy ratio needs to be converted into a linear scale first and then multiplied by the gain coefficient.

Further, the step of preprocessing the frequency domain signal through a neural network model to suppress a noise signal in the frequency domain signal includes the steps of:

dividing the frequency domain signal into a voice section containing human voice and a voice-free section not containing human voice, wherein the suppression intensity of the noise signal in the voice section is greater than or equal to 12dB, and the suppression intensity of the noise signal in the voice-free section is greater than or equal to 24dB.

Further, after obtaining the speech enhancement amplitude spectrum, the method further comprises the steps of: the speech enhancement magnitude spectrum from frame to frame needs to be smoothed with a smoothing coefficient below 0.1

Further, the smoothing coefficient is 0.

Further, the neural network model is an LSTM neural network model through deep training, different data sets are adopted in different training stages, and the LSTM neural network model deep training comprises the following steps:

training by using the voice without transient noise and the characteristics of the voice without transient noise as the input and the output of an LSTM neural network model at the same time, and obtaining a parameter A after the LSTM neural network model converges;

and using the noisy speech feature as an input of the LSTM neural network model, using the speech feature without transient noise as an output of the LSTM neural network model, continuing training the neural network model based on the parameter A, obtaining a parameter B after the LSTM neural network model converges, and determining the parameter B as a parameter of the LSTM neural network model.

In order to solve the technical problems, the invention also provides a technical scheme:

a computer storage medium having a program stored therein, which when executed by a processor performs the speech gain control method described in any one of the above claims.

Compared with the prior art, the technical scheme provides a voice gain control method based on a neural network model, wherein in the scheme, voice signals are converted into polar coordinate forms after framing and Fourier transformation, and the original amplitude spectrum and the original phase spectrum of the corresponding frequency domain signals are obtained; before the AGC processing, the transient noise is detected through a neural network model, and when the transient noise is detected, the transient noise component in the original amplitude spectrum is restrained, so that the amplitude of the transient noise is reduced, and the normal AGC processing of the voice signal is not affected. Furthermore, the amplitude of the transient noise processed by the deep learning neural network model is greatly reduced and is far lower than the gain amplification threshold value in AGC (namely the second preset threshold value in AGC processing), and the risk that the transient noise is mistakenly amplified by an AGC algorithm can be reduced through gain correction processing, so that the voice call quality is improved.

Drawings

FIG. 1 is a flow chart of a method of controlling speech gain according to an embodiment;

FIG. 2 is a schematic diagram of a neural network model according to an embodiment;

fig. 3 is a schematic diagram of a computer storage medium according to an embodiment.

Detailed Description

In order to describe the technical content, constructional features, achieved objects and effects of the technical solution in detail, the following description is made in connection with the specific embodiments in conjunction with the accompanying drawings.

The existing AGC algorithm mainly calculates the gain of the current frame by combining the amplitude characteristic of the voice signal and the voice signal existence probability (VAD) of the current frame; wherein in the voice signal segment without human voice, if transient noise exists, attenuation is carried out so as to avoid noise false amplification. In a speech signal segment including a human voice, if transient noise is present, it is often difficult to detect the transient noise by the VAD, and even if the transient noise is detected, it is difficult to suppress the transient noise by the AGC. The reason is that the duration of the transient noise is usually 5-50 ms, the period is very short, the short-time Fourier transform segment is also 10ms to 20ms, the superposition of the transient noise spectrum and the voice spectrum is relatively high, and the transient noise is difficult to strip off from the spectrum.

In addition, the AGC is usually processed by adjusting the response according to the amplitude variation of the input signal, when the transient noise exists in the voice, the gain coefficient is reduced, so that the voice volume level reaches the target level, the voice signal segment where the transient noise exists is suppressed, the smoothing strategy is added, the suppressing effect lasts for a short period of time, and the voice becomes smaller after the transient noise appears in the sense of hearing, so that the actual conversation experience is affected.

Aiming at the interference of transient noise to AGC, the embodiment provides a voice gain control method for preprocessing an AGC algorithm. Referring to fig. 1 and 2, the voice gain control method can be used in voice communication such as conference call, and is used for noise suppression and automatic gain control of voice signals, so as to improve voice call quality. As shown in fig. 1, the voice gain control method includes the steps of:

s101, framing and Fourier transforming the voice signal, and then converting into a polar coordinate form to obtain an original amplitude spectrum S102 and an original phase spectrum S103 of the frequency domain signal;

s104, preprocessing the original amplitude spectrum through a neural network model, and inhibiting transient noise amplitude spectrum components in the original amplitude spectrum to obtain a voice enhancement amplitude spectrum;

s105, restoring the preprocessed voice enhancement amplitude spectrum into a time domain signal by combining an original phase spectrum, and calculating the ratio of energy of the voice signal before and after preprocessing;

s106, carrying out AGC processing and correction processing on the time domain signal, wherein the AGC processing comprises the steps of framing the time domain signal to obtain an envelope, carrying out AGC processing on the envelope to obtain a gain coefficient, if the energy ratio is larger than a first preset threshold value, not correcting the gain coefficient, if the energy ratio is smaller than the first preset threshold value and larger than a second preset threshold value, correcting the gain coefficient to be not more than half of the gain coefficient, and if the energy ratio is smaller than the second preset threshold value, correcting the gain coefficient to be the product result of the gain coefficient and the energy ratio;

and finally, applying a gain coefficient to the time domain signal to complete gain control of the voice signal.

In this embodiment, before the AGC algorithm processing is performed, whether the current speech signal frame includes transient noise is determined by the neural network model of deep learning, if yes, suppression is performed, otherwise, no processing is performed, so that interference of the transient noise on a call is avoided, and call experience is improved.

In step S101, an input speech signal is first framed, and fourier transformed to obtain a corresponding frequency domain signal, and transformed to a polar coordinate form to obtain an original amplitude spectrum and original phase spectrum information, where only the original amplitude spectrum is modified, and the original phase spectrum remains unchanged. Step S104 is performed after step S101.

In step S104, the original amplitude spectrum is preprocessed by the neural network model, that is, AGC preprocessing is performed, and an instantaneous noise amplitude spectrum component in the original amplitude spectrum is suppressed, so as to obtain a voice enhancement amplitude spectrum.

Preferably, as shown in fig. 2, the neural network model is a deep-learning LSTM neural network model (but not limited to the LSTM neural network model), which is also called a long-short-term memory network model, and compared with a conventional cyclic neural network, the LSTM has a more careful design of an internal structure, and an input gate it, a forgetting gate ft, an output gate t three gates and an internal memory unit ct are added. The input gate controls the degree to which the new state calculated at present is updated into the memory unit; the forgetting door controls how much information in the previous step of memory unit is forgotten; the extent to which the output gate controls the current output depends on the current memory cell.

In this embodiment, the LSTM neural network model includes: the input layer, the output layer, the first full-connection layer, the second full-connection layer, the first LSTM layer and the second LSTM layer. The input layer has 128 neurons, the 128 neurons corresponding to 128 amplitude spectrum values; the output layer has 128 neurons, the 128 neurons of the output layer correspond to 128 speech enhancement magnitude spectrum values, the first fully connected layer has 64 neurons, the second fully connected layer has 128 neurons, the first LSTM layer has 64 neurons, and the second LSTM layer has 128 neurons.

The LSTM neural network model for deep learning is divided into two parts, namely training and using. When the LSTM neural network model is used, the training is needed to be performed in advance by using a data set, the LSTM neural network model can be used for preprocessing the frequency domain signal after deep learning training, and noise signals in the frequency domain signal are restrained. Learning training of the LSTM neural network model includes:

1) Collecting clean voices, extracting characteristics in the voices, wherein the characteristics are signal spectrums, taking the voices and the characteristics as input and output of a neural network model, and obtaining parameters A after convergence;

2) On the basis of the parameter A, replacing a data set, collecting voice signals with noise, extracting characteristics in the voices, wherein the characteristics are signal frequency spectrums, taking the voice signals with noise and the characteristics as input of a network model, simultaneously using clean voices and corresponding characteristics as output of the network model, continuing training a neural network model based on the parameter A, and obtaining the parameter B after the model converges.

The parameter B is the final parameter, and the training mode has the advantages of accelerating the convergence speed and reducing the training time.

Training the LSTM neural network model comprises forward transmission and reverse feedback, and extracting weight parameters when the loss function of the LSTM neural network model is converged; when in use, the weight is directly used for forward transmission to obtain the expected characteristics, and reverse feedback is not needed.

In the above embodiment, the voice signal may be a full-band voice signal. The existing audio algorithm based on deep learning is characterized by taking a Mel frequency spectrum coefficient (MFCC) as one of characteristic values, dividing a voice full frequency band into a plurality of sub-frequency bands, and using the logarithmic energy of the sub-frequency bands and the characteristic of the sub-frequency bands to represent the characteristics of the sub-frequency bands, so that the audio algorithm is a coarse resolution analysis mode and has good effects in semantic analysis and voice recognition. However, the transient noise cannot be suppressed finely because the spectrum overlap of the transient noise and the speech signal is relatively high, which is expressed by: in the signal containing the voice, more transient noise still remains, and the noise is obviously perceived from existence to nonexistence to existence. Therefore, the embodiment adopts the complete frequency spectrum as the voice characteristic, can effectively enhance the suppression effect of the instantaneous noise and weaken the perception of the instantaneous noise by human ears.

And in the above embodiments, the full-band amplitude spectrum may be employed as the signal feature. If full-band complex spectrum is used, the number of fourier transform points is assumed to be 256, and the number of effective spectrums is 128 complex spectrums, which include 256 features in total of real part and imaginary part. According to the fact that the human ear is not obvious in noise phase perception and is different from complex frequency spectrum characteristics, in the embodiment, the amplitude spectrum is adopted as signal characteristics, namely 128 amplitude values, the number of input characteristics is reduced to half of that of the original signals, and the original phase information is still adopted as phase information.

In the learning training process, specifically, the spectrum of the noisy speech signal is included as the input feature, and the spectrum of the clean speech signal is included as the desired feature. The characteristic is that the signal spectrum, in particular, the Fourier transform result is converted into polar coordinate representation, and the phase information is discarded because the human ear is insensitive to the phase information, and the amplitude value is used as the input characteristic.

Step S105 is performed after step S104, and AGC processing and correction processing are performed on the time domain signal in step S105. The method comprises the steps of converting a preprocessed voice enhancement amplitude spectrum into a time domain signal by combining an original phase spectrum, framing the time domain signal to obtain an envelope, performing AGC (automatic gain control) processing on the envelope to obtain a gain coefficient, and correcting the gain coefficient to be not more than half of the gain coefficient if the energy ratio is larger than a first preset threshold value, correcting the gain coefficient to be the product of the gain coefficient and the energy ratio if the energy ratio is smaller than the first preset threshold value and larger than a second preset threshold value, and correcting the gain coefficient to be the product of the gain coefficient and the energy ratio if the energy ratio is smaller than the second preset threshold value. Through AGC processing (namely automatic gain control algorithm), the volume of voice signals is more consistent and consistent, and the conversation quality is improved. In this embodiment, since the preprocessing is performed before the AGC processing, when the transient noise occurs in the speech signal, the effective speech signal is effectively prevented from being suppressed, and the energy value of the transient noise is reduced, so that the correction processing prevents the transient noise from being misamplified during the AGC processing.

Preferably, in the above embodiment, in order to effectively suppress the transient noise interference, in the step S104, after the preprocessing by the neural network model, the energy ratio is smaller than the second preset threshold in the correction processing.

In one embodiment, the step of preprocessing the frequency domain signal through a neural network model to suppress a noise signal in the frequency domain signal includes the steps of:

In this embodiment, before AGC processing is performed, transient noise suppression is performed, and the suppression is performed by more than 24dB in a voice-free section containing no human voice, so that the suppression is prevented from being erroneously amplified by AGC, and the suppression is performed by 12dB or more in a voice section containing human voice, so that noise is prevented from being regarded as a voice signal, and judgment of the voice amplitude in AGC is affected, so that a voice signal with transient noise is suppressed, and the sense of hearing is interrupted.

Further, in the above embodiment, in the AGC processing of the frequency domain signal after the preprocessing in the step, the gain suppression coefficient and the gain amplification coefficient are not smoothed. For the suppression of steady-state noise, the gain smoothing operation is adopted, so that the processed voice is stable and natural, and noise generated by overlarge gain difference between frames is avoided. However, the occurrence frequency of the transient noise is far lower than that of the steady-state noise, and the suppression effect of the transient noise is reduced by adopting the smoothing operation, in particular, in the scene of continuously occurring transient noise (such as knocking a table), the knock suppression effect of the former times is weaker because the suppression of the transient noise by the smoothing operation generates a delay effect.

As shown in fig. 3, in another embodiment, a computer storage medium 300 is provided, in which a program is stored, which when executed by a processor, performs the speech gain control method described in any of the above embodiments.

It should be noted that, although the foregoing embodiments have been described herein, the scope of the present invention is not limited thereby. Therefore, based on the innovative concepts of the present invention, alterations and modifications to the embodiments described herein, or equivalent structures or equivalent flow transformations made by the present description and drawings, apply the above technical solution, directly or indirectly, to other relevant technical fields, all of which are included in the scope of the invention.

Claims

1. A method for controlling speech gain, comprising the steps of:

performing AGC processing and correction processing on the time domain signal, including: framing the time domain signal, solving an envelope, and performing AGC (automatic gain control) processing on the envelope to obtain a gain coefficient; if the energy ratio is larger than a first preset threshold, the gain coefficient is not corrected, if the energy ratio is smaller than the first preset threshold and larger than a second preset threshold, the gain coefficient is corrected to be not more than half of the gain coefficient, and if the energy ratio is smaller than the second preset threshold, the gain coefficient is corrected to be the product of the gain coefficient and the energy ratio;

2. The voice gain control method of claim 1, wherein the neural network model structure comprises: the input layer, the output layer, the first full-connection layer and the second full-connection layer, the first LSTM layer and the second LSTM layer are formed by training open source and self-grinding data sets.

3. The speech gain control method of claim 2, wherein the input layer has 128 neurons, the 128 neurons corresponding to 128 magnitude spectrum values; the output layer has 128 neurons, the 128 neurons corresponding to 128 speech enhancement magnitude spectrum values; the first fully connected layer has 64 neurons, the second fully connected layer has 128 neurons, the first LSTM layer has 64 neurons, and the second LSTM layer has 128 neurons.

4. The method according to claim 1, wherein the energy ratio in the correction process is obtained by taking a logarithm of the energy ratio before and after the pretreatment of the speech signal, comparing the logarithm with a second preset threshold, and after the pretreatment by a neural network model, the energy ratio is smaller than the second preset threshold in the correction process, and if the energy ratio is smaller than the second preset threshold, the energy ratio needs to be converted into linear scale first and then multiplied by the gain coefficient.

5. The method according to claim 1, wherein the step of preprocessing the frequency domain signal by a neural network model to suppress a noise signal in the frequency domain signal, comprises the steps of:

6. The voice gain control method of claim 1, further comprising the step of, after obtaining the voice enhancement amplitude spectrum: the speech enhancement magnitude spectrum from frame to frame requires smoothing with a smoothing coefficient below 0.1.

7. The speech gain control method of claim 6, wherein the smoothing factor is 0.

8. The method of claim 1, wherein the neural network model is an LSTM neural network model that is deep trained, and the deep training of the LSTM neural network model using different data sets during different training phases comprises:

respectively training by using the voice without transient noise and the characteristics of the voice without transient noise as the input and the output of an LSTM neural network model, and obtaining a parameter A after the LSTM neural network model converges;

and using the characteristics of the voice with noise as the input of the LSTM neural network model, using the characteristics of the voice without transient noise as the output of the LSTM neural network model, continuing training the neural network model based on the parameter A, obtaining the parameter B after the LSTM neural network model converges, and determining the parameter B as the parameter of the LSTM neural network model.

9. A computer storage medium, characterized in that the computer storage medium has stored therein a program which, when executed by a processor, performs the speech gain control method according to any of the preceding claims 1 to 8.