CN110933235A

CN110933235A - Noise removing method in intelligent calling system based on machine learning

Info

Publication number: CN110933235A
Application number: CN201911077584.1A
Authority: CN
Inventors: 伍林; 尹朝阳
Original assignee: Hangzhou Zhexin Information Technology Co Ltd
Current assignee: Hangzhou Zhexin Information Technology Co Ltd
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2020-03-27
Anticipated expiration: 2039-11-06
Also published as: CN110933235B

Abstract

The invention discloses a noise removing method in an intelligent calling system based on machine learning, which comprises the following steps: slicing the telephone signal, normalizing and framing preprocessing; extracting MFCC characteristics from the sliced signals after framing and carrying out averaging processing; inputting the averaged MFCC characteristics into a machine learning classifier for model training, and taking the trained classification model as a noise classification model; slicing the newly added telephone signal; carrying out normalization and framing pretreatment on the slice signals; carrying out primary screening on the frequency spectrum flatness of the sliced signals after framing; extracting MFCC characteristics and then averaging; and inputting the MFCC characteristics after the averaging processing of each segment of signals into a noise classification model for identification. The invention has the beneficial effects that: by identifying the signal as human voice or noise based on a machine-learned classification model, a large number of noise signals in the telephony signal can be removed, thereby reducing the error rate of the signal being sent to ASR for translation into text.

Description

Noise removing method in intelligent calling system based on machine learning

Technical Field

The invention relates to the technical field of audio processing, in particular to a noise removal method in an intelligent calling system based on machine learning.

Background

In existing intelligent call systems, the telephone signal is intercepted by the VAD and sent to the ASR for conversion into text. Due to the complexity of the background, there are a large number of noise segments. The general processing method is to filter the signal by using a noise suppression method before signal interception, and estimate the noise mainly based on the frequency distribution of the signal, and the commonly used algorithms include an adaptive filter, a spectral subtraction method, a wiener filtering method and the like. The self-adaptive filter automatically adjusts the current filter parameter by using the filter parameter obtained at the previous moment so as to adapt to the statistical characteristic of random variation of signals and noise, thereby realizing noise filtering; the spectral subtraction mainly removes the frequency spectrum of noise in a frequency domain, and then restores a frequency domain signal into a time domain signal through inverse Fourier transform; the wiener filtering method mainly removes noise by designing a digital filter. These noise suppression methods can only filter a part of the noise, but cannot completely remove the intercepted noise segment, and as the signal-to-noise ratio in the telephone signal decreases, the noise reduction effect is worsened, and audio distortion due to excessive attenuation occurs in some time intervals.

Disclosure of Invention

In order to solve the above problems, an object of the present invention is to provide a noise removing method in an intelligent calling system based on machine learning, which can remove a large amount of noise signals in a telephone signal by identifying whether the signal is a human voice or noise based on a classification model based on machine learning, thereby reducing an error rate of the signal being sent to an ASR to be translated into characters.

The invention provides a noise removing method in an intelligent calling system based on machine learning, which comprises the following steps:

step 1, taking the sampled telephone signals as training data, and establishing a noise classification model based on machine learning:

step 101, slicing the telephone signal, and carrying out normalization and framing pretreatment on the sliced signal;

102, extracting MFCC characteristics of the sliced signals after framing, and averaging the extracted MFCC characteristics;

step 103, inputting the averaged MFCC features into a machine learning classifier for model training, and taking the trained classification model as a noise classification model;

and 2, inputting the newly added telephone signal into a specific noise classification model by using the established noise classification model to obtain a noise identification result:

step 201, slicing the newly added telephone signal;

step 202, carrying out normalization and framing pretreatment on the slice signals;

step 203, carrying out primary screening on the frequency spectrum flatness of the sliced signals after framing;

step 204, dividing the framing signals into odd-numbered segments after primary screening of the spectrum flatness, and respectively extracting the MFCC characteristics of each segment of signals and then carrying out averaging processing;

step 205 inputs the MFCC features of each segment of signal averaging into the noise classification model for identification, and identifies the noise in the slice signal.

As a further improvement of the invention, during preprocessing, the normalization processing is carried out by adopting the formula (1), the slice signals are uniformly quantized by 16 bits, the value range is between-65535 and 65535, and the signals are normalized to be between-1 and 1 by dividing the maximum value of the absolute value of the signals;

where x is the slice signal to be processed, | x | is the absolute value of the slice signal,

is normalized slice signal.

As a further improvement of the invention, when the slice signal is processed by framing, the frame length is 30ms, and the frame shift is 10 ms.

As a further improvement of the present invention, step 203 specifically includes: extracting the spectral flatness characteristics of each frame of slice signals, and averaging the extracted spectral flatness characteristics, namely, the average flatness; setting a flatness threshold value of the average flatness, if the average flatness of the slice signal is higher than the flatness threshold value, judging the slice signal to be noise, and directly discarding the noise; and if the average flatness of the slice signal is lower than the flatness threshold value, carrying out the next processing on the slice signal.

As a further improvement of the present invention, the flatness threshold value flatness is 0.13.

As a further improvement of the present invention, in averaging the extracted MFCC features, for each dimension data, averaging is performed in various dimensions based on all frames according to formula (2);

in the formula, y is an average value of the MFCC features in each dimension, M is the dimension of the MFCC features, and N is the number of frames of the slice signal after framing.

As a further improvement of the present invention, in step 205, the recognition result of each slice signal is given a mode, and if the ratio of the recognized noise is high, the inputted slice signal is determined as noise, otherwise, the inputted slice signal is determined as human voice.

As a further improvement of the present invention, in step S1, the time length of each segment is 0.5S, and the segment shift is 0.25S.

As a further improvement of the present invention, the slice signal is divided into a human voice signal and a noise signal, a threshold of the human voice signal is set to be 0.2, and in step 205, when the probability of the slice signal to be identified passing through the classification model is greater than the threshold, the slice signal is determined to be the human voice signal.

As a further improvement of the invention, the machine learning classifier is one of a random forest classifier, an SVM classifier and an XGboost classifier.

The invention has the beneficial effects that:

1. the noise removing method of the invention removes a large amount of noise signals in the telephone signals by modeling a large amount of telephone signals in the intelligent outbound system and identifying the noise, thereby reducing the error rate of the signals which are sent to ASR and translated into characters;

2. in the noise identification process, the noise removal method screens obvious noise signals based on the spectrum flatness, so that the workload of subsequent identification is reduced;

3. in the noise identification process, the noise removal method adopts a method of testing the signals by odd sections and taking the mode of the identification result, so that the identification accuracy of the slice signals can be effectively improved, and the voice is prevented from being deleted by mistake.

Drawings

Fig. 1 is a flowchart illustrating a noise removing method in an intelligent calling system based on machine learning according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to specific embodiments and with reference to the attached drawings.

As shown in fig. 1, a method for removing noise in an intelligent calling system based on machine learning according to an embodiment of the present invention includes:

step 1, taking the sampled telephone signals as training data, and establishing a noise classification model based on machine learning. The step 1 specifically comprises:

step 101, slicing the telephone signal, namely VAD slicing, and performing normalization and framing preprocessing on the sliced signal.

Because the volume of the slice signals is different, the volume of some signals is larger, the sound of some signals is lighter, and the normalization processing of the telephone signals is beneficial to improving the recognition rate. During preprocessing, normalization processing is carried out by adopting an equation (1), slice signals are uniformly quantized by 16 bits, the value range is between-65535 and 65535, and the signals are normalized to be between-1 and 1 by dividing the maximum value of the absolute value of the signals;

is normalized slice signal.

After the slice signal is normalized, because the frequency contour of the slice signal is lost along with the time, the slice signal needs to be subjected to framing processing, and each frame of obtained signal can be used as a stable signal for extracting frequency domain features. When the slice signal is processed by framing, the frame length is 30ms, and the frame shift is 10 ms.

And step 102, extracting MFCC characteristics from the sliced signals after framing, and performing averaging processing on the extracted MFCC characteristics.

Since the MFCC features are relatively consistent with the auditory characteristics of human ears, and can be used as representative features of a machine learning classifier, MFCC features need to be extracted from the preprocessed slice signals.

Since the MFCC features are relatively consistent with the auditory characteristics of human ears, and can be used as representative features of a machine learning classifier, MFCC features need to be extracted from the preprocessed slice signals. However, since the slice signals have different lengths and the number of frames obtained is different, the extracted MFCC features also need to be averaged. When averaging the extracted MFCC features, averaging is performed on each dimension data based on all frames according to formula (2);

In the present invention, M is 39.

Of course, besides extracting MFCC features, other acoustic features, such as short-time energy, zero-crossing rate, pitch, etc., may also be extracted, or a series of features may be combined for use by the classification model.

And 103, inputting the averaged MFCC features into a machine learning classifier for model training, and taking the trained classification model as a noise classification model.

The machine learning classifier is one of a random forest classifier, an SVM classifier and an XGboost classifier. Of course, the present invention is not limited to the above examples, and other classification learners can be applied to the present invention.

And 2, inputting the newly added telephone signal into a specific noise classification model by using the established noise classification model to obtain a noise identification result. The step 2 specifically comprises:

step 201, slicing the added telephone signal.

Step 202, performing normalization and framing preprocessing on the slice signals.

is normalized slice signal.

And step 203, performing primary screening on the frequency spectrum flatness of the sliced signals after the framing.

After the framing processing, because the speech spectrum tends to have peaks in fundamental frequency and harmonic wave, and the noise spectrum is relatively flat, the signal spectrum flatness can be used to distinguish human voice from noise. Step 203 specifically includes: extracting the spectral flatness characteristics of each frame of slice signal, averaging the extracted spectral flatness characteristics, namely average flatness, setting a threshold value of the average flatness, and if the average flatness of the slice signal is higher than the threshold value, judging the slice signal to be noise and directly discarding the noise; and if the average flatness of the slice signals is lower than the threshold value, carrying out the next processing on the preprocessed slice signals.

The flatness threshold value of the present invention is set to 0.13.

And step 204, dividing the framing signals into odd-numbered segments after primary screening of the spectrum flatness, and performing averaging processing after respectively extracting the MFCC characteristics of each segment of signals.

The invention divides the longer slice signal into odd segments, each segment has a time length of 0.5s, and the segment shift is 0.25 s.

And step 205, inputting each section of slice signals subjected to averaging processing into a noise classification model for identification, and identifying noise in the slice signals.

And taking a mode of the identification result of each section of slice signal, if the identification result is that the proportion of the noise is high, determining that the input slice signal is the noise, and otherwise, determining that the input slice signal is the human voice. Because the slice signal contains both human voice and noise, the processing of step 205 can effectively improve the accuracy of signal identification.

Further, since the slice signal is divided into a human voice signal and a noise signal, the human voice signal threshold is set to be 0.2, and in step 205, when the probability that the slice signal to be identified passes through the noise classification model is greater than the threshold, the slice signal is determined to be the human voice signal, otherwise, the slice signal is discarded as the noise. The method can improve the voice recall rate to 99 percent and avoid deleting the voice by mistake.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for removing noise in an intelligent calling system based on machine learning, comprising:

step 201, slicing the newly added telephone signal;

step 205, inputting the MFCC features of each segment of signal averaging into a noise classification model for identification, and identifying the noise in the slice signal.

2. The noise removing method in the intelligent calling system based on machine learning of claim 1, wherein in the preprocessing, the normalization processing is performed by using formula (1), the slice signals are uniformly quantized by 16 bits, the value range is-65535 to 65535, and the signals are normalized to be between-1 to 1 by dividing the maximum value of the absolute value of the signals;

is normalized slice signal.

3. The noise removing method in a machine learning-based smart calling system according to claim 1, wherein the frame length of the slice signal is 30ms and the frame shift is 10ms in the framing process.

4. The method for removing noise in an intelligent calling system based on machine learning according to claim 1, wherein step 203 specifically comprises: extracting the spectral flatness characteristics of each frame of slice signals, and averaging the extracted spectral flatness characteristics, namely, the average flatness; setting a flatness threshold value of the average flatness, if the average flatness of the slice signal is higher than the flatness threshold value, judging the slice signal to be noise, and directly discarding the noise; and if the average flatness of the slice signal is lower than the flatness threshold value, carrying out the next processing on the slice signal.

5. The method of claim 4, wherein the flatness threshold value of flatness is 0.13.

6. The noise removing method in a machine learning-based smart call system as claimed in claim 1, wherein in averaging the extracted MFCC features, for each dimension data, averaging is performed in various dimensions based on all frames according to formula (2);

7. The method of claim 1, wherein in step 205, a mode is selected for the recognition result of each slice signal, and if the recognition rate of the noise is high, the inputted slice signal is determined as noise, otherwise, the inputted slice signal is determined as human voice.

8. The noise removing method in a machine learning-based smart calling system according to claim 7, wherein in step S1, the duration of each segment is 0.5S, and the segment shift is 0.25S.

9. The method as claimed in claim 4, wherein the slice signal is divided into a vocal signal and a noise signal, a threshold of the vocal signal is set to 0.2, and in step 205, when the probability of the slice signal to be identified passing through the classification model is greater than the threshold, the slice signal is determined to be the vocal signal.

10. The method of claim 1, wherein the machine learning classifier is one of a random forest classifier, an SVM classifier, and an XGboost classifier.