EP0655731B1

EP0655731B1 - Noise suppressor available in pre-processing and/or post-processing of a speech signal

Info

Publication number: EP0655731B1
Application number: EP19940118782
Authority: EP
Inventors: Kazunori C/O Nec Corporation Ozawa
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1993-11-29
Filing date: 1994-11-29
Publication date: 2000-03-29
Anticipated expiration: 2014-11-29
Also published as: JP2739811B2; JPH07152395A; EP0655731A3; DE69423703D1; EP0655731A2; DE69423703T2

Description

This invention relates to a noise suppressor for use in suppressing a noise signal from a speech signal.
As a rule, a speech signal is subjected to pre-processing before the speech signal is encoded into a sequence of encoded signals. For example, such pre-processing has been made to judge either a speech duration or a non-speech duration, in an article which is contributed by J.F. Lynch, Jr. et al to IEEE and which is entitled "SPEECH/SILENCE SEGMENTATION FOR REAL-TIME CODING VIA RULE BASED ADAPTIVE ENDPOINT DETECTION" (Proceedings ICASSP, pages 1348-1351, 1987). In the article, description is made only about detection between the speech duration and the non-speech duration but is not made about suppressing a noise signal from the speech signal during the pre-processing. In other words, Lynch et al never consider about pre-processing which suppresses the noise signal from the speech signal. Practically, even when the pre-processing described in the article is used for suppressing the noise signal from the speech signal, it is difficult to suppress the noise signal, namely, a non-speech signal within the speech duration.
On the other hand, spectrum subtraction has been proposed to remove a noise component from the speech signal in JP-A-2-278298. Thereafter, the speech signal is encoded into a sequence of encoded signals. With this method, only a noise spectrum which results from the noise component is subtracted or removed from a spectrum including the noise spectrum and produced as a noise-subtracted speech signal. Thus, the noise-subtracted speech signal might be free from the noise component on the spectrum.
However, it is to be noted that speech encoding is usually carried out in connection not only with the spectrum but also with a phase component of the speech signal. This shows that a noise component can not be removed which is included in the phase component in the above-mentioned method.
Therefore, the spectrum subtraction is disadvantageous in that the noise component can not be completely suppressed from the speech signal.
Moreover, the spectrum subtraction can not be applied on post-processing which is carried out after the encoded signal sequence is decoded into a sequence of decoded signals.
At any rate, no consideration is made at all about suppressing a noise component on post-processing, despite that noise suppression is necessary after decoding.
EP-A-459364 discloses a noise signal prediction system according to the preamble of claim 1.
It is an object of this invention to provide a noise suppressor which is capable of completely suppressing a noise component or signal from a speech signal.
It is another object of this invention to provide a noise suppressor of the type described, which can be used either on pre-processing or on post-processing of the speech signal.
It is still another object of this invention to provide a noise processor of the type described, which can suppress the noise signal not only within a speech duration but also within a non-speech duration.
These objects are attained with the features of the claims.
Fig. 1 is a block diagram of a noise suppressor according to a first embodiment of this invention;
Fig. 2 is a block diagram for use in describing a part of the noise suppressor illustrated in Fig. 1;
Fig. 3 is a block diagram of a noise suppressor according to a second embodiment of this invention; and
Fig. 4 is a block diagram for use in describing a part of the noise suppressor illustrated in Fig. 3.
Description will be at first made as regards a principle of this invention so as to facilitate an understanding of this invention. Herein, it is assumed that a speech signal is given in the form of a sequence of digital speech signals to be subjected to pre-processing and post-processing to suppress a noise signal from the speech signal. In addition, the pre-processing is carried out in response to an input signal specified by the digital speech signal sequence which is not encoded yet while the post-processing is carried out in response to an input signal specified by the digital speech signal sequence which is already decoded. Therefore, it is noted that the terms "digital speech signal sequence" and "input signal" may be used in two different meanings hereinunder so as to include both the pre-processing and the post-processing.
At any rate, the input signal includes the speech signal (namely, the digital speech signal sequence) and the noise signal and may be therefore considered as a combination of the digital speech signal sequence and the noise signal.
According to this invention, feature parameters are extracted from the input signal and may be, for example, selected one or ones of spectrum parameters representative of features of a spectrum in the input signal, pitch prediction gains representative of periodicity of the input signal, and the like. The feature parameters are used to determine either a speech duration or a non-speech duration by comparing the feature parameters with a threshold level.
Briefly, a preliminary sound source signal which specifies a sound source is obtained by the use of the input signal and the feature parameters on the pre-processing and the post-processing. Specifically, the preliminary sound source signal appears in the form of an error signal which is produced on the pre-processing by allowing the input signal to pass through an inverse filter controlled by the feature parameters.
On the other hand, the preliminary sound source signal appears in the form of a decoder output signal or a sequence of decoded signals which is decoded by the use of the feature parameters.
Since the speech signal has an amplitude greater than the noise signal in the preliminary sound source signal, it is possible to suppress the noise signal alone by comparing an amplitude of the preliminary sound source signal with a predetermined threshold level and to therefore attain a noise-suppressed signal. The noise-suppressed signal is reproduced by the use of the feature parameters into a noise-free output signal on the pre-processing or is produced as a noise-free decoded signal on the post-processing. The noise-free output signal may be encoded by an encoder after the pre-processing while the noise-free decoded signal may be converted into an audio signal after the post-processing.
Noise suppression may be carried out only within a selected one of the speech duration or the non-speech duration or within both the speech duration and the non-speech duration. Thus, this invention enables to suppress the noise signal on a waveform by the use of the feature parameters and is applicable to both the pre-processing and the post-processing.
Referring to Fig. 1, a noise suppressor according to a first embodiment of this invention is applicable to the pre-processing and is therefore supplied through an input terminal 10 with an input signal IN which includes a speech signal and a noise signal superposed on the speech signal. As mentioned before, the speech signal is given in the form of a sequence of digital speech signals. The input signal IN is given to a frame division circuit 11 and is divided by the frame division circuit 11 into a plurality of frames each of which has a length of, for example, 40 milliseconds. Each frame is further subdivided by a subframe division circuit 12 into a plurality of subframes each of which has a length of, for example, eight milliseconds.
The input signal IN is divided into the subframes, as mentioned above, and is sent in the form of a divided input signal sequence x(n) either at every frame or at every subframe to a feature parameter calculator 15 on one hand and to a noise suppression circuit 20 on the other hand. Herein, the divided input signal sequence x(n) may be referred to as an internal input signal.
In the illustrated example, the feature parameter calculator 15 is supplied with the internal input signal x(n) at every subframe. The feature parameter calculator 15 at first places a window to extract a piece of the internal input signal x(n) in relation to each subframe. The window is longer than each subframe length and may be, for example, 24 milliseconds.
Thereafter, the feature parameter calculator 15 calculates, as feature parameters, spectrum parameters indicative of features of a spectrum in the input signal, pitch prediction gains indicative of periodicity of the speech signal, and an average amplitude in each subframe. In this event, average power may be calculated in the feature parameter calculator 15. Such calculations of the feature parameters are known in the art and will not be described any longer. In any event, the feature parameters are produced as feature parameter signals from the feature parameter calculator 15.
Herein, it is to be noted that the feature parameter calculator 15 shown in Fig. 1 calculates the spectrum parameters of a predetermined order which may be, for example, a tenth order. In addition, the following description will be made on the assumption that linear prediction coefficients a_i are used as the spectrum parameters. Although such linear prediction coefficients are calculated by using a well-known LPC analysis, Burg analysis, or the like, it is assumed in connection with the illustrated example that the Burg analysis is used to calculate the linear prediction coefficients. The Burg analysis is described in detail in a book (pages 82 to 87) which is written by Nakamizo et al and which is titled "Signal Analysis and System Identification" published by Corona Company Ltd, Tokyo, in 1988. Accordingly, description will be omitted from the instant specification as regards the Burg analysis.
Alternatively, the linear prediction coefficients may be also calculated by the use of a covariance method or a correlation method.
As mentioned before, the pitch prediction gains are also calculated in the feature parameter calculator 15. The pitch prediction gains are represented by P_g and are given by: Pg = n=0 N-1 x2(n) /( n=0 N-1 x2(n) - ( n=0 N-1 x(n)x(n-T))2 / n=0 N-1 x(n-T)2 ) where T is a delay time representative of a pitch period; n, a sample number; and N, a maximum sample number.
Instead of Equation (1), the pitch prediction gains P_g can be simply calculated by the use of the following equation: Pg = n=0 N-1 x(n)x(n-T) / n=0 N-1 x(n-T)2
The average amplitude is represented by R and is given by: R = 1/N n=0 N-1 x2(n)
Herein, it is readily possible to implement circuits for calculating the above-mentioned linear prediction coefficients, the pitch prediction gains P_g, and the average amplitude R by a combination of conventional circuit elements. Accordingly, specific circuits for calculating the linear prediction coefficients, the pitch prediction gains P_g, and the average amplitude will not be described later.
Thus, the feature parameter calculator 15 supplies a speech detection circuit 25 and the noise suppression circuit 20 with the feature parameter signals representative of the feature parameters, as mentioned above. In the illustrated example, the speech detection circuit 25 detects or determines either the speech duration or the non-speech duration of the speech signal in response to at least one of the feature parameters. To this end, a wide variety of methods can be applied to determine the speech duration or the non-speech duration. For example, the illustrated speech detection circuit 25 at first smooths the pitch prediction gains P_g and the average amplitude R to obtain smoothed pitch prediction gains P_g' and a smoothed average amplitude R' and thereafter compares the smoothed pitch prediction gains P_g' and the smoothed average amplitude R' with first and second threshold values TH1 and TH2, respectively.
The above-mentioned smoothing operation of the pitch prediction gains P_g and the average amplitude R is carried out in accordance with the following equation: P'j = (1 - δ) P'j-1 + δ·P, where P is representative of the pitch prediction gains or the average amplitude to be smoothed; δ is representative of a time constant for smoothing and takes a value between 0 and 1, both exclusive; and P'_j and P'_j-1 are representative of smoothed values at time instants j and j-1.
As a result of comparison, when the smoothed pitch prediction gains P_g' and the smoothed average amplitude R' are lower than the first and the second threshold values TH1 and TH2, respectively, the speech detection circuit 25 judges that the non-speech duration lasts in the internal input signal x(n). Otherwise, the speech detection circuit 25 judges that the speech duration lasts in the internal input signal x(n). Thus, the non-speech and the speech durations are detected by the speech detection circuit 25. In the example, the first and the second threshold values TH1 and TH2 may be invariable or variable with time.
As mentioned before, the speech detection circuit 25 comprises a calculation circuit for calculating the smoothed values (namely, the smoothed pitch prediction gains P_g' and the smoothed average amplitude R') in accordance with Equation 4 and a comparator unit for comparing the smoothed values with the first and the second threshold values TH1 and TH2. As a result, the illustrated speech detection circuit 25 can produce the smoothed average amplitude R' at every frame or at every subframe and a detection signal DT representative of either the speech or the non-speech duration at every frame or at every subframe.
The smoothed average amplitude R' is delivered to a memory circuit 30 while the detection signal DT is sent to the noise suppression circuit 20.
Referring to Fig. 2 in addition to Fig. 1, the noise suppression circuit 20 is operable to suppress the noise signal within at least one of the speech and the non-speech durations. In Fig. 2, the noise suppression circuit 20 comprises an inverse filter 201 supplied with the internal input signal x(n) from the input terminal 10 through the frame and the subframe division circuits 11 and 12. The feature parameters a_i are also supplied from the feature parameter calculator 15 to the inverse filter 201. The inverse filter 201 carries out an inverse filtering operation to produce an inverse-filtered signal e(n) which may be called a preliminary sound source signal because the inverse-filtered signal e(n) specifies a sound source. Herein, the inverse-filtered signal e(n) is given by: e(n) = x(n) - i=1 P ai x(n - i), where P represents an order of the inverse filter 201. Thus, the inverse-filtered signal e(n) is dependent on the feature parameters and specifies the sound source.
The inverse-filtered signal e(n) includes a speech signal component and a noise signal component superposed on the speech signal component and appears in the form of a continuous signal. The inverse filter 201 may be simply called a filter circuit.
Now, it is to be noted that the inverse-filtered signal e(n) is specified by a comparatively large amplitude pulse within a portion of the speech signal component appearing in the speech duration because the speech signal has a pitch. On the other hand, the inverse-filtered signal e(n) exhibits a comparatively small amplitude within a portion of the noise signal.
Accordingly, it is possible to suppress the noise signal by comparing the inverse-filtered signal e(n) with a threshold level TH1.
More specifically, the noise suppression circuit 20 illustrated in Fig. 2 comprises a threshold value calculation circuit 202 supplied with the smoothed average amplitude R' which is calculated by the feature parameter calculator 15 in accordance with Equation 4 and which is memorized into the memory circuit 30. The threshold value calculation circuit 202 calculates the threshold value TH1 given by: TH1 = K2·R' to produce a threshold value signal representative of the threshold value TH1, where K2 is greater than zero. Thus, the threhold value TH1 is determined by the average amplitude R memorized in the memory circuit 30.
The inverse-filtered signal e(n) and the threshold value signal are sent to a suppressor unit 203 which is also given the detection signal DT from the speech detection circuit 25. The suppressor unit 203 is put into an active state or into an inactive state in response to the detection signal DT. In this event, the suppressor unit 203 may suppress the noise signal within at least one of the speech duration and the non-speech duration. In the illustrated example, it is assumed that the suppressor unit 203 is put into the active state within the non-speech duration in response to the detection signal DT, although the suppressor unit 203 may be put into the active state within the speech duration.
In addition, the suppressor unit 203 compares the inverse-filtered signal e(n) with the threshold value signal. The suppressor unit 203 attenuates the inverse-filtered signal e(n) by a predetermined amount or renders the inverse-filtered signal e(n) into zero when the inverse-filtered signal e(n) is smaller than the threshold value TH1. As a result, the suppressor unit 203 produces a noise-suppressed signal e' specified by:
where K is greater than zero and smaller than unity.
At any rate, a combination of the threshold value calculation circuit 202 and the suppressor unit 203 serves to suppress the noise signal included in the inverse-filtered signal e(n) and to produce the noise-suppressed signal e'(n) and may be collectively called a noise suppression portion.
The noise-suppressed signal e'(n) is sent to a reproduction circuit 204 together with the feature parameters a_i. The reproduction circuit 204 reproduces the noise-suppressed signal e'(n) into a noise-suppressed speech signal x'(n) with reference to the feature parameters ai. In this event, the noise-suppressed speech signal x' is given by: x'(n) = e'(n) + i=1 P ai x'(n - i).
The noise-supressed speech signal x'(n) is delivered through an output terminal 35 of the noise suppression circuit 20 to an encoder (not shown) to be encoded. Thus, the noise-suppressed speech signal x'(n) is produced during the pre-processing prior to the encoding. Since the noise-suppression is carried out with reference to the feature parameters of the input signal IN, a phase component of the noise signal can also be suppressed in the above-mentioned example.
Referring to Fig. 3, a noise suppressor (depicted at 40) according to a second embodiment of this invention is operable to carry out post-processing after decoding. To this end, the illustrated noise processor 40 is connected to a decoder 45 which is supplied as a decoder input signal or an input signal DIN with feature parameters of a speech signal and an index signal related to a sound source. The decoder 45 itself may be similar to that known in the art and produces a sequence of decoded sound source signals v(n) representative of a sound source together with the feature parameters and the index signal, in a known manner. The decoded sound source signal sequence v(n) and the feature parameters and the index signal are sent to the noise suppressor 40.
In the noise suppressor 40, the decoded sound source signal sequence v(n) is given to a noise suppression circuit which is depicted at 50 and which is operable in a manner to be described later in detail. Furthermore, the illustrated noise suppressor 40 comprises a speech detection circuit 25' and a memory circuit 30' which may be similar to those illustrated in Fig. 1, respectively. From this fact, it is readily understood that the speech detection circuit 25' is operated in response to the feature parameters, such as the spectrum parameters, the pitch prediction gains P_g, and the average amplitude R, to detect either the speech duration or the non-speech duration. Thus, the speech detection circuit 25' supplies the noise suppression circuit 50 with a detection signal DT' indicative of either the speech duration or the non-speech duration. Like in Fig. 1, the speech detection circuit 25' calculates the smoothed average amplitude R' which is stored in the memory circuit 30'.
Referring to Fig. 4 together with Fig. 3, the noise suppression circuit 50 comprises a threshold calculator 501 supplied with the smoothed average amplitude R' to calculate a threshold value signal representative of a threshold value TH2, like in the threshold value calculation circuit 202. The threshold value signal is given to the suppressor unit 502 together with the detection signal DT'.
The suppressor unit 502 is put into an active state within at least one of the speech and the non-speech durations. Herein, it is assumed that the illustrated suppressor unit 502 becomes active only within the non-speech duration, like in the suppressor unit 203. In any event, the suppressor unit 502 produces a sequence of noise-suppressed sound source signals v'(n) given by:
where K is identical with K shown in Equation 7. The threshold value TH2 may be equal to that of Equation 7.
Turning back to Fig. 3, the noise-suppressed sound source signals v'(n) are sent to a speech reproducing circuit 52 which is supplied with the feature parameters from the decoder 45. The speech reproducing circuit 52 reproduces the noise-suppressed sound signals into a reproduced speech signal with reference to the feature parameters in a known manner. The reproduced speech signal is delivered to a loudspeaker or the like.
Thus, the noise suppressor according to this invention can be used in post-processing the decoded sound source signals DIN in the above-mentioned manner.
While this invention has thus far been described in conjunction with a few embodiments thereof, it will readily be possible for those skilled in the art to put this invention into practice in various other manners. For example, the feature parameters need not be always restricted to the linear prediction coefficients but may be any other parameters known in the art. In addition, it is possible to use any other parameters than the average amplitude, and the pitch prediction gains. The speech detection circuit 25 or 25' may be operated in a manner different from that illustrated in Figs. 1 and 3.
Moreover, the post-processing can be carried out to suppress the noise signal even when the feature parameters are not transmitted from a transmitter and are not received by the decoder 45 (Fig. 3). In this case, the speech signal is once reproduced by a receiver to form a reproduced speech waveform and to thereafter calculate feature parameters from the reproduced speech waveform in the manner mentioned in conjunction with Fig. 1. Thus, the calculated feature parameters can be used to suppress the noise signal in the above-mentioned manner.
With this structure, the noise suppression is possible during both the pre-processing and the post-processing of the speech signal. Moreover, it is also possible to suppress not only the noise signal appearing within the non-speech duration but also a non-speech signal superposed on the speech signal appearing within the speech duration. Such suppression can be accomplished on the waveform.

Claims

A noise suppressor supplied with an internal input signal (IN) which includes both a speech signal and a noise signal to produce an output signal substantially free from said noise signal, said speech signal being specified by a sound source, said noise suppressor comprising feature parameter calculating means (15) supplied with said internal input signal for calculating a feature parameter specifying a feature of said speech signal to produce a feature parameter signal representative of said feature parameter, and noise suppressing means (20) coupled to said feature parameter calculating means (15) for suppressing said noise signal from said internal input signal to produce said output signal, wherein said noise suppressing means comprises:

a suppression unit (203) for suppressing the noise signal from a residual signal (e(n)) by estimating said noise signal to produce a nose-suppressed signal (e'(n)); and

output means (204) for producing said noise-suppressed signal as said output signal; and is characterized by

filter means (201) supplied with said feature parameter signal (a_i) and said internal input signal for filtering said internal input signal (x(n)) to produce a filtered signal which is dependent on said feature parameter (a_i) and which specifies said sound source in that said residual signal (e(n)) is calculated which represents the difference between said feature parameter signal representation and said internal input signal; wherein the suppression unit is coupled to said filter means (201).
A noise suppressor as claimed in Claim 1, said speech signal being divisible into a speech duration and a non-speech duration, wherein said noise suppressor (20) further comprises:

speech detection means (25) coupled to said feature parameter calculating means (15) for detecting said speech and said non-speech durations in response to the feature parameter signal to produce a detection signal representative of either one of said speech and said non-speech durations;

average calculation means (30) coupled to said speech detection means for calculating an average value of either power or an amplitude within said non-speech duration to produce an average signal representative of said average value;
said noise suppressing means (20) further comprising:

threshold level calculating means (202) for calculating a threshold level from said average signal to supply said suppression unit (203) with a threshold level signals (TH1) representative of said threshold level, to make said suppression unit 203 compare said filtered signal with said threshold level signal, and to make said suppression unit suppress said noise signal.
A noise suppressor as claimed in Claim 2,
wherein said suppression unit 203 is further supplied with said detection signal (DT) to be put into an active state within at least one of said speech and said non-speech durations.
A noise suppressor as claimed in Claim 1, 2 or 3,
wherein said feature parameter calculating means (15) calculates, as said feature parameter (a_i), spectrum parameters representative of a spectrum of said internal input signal, a pitch period of said internal input signal, and an average amplitude of said internal input signal.
A noise suppressor according to claims 1, 2, 3, or 4, wherein said internal input signal is divided into a sequence of frames each of which lasts for a predetermined interval of time, said speech signal is generated by a sound source and has a spectrum specified by at least one feature parameter and is divisible into a speech duration and a non-speech duration, said noise suppressor comprises feature parameter calculating means for calculating said at least one feature parameter to produce a feature parameter signal representative of said at least one feature parameter and speech detection means coupled to said feature parameter calculating means (15) for detecting said speech and said non-speech durations in response to the feature parameter signal to produce a detection signal representative of either one of said speech and said non-speech durations,

average memory means is coupled to said speech detection means for memorizing an average value of either one of power and an amplitude of said internal input signal within said non-speech duration to produce an average signal representative of said average value; and

said noise suppressing means (20) is coupled to said feature parameter calculating means (15) said speech detection means, and said average calculating means for suppressing said noise signal with reference to said feature parameter signal, said detection signal, said average signal, and said internal input signal to produce said output signal.
A noise suppressor operable in response to a feature parameter signal specifying a speech signal and to a sound source signal (v(n)) representative of a sound source of said speech signal to suppress a noise signal from the sound source signal and to produce an output signal (v'(n)) substantially free from said noise signal, said speech signal being divisible into a speech duration and a non-speech duration, said sound source signal appearing in the form of an error signal which is produced on the preprocessing by allowing an input signal to pass through an inverse filter controlled by said feature parameter signal,
said noise suppressor being characterized by:

a noise suppressing circuit (50) for suppressing said noise signal from said sound source signal with reference to said feature parameter signal to produce a noise-suppressed signal (v'(n));

means (52) for producing said noise-suppressed signal as said output signal.
A noise suppressor as claimed in Claim 6, characterized by:

speech detection means (25') supplied with said feature parameter signals for detecting said speech and said non-speech durations to produce a detection signal representative of either one of said speech and said non-speech durations; and

average memory means (30') coupled to said speech detection means for memorizing an average value of either one of power and an amplitude of said speech signal within said non-speech duration to produce an average signal representative of said average value;

said noise suppressing circuit (50) suppressing said noise signal with reference to said average signal also.