EP0226590A1

EP0226590A1 - Analyzer for speech in noise prone environments

Info

Publication number: EP0226590A1
Application number: EP19850906004
Authority: EP
Inventors: Bishnu Saroop Atal; Vijay Kumar Jain
Original assignee: American Telephone and Telegraph Co Inc; AT&T Corp
Current assignee: AT&T Corp
Priority date: 1985-03-22
Filing date: 1985-11-14
Publication date: 1987-07-01
Also published as: ES8704658A1; ES549155A0; WO1986005619A1; JPS62502288A; AU5202086A

Abstract

Une modèle vocal capté dans un environnement bruyant est analysé pour former une séquence représentative de signaux de paramètres vocaux. Le modèle vocal affecté par le bruit est partagé en intervalles par tranches de temps successives. Un signal d'erreur prédictif est produit pour chaque intervalle successif en fonction de l'émission vocale affectée par le bruit dans l'intervalle et d'une estimation du bruit ambiant. Le signaux de paramètres prédictifs sont sélectionnés pour réduire au minimum le signal d'erreur prédictif de chaque tranche de temps, de manière à produire des codes numériques précis représentant l'émission vocale.A speech model captured in a noisy environment is analyzed to form a representative sequence of speech parameter signals. The voice model affected by noise is divided into intervals by successive time slots. A predictive error signal is produced for each successive interval based on the voice emission affected by noise in the interval and an estimate of the ambient noise. The predictive parameter signals are selected to minimize the predictive error signal for each time slot, so as to produce precise digital codes representing the voice broadcast.

Description

ANALYZER FOR SPEECH IN NOISE PRONE ENVIRONMENTS

Field of the Invention

This invention relates to a method for analyzing speech in a noise prone environment. Background of the Invention

The analysis of speech patterns using linear prediction techniques is well known in the _#art and is used in speech processing, speech recognition, and speech synthesis. U. S. Patent 3,624,302 discloses arrangements to transform speech signals into linear prediction signals for such uses as speech coding, speech recognition, and speech synthesis. The prediction analysis is based on a pth order all-pole model of the speech signal. .In noisy environments such as factories, moving vehicles, or large offices with typewriters and other business equipment, however, the background noise contaminates the speech so that the all-pole model is not accurate. Even in relatively quiet environments, or where noise canceling microphones are used, noise always present in the background affects the accuracy of speech analysis.

With additive noise, the problem is that the speech signal spectrum .is no longer an all-pole spectrum and the usual methods of estimating the pth order all-pole parameters from the first p+1 correlations of the signal are no longer valid. While such an analysis results in a match of the first p+1 correlations, it does not guaranty the matching of the higher order correlations and affects the degree to which speech parameters obtained from the analysis corresponds to the speech pattern applied to the analyzer. It is an object of the invention to provide improved speech analysis that mitigates the effects of environmental noise on the generation of speech parameters, The problems are solved in accordance with this invention by a method which comprises the steps of partitioning an input speech pattern into successive time frame intervals, forming signals representative of the autocorrelation of the input speech of the current time frame interval responsive to the input speech pattern, generating noise signals representative of the environment, forming first and second autocorrelation signals responsive to said input speech autocorrelation signals and said generated noise signals, generating a signal corresponding to the difference between said first and second autocorrelation signals, and producing a set of signals representative of the current time frame interval speech responsive to said difference corresponding signal and the second autocorrelation signals. Summary of the Invention

The invention is directed to an arrangement for analyzing noise-contaminated speech in which the speech pattern is partitioned into successive time frame intervals. A predictive error signal is generated for each interval responsive to the noise-contaminated speech of the interval and an estimate of the noise. Predictive parameter signals are selected to minimize the time frame predictive error signal so that accurate digital codes representative of the speech are generated. Brief Description of the Drawing

FIG. 1 depicts a flow chart showing the general operation of a speech analyzer illustrative of the invention FIG. 2 shows a block diagram of a circuit adapted to analyze speech patterns in accordance with the flow chart of FIG. 1 that is illustrative of the invention;

FIGS. 3-5 are detailed flow charts illustrating the speech analysis process of FIG. 1; and FIG. 6 shows waveforms illustrating the autocorrelations of speech patterns obtained through the analyzer of FIGS. 1 and 2. Detailed Description

As is well known in the art, linear predictive analysis (LPC) utilizes a pth order all-pole model for the speech signal. For an all-pole spectrum, the matching of the first p+1 autocorrelations assures that the subsequent correlations also match. In this way the underlying spectrum of the process is recovered, where additive white or other kind of noise is present, however, the signal spectrum no longer fits the all-pole spectrum and should be modeled as a sum of an all-pole spectrum and the noise spectrum. Using only the all-pole spectrum results in inaccuracies in the higher order correlations.

According to the invention, the parameters of a pth order all-pole filter are determined such that the sum of the autocorrelations based on the all-pole model and the noise autocorrelations match the autocorrelations of speech contaminated with noise over a large number of lags beyond £. Using speech contaminated with noise, the error for the kth sample lag is ξ_k = r_k - (r_k + gη_k) (1)

" ^rk ^{" βη}k^"rk ⁽²⁾ where r_k is the autocorrelation function of the speech contaminated with noise at the kth sample lag, r^ is the corresponding autocorrelation based on the all-pole model, n_k is the autocorrelation function of the noise signal, and ø is the unknown noise intensity. As aforementioned, the matching of the first ρ+1 autocorrelations result in

where a-_j,...,a_p are the predictor coefficients of the all-pole model. The unknown predictor coefficients a^_ and the unknown noise intensity factor β are determined by minimizing an extended sum squared error p+q p+q p m

The minimization of the extended sum squared error of Equation (4) requires solving a set of nonlinear equations in p+1 unknowns. Such equations are difficult to formulate and solve using data processing techniques. In accordance with the invention, the first p+1 correlations of the speech signal and the all-pole filter and noise model are matched exactly while simultaneously the noise factor β is selected to minimize the mismatch at the next correlations. The mismatch is then a function of β which is determined by a one-dimensional search. The optimum linear predictor coefficients characterizing a time frame interval are obtained as solutions of p+1 linear equations

P Σ (r(k-i)-βη(k-i))a(i) = r(i)1< k < p (5) i=1 where r(i) is the ith autocorrelation coefficient of the noisy speech signal. Equation (5) is solved for a number of values of β between zero and the highest expected value of the noise power expressed as a fractio-n of the speech power. The value of β that minimizes the extended sum squared error in Equation (4) is selected as the optimum value and the linear prediction parameter signals a1 ,a2, .... ,ap are formed.

FIG. 6 illustrates the effect of noise on the speech pattern analysis. Waveform 601 shows the true autocorrelation function for a 20 millisecond time frame interval of speech contaminated by 10 db of additive white noise. Waveform 603 shows the autocorrelation function obtained from a pth order all-pole model in accordance with the prior art, and waveform 605 illustrates the autocorrelation function obtained from the modified autocorrelation analysis in accordance with the invention. It is readily seen that waveform 605 follows the true autocorrelation function of waveform 601 very closely while waveform 603 obtained from an all-pole model analysis deviates significantly from the true autocorrelation functio Detailed Description

A general flow chart illustrating the noisy speech pattern analysis arrangement is shown in FIG. 1, and FIG. 2 depicts a block diagram of a microprocessor circuit adapted to carry out the operations of the flow chart of FIG. 1.

Referring to FIGS. 1 and 2, a speech pattern and the environmental noise associated therewith are sampled and digitized as indicated in step 101 of FIG. 1. This is accomplished in the circuit of FIG. 2 by receiving the noisy speech pattern at microphone 201 and low pass filtering the speech signal in filter 205. The bandlimited signal from the filter is sampled in analog-to-digital converter 210 at a prescribed rate, and each sample is converted into a digital code corresponding to the magnitude of the sample. Step 105 of FIG. 1 is performed and the digitized speech samples from converter 210 are partitioned into time frame intervals in floating point array processor 220 hereinafter known as an arithmetic 220 under control of control processor 215. Such partitioning may be done on a frame-by-frame basis as the speech signal is received from microphone 201. The time frame interval signal samples are processed successively, and a set of LPC speech parameter signals are produced in arithmetic processor 220 and transferred therefrom to utilization device 260. The utilization device may comprise speech processing equipment such as a speech recognizer, synthesizer or coder or general purpose data processing equipment such as a main frame computer or a personal computer. The programmed instructions controlling the operation of the circuit of FIG. 2 in carrying out the operations of the flow chart of FIG. 1 in control memory 230 are listed in Fortran language form in the Appendix hereto."

Autocorrelation signals r(i) for the current time frame are generated by methods well known in the art

(step 110) in arithmetic processor 220 according to stored instructions in program memory 230 of FIG. 2 using the frame speech sample signals of store 240 and the window signal in store 235. Windowed speech samples are stored in memory 245 and the r(i) correlation signals are placed in store 250 of FIG. 2. In the frame analysis, N successive digitized speech samples s(1 ) ,s(2) , ...s(n) , ... ,s(N) are windowed by combining the sample signals with a window function signal w(n) stored in memory 235 of FIG. 2 as is well known in the art. The windowed sample signals x(n) = s(n)*w(n) 1<n<N (6)

are produced in arithmetic processor 220 and are stored in memory 245. The autocorrelation signal formation of step 110 in FIG. 1 is shown in greater detail in FIG. 3.

Referring to FIG. 3 the autocorrelation index signal k. is initially set to zero (step 301). The autocorrelation signals are then iteratively formed in steps 305, 310 and

315 until the autocorrelation index reaches p+q+1. Each autocorrelation signal r(k) = ∑x(n)*x(n-k) . (7)

The r(k) signals are stored in autocorrelation signal store 250 as they are produced in step 305.

The noise contribution level signal is then initially set to a minimum prescribed value as per index K in step 115 of FIG. 1. The noise contribution signal corresponds to the noise autocorrelation patterns expected during a time frame interval. Such noise pattern signals may be fixed as white or colored noise or may be obtained by sampling the particular speech analyzer environment in the absence of speech. The sampling may represent the average noise background or may be obtained in the first several milliseconds of each speech analysis operation prior to the application of speech to microphone 201. The loop including steps 120 to 140 that is operative to form a set of modified autocorrelation signals is then entered. This loop is adapted to form modified autocorrelation signals of the current time frame interval^* by subtracting the noise contribution signal^indexed by K (step 120), form linear prediction parameter signals for the current time frame interval, form the modified autocorrelation signals (step 125), form all-pole model autocorrelation signals and to generate an error signal corresponding to the match between the all-pole model autocorrelations and modified autocorrelation signals (step 130). Noise contribution index signal K is then incremented (step 135) and the loop is iterated for the predetermined set of noise contribution indices.

The flow chart of FIG. 4 illustrates the operations of the modified autocorrelation signal formation loop in greater detail. In FIG. 4, a singularity flag IS is initially reset to zero in step 403. If during the iterations through the loop from step 410 through 430, unacceptable values for predictor coefficients are obtained, the singularity flag is set and no further modification autocorrelation signals are formed. Noise index K is then set to zero in step 405. Index K is incremented by a predetermined amount each iteration through the loop to provide modified autocorrelation signals responsive to different values of noise contribution,

The modified autocorrelation signal loop is then iterated for increasing values of noise index K until K exceeds Kmax corresponding to the maximum noise contribution signal expected. In step 410, a modified autocorrelation signal is generated in accordance with c(m) = r(m) - β*η(m) 0<m<p+q . (8) Step 415 is then entered and the linear prediction coefficients resulting from the current modified autocorrelation signals of the time frame interval are generated from

P

Σ c(m-j)*a(j) = c(m) 1<m<ρ . (9) j=1 Step 415 includes generating signals

P s1 = c(m)+ ∑ a(j)*c(m-j ) (10) j=1 and

P s2 = c(0)+ Σ a(j)*c(j) (11) j=1 and forming the signal a(m) = -s1/s2 . (12)

If s2 in Equation (10) is equal to or less than zero, the singularity flag IS is set and the modified autocorrelation forming loop is exited. Otherwise, the LPC coefficient is produced by generating signal b(j)=a(j) for 1^.j m and using the b(j) signal in

a(j) = b(j) + a(m)*b(m) for 1<j<m-1 (13) this process is iterated for modified autocorrelation index m=1 through m=p.

At this point in the operation of the circuit of

FIG. 2, the error signal q p e(K) = Σ [c(p+i)- Σ c (p+i-j)*a(j)]² (14) i=1 j=1 representing degree of match between the modified autocorrelation signals of FIG. 4 and the corresponding autocorrelation signals based on the all-pole model for the present index K is formed in processor 220 of FIG. 2^"". Index K is then incremented in step 425 and the next iteration is started in step 410.

When index Kmax has been exceeded in decision step 430 the index K* corresponding to the minimum matching error signal is determined (step 145) by a search through the correlation signal matching errors obtained in the iterations of loop 450 in FIG. 4. The noise contribution signal for. index K* is then used to form LPC speech parameter signals for the current frame (step 150 of

FIG. 1) and transferred to utilization device 260. The formation of the parameter signals corresponding to the noise speech and the additive noise is shown in greater detail in FIG. 5. Step 501 is entered after the noise index K* has been determined in step 150 of FIG. 1.

In step 501 the noise corrected autocorrelation signals are generated responsive to the initial autocorrelation signals r(m) and the noise signal β = K*/100. The linear prediction coefficients corresponding to the noise corrected autocorrelation coefficients are then formed from the relationship

P

Σ c( |m-j|)*a(j) = c(m) . (16) j=1

These noise corrected speech parameters are then transferred to utilization device 260. Control is then passed to step 110 via step 155 and the circuit of FIG. 2 processes the next set of N digitized speech sample signals of store 240.

The invention has been shown and described with reference to a particular embodiment thereof. It is to be understood that changes and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention.

APPENDIX A

SUBROUTINE NLPC (A)

COMMON /STO 235/ W(0:200) COMMON /STO 240/ S(0:200) COMMON /STO 245/ X(0:200) COMMON /STO 250/ R(0:40)

COMMON /STO 255/ C(0:40)

REAL ETA(0:40) ,A(0:12) ,E(0:100) ,T(0:12) DATA (ETA(I) ,I=0,0)/1.0/

INTEGER P,Q

P=12 Q=12 N=172

KMAX=100

DO1I=0,N-1

1 W(I)=0.54-0.46*COS( (2*3.14159265*1) /N)

C+++ STEP 110 - GENERATE AUTOCORRELATION SIGNALS

DO2M=0,N-1

2 X(M)=S(M)*W(M)

DO3K=0,P+Q R(K)=0 D04M=K,N-1 4 R(K)=R(K)+X(M)*X(M-K) 3 CONTINUE

R0=R(0) D05M=0,P+Q 5 R( M ) =R( M ) /R0

K=0

100 CONTINUE

C+++ STEP 120 - MODIFY AUTOCORRELATION SIGNALS

BETA=0.001*K

C0=R(0)-BETA*ETA(0) DO6M=0,P+Q

6 C(M)=(R(M)-BETA*ETA(M))/C0

C+++ STEP 125 - FORM SPEECH PARAMETER SIGNALS

CALL ACTOPC (C( 1 ) ,A,P,T,IS)

IF(IS.EQ.1 )GOTO150

C+++ STEP 130 GENERATE CORRELATION MATCHING ERROR

E(K)=0

D07I=1 ,Q SM=0

D08J=1 ,P 8 SM=SM-C(P+I-J)*A(J) 7 E(K)=E(K)+(C(P+I)-SM)**2

WRITE(IFCW,908)K,BETA,E(K)

908 FORMAT(I5,F8.4,F8.4)

K=K+1

IF(K.LE.KMAX)GOTO100

150 CONTINUE C+++ STEP 145 DETERMINE INDEX OF MINIMUM MATCHING ERROR KSTAR=0 EMIN=E(0) DO9I=0,K-1 IF(E(I) .GE.EMIN)GOT09

EMIN=^'E(I) KSTAR=I 9 CONTINUE

C+++ STEP 150 FORM SPEECH PARAMETER SIGNALS

BETA=0.001*KSTAR C0=R(0)-BETA*ETA(0) DO16M=0,P+Q 16 C(M)=(R(M)-BETA*ETA(M) )/C0

CALL ACTOPC (C(1 ) ,A,P,T,IS)

RETURN END

SUBROUTINE ACTOPC (A,X,N,T,M)

DIMENSION A( 1 ) ,X( 1 ) ,T( 1 )

M=0 X(1)=1

X(2)=-A(1)

D03I=2,N

S1=A(I)

S2=1 D04J=1,I-1

S1=S1+A(I-J)*X(J+1 ) 4 S2=S2+A(J)*X(J+1 )

IF(S2.GT.0.0)GOTO7

M=1 RETURN

7 RC=-S1/S2 T(I)=RC X(I+1 )=RC D01J=1,(I/2) 5 TI=X(J+1)

TJ=X(I-J+1 ) X(J+1 )=TI+RC*TJ 1 X(I-J+1 )=TI*RC+TJ 3 CONTINUE 10 M=0

RETURN END

15

20

25

30

35

Claims

Claims :

1. A method for analyzing speech in a noise prone environment comprising the steps of: partitioning an input speech pattern into successive time frame intervals? forming signals representative of the autocorrelation of the input speech of the current time frame interval responsive to the input speech pattern: generating noise signals representative of the environment; forming first and second autocorrelation signals responsive to said input speech autocorrelation signals and said generated noise signals; generating a signal corresponding to the difference between said first and second autocorrelation signals; and producing a set of signals representative of the current time frame interval speech responsive to said difference corresponding signal and the second autocorrelation signals.

2. A method for analyzing speech in a noise prone environment according to claim 1

CHARACTERIZED IN THAT the noise signal generating and first and second autocorrelation signal forming steps comprise: producing a time frame interval noise signal of prescribed magnitude; forming first and second autocorrelation signals responsive to the produced noise signal of prescribed magnitude; and repeating the noise signal producing step and the first and second autocorrelation signal forming step for prescribed magnitudes over a predetermined range of prescribed magnitudes. 3. A method for analyzing speech in a noise prone environment according to claim 2

CHARACTERIZED IN THAT generating the difference corresponding signal comprises the steps of: producing a signal representative of the difference between the first autocorrelation signals of the current time frame interval and the second autocorrelation signals of the current time frame interval for each prescribed magnitude noise signal; and selecting the minimum of the difference representative signals. 4. A method for analyzing speech in a noise prone environment according to claim 3

CHARACTERIZED IN THAT producing the set of current time frame interval speech pattern representative signals comprises: generating a set of corrected autocorrelation signals responsive to the minimum difference representative signal; and producing a set of speech parameter signals for the current time frame interval responsive to the corrected autocorrelation signals.

5. A method for analyzing speech in a noise prone environment according to claim 4

CHARACTERIZED IN THAT the speech parameter signals are linear prediction coefficient signals.

6. A method for analyzing speech in a noise prone environment according to claim 5

CHARACTERIZED IN THAT the first autocorrelation signals correspond to all-pole model autocorrelation signals and said second autocorrelation signals correspond to noise signal reduced input speech autocorrelation signals.

1 . A method for analyzing speech in a noise prone environment according to claims 1, 2, 3, 4, 5 or 6 CHARACTERIZED IN THAT the speech pattern comprises speech and additive noise. 8. A method for analyzing speech in a noise prone environment according to claims 1. 2, 3, 4, 5, or 6

CHARACTERIZED IN THAT each time frame interval noise signal comprises a set of autocorrelation signals generated responsive to environmental noise preceding start of the speech pattern.

9. A speech analyzer for a noise prone environment

CHARACTERIZED IN THAT the analyzer comprising: means for partitioning an input speech pattern into successive time frame intervals; means for forming a signal representative of the autocorrelation of the input speech of the current time frame interval responsive to the input speech pattern; means for generating noise signals representative of the speech environment; means for forming a set of first and second autocorrelation signals of the current time frame interval responsive to said input speech autocorrelation signals and said generated noise signals; means for generating a signal corresponding to the difference between said first autocorrelation signals and said second autocorrelation signals_? and means for producing a set of signals representative of the current time frame interval speech pattern responsive to said difference corresponding signal and said first autocorrelation signals.

10. A speech analyzer for a noise prone environment according to claim 9

CHARACTERIZED IN THAT the noise signal generating and first and second autocorrelation signal forming means comprise: means for producing a time frame interval noise signal of prescribed magnitude; means for forming first and second autocorrelation signals responsive to the produced noise signal of prescribed magnitude; and means for iteratively operating the noise signal producing means and the first and second autocorrelation signal forming means for prescribed magnitudes over a predetermined range of prescribed magnitudes.

11. A speech analyzer for a noise prone environment according to claim 10

CHARACTERIZED IN THAT the difference corresponding signal generating means comprises: means for producing a signal representative of the difference between the autocorrelation signals of the current time frame interval and the first and second autocorrelation signals for each prescribed magnitude noise signal; and means for selecting the minimum of said difference representative signals.

12. A speech analyzer for a noise prone environment according to claim 11 CHARACTERIZED IN THAT the means for producing the set of current time frame interval speech pattern representative signals comprises: means for generating a set of corrected autocorrelation signals responsive to the minimum difference representative signal? and means for producing a set of speech parameter signals for the current time frame interval responsive to the corrected autocorrelation signals. 13. A speech analyzer for a noise prone environment according to claim 12 CHARACTERIZED IN THAT the speech parameter signals are linear prediction coefficient signals. 14. A speech analyzer for a noise prone environment according to claims 9, 10, 11. 12, or 13 CHARACTERIZED IN THAT the speech pattern comprises speech and additive noise.

15. A speech analyzer for a noise prone environment noise according to claim 9, 10, 11, 12, or 13 CHARACTERIZED IN THAT the first autocorrelation signals correspond to all-pole model autocorrelation signals and said second autocorrelation signals correpond to noise signal reduced input speech autocorrelation signals. 16. A speech analyzer for a noise prone environment according to claims 9, 10. 11, 12 or 13 CHARACTERIZED IN THAT the means for generating the time frame interval noise signal comprises means for forming a set of autocorrelation signals responsive to environmental noise preceding start of the speech pattern.