CA1218458A

CA1218458A - Apparatus and method for automatic speech activity detection

Info

Publication number: CA1218458A
Application number: CA000458275A
Authority: CA
Inventors: Sandra E. Hutchins; Steven F. Boll; George Vensko; Lawrence Carlin; Allen R. Smith
Original assignee: International Standard Electric Corp
Current assignee: International Standard Electric Corp
Priority date: 1983-07-08
Filing date: 1984-07-06
Publication date: 1987-02-24
Also published as: JPS6039695A; EP0143161A1

Abstract

Abstract of the Disclosure An apparatus and method for automatic detection of speech signals in the presence of noise including noise events occurring when speech is not present and having signals whose signal strengths are substantially equal to or greater than the speech signals. Frames of data representing digitized output signals fromn a plurality of frequency filters are operated on by a linear feature vector to create a scalar feature for each frame which is indicative of whether the frame is to be associated with speech signals or noise event signals. The scalar features are compared with a detection threshold value which is created and updated from a plurality of previously stored scalar features. A plurality of the results of the comparison for a succession of frames is stored and the stored results combined in a predetermined way to obtain an indication of when speech signals are present. In automatic speech recognizers employing the above-described speech detections, when such indication is-given, frames are further preprocessed and then compared with stored templates in accordance with the dynamic programming algorithm in order to recognize which word was spoken.

Description

121~

APPARATUS AND METHOD FOR
AUTOMATIC SPEECH ACTIVITY DETECTION
Background of the Invention This invention relates to an apparatus and method for speaker independent speech activity detection in an environ-ment of relatively high level noise, and to automatic speech recognizers which use such speake independent speech activity detection.
Automatic speech recognition systems provide a means for man to interface with communication equipment, computers and other machines in a human's most natural and convenient mode of communication. Where required, this will enable operators of telephones, computers, etc. to call others, enter data, request information and control systems when their hands and eyes are busy, when they are in the dark, or when they are unable to be stationary at a terminal.
One known approach to automatic speech recognition involves the following: periodically sampling a bandpass filtered (BPE') audio speech input signal to create frames of data and then preprocessing the data to convert them to processed frames of parametric values which are more suitable for speech processing; storing a plurality of templates (each template is a plurality of previously --1-- ~ `~
, , 4~8 ~ S.E. Hutchlns et zl 1-1-4-3-3 cre~ted processed frames of parame-ric vallles repres~n~i-ng a word, whicn when taken together form the reference vocabulary of the automatic speech recogni2er); and comparing the processed frames of speech with the templates in accordance with a predetermined algorithml such as the dynamic programming algorithm lDPA) described in an article by F. Itakura, entitled "Minimum prediction residual principle applied to speech recognition", IEEE
Tr~ns. Acoustics, Speech and Signal Processing, Vo.
ASSP-23, pp. 67-72, February 1975, to find the best time alignment path or match between a given template and the spoken word.
Aut-omatic Speech Recognizers depend on detecting tne end points of speech based on measurements of energy.
Prior art speech activity detectors discriminate between energy, assumed to be speech, and lack o- en.ergy, assumed to be silence. Therefore, prior art Automatic Speech Recognizers require a relatively quiet environment in wnich to operate, otherwise, performance in terms of recognition accuracy drops drastically. Requiring a auiet environment restricts the uses to wnich a Speech Recognizer can ~e put, for example, prior art recognizers would have difficulty operating on a noisy factory floor or in a coc~pit of a tactical aircraft, etc. Such noisy environments as tnese can be characterized as having bac~ground noise present whetner or not speecl~ is present and noise events occurring when speecn is not present, tne noise events sometimes having signal levels e~ual to or greater than the speech signal levels. It is desiraDle, therefore, to provide an apparatus and method for speaker independent speech activity detection and for such speecn activity detection for use in automatic speech recognizers WhiCh must operate in an environment wherein noise events with relatively high signal levels occur when speech is not present.
SummarY of the Invention The present invention relates to an apparatus and ~z~
, 5.E. ~utc~ins et ~ 3-3 me.hod for speech activity cetec.ion of s~eech sicnals in the presence of noise, including noise events occurring when speech is not present with signal le~els which may have signal strengths e~ual to or greater than the speech signals. The input signals are digitized and frames of digital signal values associated with said digitized signals are repeatedly formed. The speech signals and noise event signals are automatically separated. In the - preferred embodiment, this is done with a spea~er 0 independent predefined, fixed operation or transformation performed on tne frames.
Also, in tne preferred embodiment, the input signals are fre~uency filtered to provide a plurality of filter output signals which are then digitized. The frames are created from the digitized filter output signals. A
linear transformation is zpplied to the frames of digital signal values to create a scal2r feature for each frame whose magnitude will be larger for speech signals than for noise event signals.
A detection tnreshold value is created for the scalar feature magnitudes and repeatedly updated. Scalar features are compared with the detection threshold value, and the results of a plurality of successive comp2risons are stored. The stored results are com~ined in a 2j predetermined manner to obtain an indication Ot when speech signals are present.
When an indication that speech signals are present is given, frames are further preprocessed before being compared with stored templates representing the vocabulary o~ recognizable words. The comparison is based on the .ynamic prosramming algorithm (DPA).
Brief Description of the_Drawings O~jects, features and advantages of the present inven.ion will become more fully apparent from the following de.ailed description of the preferred emboàiment, the appended claims and the accompanying drawings, in which:

~2~8~8 , S.E. :~;utcnins et al 1-1-4-3-3 Flg. 1 is a preLerred em~odiment bloc~ dia~r2~ of the autom2tic speech recognition appar2tus of the present invention.
Fig. 2 is a more detailed bloc~ diagram of the bandpass filter portion of the invention of Fig. 1.
Fig. 3 is a table giving the filter characteristics of the bandpass filter portion of Fig. 2.
Fig. 4 is a preferred emDodiment bloc~ diagram of the operation of the speech recognition algorithm of the present invention.
~ ig. 5 is a graph summarizing the time alignment and matching of the recognition portion of the speech recognition algoritnm of Fig. 4.
Fig. 6 shows three graphs of amplitude vs. frequency for voice, jet noise and oxygen regulator noise.
Fig. 7 is a more detailed bloc~ diagram of the speech activity detector portion of the speech recognition algorithm of Fig. 4.
Detailed Description of tne Drawinqs Fig. 1 is a bloc~ diagram of an automatic speech recognizer apparatus designated generally 100. It comprises a microphone 102; a microphone preamplifier - circuit 104; a bandpass filter bank circuit 108 for providing a digital spectrum sampling of the audio output of circuit 104; a pair of processors 110 and 112 interconnected by inter-processor communication circuits 114 and 116; and an external non-volatile memory device 118. In the preferred embodiment, processors 110 and 112 are Motorola MC68000 microprocessors and inter-processor communication circuits 114 and 116 are conventionally designed circuits for handling interrupts and data transfers between MC68000 microprocessors.
Interrupt procedures for the MC68000 are ade~uately described in tne MC68000 specification.
The speech recognition algorithm is stored in the EPROM memory portions 122 and 124 of the processors 110 and 112, respectively, wnile tne predefined vocabulary is ~Z184~8 S.E. Hutchins et âl 1~ -3-3 .

stored as previously created templates in the external non-vola.ile memory device 118 which in the preferred embodiment is an Intel bubble memory, Model No. 7110, capable of storing one million bits. In the preferred embodiment, there are only 36 words in the vocabulary, and, hence, 36 templates with 4000 bits -equired per template on the average. Hence, the bubble memory is capa~le of storing approximately 250 templates. When templates are needed for comparison with incoming frames of speech data from BPF circuit 108, they are ~rousht from memory 118 into working memory 126 in processor 112.
Referring now to Fig. 2, a more detailed block diagram of the bandpass filter ~ank circuit 108 is shown.
The output from preamp 104 on lead 130 from Fig. 1 is transmitted to an input amplifier stage 200 which has a 3 db bandwidth of lOkHz. This is followed ~y a 6 db per octave preemphasis amplifier 202 having selectable cut in frequencies of 500 or 5000 Hz. This is conventional practice to provide more g2in at the higher frequencies than at the lower frequencies since the higher frequencies are generally lower in amplitude in speech data. At the output of amplifier 202 the signal splits and is provided to the inputs of anti-aliasing filters 204 (with a cutoff frequency of 1.4 kHz) and 206 (with a cutoff frequency of 10.5 XHz). These are proviced to eliminate aliasing which ~ay result because of subsequent sampling.
The outputs of filters 20' and 206 are provided to Dandpass ,ilter circuits (BPP) 208 and 210, respectively.
BPF 208 includes channels 1-9 while BPF 210 includes cnannels 10-19. Each of channels 1~18 contains a one/third octave filter. Channel 19 contains a full octave filter. The channel filters are implemented in a conventional manner using Reticon Model Numbers R5604 and R56606 switched-capacitor devices. Fig. 3 gives tne clock input frequency, center frequency and 3 db bandwidth of tne 19 channels of the BPF circuits 208 and 210. The bandpass filter clock frequency inputs required for the BPF circuits 208 and 210 are generated in a conventional .

~Z184~3 ,C F~ r ch; rlS ~ ' i~

r,anner from a clock generator circuit 2i2 driven Dy a 1.632 ~.~z clock 21~.
The outputs of BPF circuits 208 and 210 are rectified, iow pass filtered (cutoff frequency = 30 Hz) and sampled simultaneously in 1~ sample and hold circuits ~National Semiconductor Model No. LF398) in sampling circuitry 214. Tne 19 channel samples are then multiplexed through multiplexers 216 and 218 (Siliconix Model No. DG506) and converted from analog to digital signals in lo~ A/D converter 220, a Siliconix device, Model No. DF331. The converter 220 has an 8 bit serial output which is converted to 2 parallel format in serial to parallel register 222 (Nationzl Semiconductor Model No. DM86LS62) for input to processor 110 via bus 132.
A 2 MHz cloc~ 224 generates various timing signals for the circuitry 214, multiplexers 216 and 218 and for A/D converter 220. A sample and hold,command is sent to circuitry 214 once every io milliseconds over lead 215.
Then each of the sample and hold circuits is multiplexed sequentially (one every 500 microseconds),in response to a five bit selection'signal transmitted via bus 217 to circuits 216 and 218 from timing circuit 226. Four bits are used by each circuit while one bit is used to select which circuit. It therefore takes 10 milliseconàs to A/D
convert 19 sampled channels plus a ground reference sample. These 20 8-bit digital sisnals are callec a frame of data and they zre-transmitted over bus 132 at appropriate times to microprocessor 110. Once every frame a status signal is generated from timing generator circuit 226 and provided to processor 110 via lead 228.
This signal serves to sync the filter circuit 108 timing to the processor 110 input. Timing generator circuiit 226 further provides a 2 kHz data reaày strobe via lead 230 to processor 110. This provides 20 interrupt signals per frame to processor 110.
Referring now to Fig. 4, a block diagram of the automatic speech recognition algorithm 400 of the present ~18458 S.E H'tchins e. c~ 3-~

invention is presented. It can be divided into four sub.asks: bancp2ss filter data transformation 402; speech activity detection 404; variable frame rate encoding and normalized mel-cepstral tr2nsformation 406; and recognition 408. The speech activity detection subtask 404 has been implemented in C lansuage for use on

2 VAX 11/780 and in assembly lansuage for use on an MC68000. C language is a higher order language commonly used in the technical community 2nd available from Western Electric. The C language version of subtask 404 can be found on pages 12 throush 16 of the specification.
_ It will be described in more detail in connection with a description of Fig. 7.
- As discussed earlier, every S00 microseconds the microprocessor 110 is interrupted by the circuit 108 via lead 230. The software which handles that interrupt is the BPF transformation subtask 402. Usually, the new 8-bit filter value from bus 132 is stored into a buffer, but every 10 miilisecond (the 20th interrupt) a new frame sign21 is sent via lead 228. The BPF tr~nsformation subt2sk 402 takes the 19 8-bit filter values that were buffered, combines the first three values as the first coefficient and the next two v21ues as the second coefficient, and discards the l9th value beczuse it has been founà to contain little if any useful information, especizlly in a noisy environmen,. The resulting 15 coefficients characterize one 10 ms frame of the input sign21 The transformed frzme of speech is passed onto buffer 410 and then to the VERE and mel-ceostral transformation subtask 406 if the speech activity detector substask 404 has indicated that speech is present. The speech activity detector subtask 404 will be explained in more detail later. Assuming for tne moment that su5task 404 indicates that speech is present, then in subtask 406, the Euclidean distance between a previously stored frame and tne current frame in buffer 410 is determined. If the distance is small (large similarly) i~ -7-~Z18~8 and not more than two frames of data have been skipped, the current frame is passed over, otherwise it is stored for future comparison and passed onto the next step of normalized mel-cepstral transformation. On the average one-half of the data frames from the circuit 108 are passed on (i.e. 50 frames per second).
To reduce the data to be processed, the 15 filter coefficients are reduced to 5 coefficients by a linear transformation matrix. A commonly used matrix comprises a family of 5 "mel-cosine" vectors that transform the bandpass filter data into an approximation of "mel-cepstral" coefficients.
Mel-cosine linear transformations are discussed in (1) Davis, S. B. and Mermelstein, P. "Evaluation of Acoustic Parameters for Monosyllable Word Identification", Journal Acoust. Soc. Am., Vol. 64, Suppl. 1, pp. S180-181, Fall 1978 (Abstract) and (2) S. Davis and P. Mermelstein "Comparison of Parameter Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences", IEEE Trans. Acoust., Speech, Signal Proc., Vol. ASSP-28, pp. 357-366. However, in the preferred embodiment of the present invention, a variation on "mel-cosine" linear transformation is used called normalized mel-cepstral transform-ation, i.e., the raw BPF data is normalized to zero mean, normalized to zero net slope above 500 Hz and mel-cosine transformed in one step. The first mel-cepstral coefficient (which is very sensitive to spectral slope) is not used.
Each frame which has undergone mel-cepstral transformation is then compared with each of the templates representing the vocabulary which are now stored in the processor's working memory 126. The comparison is done in accordance with a recognition portion 408 of an algorithm based on the well-known dynamic programming algorithm (DPA) which is ~Z~8~L58 described in an article by F. Itakura entitled "Minimum Predic-tion Residual Principle Applied to Speech Recognition", IEEE
Trans. Acoustics, Speech and Signal Processing, Vol. ASSP-23, pp. 67-72, February 1975. A modified version of the DPA
may be used, called a windowed DPA with path boundary control.
A summary of the DPA is provided in connection with a description of Fig. 5. A template is placed on the y-axis 502 and the input word to be recognized is placed on the x-axis 504 to form a DPA
matrix 500. Every cell in the matrix corresponds to a one-to-one mapping of a template frame with a word frame. Any time align-ment between the frames of these patterns can be represented by a path through the matrix from the lower-left corner to the upper-right corner. A typical alignment path 506 is shown.
The DPA function finds the locally optimal path through the matrix by progressively finding the best path to each cell, D, in the matrix by extending the best path ending in the three adjacent cells labeled by variables, A, B, and C. The path that has the minimum score is selected to be extended to D
subject to the local path contraint: every horizontal or vertical step must be followed by a diagonal step. For example, if a vertical step was made into cell C, the path at cell C
cannot be chosen as the best path to cell D. The path score at cell D is updated with the previous path score (from A, B, or C) plus the frame-to-frame distance at cell D. This distance is doubled before adding if a diagonal step was chosen to aid in path score normalization. The movement of the DPA function is along the template axis for each utterance frame. The func-tion just described is repeated in the innermost loop of the recognition algorithm by resetting the B variable to cell D's score, the A variable to cell C's score and retrieving from storage a new value for C.

However, before the subtasks 406 and 408 can operate, the beginning and end of speech must be detected. Where _g_ ~841~8 S.B. ~ tcnins et al 1-1-4-3-3 s?eech recosni.ion is ta~ing place in a quiet environment witn lit,le or no noise present, endpoint detection b2sed on energy measurement can be used. However, in the environment of tactical fighters, for example, there are present two types of noise wnich render traditional speech activity detectors useless. Background noise from engines and wind is added to the speecn signal and results in tne classical detection problem of separating signal and additive nuise. See curve ~02 in Fig. 6. The use of an oxygen regulator with a mask introduces noise from inhales and exhales which are not concurrent with speech but resemble speech in spectral shape and can cause spurious deiection. See Curves 60~ and 606, respectively. The amplitudes of the signals associated with these noise events often exceed the speecn signal amplitudes in many coc~pit conditions.
~ eferring now to Fig. 7, a more detailed description of the speech activity detection subtas~ 404 is given. A
large number of frames or data ~rom subtask 4D2 representing both speech and noise event sounds from a variety of spea~ers and oxygen regulators were studied to determine a fixed transformation which when applied to the frames would provide a good separation between speech and noise events over a range of s?eakers. It was` determined that a single 15 parameter feature vector 702 could be found whose inner product 703 with modified frames 704 derived from the bandpass filter frame 70; would proviàe a scalar feature 706 giving good separation of speech from noise events. The frames coming from the BPF
transformation subtas~ 402 are logarithmically encodeQ
frames àue to the action of the log AjD Converter 220.
Better results are achieved, however, if frames proportional to the energy of the noise event signals and speech signals are formed. This is accomplished by modifying the BPF frames from 705 via the operation of squaring the inverse log of the frame components 707.
This step enhances speech activity detection by increasing ~2~89L~8 S.E. -u.c~ ns et ai 1~ 3-3 the cynamic r2nse of the fe2turec, thus providing srec.er separation between the pea~s of the speech spectra ana the relatively broad band noise and non-speecn spectra.

To derive a good feature vector F, a collection of frames of BPF data from a plur21ity of spea~ers and noise events occurring when speech is not present are collected and modified as described above. The data is divided into sets of speech frames [S] and noise event frames [N]. By inspection, a good intuitive guess at F is made and then in accordance with the equation below, the inner products of F with all of IS] and all of [N] is formed, and the statistical overlap of the resulting two classes of scalar features, [F S] and ¦F.N] is measured to form a separation figure of merit. (- represents forming the inner product of the two vectors.) Separation = Mean (~F.S]) - Mean (IF-N]) Std Dev (lF.S]) I Std Dev ([F N]) Small changes in each of the feature veçtor components, fj is made, for example, tne first component, fl, of F
is made a little larger and then a little smaller, then the same is done for f2 and so on. For each small cnange F.S and F.N is recomputed for all the frames lS]
and [N] and the separation remeasured. This i~entlfies the direction to ta~e to cnange F for be-ter separation.
F is changed accordingly, obtaining a new vector for a starting point and then the process is repeated. This approach is ~nown as a gradient search.

When a feature vector F is formed ~nich appears to be a significant improvement, it is tried in the recognizer algoritnm to see how it wor~s. If certain types of noise events are found to still trigger the detection, or if certain speech sounds are consistently missed, samples of them are ta~en and added to the data base ~S] and [N~.
Then a new feature vector is searched for that handles the new data as well as the old.

12~84~.8 s.r. ~:~.chins et al 1-1-4-3-, To assist in carrying out all the inner proauc~ and sep~ration com~u.ations re~uired during the gradient search, a program W25 created in C language for a VAX
computer. A listins of the program for a slightly modified gradient searcn from that described a~ove is founo on pages 21 to 23 of tne specification.

The preferred embodiment, 15 parameter feature vector found by the gradient search as substanti~lly described above is, 0 .0 2 13.9

3 5.9

4 1.2 1.4 6 1.4 7 1.5 8 1.6 9 2.4 1.3 11 2.0 12 1.2 13 4.8 14 -13.6 0.0 Once the optimum feature vector is determined, the result nt scalar features formed by the inner product operation with the modified frames are collected and formed into a histogram designated generally 710 in Fig. 7. The x-axis 712 is the magnitude of the scalar feature while the y-axis 714 is the number of times a particular magnitude occurs. Jet noise 716 and regulator lZ1~4S~3

5.E. :.~.cnins et al 1~ -3-3 sounds 718 occu below z th.eshol2 720 while voice 727 occurs above the threshold 720.
When the speech recognizer is being used, e.g., in flight in an aircr2ft cockpit, t~e speech 2ctivity detection subtask 404 initially selects a detection threshold but tnereafter continually gathers statistics and updates the histogram on the feature 726. Every 1000 frames, the detection threshold is adjusted based on the statistics in the histogram. For example, the peak 750 is located in the histogram 710, and a search is cond~cted forward from the peak 750 to locate the low point 720. Tne threshold is set to the low point value plus some bias sucn 2S one or two. Finally, each histosram entry is divided by two to keep the histogram values from growing t-oo large.
The magnitude of the detection threshold 708 is subtrac.ed from the magnitude of the scalar feature 706 at block 730 for each frame. A weighting function 732 is applied to the outp~t vzlue of block 730 to smooth out the values before they are filtered and clamped at 73~. The weighting function reduces Iarge negative values from block 730 and reduces smali positive values. Large positive values are left substantially unaffected. The weiahting function cooperates with the integra.ion process performed by the filter and clamp function 734~to proviàe sharp c~toff points between the beginning and end of speech detection. Large nesztive values provide nc better indication of non-speech than smaller values, b~t will distort and delay the integration process from inGicating when speech is present. Small positive values create uncertainty ~s o whether speech is present and are better left undetec~ed. An example of the preferred embodiment weighting function and filter and clamping functions are provided in C language on page 19 of the specification.

Four values from filter znd clamp 734 corresponding to four successive frames from subtàsk 402 are stored in ~Z1891L~8 ~ c~ s ~ 3~3 b~Lfers 736. Then multi-frzrle decision logic 738 is employed to make a aecision whether speech is present.
For example, if no speech were present and if all four buffers provide a positive indication, then a decision is made that speech is present, and this is passed on to block ~10 in Fig. 4, otnerwise a decision is made that speech still is not present. On the other hand, if speech is currently present, a decision is made tnat speech is still present if any one of the buffers indicates that a speech signal is present. Only if 211 four buffers indicate no speech signals present will 2 decision be made that speech is now over. The above-described decoding is provided in C language at pages 19 and 20 of the specification It should be noted that in the preferred em30diment, subtasks 402, 404 and 4~6 are performed in processor 110 while subtask 408-is performed in processor 112. ~owever, there is no reason why the two processors could not be combined as one. Although the present invention relates to a 36 word vocabulary with isolated word recognition, there is no reason why the speech activity detector could not be used witn larser vocabulary continuous speech recognition machines. Also, speech activity detection through the use of the inner product between 2 predefined feature vector and frames of speech can be perforr1ed on frames of speech provided directly from the bandpcss filter transformation subtask 402 even thOUgh thi~ frame is proportional to the log of the value of the disital signals. Similarly, the inner product could be performed using frames whcse digital sisnals are proportional to the masnitude of the digital signals and not the magnitude squared.
Results to date on the performance of the recognizer indicate recognition accuracy of ~5 to 95~ for worst cases of cockpit sound pressur.e level.of 115 dB and acceleration forces of 5G. In fact, the system shows no degradation from low level ambient noise performance (95+~ accuracy) ~Z~8~S8 S.~ chins et al 1-1-4-3-3 .o noise levelc cf 2pproximately- 106 dB. It should be pointed out, however, that the 115 dB sound levels at 5G
acceler2tion forces are often simulated. The pilot is speaking into an oxygen regulator which parti211y seals off the ambient cockpit noise. However, the stress of the noise and acceleration forces causes the pilot to speak in a less than normal speaking m~nner. Also, the noise events caused by the stressed breathing of the pilot into the oxygen regulator are also present.

REL:jn June 28, 1983 - - -S . E . Hutchins et al 1-1-4-3-3 epsqnew. c epsqnew. c -prosram n~me: epsqnew.c cLassiJ~cc~um of voiced speech vs noise usir4g bp~ energy in bir s 0 thTough 1 9 (0 &1 9 are usucl ly ~7r ored) this is true energy (inverse mu squared~ ver~ion 4ptive voicing threshoLd and weighting derived from a rolling histogram r~eeds 10 seconds of noise or voice and roise to ada~t to environment nction: reads in ra7~J PP~le and durr~ps decisions to ffLe of your choice k)ad vw~ cc epsqnew.c--o epsqnew--lm Input format is r the form.
ep /speech~csr/tac /neLson/wc95 0 33000 < da~a ep "data.ep" is an input datafile with the folLowirg form~;
bias: detection threshold bias thr; a~ptive threshoLcL
PcLamp. Positive cLamp vaLue Ncl,amp. I~eg~;ive cLamp vaLue options. threshoLd offset vaLue dipdenom: determines ~of pe~Jc th~t defines aDpro2ima~e vaLley coefs: 16 linear BP~ coe,fflcients sc4Le. scale ractor for coefs Or~in~L by: S.Æ. Hutchins.. va~ious mods in 1932 G7an~ed to reflect 68000 a~t~metic: S. S4~ew~rt and H.Koble May, June I se3 ~/
lude Cstdio.h>
#i~clude ~cu~es.h>
~define DIM 15/~number of features~/
main(argc~ergv) mai~
~Dt argc; char ~argv[];
/~voicing decision va~bles~i' Rhart buf20r20] /* r~v~ bpf dat~ ~/
int stat~10~; /* rolLing nistogram ~/
~nt vyes,vyeso,vyesol,v,Yeso2lyyeso3; /~ voicing indicators ~/
illtthr,ncount; /~ threshold,frame counter ~/
~nt sphfig,glfflg,cnvcnt,ctpc;
int cdbar,cdsig; /* st~ts on ~ssumed voicing */
~llt Lmp,x,peak,decisiou; /~ temp variabLes and flr~l decision ~/
intPclamp- /~Posi~ive VyesocL~npvaLue ~/
~nt Nclamp, /~ Negative ~yeso cLamp va,ue ~/
int options /* Thre shoLd o~set vaLue /
intbias; /~ bvls forhistogramLowpoint:+4forchestreg, else a~
/~ logic for word and syLlaole lengths ~/
int state,oldstate,tempframe,fromframe,toframe;
int buf[20] /~ new op~ da~a ~/
1ung cd; /~ ,-aw fe~ure ~/
~oat ampscale,tf; /* used to v~eight decisi,on ~/
flDat scale; /~ scc,Les up coef ~rr~y ~/

P~ge 1 Df epsqne~.c ~Z~84~3 S.E. Hutchins eL al 1-1-4-3-3 epsqnew. c epsqnew. c . . . m~Ln ~at dipdenom; /~ ~etermines ~ of pe~c that ciefines relative valley * /
floatcoef[DlM~ buffer offea~ure coefiicients A/
fioat imu;
float imusq[128]-double c,log(),exp(); ~t housekeep~ng andplotting vari~bles ~/
int i,j,frarnecount,firstfr.lastfr-char filename[ôO];
F,LE ~fp,fptag;
* ~rgv,ments * /
if(argc < 2) ¦
printf("usage: 70S BPF--file ~firstframe~ ~lastframe~\n".argv[0~);
exit();
if((fp = fopen(argV[l],"r")) == o) exit(printf("Unable to open %s~n",argv[l]));
if(argc > 2) firstfr = atoi(argv[2 )- else firstfr = 0;
if(argc > 3) lastfr = atoi(argv[3J ; else lastfr = ~0000;
printf('~That is the name of the output file?~n"); fflush(stdin);
gets(filename) -if((fptag = fopén(filename,'W')) ~ 0) ~
printf("unable to open or create file. 7,s~n",filename);
exit~);
I

printf("Enter bias~n"~;
scanf(" %d",&bias); /~ detection bias (1 is good start) ~/ -printf("Enter threshold~n")- fflush(stdout) scanf(" %d",&thr); / adaptive threshold (3 is good start) ~/
printf("Enter positive clamp~n~l) scanf(" %d",~cPclamp); /~ clamp value (12 is 68000 v~lv,e) /
printf("Enter negative clamp\n") scanf(" %d",~Nclamp); / clampva~ue (--~is 6f~000va~ue~
printf("Enter options~n");
scarf(" %d",8coptions); /~ threshold offset (3 is good st~rt) ~/
printf("Enter dipdenom~n") scanf(" 7Df",d~dipdenom); /~ deter~nines 7peak th~t de~nes reiative valley for(i=O;i<DI~,i++) ~
printf("Enter coef[70d~n",i)-scanf(" Xf",&coefli]);
'i . ' cabar = thr + options; /~ guessesfor center uf voicing ~/
cdsig = 2; /4 and average devi~tion /
printf("Enter coef scale factor~n"~;scanf(" 70f",~scale);
for~j=O;j<DI~;j++) ~
coef~ = scale;
~rintf("coefLXd~ = 7f\n",j,coef[~']); ~7 vyeso = 0; vyesol = 0; /~ initialize decision memorres ~/
vyeso2 = 0; vyeso3 = 0;
state = 0; oldstate = 0; /~ ini~lize voicing logic ~/
cnvcnt = 3;
sphflg = 0; glfflg = 0;
fromframe = -40; tofrarne =--40;
frameconnt = 0;

Poge 2 of ep~qne?l). c lZ18~8 epsqrlew. c epsqne~. c n for(x=C;x<100;x . +) stat~x] = 0; /~ clear histogram ~/
fprintf(fptag,"file processed = 70s~n\r,",argvll]) fprintf~fptag,"bias = 7d options = %d dipdenorl = 7f\n\n",bias,opticns,dipdenom);
fprintf(i`ptag,"cdsig = %d starting thr = 7Od\n\n",cdsig,th~) fprintf(fptag,"coef array\n");
for(j=O;j<DI~;j++) fprintf(fptag,"78.2f ",coef[j]);
fprintf(fptag,"~n\n");
fseek(fp,firstfr~s~zeof(buf),0); /*skip tofirstframe ~/
/~ make imusq ~rray ~/
c = log(256.0)/127.0;
for(i=O;i<120;i+.) ~
imu = (exp(c~ 1.0) i85.0;
imusq~i] = (~nt)( 32767.0~imu~imu/9.0 );
Start ~rame Loop ~/ ~
for (i=firstfr;i<=lastfr;i++) l if(fread(buf20,s~zeof buf20,1,fp) <= 0) break;
/~,find neqlJ threshoLd once every l, OOOframes ~/
ncount = framecount 7 1000;
if(ncount == 999) ~
peak=0; /~ note: histogrGm inde2 = real value , 49 t /
for(x=O;x< 100;x++) /~ find peak ~/
if(stat[x] > peak) l peak = stat[x];
thr = x;
x = thr; /t 2 = peak ~ocation ~
peak = peak/dipdenom; /~ take some ~of realpe~k ~/
lvhile(stat[x] > peak) x++; /~findcorrespondi77g ~r value ~/
thr = x--49 + bias; /* new threshold = appro~imate valley - offset + bias ~/
cdbar = thr + options;
~ diagnos~cs ~/ printf("\n stats ~n");
for(x-O;x<25;x+ ' )printf(" %d ",stat[x]);
for(x=25;x<50;x++)printf(" 70d ",stat[x]);
printf("\n");
for~x=50;x<74;x++)printf(" %d ",stat[x]);
printf("~n");
for(x=75;x<100;x++)printf(" %d ",stat[x]);
printf("~n thr= %d ~n",thr)-~* end ~gnostics ~/ prir.tf~"cdbar = 70d cdsig = 70d ~n",cdbar,cdsig);
for(x=O;x< 100;x++) /~ shi~t histo~rams do~n ~/stat[x] = stat[x] 1;
/~ end statistics update------begi,n frame--by--frame ~ork */
for(j=1;j<19;j++) 1 buf20[j] = buf20[j] .1 if(buf20[j] < 0) buf20[j] = 0;
if(buf20[j] > 127) buf20t;] = 127;

buf[0]=(b~f20[1] buf20[2]+buf20[3])/4 buf[ 1]=(buf20[4]+buf20[~]) /2;
for~j=2;j<DI~;j++) bufLj]=buf20[j~4];
cd = 0;

Pa~e 3 of EpSg~E~) C

~LZ~L8~S~3 epsqnew.c S.E. Hutchins et al l~le~ e'' C
.. n~in ~or~j=C;j<DI~ ) cd += coer[j]*imusq~ buf[jl ];
/~ i,f((framecount ~ 50~ == 0) ~
~"intJ'("cd = Z~d\n",cd) ~/ 1 for~=oli<DIM~i++)printf~li7nusq[b?~f[7d]] = ~f~n",j,imusq~ buf~ ]);
cd = cd 16; /~ jvst keep most signi~cant word of result ~/
vyes = cd, /~ ranse:--25-->25for most data ~/
if(vyes < -49) vyes = -49- /~ but some noise can go far negative ~/
iE(vyes > 50) vyes = ~o; /~ vyes is most of the decision ~/
cd = vyes; /~ clamp vyes as vJell ~/
statLvyes + 49]++; /~ accumulate in the histogram ~/
vyes = (~yes--thr); ~ minv,s variable threshold here ~/
/~ weighting based on distance from center of voice--like data ~/
tmp = cd--cdbar;
tf = (~at)(tmp)/(float)cdsig; /~ d~mp down score if beLow mean t/
ampscale = 1.0; /~J'orvoice--like thir~gs ~/
if(tmp < 0) ampscale = 4.0/(tf~tf + 4.0); /~ weight is 0. 80 if off 1 std dev */vyes = vyes ~ ampscale; /* raw weishted voicins decision ~/
vyeso = 7 ~ ~eso ~ 8 + vyes; /o smooth the decision ~/
if(vyeso > Pclamp) ~ /~ clamp to help word endi77,ss drop of~ ~/
vyeso = Pclamp /~ diagnositcs ~/ printf("+ );
iE(vyeso < Nclamp) ~ /- clamp to help word starts cor~e ~p ~/
vyeso = Nclamp;
/~ diasnostics ~/ printf("-");

oldstate = state;
decision = 0; /~ ~ook atfourframes to decide ~/
if(vyeso > 0) decision = decision + 1 if(vyesol > 0~ decision = decision + ;;
if(vyeso2 > 0~ decision = decision + 1 if(vyeso3 > 0) decision = decision + 1; /~ decision=4for 4 hits in a . ow ~/
vyeso3 = vyeso2;
vyeso2 = vyesol; /~ decision = 4 is the voicins start trigger */
vyesol = vyeso; /~ it means this is the 4th vo*edframe ~/
/~ once voicing sta~ts decision>0 contir~ues it ~/
state = 0;
if(oldstate == 0 d~& decision == ~) state = 1;
if(oldstate == 1~& decision > 0) state - 1;
i~(state == 1~; sphflg == 0) sphflg = 1;
ctpc = -6;
iE(state == 1) ~
ctpc = ctpc + 1;
cnvcnt = 3;
I(state == 0 &~ sphfig > O && ctpc < 0) glfflg = 25~;
if(state == 0 && sphfig > 0 && ctpc >= 0) cnvcnt = cnvcnt + l;
i~(cnvcnt >= 32) glf~lg = 1;
i~(glf~lg == 0 && sphflg == 1) ~

-19- Pa3e ~ of epSq?~E7~1.C

,, .

12~S~3 S . E . Hutchins et al 1-1-4-3 -3 epsql~ e~. r epsqne ~ . ~
. m~in sphfig = 255;
fromframe = frar2ecc~r.t -2;
i~(glf~lg == 255) spl~fig = O;
glfflg = O;
if(gl~lg == 1) ~
toframe = framecount--32 sphflg = O;
glf~lg = O;
fprintf (fptag,"%d 7Dd ~n",fromframe, toframe);
framecount+ ';

Pr~e ~of e?sqne~.c --20 ~

~Z~8~

S . E . Hutchins et al l -1-4-3-3 agc~tats.c agcs~;s c /~ Computes cost fur-tion for evaluating a set o,f bpf coeficient u~eiJ~hts ard perforrri?lg ~ gra~lient search J'or neighboring vectors ~h betterperformance ~put: files o~f sta~stics on speech ard noise and ini~l ~eight vectors - Loa~via: cc agcstats.c--o agcstats -g -Im ~/
ritten by A. Higgins, Oct. 1981 ~/
odiffedfor tactical recog. agc by H.Koble, Aug. 19~Z ~/
/~ Completely gutted by Steven Ste?l)art",~larch lg83 /
/t commer~ts added 6/22/83 by S.E.Hutchins ~/
lude "stdio.h"
#include "math.h"
~define DIM 15 /~number offeatures,i.e. dimer~sion of feature vector~/
f3Dat covl[DIM][D~],cov2[Dr~]~D~a]; /~tspeech & no*e covariances~/
l~at meanl~DIM],~ean2~DIM~ speech & noise means~/
main(argc,argv) ma?,n intargc; ~ argv~];
FILE Ifpinl,~fpin2,~fpwt;
~rLt fdout,i,j,k,n,ii,dim,Rag;
int buff[DlM];
flDat stp,best;
S~at temp[2];
fiDat weight[DLM],newwt[DIM],wt[DI~];
double cost();
char filename[80];
/~ operL requestedJiles of statistics~/ -i~(argc < 4) ~
printf("Usage YOs stat_flle1, 1 stat_file7,2 ueight file\n",argv[0]);
exit();
~f((fpinl = fopen(argv[l],"r")) == NULL) I
printf("Unable to open input file: 7s~n",argv[1]);
exit();
if(~fpin2 = fopen(argv[2],"r")) == NULL) ~
printf("Unable to open input file: 7s\n",argv~2~);
exit();
/~ openfile of starting vectors ~/
if((fpwt = fopen(argv[3],"r")) == ~rULL) ~
printf("~1nable to open weights file: ,Os\n",argv[3]);
exit();
/~ printf~ hatis~henarrLe ofthe outputfile?");ffl~sh~stdout);
gets (filer ame);
if((fdout = creat(filer~ame,O644)) < O) ~
printf~'unable to open or create file: 7s\n",filenameJ;
e~it();
read speech statistics~ /
i~(fread(meanl,s~ meanl,l,fpinl) <= C) ~
printf("unab~e to read n~eanl data\n");
exit();
if(fread(covl,s~ze~f covl,l,fpir,l~ <= 0) ~
printf("~mable to read covl data\r");
exit();

Page 1 of a3cstats. c ~Z~8~5~3 -- S . E . Hutchins et al 1-1~3 -3 a~cstats c agc ats.c . . . ma~n /~read noise statistics~ /
if(fread(mean2,siYeo~ mean2,1,fpin2) <= 0) ~
pr?ntf("unable to read mean2 data\n");
eXIt();
i~fread(cov2,sizeo~cov2,1,fpin2) ~= 0) 1 print~("unable to read cov2 data\n");
exit();
printf("Enter step size\n"); /~step size for vectorperturbation~/
scanf(" 7~,f",&stp);
/~read aweLght vector~/
for(j=O;j<DlM;j++) flag=fscanf(fpwt,"7Df",&weight[j]);
if(flag <= 0) goka endit;
fe~r(j=O;j<DlM;j++) ne~nrt[j] = weight[j];
best = cost(newwt); /~ best = cost(startin~ vector)$/
/~ie measure of separation giverl, by initial vector~/
iterate hl~enty times... each time perturb each vector elemer~/
for(n=O;nC20;n++) for(k=l;k<DI~;k++) ~ /~for each dimension~/
for(ii=O;ii<2;ii++) ~ /~try T--step value~/
~(ii == O) newwt[k] = weight[k]--stp;
cLse ne-t~-t[k] = weight[k] + stp;
temp[iil = cost(newwt);
/~pick the chc~nge for no chGnge~
/~with the best separation ~/
i~(temp[~] > temp[ 1]) if(temp[0] > best) ~
new rt[k] = weight[k]--stp;
~eight[k] = weight[k]--stp;
best = temp[0];
eLse ne~nrt[k] = weight[k];
~Ise if(temp[ 1] > best) ~
new;Yttkl = weighttkl + stp;
weightLk~ = weightLkJ + stp;
best = temp[1~;
else ne n~t[k] = ~Yeight[k];
/~ output st~s to fUe~/
endit dim= DIM-rite ffctout, ~im, sizeof (dim));
write(fdout,weight,sizeof(weight));
write(fdov,t, best,sizeof(best));
close lfdout); /
fclose(fpinl~;
fclose(fpin2);
fclose(fpwt);
printf("~nDone~n");

png~ 2 of agcst ats. o ~2~ 8 agcstats. c agcs'~a;~s. c . . , ~L.a,in ~/~en3 mair~/
double ccs~(ne~-t) ~ou~e /~ comp?~te the stal~s~al separat~ion of ~he t~o ~/
classes cl' dsta g~ve7~ the cun ent vector~/
~:at ~ne~
iQt i,j;
noat alpha ~at wt[DI~];
~at sqs,upper,x,y,su.n;
double ~abs(),sqrt();
sum= 0;
far(j=O;i<D~ + ) sum+= newwtLj]~new~t[j];
sqs= sqrt((double)sum);
for~j=O;j<DlM;j++~
wt~]= newwt[~/sqs;
upper=0 or(j=O;j<DIM;j++) upper+= wt[j]~t(meaIll[j]-mean2[j]);
x=O;
fcr(i=O;i<DIM;i++) sum=O
for(j=O;j<DIM;j++) sum+= ~t[j]~covl[j][i]:-x+= sum~r,t[i];
x= sqrt((doll~le)x);
y=O;
fc~r~i=O;i<DI~;i++) surn=0 for(j=O;j<DIM;j++) sum+= ~t~j]~cov2~j][i];
y+= sum~t[i];
y= sqrt((dou~le)y);
alpha = upper/(x+y);
printf("~n\n~3PF CoefIicient Weights:");
for(j=O;j<DIM;j++) printf(" 7f ",lO.O~wt~j]);
printf("~n\nCost FuncticrL");
printf("79.4f",fabs(a~pha));
~tunl(fabs(alpha));

While the present invention has been disclosed in connection with the preferred embodiment thereof, it should be understood that there may be other embodiments which fall within the spirit and scope of the invention as defined and by the following claims.

- - 2 3 - Page 3 ~f agcstats c

Claims

THE EMBODIMENTS OF THE INVENTION IN WHICH AN EXCLUSIVE
PROPERTY OR PRIVILEGE IS CLAIMED ARE DEFINED AS FOLLOWS:

1. A method of speech activity detection in the presence of noise including noise events occurring when speech is not present, comprising the step of: automatically separating signals associated with said speech from signals associated with said noise events; including the substeps of: frequency filtering said speech and noise event signals to provide a plurality of filter output signals; digitizing said filter out-put signals and repeatedly forming frames having a plurality of digital signal values associated with said filter output signals;
and applying a speaker independent, predetermined, fixed trans-formation to said digital signal values of said frames whereby frames associated with said speech signals are separated from frames associated with said noise event signals.

2. The method of claim 1, wherein the magnitude of said noise event signals are equal to or greater than the magnitude of said speech signals.

3. The method of claim 1 wherein the substep of apply-ing a predetermined, fixed transformation creates a scalar feature for most of said frames associated with said speech signals which has a magnitude greater than the magnitude of said scalar feature associated with frames associated with said noise event signals.

4. The method of claim 3 wherein said method further comprises the steps of: storing the magnitudes of said scalar features associated with said frames; repeatedly establishing a detection threshold value from said stored magnitudes; compar-ing said scalar features of each frame with said detection threshold value to separate said speech signals from said noise signals absent said speech signals.

5. The method of claim 4 wherein said step of storing said scalar feature magnitudes comprises: forming a histogram of scalar feature magnitudes from said stored magnitudes, and said step of repeatedly establishing a detection threshold value is performed once every N frames where N is approximately 1000.

6. The method of claim 4 wherein the step of comparing comprises subtracting said detection threshold value from said scalar feature magnitude to create a raw feature value and wherein said method further comprises the steps of: storing a plurality of said raw feature values associated with a plurality of successive frames; and decoding said plurality of raw fea-ture values in a predetermined manner to indicate when said speech signals are present.

7. The method of claim 3 wherein said substep of applying a transformation to said digital signal values comprises: form-ing a fixed linear feature vector having a plurality of ele-ments equal in number to said plurality of digital signal values in each frame; and forming an inner product between said linear feature vector and each of said frames of digital signal values.

8. The method of claim 1 wherein said plurality of digi-tal signal values of said frames are related to the square of the magnitude of said speech and noise event signals.

9. An apparatus for speech activity detection of speech in the presence of noise including noise events occurring when speech is not present comprising: means for digitizing signals associated with said speech signals and signals associated with said noise events and for forming frames of digital signal values associated with said speech and noise event signals; and separation means coupled to said digitizing means for auto-matically separating said speech signals from said noise event signals, said separation means further comprising means for applying a speaker independent, predetermined, fixed transfor-mation to said digital signal values of said frames whereby frames associated with said speech signals are separated from frames associated with said noise event signals.

10. The apparatus of claim 9 wherein said means for apply-ing said speaker independent, predetermined, fixed transforma-tion comprises: means for creating scalar features from said frames; and wherein said separation means further comprises:
means for establishing and updating a detection threshold value wherein frames associated with scalar features having a mag-nitude less than said detection threshold value are considered as associated with noise event signals while frames associated with scalar features having magnitudes greater than said detection threshold values are considered as associated with speech signals.

11. The apparatus of claim 10 wherein said apparatus further comprises: means for comparing said scalar features with said detection threshold value; means for storing the results of a plurality of said comparisons for a plurality of successive frames; and means for combining said stored results to obtain an indication of when speech signals are present.

12. The invention of claim 8 wherein the magnitude of said noise event signals are equal to or greater than the mag-nitude of said speech signals.

13. The invention of claim 9 wherein said digital signal values of said frames are related to the square of the magnitude of said speech and noise event signals.

14. An apparatus for automatic recognition of speech in the presence of noise including noise events occurring when speech is not present comprising: means for digitizing signals associated with said speech and signals associated with said noise events and for forming frames of digital signal values associated with said speech and noise event signals; speech activity means coupled to said digitizing means for automatically separating said speech signals from said noise event signals to determine when said speech signals are present; speech recog-nition means coupled to said digitizing means and said speech activity means for converting said frames into frames of para-metric data more suitable for further recognition processing when said speech activity means determines that speech signals are present; and means coupled to said recognition means for comparing selected ones of said frames of parametric data with a plurality of templates which are representative of said speech to be recognized whereby said speech signals are recog-nized; wherein said speech activity means further comprises:
means for creating scalar features from said frames; means for establishing and updating a detection threshold value wherein frames associated with scalar features having a magnitude less than said detection threshold value are considered as asso-ciated with noise event signals while frames associated with scalar features having magnitudes greater than said detection threshold value are considered as associated with speech signals;
means for comparing scalar features with said detection thresh-old values; means for storing the results of a plurality of said comparisons for a plurality of successive frames; and means for combining said stored results to obtain an indication of when speech signals are present.

15. The means coupled to said recognition means of claim 14 wherein said comparison is done in accordance with a dynamic programming algorithm (DPA).

16. The invention of claim 14 wherein said speech activity means further comprises: means for creating scalar features from said frames; means for establishing and updating a detec-tion threshold value wherein frames associated with scalar features having a magnitude less than said detection threshold value are considered as associated with noise event signals while frames associated with scalar features having magnitudes greater than said detection threshold value are considered as associated with speech signals; means for comparing said scalar features with said detection threshold values; means for stor-ing the results of a plurality of said comparisons for a plurality of successive frames; and means for combining said stored results to obtain an indication of when speech signals are present.

17. The invention of claim 14 wherein the magnitude of said noise event signals is equal to or greater than the magni-tude of said speech event signals.

18. The invention of claim 14 wherein said apparatus further comprises means for modifying said frames of digital signals coupled to said speech activity means to form modified frames of digital signals wherein said digital signal values are related to the square of the magnitude of said speech and noise event signals.