CA1324833C - Method and apparatus for synthesizing speech without voicing or pitch information - Google Patents

Method and apparatus for synthesizing speech without voicing or pitch information

Info

Publication number
CA1324833C
CA1324833C CA000526482A CA526482A CA1324833C CA 1324833 C CA1324833 C CA 1324833C CA 000526482 A CA000526482 A CA 000526482A CA 526482 A CA526482 A CA 526482A CA 1324833 C CA1324833 C CA 1324833C
Authority
CA
Canada
Prior art keywords
speech
word
channel
excitation signal
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CA000526482A
Other languages
French (fr)
Inventor
David Edward Borth
Ira Alan Gerson
Richard Joseph Vilmur
Brett Louis Lindsley
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Motorola Solutions Inc
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc filed Critical Motorola Inc
Application granted granted Critical
Publication of CA1324833C publication Critical patent/CA1324833C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

ABSTRACT OF THE DISCLOSURE

A channel bank speech synthesizer for reconstructing speech from externally-generated acoustic feature inforamtion without using externally-generated voicing or pitch information is disclosed. An N-channel pitch-excited channel bank synthesizer (340) is provided having a first low-frequency group of channel gain values (1 to M) and a second high-frequency group of channel gain values (M+1 to N). The first group control a first group of amplitude modualtors (950) excited by a periodic pitch pulse source (920), and the second group controls amplitude modulators excited by a noise source (930).
Both groups of modulated excitation signals are applied to the bandpass filters (960) to reconstruct the speech channels, and then combined at the summation network (970) to form a reconstructed synthesized speech signal.
Additionally, the pitch pulse source (920) varies the pitch pulse period such that the pitch pulse rate decreases over the length of the word.

Description

132~833 METHOD AND APPARATUS FOR SYNTHESIZING SPEECH
WITHOUT VOICING OR PITC~ INFORMATION

Background of the Invention .

- The present invention relates generally to speech synthesis, and more particularly, to a channel bank speech synthesizer operating without externally-generated 0~ voicing or pitch information.
Speech synthesizer networks generally accept digital data and translate it into acoustic speech signals representative o~ human voice. Various techniques are known in the art for synthesizing speech from this acoustic feature data. For example, pulse code ~- modulation, linear predictive coding, delta modulation, channel bank synthesizers, and formant synthesizer are known synthesizing techniques. The particular type of synthesizer technology is typically chosen by comparing the size, cost, reliability and voice quality ~ requirements of the specific synthesis application.
:~ The further development of present-day speech synthesis systems is hindered by the inherent problem that the complexity and storage requirements of the synthesizer system dramatically increase with the vocabulary size. Additionally, the words spoken by the typical synthesizer are often of poor fidelity and difficult to understand. Nevertheless, the trade-off ~ between vocabulary and voice intelligibility has all too `~ 25 often been decided in terms of a larger vocabulary for ~ enhanced user features. This determination generally -~ results in a harsh, robot-like "buzziness" sound in the I synthesized speech.

1~

. r .;~ , :

V' , :

~ - 2 _ 1324833 CMO0249G
Recently, several approache~ have been taken to solve the problem of unnatural sounding synthesized speech. Obviously, the reverse trade-off --to maximize voice quality at the expense of speech synthesis system 05 complexity-- can be made. It is well known in the art that a high data rate digital computer, synthesizing speech from an infinite memory source, can create the ideal situation of unlimited vocabulary with negligible voice quality degradation. However, such devices tend to - 10 be much too bulXy, very complicated, and prohibitively - e~pensive for most modern application.
Pitch-excited channel bank synthesizers have frequently been used as a simple, low cost means for synthesizing speech at a low data rate. The standard 15 channel bank synthesizer consists of a number of gain-controlled bandpass filters, and a spectrally-flat excitation source made up of a pitch pulse generator for voiced excitation (buzz) and a noise generator for unvoiced excitation thiss). The channel bank synthesizer ` 20 utilizes externally-generated acoustic energymeasurements ~erived ~rom human voice parameters) to adjust the gains of the individual filters. The s excitation source is controlled by a known voiced/
$` unvoiced control signal (prestored or provided from an external source) and a Xnown pitch pulse rate.
A renewed interest in channel vocoders has led to a wide variety of proposals to improve the quality of low data rate synthesized speech. Fukimura, in an article entitled "An Approximation to Voice Aperiodicity", IEEE
Transactions on Audio and Electroacoustics, vol. AU-16, no. 1, pp. 68-72 tMarch 1968), d~scribes a technique called "partial devoicing" --partially replacing voiced ~` excitation of the high-frequency ranges by random noise--to make the synthesized sound less mechanically "buzzy".
On the other hand, Coulter, in U.S. Patent no. 3,903,666, purports to improve the performance of channel vocoders - -:
~ ' , , ' - . ~ , 13 2 4 ~ 3 ~ CM00249G
by connecting the pitch pulse source to the lowest channel of the vocoder synthesizer at all times.
Alternatively, the article entitled "The JSRU Channel Vocoder", IEE Proceeding, vol. 127, part F, no. 1, pp.
05 53-60 (February 1980), by J.N. Holmes describes a " technique for reducing the "buzzy" quality of voiced sounds by varying the bandwidth of the high-order channel filter in response to the voiced/unvoiced decision.
Several other approaches were taken to the "buzziness" problem in the context of LPC vocoders. "A
Mixed-sourcQ Model for Speech Compression and Synthesis"
by J. MaXhoul, R. Viswanathan, R. Schwartz, and A.W.F.
Huggins, lg78 International Conference on Acoustics, A S~eech, and Signal Processing, pp. 163-166, (April 10-12, 1978), describes an excitation source model which permits varying degrees of voicing by mixing voice (pulse) and unvoiced (noisa) excitations in a frequency-selective manner. Yet another approach was taken by M. Sambur, A.
Rosenberg, L. Rabiner, and C. McGonegal, in an article entitled "On Reducing the Buzz in LPC Synthesis", 1977 i~ IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 401-404, ~May 3-11, 1977). Sambur ~ et al. reported a reduction in buzziness by changing the `~ pulse width of tha excitation source to be proportional ` 25 to the pitch period during voiced excitation. Still another approach, that o~ modulating the amplitude of the excitation signal ~from a substantially 0 value to a ; constant value and then back to 0) was taken ~y Vogten et al. in U.S. Patent no. 4,374,302.
All of the above prior art techniques are directed toward improving the voice quality of a low data rate speech synthesizer through modification of the voicing and pitch parameters. Under normal circumstances, this voicing and pitch information is - 35 readily accessible. However, none of the known prior art techniques are viàble for speech synthesis applications . .

~ .
.~i,. - , . . . . . . . .

... . .
.

-` 4 1324~33 CM00249G

in which voicing or pitch parameters are not available.
For example, in the present application of synthesizlng speech recognition templates, voicing and pitch parameters are not stored, since they are are not 05 required for speech recognition. Hence, to accomplish -~ speech synthesis from recognition templates, the synthesis must be performed without prestored voicing or pitch information.
It is believed that most practitioners s~illed in the art of speech synthesis would predict that any computer-generated voice, created without externally accessible voicing and pitch information, would sound extremely roSot-like and highly ob;ectionable. To the contrary, the present invention teaches a method and lS apparatus for synthesizing natural-sounding speech for applications in which voicing or pitch cannot be provided.
' -~ 20 Summarv of the Invention -` Accordingly, it is the general ob~ect of the j present invention to provide a method and apparatus for "~ synthesizing speech without voicing or pitch information.
A more particular ob~ect of the present invention Y is to provide a method and apparatus for synthesizing ; speech from speech recognition templates which do not contain prestored voicing or pitch information.
Another ob~ect of the present invention is to reduce the storage reguirements and increase the ` flexibility of a speech synthesis device employing a substantial vocabulary.
A particular, but not exclusive, application of the present invention is in a hand-free vehicular radiotelephone control and dialing system which synthesizes speech from speech recognition templates without prestored voicing or pitch information.

i- ~ , . . .

~ ~' ' '`' , ', . ~ .
.

~ - 5 _ 132~83'3 CM00249G

Accordingly, the present inventlon provides a speech synthesizer for reconstructing speech from `~ externally-generated acoustic feature information without - using external voicing or pitch information. The speech 05 syntehsizer of the present invention employs a technique : of "split voicing" with a technique for varying the pitch pulse rate. The speech synthesizer comprises: a means - for generating a first and second excitation signal, the first excitation signal being representative of random -~ 10 noise (hiss), the second excitation signal being representative of periodic pulses of a predetermined rate (buzz); a means for amplitude modulating the first ~ excitation signal (hiss) in response to a first predetermined group of acoustic feature channel gain - 1~ values, and for amplitude modulating the second excitation signal (buzz) in response to a second predetermined group of channel gain values, thereby producing corresponding first and second groups of channel outputs: a means for bandpass filtering these first and second groups of channel outputs to produce corresponding first and second groups of filtered channel . outputs; and a means for combining each of the first and .t ` second groups of filtered channel outputæ to form the reconstructed speech signal.
In an embodiment illustrativQ of the present invention, a 14-channel bank synthesizer is provided having a first low-frequQncy group of channel gain values and a second high-frequency group of channel gain values.
Both groups of channel gain values are first low-pass filtered to smooth the channel gains. Then the first low-frequency group of filtered channel gain values controls a first group of amplitude modulators excited by a periodic pitch pulse source. The second high-frequency group of filtered channel gain values is applied to a .~ 35 second group of amplitude modulators excited by a noise ~ source. Both groups o~ modulatod exoitat1on signals .

. .- . . .

,, :

'.' `', -~ - 6 - 132~833 CM00249G
--the low-freguency (buzz) group and the high-frequency (hiss) group-- are then bandpass filtered to reconstruct the speech channels. All the bandpass f$1ter outputs are then combined to foxm a reconstructed synthesized speech signal. Furthermore, the pitch pulse sourcQ var~es the pitch pulse period such that the pitch pulse rate decreases over the length of the word. This combination of spl~t voicing and variable pitch pulse rate allows natural-sounding speech to be generated without external voicing or pitch information.

;

Brief Description of the Drawings Additional ob~ects, features, and advantages in accordance with the present invention will be more clearly understood by reference to the following description taken in connection with the accompanying drawing~, in the several figures of which like reference ~ numQrals identify like elements, and in which:
; Figure 1 is a general block diagram illustrating the technique of synthesizing speech from speech recognition templates according to the present invention;
Figure 2 i8 a block diagram of a speech communications dQvic- ~aving a usQr-intQractivQ control syst-m employlng speech recognition and spQech synthesis in accordance with the presQnt invention:
Figure 3 is a detailed block diagram of the preferred embodiment of the present invention illustrating a radio transceiver having a hand~-free speech recognition/spQech synthesis control system;
Figur~ 4a is a detailed block diagram of the data - reduc-r block 322 of Figure 3;
Figure 4b i~ a flowchart showing the sequencQ of ~teps performed by the energy nor~alization block 410 of Figure 4a;

-~ 35 ~`` ' . ' ` ` ` :

1 3 2 ~ 8 33 CM00249G
` Figu:re 4c is a detailed block diagram of the particular hardware configuration of the segmentation/compression block 420 of Figure 4a:
Figure 5a is a graphical representation of a S spoken word segemented into frames for forming a cluster according to the present invention;
Figure 5b is a diagram exemplifying output clusters being formed for a particular word template, x according to the present invention;
lo Figure 5c is a table showing the possible for~ations of an arbitrary partial cluster path according ~ to the present invention;
-~- Figures 5d and 5e show a flowchart illustrating a -~ basic implementation of the data reduction process performed by the segmentation/compression block 420 of .~ Figure ~a;
Figure 5f is a detailed flowchart of the traceback and output clusters block 582 of Figure 5e, showing the formation of a data reduced word template from previously determined clusters;
Figure 5g is a traceback pointer table illustrating a clustering path for 24 frames, according ;``~ to the present invention, applicable to partial traceback;
2s Figure 5h is a graphical representation of the traceback pointer table of Figure 5g illustrated in the form of a frame connection tree;
Figure 5i is a graphical representation of Figure 5h showing the frame connection tree after three clusters have been output by tracing back to common frames in the tree;
Figures 6a and 6b comprise a flowchart showing the sequence of steps performed by the differential encoding block 430 of Figure 4a;
Figure 6c is a generalized memory map showing the particular data format of one frame of the template ,~7 memory 160 of Figure 3;

. ~ . .

Figure 7a is a graphical representative of ~rames .~ clustered into average frames, each average frame represented by a state in a word model, in accordance with the present invention;
05 Figure 7b is a detailed block diagram of the ` recognition processor 120 of Figure 3, illustrating its relationship with the template memory 160:
-~ Figure 7c is a flowchart illustrating one embodiment of the sequence of steps required for word decoding according to the present invention:
Figures 7d and 7e comprise a flowchart illustrating one embodiment of the steps requlred for state decoding according to the present invention;
Figure 8a is a detailed block diagram of the data expander block 346 of Figure 3;
^~ Figure 8b is a flowchart æhowing the sequence of - steps pQrformed by the differential decoding block 802 of Figure 8a:
Figure 8c is a flowchart showing the sequence of steps performed by the energy denormalization block 804 of Figure 8a;
~`~ Figure 8d is a flowchart showing the sequence of steps performed by the frame repeating block 806 of Figure 8a:
Figure 9a is a detailed block diagram of the -.~ channel bank speech synthesizer 340 of ~igure 3;
Figure 9b is an alternate embodiment of the modulator~bandpass filter configuration 980 of Figure 9a;
Figure 9c is a detailed block diagram of the preferred embodiment of the pitch pulse source 920 of Figure 9a;
Figure 9d is a graphic representation ~, illustrating various waveforms of Figures 9a and 9c.

1.
- . ~ . . . . ..

- . . . ~:

`- g i32~833 CM00249G

Description of the Preferred Embodiment -. .

1. System Confi~uration ` 05 Referring now to the accompanying drawings, Figure 1 shows a general block diagram of user-interactive control system 100 of the present invention. Electronic device 150 may include any 10 electronic apparatus that is sophisticated enough to ~- warrant the incorporation of a speech recoqnition/speech synthesis control system~ In the preferred embodiment, electronic device 150 represents a speech communications device such as a mobile radioteléphone.
User-spoken input speech is applied to microphone ~` 105, which acts as an acoustic coupler providing an electrical input speech signal for the control system.
Acoustic processor 110 performs acoustic feature extraction upon the input speech signal. Word features, 20 defined as the amplitude/frequency parameters of each user-spo~en input word, are thereby provided to speech recognition processor 120 and to training processor 170.
Acoustic processor 110 may also include a signal conditioner, such as an analog-to-digital converter, to t 25 interfac~ the input speech signal to the speech recognition control system. Acoustic processor 110 will be further described in con~unction with Figure 3.
Training processor 170 manipulates this word feature information from acoustic processor 110 to 30 provide word recognition templates to be stored in template memory 160. During the training procedure, the incoming word features are arranged into individual words by locating their endpoints. If the training procedure is designed to accommodate multiple training utterances for word feature consistency, then the multiple utterances may be averaged to form a single word , .

: .
.
~ ' . . , , :

.

- lo - 13 2 ~ 8 3 3 CM00249G

template. Furthermore, since most speech recognition : systems do not require all of the speech information to be stored as a template, some type of data reduction is often performed by training processor 170 to reduce the template memory requirements. The word templates are stored in template memory 160 for use by speech recognition processor 120 as well as by speech synthesis - processor 140. The exact training procedure utilized by the preferred embodiment of the present invention may be found in the description accompanying Figure 2.
In ~he recoqnition mode, speech recognition processor 120 compares the word feature information ~ provided by acoustic processor 110 to the word ;~ 15 recognition templates provided by template memory 160.
If the acoustic features of the present word feature information derived from the user-spoken input speech sufficiently match thQ acoustic features of a particular prestored word template derived from the template memory, then recognition processor 120 provides device control ~` data to device controller 130 indicative of the particular word recognized. A furthe~ discussion of an appropriate speech recognition apparatus, and how the `~ pre~erred embodiment incorporates data reduction into the training process may be found in the description accompanying Figures 3 through 5.
Device controller 130 interfaces the entire control system to electronic device 150. Device controller 130 translates the device control data `~` 30 provided by recognition processor 120 into control signals adaptable for use by the particular electronic device. These control signals direct the device to perform specific operating functions as instructed by the user. ~Device controller 130 may also perform additional supervisory functions related to other elements shown in Figure 1.) An example of a device controller known in the art and suitable for use with the present invention ~.

.
... .
' - , ' . ' ' , ., " ' ~

11 - 132~833 CM00249G

is a microcomputer. Refer to Figure 3 for further details of the hardware implemQntation.
Device controller 130 also provides device status 05 data representing the operating status of electronic device 150. This data is applied to speech synthesis processor 140, along with word recognition templates from template memory 160. Synthesis processor 140 utilizes the status data to determine which word recognition template is to be synthesized into user-recognizable reply speech. Synthesis processor 140 may also include an internal reply memory, also controlled by the status data, to provide "eanned" reply words to the user. In either case, the user is informed of the electronic devire operating status when the speech reply signal is output via speaker 145.
~` Thus, Figure 1 illustrates how the present invention provides a user-interactive control system i; utilizing speech recognition to control the operating `-~ 20 parameters of an electronic device, and how a speech recognition template may be utilised to generate reply spaech to the user indicative of the operating status of ~- the device.
Figure 2 illustrates in more detail the applieation of the usQr-interactive control system to a ~ speech communications device comprising a part of any .~ radio or landline VoicQ communications system, such as, for example, a two-way radio system, a telephone system, an intercom system, etc. Aeoustic processor 110, recognition processor 120, template memory 160, and device controller 130 are the same in structure and in operation as the corresponding blocks of Figure 1.
However, control system 200 illustrates the ;~ internal structure of speech communications device 210.
;x 35 Speech communication terminal 225 represents the main electronic networX o~ devicQ 210, such as, for example, a telephone terminal or a communications console. In this `~ :

. -- , . .
. ~, .
.~ - - , . . .

: .
, ' ' 1324~3~ CMo0249G
embodiment, microphone 205 and speaker 245 are incorporated into the speech communications device itself. A typical example of this microphone/speaker 05 arrangement would be a telephone handset. Speech communications terminal 225 interfaces operating status information of the speech communications device to device controller 130. This operating status information may ~ comprise functional status data of the terminal itself "~ 10 (e.g., channel da~a, service information, operating mode messages, etc.~, user-feedback information of the speech recognition control system (e.g., directory contents, ~ word recognition verification, operating mode status, : etc.), or may include system status data pertaining to - 15 the communications link (e.g., loss-of-line, system busy, - invalid access code, etc.).
In either the training mode or the recognition mode, the features of user spoken input speech are extracted by acoustic processor 110. In the training `~ 20 mdQ, which is represented in Figure 2 by position "A"
~r of switch 215, the word feature information is applied to '~ word averager 220 of training processor 170. As -~ previously mentioned, if the system is designed to ~ average multiple utterances together to form a single ;~ 2~ word template, the averaging is performed by word averager 220. Through the use of word averaging, the training processor can take into account the minor variances between two or more utterances of the same word, thereby producing a more reliable word template.
Numerous word averaging techniques may be used. For example, one method would be to combine only the similar word features of all training utterances to produce a "best" set of features for the word template. Another technique may be to simply compare all training utterances to determine which one provides the "best"
~ template. Still another word averaging technique is `; described by L.R. Rabiner and J.G. Wilpon in "A

.
. ~ :~ . ... i , .
.. .
,.

.

.~
Simplified Robust Training Procedure for Speaker Trained, Isolated Word Recognition Systems", Journal of the Acoustic Society of America, vol. 68 (November 1980), pp.
1271-76.
Data reducer 230 then performs data reduction upon either the averaged word data from word averager 220 or upon the word feature signals directly from acoustic processor 110, depending upon the presence or absence of a word averager. In either case, the reduction process consists of segmenting this "raw" word feature data and combining the data in each segment. The s~o~ e requirements for the template are then furt~er reduced by differential encoding of the segmented data to produce "reduced" word feature data. This specific data reduction technique of the present invention is fully described in conjunction with Figures 4 and 5. To summarize, data reducer 230 compresses the raw word data to mini~ize the template storage requirements and to reduce the speech recognition computation time.
The reduced word feature data provided by training processor 170 is stored as word recognition templates in template ~emory 160~ In the recognition mode, which is illustrated by position "B" of switch 215, recognition procQssor 120 compares the incoming word feature signals to the word recognition templates. Upon recognition of a valid command word, recognition processor 120 may instruct device controller 130 to cause a corresponding speech communications device control function to be executed by speech communications terminal 225. Terminal 225 may respond to device controller 130 by sending operating status information back to controller 130 in the form of terminal status data. This data can be used by the control system to synthesize the appropriate speech reply signal to inform the user of the present device operating status. This sequence of events will be more clearly understood by referring to the subsequent example.

;, ` - . -''.
- " :
,. ~ , .. ,~

- . . ~

:``

13 2 ~ ~ ~ 3 CMo0249G
- Synthesis processor 140 is comprised of speech .
synthesizer 240, data expander 250, and reply memory 260.
A synthesis processor of this configuration is capable of generating ~canned" replies to the user from a prestored vocabulary (stored in reply memory 260), as well as generating "template" responses from a user-generated vocabulary (stored in template memory 160). Speech syn~hesizar 240 and reply memory 260 are further described in conjunction with Figure 3, and data expander 250 is fully described in the text accompanying Figure 8a. In combination, the bloc~s of synthesis processor 140 generate a speech reply signal to speaker 245. Accordingly, Figure 2 illustrates the technique of using a single template memory for both speech recognition and speech synthesis.
The simplified example of a "smart" telephone ~erminal employing voice-controlled dialing from a stored telephone number directory is now used to describe the operation of the control system of Figure 2. Initially, an untrained speaker-dependent speech recognition system cannot recognize command words. Therefore, the user must manually prompt the device to begin the training procedure, perhaps by entering a particular code into the telephone keypad. Device controller 130 then directs switch 21~ to enter the training mode (position "A").
Device controller 130 then instructs speech synthesizer 240 to respond with the predefined phrase TRAINING
VOCABULARY ONE, which is a "canned" response obtained from reply memory 260. The user then begins to build a command word vocabulary by uttering command words, such as STORE or RECALL, into microphone 205. The features of the utterance are first extracted by acoustic processor 110, and then applied to either word averager 220 or data reducer 230. If thle particular speech recognition system is designed to accept multiple utterances of the same word, word averager 220 produces a set of averaged word . .
.:

. ~ .
`_ ' !

15 13 2 ~ 8 ~ 3 CMoo249G

features representing the best rspresentation of that particular word. If the system does not have word averaging capabilities, the single utterance word 05 features (rather than the multiple utterance averaged word features) are applied to data reducer 230. The data ~` reduction process removes unnecessary or duplicate feature data, compresses the remaining data, and provides template memory 160 with "reduced" word recognition x 10 templates. A similar procedure is followed for training the sys~em to recognize digits.
once the system is trained with the command word vocabulary, the user must continue the training procedure ~- by entering telephone directory names and numbers. To ; 15 accomplish this task, the user utters the previously-;~ trained command word ENTER. Upon recognition of this utterance as a valid user command, device controller 130 instructs speech synthesizer 240 to reply with the "canned" phrase DIGITS ~LEASE? stored in reply memory ` 20 260. Upon entering the appropriate telephone number digits (e.g., 555-1234), the user says TERNINATE and the system replys NANE PLEASE? to prompt user-entry of the corresponding directory name (e.g., SMITH). This - user-interactive process continues until the telephone ` 25 number directory is completely filled with the - appropriate telephone names and digits.
-~ To place a phone call, the user simply utters the command word RECALL. When the utterance is recognized as a valid user command by recognition processor 120, device controller 130 directs speech synthesizer 240 to generate ` the verbal reply NAME? via synthesizing information - provided by reply memory 260. The user then responds by `~ speaking the name in the directory index corresponding to * the telephone number that he desires to dial (e.g.
~` 35 JONES). The word will be recognized as a valid directory entry if it corresponds to a predetermined name index ^. controller 130 directs data expander 250 to obtain the ..

.
. ............................... .
~ ~ .

-- 16 -- 1324~33 CM00249G

`` appropriate reduced word recognition template from template memory 160 and perform the data expansion process for synthesis. Data expander 250 "unpacks" the 05 reduced word feature data and restores the proper energy contour for an intelligible reply word. The expanded - word template data is then fed to speech synthesizer 240.
Using both the template data and the reply memory data, speech synthesizer 240 generates the phrase JONES
10 (from template memory 160 through data expander 250 ...FIVE-FIVE-FIVE, SIX-SEVEN-EIGHT-NINE (from reply memory 260).
? The user than says the command word SEND which, when recognized by the control system, instructs device ~ 15 controller 130 to send telephone number dialing `~ information to speech communications terminal 225.
; Terminal 225 outputs this dialing information via an appropriate communications lin~. When the telephone ; connection is made, speech communications terminal 225 interfaces microphone audio from microphone 205 to the appropriate transmit path, and receive audio from the appropriate receive audio path to speaker 245. If a proper telephone connection cannot be made, terminal controller 225 provides the appropriate communications lin~ status information to device controller 130.
.~ccordingly, device controllar 130 instructs speech synthesizer 240 to generate the appropriate reply word corresponding to the status information provided, such as ` the reply word SYSTEM BUSY. In this manner, the user is informed of the communications link status, and user-interactive voice-controlled directory dialing is achieved.
The above operational description is merely one application of synthesizing speech from speech recognition templates according to the present invention.
Numerous other applications of this novel technique to a speech communications device are contemplated, such as, , .. .
'` `.:~ ,'' ~

- 17 - 13 2 ~ ~ 3 ~
f for exa~ple, a communications console, a two-way radio, etc. In the preferred embodiment, the control system of ` the present invention is used with a mobile radiotelephone.
Although speech recognition and speech synthesis allows a vehicla operator to keep bot~ eyes on the road, . the conventional handset or hand-held microphone prohibits him from keeping both hands on the steering `~' 10 wheel or from executing proper manual (or automatic) ; transmission shifting. For this reason, the control -~ system of the preferred embodiment incorporates a speake~hone to provide hands-free control of the speech communications device. The speakerphone performs the transmit/receive audio switching function, as well as the . received/reply audio multiplexing function.
Referring now to Figure 3, control system 300 ,~
utilizes the same acoustic processor block 110, training processor block 170, recognition processor block 120, template memory block 160, device controller block 130, and synthesis processor block 140 as the corresponding blocks o~ Figure 2. However, microphone 302 and speaker 375 are not an integral part of the speech communications terminal. Instead, input speech signal from microphone 302 is directed to radiotelephone 350 via speakerphone 360. Similarly, speakerphone 360 also controls the multiplexing of the synthesized audio from the control system and the receive audio from the communications link. A more detailed analysis of the switching/
multiplexing configuration of the speakerphone will be described later. Additionally, the speech communications terminal is now illustrated in Figure 3 as a radiotelephone having a transmitter and a receiver to provide the appropriate communications link via radio frequency (RF) channels. A detailed description of the radio bloc~s is also provided later.
, , . , . : . . .

- , ,.: . , :,,, . . ,- :
.~ ~': . , ' ' ' 132~3 CMo0249G
`::
Microphone 302, which is typically remotely-mounted at a distance from the user's mouth (e.g., on the automobile sun visor), acoustically couples the user's voice to control system 300. This speech signal is usually amplified by preamplifier 304 to provide input speech signal 305. This audio input is - directly applied to acoustic processor 110, and is switched by spea~erphone 360 before being applied to radiotelephone 350 via switched microphone audio line 15.
As previously mentioned, acoustic processor llo extracts the features of the user-spoken input spee_h to provide word feature information to both training processor 170 and recognition processor 120. Acoustic processor 110 first converts the analog input speech into digital form by analog-to-digital (A/D) converter 310.
This digital data is then applied to feature extractor 312, which digitally performs the feature extraction 2~ function. Any feature extraction implementation may be utilized in block 312, but the present embodiment ~;` utilizes a particular form of "channel bank" feature ~. . .
- extraction. Under the channel bank approach, the audio - input signal frequency spectrum is divided into ~ 25 individual spectral bands by a bank of bandpass filters, f~ and the appropriate word feature data is generated * according to an estimate of the amount of energy present in each band. A feature extractor of this type is described in the article: "The Effects of Selected Signal Processing Techniques on the Performance of a Filter Bank Based Isolated Word Recognizer", B.A. Dautrich, L.R.
Rabiner, and T.B. Martin, Bell Svstem Technical Journal, ~ vol. 62, no. 5, (May-June 1983~, pp. 1311-1335. An - appropriate digital filter algorithm is described in x 35 Chapter 4 of L.R. Rabiner and B. Gold, Theory and Aplication of Digital Signal Processing, (Prentice Hall, Englewood Cliffs, N.J., 1975).

,., -~

::
.
~ . .
. .

-i 19 1 3 2 ~ 8~3 CM00249G

Training processor 170 utilizes this word feature data to qenerate word recognition templates to be stored in template memory 160. First of all, endpoint detector 0 05 318 locates the appropriate beginning and end locations of the user's words. These endpoints ar~ based upon the time-varying overall energy estimate of the input word feature data. An endpoint detector of this type i9 described by L.R. ~abiner and M.R. Sambur in "An lo Algorithm for Determining the Endpoints of Isolated Uttorances", Bell svstem ~echnical Journal, vol. 54, no. 2, ~February 1975), pp. 297-315.
~ Word averager 320 then combines the several - utterances of the same word spoken by the user to provide a more reliable template. As previously described in ~-~ Figure 2, any appropriate word averaging scheme may be ~ utilized, or the word averaging function may be entirely ;i omitted.
Data reducer 322 utilizes the "raw" word feature `~ 20 data from word averager 320 to generate "reduced" word .b' feature data for storage in template memory 160`as reduced word recognition templates. The data reduction process basically consists of normalizing the energy data, segmenting the word feature data, and combining the 25 data in each segment. After the combined segments have been generated, the storage requirements are further reduced by differential encoding of the filter data. The actual normalization, segmentation, and differential encoding steps of data reducer 322 are described in ~ 30 detail in conjunction with Figures 4 and 5. For a : general memory map illustrating the reduced data format of template memory 160, refer to Figure 6c.
Endpoint detector 318, word averager 320, and data reducer 322 comprise training processor 170. In the ~ 35 training mode, training control signal 325, from device ;~ controller 130, instructs these three blocks to generate new word templates for storage in template memory 1~0.

r . . , - 20 - 1~2~833 CM00249G
` ~:
However, in the recognition mode, training control signal 325 directs these blocks to suspend the process of generating new word templates, since this function is not desired during speech recoqnition. Hence, training processor 170 is only used in the training mode.
Template memory 160 stores woxd recognition templates to be matched to the incoming speech in recoqnition processor 120. Template memory 160 is typically comprised of a standard Random Access Memory (RAM), which ~ay be organised in any desired address configuration. A general purpose RAM which ma ~be used with a speech recognition system is the Toshiba~Y5565 8k x 8 static RAM. However, a non-volatile RAM is preferred " 15 such that word templates are retained when the system is turned off. In the present embodiment, an EEPRON
(Electrically-erasablR, programmable read-only memory) functions as template memory 160.
Word recognition templates, stored in template memory 160, are provided to speech recognition processor - 120 and speech synthesis processor 140. In the recognition mode, recognition processor 120 compares these previously stored word templates against the input word features provided by acoustic processor 110. In the ~` 25 present embodiment, recognition processor 120 may be thought of as being comprised of two distinct blocks --template decoder 328 and speech recognizer 326. Template decoder 328 interprets the reduced feature data provided ; by the template memory, such that speech recognizer 326 can perform its comparison function. Briefly described, template decoder 328 implements an efficient "nibble-mode ; access technique" of obtaining the reduced data from template storage, and performs differential decoding on the reduced data such that speech recognizer 326 can utilize the information. Template decoder 328 is described in detail in the text accompanying Figure 7b.
;

-:, .~, . .

: 1 3 2 ~ CM00249G
Hence, the technique of implementing data reducer - 322 to compress the feature date into a reduced data format for storage in template memory 160, and the use of `~ 05 template decoder 328 to decode the reduced word template -~ informati~n, allows the present invention to minimize template storage requirements.
Speech recognizer 326, which performs the actual speech recognition comparison process, may use one of several speech recognition algorithms. The recognition algorithm of the present embodiment incorporates near-" continuous speech recognition, dynamic time warping, -~ energy normalization, and a Chebyshev distance metric to -~ determine a template match. Refer to Figure 7a et seq.
for a detailed description. Prior art recognition algorithms, such as described in J.S. Bridle, M.D. Brown, ~`
and R.M. Chamberlain, "An Algorithm for Connected Word - Recognition," IEEE International Confarence on Acoustics, -~ S~eec~, and Signal Processin~, May 3-5 1982, vol. 2, pp.
~99-902, may also be used.
; In the present embodiment, an 8-bit microcomputer ~- performs the function of speech recognizer 326.
Moreover, several other control system blocks of Figure 3 are implemented in part by the same microcomputer with ~- 25 the aid of a CODEC/FILTER and a DSP tDigital Signal -~ Processor). An alternate hardware configuration for -~ speech recognizer 326, which may be used in the present - invention is described in an article by J. Peckham, J. Green, J. Canning, and P. Stevens, entitled "A Real-Time Hardware Continuous Speech Recognition System," IEEE International Conference on Acoustics, Speech, and Siqnal Processing, tMay 3-5 1982), vol. 2, pp. 863-866, and the references contained therein.
Hence, the present invention is not limited to any specific hardware or any specific type of speech recognition. Nore particularly, the present invention contemplates the use of: isolated or continuous word .

~ - 22 - 1~2~833 CM00249G
.
recognition; and a software-based or hardware-based - implementation.
Device controller 130, consisting of control unit - 05 334 and directory memory 332, serves to interface speech recognition processor 120 and speech synthesis processor 140 to radiotelephone 350 via two-way interface busses.
Control unit 334 is typically a controlling microprocessor which is capable of interfacing data from `- 10 radio logic 352 to the other blocks of the control ~~ system. Control unit 334 also performs operational - control of radiotelephone 350, such as: unlocking the ;~` control head; placing a telephone call: ending a talephone call; etc. Depending on the particular - 15 hardware interface structure to the radio, control unit 334 may incorporate other sub-blocks to perform specific control functions as DTMF dialing, interface bus multiplexing, and control-function decision-making.
Moreover, the data-interfacing function of control unit ~` 20 334 can be incorporated into the existing hardware of radio logic 352. Hence, a hardware-specific control program would typically be provided for each type of ~ radio or for each kind of electronic device application.
Q5 Directory memory 332, an EEPRON, stores the ~ 2~ plurality of telQphon~ numbQrs, thereby permitting --~ directory dialing. Stored telephone number directory Sr~ information is sent from control unit 334 to directory memory 332 during the training process of entering telephone numbers, while this directory information is .~
30 provided to control unit 334 in response to the recognition of a valid directory dialing command.
~'~ Depending on the particular device used, it may be more economical to incorporate directory memory 332 into the telephone device itself. In general, however, controller block 130 performs the telephone directory storage ~` function, the telephone number dialing function, and the radio operational control function.

~ .

,~ . .

~ ', :
~,' ' ~, Controller block 130 also provides different types of status information, representing the operating ~` status of the radiotelephone, to speech synthesis 05 processor 140. This status information may include ~~ information as to the telephone numbers s~ored in directory memory 332 ("555-1234", etc.), directory names stored in template memory 160 ("Smith", "Jones", etc.), directory status information ("Directory Full", "Name?", etc.), speech recognition status information ("Ready", "User Number?", etc.), or radiotelephone status information ~"Call Dropped", "System ~usy", etc.).
Hence, controller block 130 is the heart of the user-interactive speech recognition/speech synthesis `` 15 control system.
Speech synthesis processor blocX 140 performs the voice reply function. Word recognition templates, stored in template memory 163, are provided to data expander 346 whenever speech synthesis fro~ a template is required.
As previously mentioned, data expander 346 "unpacks" the reduced word feature data from template memory 160 and provides "template" voice response data for channel bank speech synthesizer 340. Refer to Figure 8a et seq. for a ~- deta~led explanation of data expander 346.
; 25 If t~e system controller determines that a "canned" reply word is desired, reply memory 344 supplies voice reply data to channel bank speec~ synth~sizer 340.
`~ Reply memory 344 typically comprises ~ ROM or an EPROM.
In the preferred em~o~iment, an Inte rTD27256 EPROM is used as reply memory 344.
Using either the "canned" or "template" voice reply data, channel bank speech synthesizer 340 synthesizes these reply words, and outputs them to digital-to-analog (D/A) converter 342. The voice reply is then routed to the user. In the present embodiment, channel banX speech synthesizer 340 is the speech ~- synthesis portion of a 14-channel vocoder. An example of . ~ .
.. . .

.
~ , 24 ~32~3 CM00249G

~ such a vocoder may be found in J.N. ~olmes, "The JSRU
-~ Channel Vocoder", IEE PROC., vol. 127, pt. F, no. 1, - (February, 1980), pp. 53-60. The information provided to 05 a channel bank synthesizer normally includes whether the input speech should be voiced or unvoiced, the pitch rate if any, and the gain of each of the 1~ filters. ~owever, as will be obvious to those skilled in the art, any type of speech synthesizer may be utilized to perform the basic speech synthesis function. The particular configuration of channel bank speech synthesizer 340 is fully described in conjunction with Figure sa et seq.
~ As we have seen, the present invention teaches -~ the implementation of speech synthesis from a speech recognition template to provide a user-interactive ~` control sy~tem for a speech communications device. In ~he present embodiment, the speech communications device is a radio transceiver, such as a cellular mobile radiotelephone. However, any speech communications ~` 20 device warranting hands-free user-interactive operation "` may be used. For example, any simplex radio transceiver requiring hands-free control ma~ also take advantage of the improved control system of the present invention.
- Referring now to radiotelephone block 350 of Figure 3, radio logic 352 performs the actual radio ` operational control function. Specifically, it directs frequency synthesizer 356 to provide channel information ~3 to transmitter 353 and receiver 357. The function of ; frequency synthesizer 356 may also be performed by -~ 30 crystal-controlled channel oscillators. Duplexer 354 interfaces transmitter 353 and receiver 357 to a radio frequency (RF) channel via antenna 359. In the case of a , simplex radio transceiver, the function of duplexer 354 may be performed by an RF switch. For a more detailed explanation of representative radiotelephone circuitry, refer to Motorola Instruction Manual 68P81066E40 entitled "DYNA T.A.C. Cellular Mobile Telephone."

, . ' ' - - 25 - 132~8~3 CM00249G

Speakerphone 360, also termed a vehicular speakerphone in the present application, provides hands-- free acoustic coupling of the user-spoken audio to the control system and to t~e radio telephone transmitter ;05 audio; the synthesized speech reply signal to the user;
and the received audio from the radiotelephone to the user As previously noted, preamplifier 304 may perform amplification upon the audio signal provided by microphone 302 to produce input spaech signal 305 to acoustie procQssor 110 T~is input speech signal is also appl~ed to transmit audio switch 362, whieh rout~s input signal 305 to radio transmitter 353 via transmit audio 315 Transmit switch 362 is controlled by signal - detector 364 Signal detector 364 compares input signal 305 amplitude against that of raceive audio 355 to perfor~ tho speaXerphone switching function ~ en the mobilQ radio user is talking, signal detector 364 provides a positive control signal via detector output 361 to close transmit audio switch 362, and a negative control signal via detector output 363 to open r-ceive audio switch 368 Conversely, when the landllne party 18 talklng, slgnal detector 364 provides the opposite polarity signals to close receive audlo switch 368, w~ile op~nlng transmit audio switch 3C2 -~25 ~h~n th- r e-lv audlo swltch is closQd, receiver audio 355 from radlotelephone r-c-lver 357 18 routed through recelv nudlo switch 368 to multlplexer 370 via switched r-celv- audlo output 367 In some communicatlons systems, it ~ay prove advantaqeous to replace audio ~witehes 362 and 3C8 with variable gain devices that provide oqual but opposite attenuations in response to the control signals from the signal detector Multiplexer 370 switehes between voice reply audio 345 and ~witched recelve audio 367 in response to multiplex signal 335 from control unit 334 Whenever the control unit sends status lnformation to the speech synthesizer, multiplexer ' - 26 - I 3 2 4 8 ~ 3 CM00249G

- ~ignal 335 directs multiplexar 370 to route the volce ~` reply audio to the speaker. Speakerphone audio 3~5 i8 usually amplified by audio amplifier 372 before being applied to speaker 375. ~t is to be noted that the oS vehicle speakerphone embodiment described herein is only one of numerous possible configurations which can be used in the present invention.
In summary, Figure 3 illustrates a radiotelephone having a hands-free user-interactive speech-recogni2ing io control sy~te~ for controlling radiotelephone operating parameters upon a u~er-spoXen command. ~he control - system provides audi~le feedback to the user via speech synthesis fro~ speech recognition template memory or a -~ ~canned" response reply memory. The vehicle speakerphone provides handQ-free acoustio coupling of the user-spoken input speech to the control syste~ and to the radio tran~mitter~ the speQch reply signal from the control system to the user, and the receiver audio to the user.
The implemen~ation of speech synthesis from recognition templates significantly improves the performance and ~ersatility o~ the radiotelephone's speech recognition control ~ystem.
.~

-,~

L

- 27 - 132~ ~3 CM00249G
2. Data Reduction and Tem~late Storaqe Referring to Figure 4a, an expanded block diagram of data reducer 322 is shown. As previously stated, data - 05 reducer blocX 322 utilizes raw word feature data from word averager 320 to generate reduced word feature data ; for storage in template memory 160. The data reduction function is performed in three steps: (1) energy normalization blocX 410 reduces the range of stored values for channel energies by subtracting the average value of the channel energies; (2) segmentation/
compression block 420 segments the word feature data and combines acoustically similar frames to form "clusters";
and (3) differential encoding block 430 generates the differences between adjacent channels for storage, rather than the actual channel energy data, to further reduce storage requirements. Nhen all three processes have been performed, the reduced data format for each frame is stored in only nine bytes as shown in Figure 6c. In short, data reducer 322 "packs" the raw word data into a reduced data format to minimize storage requirements.
The flowchart of Figure 4b illustrates the sequence of steps performed by energy normalization block 410 of the previous figure. Upon starting at block 440, block 441 initializes the variables which will be used in ; later calculations. Frame count FC is initiallzed to one to correspond to the first frame of the word to be data reduced. Channel total CT is initialized to the total number of channels corresponding to those of the channel bank feature extractor 312. In the preferred embodiment, a 14-channel feature extractor is used.
Next, the frame total FT is calculated in block 442. Frame total FT is the total number of frames per word to be stored in the template memory. This frame total information is available from training processor 170. To illustrate, say that the acoustic features of a ~: -. ~ . .
.
,, : . ., . :
.

- 28 - 132~833 CM00249G

-~ 500 millisecond duration input word are (digitally) ;~ sampled every 10 milliseconds. Each 10 millisecond time segment is called a frame. The S00 millisecond word then ` 05 comprises 50 frames. Thus, FT would equal 50.
Block 443 tests to see if all the frames of the `i word have been processed. If the present frame count FC
is greater than the frame total FT, no frames of the word would be le~t to normalize, so the energy normalization process for that word will end at block 444. If, ; however, FC is not greater than FT, the energy normalization process continues with the next frame of the word. Contin~ing with the above example of a 50-frame word, each frame of the word is energy normalized in blocks 445 through 452, the frame count FC
is incremented in block 453, and FC is tested in block 443. After the 50th frame of the word has been energy -~ normalized, FC will be incremented to 51 in block 453.
When a frame count FC of 51 is compared to the frame total FT of 50, block 443 will terminate the energy normalization process at block 444.
The actual energy normalization procedure is accomplished by subtracting the average value of all of the channels from each individual channel to reduce the " 25 range of values stored in the template memory. In block 445, the average frame energy (AVGENG) is calculated according to the formula:

i=CT
`~ 30 AVGENG = SUN CH(i) / CTo i=l .~
where CH(i) is the individual channel energies, and where CT equals the total number of channels. It should be ; 35 noted that in the present embodiment, energies are stored as log energies and the energy normalization process :
.

:1~ t .'.'' ' ' ' . ~
.` , .

~ . ' .

13 2 ~ ~ 3 3 CMo0249G
actually subtracts the average log energy from the log energy of each channel.
The average frame energy AVGENG is output in 05 block 446 to be stored at the end location of the channel data for each frame. (See Figure 6c byte 9.) In order to efficiently store the average frame energy in four bits, AVGENG is normalized to the peak energy value of the entire template, and then quantized to 3 dB steps. When the peak energy is assigned a value of 15 (the four-bit maximum), the total energy variation within a template - would be 16 steps x 3 dB/step = 48 dB. In the preferred embodi~ent, this average energy normalization/
! quantization is performed after the differential encoding of channel 14 (Figure 6a) to permit higher precision calculations during the segmentation/compression process (block 420)~
Block 447 sets the channel count CC to one.
Block 448 reads the channel energy addressed by the channel counter CC into an accumulator. Block 449 subtracts the average energy calculated in block 445 from the channel energy read in block ~48. This step generates normalized channel energy data, which is then ~: output (to segmentation/co~pression block 420) in block 25 450 Block g51 increments the channel counter, and block 52 tests to see if all channels have been normalized.
If the new channel count is not greater than the channel total, then the process returns to block 448 where the next channel energy is read. If, however, all channels of the frame have been normalized, the frame count is incremented in block 453 to obtain the next frame of data. When all frames have been normalized, the energy normalization process of data reducer 322 ends at block - 444.
Refering now to Figure 4c, shown is a block diagram illustrating an implentation of the data reducer, block 420. The input feature data is stored in frames in : ,.,. ;, ,, ,, :

-`~ 13 2 il 8 3 ~ CM00249G
initial frame storage, block 502. The memory used for storage is preferred to be RAN. A segmentation controller, block 504, is used to control and to 05 designate which frames will be considered for clustering.
A number of microprocessors can be used for this purpose, such as the Motorola type 6805 microprocessor.
The present invention requires that incoming - frames be considered for averaging by first calculating a 10 distortion measure associated with the frames to determine the similarity between the frames before averaging. The calculation is preferably made by a microprocessor, similar to, or the same as that used in block 504. Details of the calculation are subsequently 15 discussed.
Once it has been determined which frames will be -~ combined, the frame averager, block 508, combines the frames into a representative average frame. ~gain, similar type processing means, as in block 504, can be 20 used for combining the specified frames for averaging.
~~ To effectively reduce the data, the resulting word templates should occupy as little template storage as possible without being distorted to the point that the t recogni~ion process is degraded. In other words, the ~ amount of information representing the word templates Y should be minimizQd, while, at the same time, maximizing the recognition accuracy. Although tha two extremes are contradictory, the word template data can be minimized if a minimal level of distortion is allowed for each 30 Cluster~
Figure 5a illustrates a method for clustering frames for a given 1QVQ1 of distortion. Speeah is depicted as feature data grouped in frames 510. The five center frames 510 form a cluster 512. The cluster 512 is 35 combined into a representative average frame S14. The average frame 514 can be generated by any number of known averaging methods according to the particular type of ~' 5 " ` . . ` . ` ~ .` .: .

`' : , , . ~:

- 31 - 132~3 CM002495 feature data used in the system. To determine whether a cluster meets the allowable distortion level, a prior art distortion test can be used. However, it is preferred that the average frame 514 be compared to eaah of the - frames 510 in the cluster 512 for a measure of -~ similarity. The distance between the averags frame 514 ~ and each frame 510 in the cluster 512 is indicated by distances Dl-D5. If one of these distances exceeds the allowable distortion level, the threshold distance, the cluster 512 is not considered for the resulting word template. If the thrashold distance is not exceeded, the cluster 512 is considered as a possible cluster ~: represented as the average frame 514.
lS This technique for determining a valid cluster is referred to as a peak distortion measure. The present embodiment uses 2 types of peak distortion criteria, peak . energy distortion and peak spectural distortion.
Mathematically, this i~ stated as follows:
-~ 20 D = max tDl, D2, D3, D4, D5], where Dl-D5, as discussed above, represent each distance.

~ These distortion measures are used as local `~` 25 constraints for restricting which frames may be combined into an average frame. If D exceeds a predetermined distortion threshold for either energy or spectral distortion, the cluster is re~ected. By maintaining the same constraints for all clusters, a relative quality of ;- 30 the resulting word template is realized.
Z This clustering technique is used with dynamic programming to optimally reduce the data representing the word template. The principle of dynamic programming can be mathematically stated as follows:

Yo = 0 and ` Y; = min ~Yi + Cij], for all i, . ' ~

- ' ' : , - .
.:

- 32 - 1 3 2~ g33 cM00249G

. .
` where Yj is the cost of the least cost path from node o ~ to node j and Ci; is the cost incurred in moving from node i to node j. The integer values of i and ; range over the possible nun~er of nodes.
To apply this principle to the reduction of word templates in accordance with the present invention, several assumptions are made. They are:
~' ~ 10 The information in the templates is in y` the form of a series of frames, spaced equally in time;
A suitable method of combining frames into an average frame exists;
A meaningful distortion measure exists for comparing an average frame to an original frame: and } Frames may be combined only with adjacent frames.

Tha end objective of the present invention is to ; find ~he minimal set of clusters representing the ;~ template, subject to the constraint that no cluster exceeds a predetermined distortion threshold.
~` 25 The following definitions allow the principle of dynamic programming to be applied to data reduction according to the present invention.

Y; is the combination of clusters for the `~ 30 first j frames;
Yo is the null path, meaning there are no clusters at this point;
cij = 1 if the cluster of frames, i + 1 through j, n~eets the distortion criteria, cij =
infinity otherwise.

.

:~ `
132~833 ~he clustering method generates optimal cluster -` paths starting at the first frame of the word template.
The cluster paths assigned at each frame within the 05 template are referred to as partial paths since they do not completely define the clustering for the entire word.
:- The method begins by initializing the null path, associated with 'frame 0', to 0, i.e. Yo = 0. This ~` indicates that a template with zero frames has zero clusters associated with it. A total path distortion is assigned to each path to describe its relative quality.
Although any total distortion measure can be used, the implementation described herein uses the maximum of the ;: peak spectral distortions from all the clusters defining the current path. Accordingly, the null path, Yo, is assigned zero total path distortion, TPD.
To find the first partial path or combination of clusters, partial path Yl is defined as follows:
. ~
" 20 Yl (partial path at frame one~ = Y0 +
C0,1 This states that the allowable clusters of one frame can be formed by taking the null path, Y0, and appending all ~` 25 ~rames up to frame 1. Hence, the total cost for partial i path Yl is 1 cluster and the total path distortion is zero, since the average frame is identical to the actual ~r frame.
The formation of the second partial path, Y2, requires that two possibilities be considered. They are:

.
Y2 = min [Y0 + C0,2;
Yl ~ Cl,2].

The first possibility is the null path, Y0, with frames 1 and 2 combined into one cluster. The second possibility ",~
. .
' .

~ 34 ~ 132~833 CMoo249G
., is the first frame as a cluster, partial path Yl, plus the` second frame as the second cluster.
The first possibility has a cost of one cluster while the second has a cost of two clusters. Since the object in optimizing the reduction is to obtain the fewest clusters, the first possibility is preferred. The total cost for the first possibility is one cluster. Its TPD is equal to the peak distortion between each frame lo and the average of the two frames. In the instance that the first possibility has a local distortion which -~ exceeds the predetermined threshold, the second possibility is chosen.
To form partial path Y3, three possibilities exist:

Y3 = min ~YO + C0,3:
Yl + Cl,3;
Y2 + C2,3].

`- The formation of partial path Y3 depends upon which path was chosen during the formation of partial path Y2. One of the first two possibilities is not considered, since partial path Y2 was optimally formed~ Hence, the path ~ 25 that was not chosen at partial path Y2 need not be -~ considered for partial path Y3. In carrying out this technique for large numbers of frames, a globally optimal solution is realized without searching paths that will never become optimum. Accordingly, the computation time required for data reduction is substantially reduced.
Figure 5b illustrates an example of forming the optimal partial path in a four frame word template. Each partial path, Yl through Y4, i~ shown in a separate row.
The frames to be considered for clustering are underlined. The first partial path, defined as YO +
CO,l, has only one choice, 520. The single frame is clustered by itself.

..

_ 35 _ 132'~ ~33 CM00249G
' ' For partial path Y2, the optimal formation includes a cluster with the first two frames, choice 522.
In this example, assume the local distortion threshold is exceeded, therefore the second choice 524 is ta~en. The r X over these two combined frames 522 indicates that ` combining these two frames will no longer be held as a consideration for a viable average frame. Hereinafter, this is referred ~o as an invalidated choice. The 10 optimal cluster formation up to frame 2 comprises two clusters, each with one frame 524.
- For partial path Y3, there are three sets of s choices. The first choice 526 is the most desirable but ; it would typically be rejected since combining the first ~ 15 two frames 522 of partial path Y2 exceeds the threshold.
- It should be noted that this is not always the case. A
truly optimal algorithm would not immediately reject this combination based solely on the invalidated choice 522 of ` partial path Y2. The inclusion of additional frames into `~ 20 a cluster which already exceeds the distortion threshold occasionally causes the local distortion to decrease j ~owever, this is rare. In this example, such an 3 inclusion is not considerèd. Larger combinations of an invalidated combination will also be invalidated. Choice 2~ 530 is invalidated because choice 522 was re~ected.
Accordingly, an X is depicted over the first and third choices 526 and 530, indicating an invalidation of each.
Hence, the third partial path, Y3, has only two choices, ~ the second 528 and the fourth 532. The second choice 528 ? 30 is more optimal tfewer clusters) and, in this example, is found not to exceed the local distortion threshold.
Accordingly, the fourth choice 532 is invalidated since it is not optimal. This invalidation is indicated by the XX over the fourth choice 532. The optimal cluster 35 formation up to frame 3 comprises two clusters 528. The first cluster contains only the first frame. The second cluster contains frames 2 and 3.

' ` ` ': - ' ' ' . " ' .
:

- 36 - 132~833 CM00249G

The fourth partial path, Y4, has four conceptual sets from which to choose. The X indicates that choices `` 534, 538, 542 and 548 are invalidated as a consequence of 05 choice 522, from the second partial path, Y2, being invalidated. This results in consideration of only choices 536, 540, 544 and 546. Since choice 546 is known ` to be a non-optimal choice, since the optimal clusterinq - up to Y3 is 528 not 532, it is invalidated, as indicated by XX. Choice 536, of the remaining three choices, is selected next, since it minimizes the number of representative clusters. In this example, choice 536 is found not to exceed the local distortion threshold~
Therefore, the optimal cluster formation for the entire word template comprises only two clusters. The first cluster contains only the first frame. The second cluster contains frames 2 through 4. Partial path Y4 represents the optimally reduced word template.
-~ Mathematically, this optimal partial path is defined as:
- 20 Yl I Cl,4.
The above path ~orming procedure can be improved upon by selectively ordering the cluster formations for each partial path~ The frames can be clustered from the last frame of the partial path toward the first frame of ` 25 the partial path. For example, in forming a partial path YI0, the order of clustering is: Y9 + C9,10; Y8 + C8,10;
Y7 + C7,10; etc. The cluster consisting of frame 10 is considered first. Information defining this cluster is saved and frame 9 is added to the cluster, C8,10. If ` 30 clustering frames 9 and 10 exceeds the local distortion threshold, then the information defining cluster C9,10 is not considered an additional cluster appended to partial path Y9. If clustering frames 9 and 10 does not exceed the local distortion threshold, then cluster C8,10 is considered. Frames are added to the cluster until the threshold is exceeded, at which time the search for partial paths at Y10 is completed. Then, the optimal .
~ .
, :. :
~', . . ~ , ' :

, '' ' ' . ~ .
.

132~33 CMo0249G
partial path, path with least clusters, is chosen from all the preceding partial paths for Y10. This selectlve order of clustering limits the testing of potential 05 cluster combinations, thereby reducing computation time.
In general, at an arbitrary partial path Y~, a ~`maximum of ~ cluster combinations are tested. Figure 5c illustrates the selective ordering for such a path. The optimal partial path is mathematically defined as:

Y; = min ~Yj-l ~ Cj-l,j; . . . ; Yl ~ Cl,j; YO + CO,j].

where min is min number o~ clusters in cluster path that satisfies distortion criteria. Marks are placed on the horizontal axis of Figure 5c, depicting each frame. The rows shown vertically are cluster formation possibilities for partial path Y;. The lowest set of brackets, cluster possibility number 1, determines the first potential cluster formation. This formation includes the single frame, ;, clustered by itself and the optimal partial path Yj-l. To determine if a path exists with a lower cost, possibility two is tested. Since partial path Yj-2 is optimal up to frame ~-2, clustering frames ~ and j~l determines if another formation exists up to frame ;.
Frame ~ is clustered with additional ad~acent frames until the distortion threshold is exceeded. When the distortion threshold is exceeded, the search for partial path Y; is completed and the path with the fewest clusters is taken as Y;.
Ordering the clustering in this manner forces only frames immediately ad;acent to frame ; to be clustered. An additional benefit is that invalidated choices, are not used in determining which frames should be clustered. Hence, for any single partial path, a minimum number of frames are tested for clustering and only information defining one clustering per partial path is stored in memory.

', ,, , ,, ,'.

: : ; .
. , ~ . .

:`
:`.
~ 3 2 ~ 8 3 3 CMo0249G
The information defining each partial path includes three parameters:
, 1~ The total path cost, i.e., the number of clusters in the path.
2~ A trace-back pointer indicating the previous path formed. For example, if partial path Y6 is defined as ~Y3 +
C3,6), then the trace-back pointer for Y6 points to partial path Y3.
3) The total path distortion (TPD) for the current path, reflecting the overall distortion of the path.
The traceback pointers define the clusters within the path.
The total path distortion reflects the quality of the path. It is used to determine which of two possible ~ path formations, each having equal minimal cost tnumber `~ 20 f cl~sters), is the most desirable.
The following example illustrates an application of these parameters.
Let th~ following combinations exist for partial path Y8:

Y8 = Y3 + C3,8 or Y5 + C5,8.
~ .
Let the cost of partial path Y3 and 3 partial path Y5 be egual and let clusters C3,8 and C5,8 both pass the local distortion constraints.

The desired optimal formation is that which has the least TPD. Using the peak distortion test, the optimal formation for partial path Y8 is determined as:

c . '.

- 1324833 CMo0249G
min~max [Y3TpD; peak distortion of cluster 4-8]; max~Y5TpD; pea~ distortion of cluster 6-8]].

~^ The trace-back pointer would be SQt to S either Y3 or Y5, depending on which formation has the least TPD.

.~ Now rQferring to F$g~re 5d, shown is a flowchart illustrating the formation of partial paths for a -~ 10 sequQnoe of ~ frames. Discussion of this flowchart pertains to a word template h~ving 4 frames, 1~Q~ N-4.
The resulting data reducad template is the same as in the example from Figure Sb, ~here Y~ ~ Yl I Cl,4.
ThQ null p~th, partial path Y0, is initialized along with the cost, the traceback pointers ànd the TPD, block 550. It should be noted that that each partial path has its own set of values for TPD, cost and TBP. A
frame pointer, ~, is initialized to 1, indicating the first partial path, Yl, block 552. Continuing on to the 2 sQcond part of the flowehart, at Figure 5e, a second -I frama pointer, k, is inltialized to 0, block 554. The second fram~ pointer is used to spQcify how far back frames are considered for elustering in the partial path.
~ a5 Hence, the ~rames to be considered ~or clustèring are -~ specified from kll to ~.
-~ These fra~es are averaged, block 556, and a eluster distortion is generated, block 558. A test is perform~d to deter~ine if the first cluster of partial ~ path is being formed, block 562. In this instance, the ; 30 first partial path i8 being formed. Therefore, thQ
cluster is defined in memory by sQtting the necessary paramQters, block S64. Since this is the first cluster in the first partial path, the tracebac~ pointer (TBP) is set ~o the null word, the eost is set to 1 and the TPD
~ 35 remains at 0.
:` .
~, ~

.

:

1324.~ CMo0249G

The cost for the path ending at frame ~ is set as the cost of the path ending at ~ (number of clusters in path j) plus one for the new cluster beinq added.
05 Testing for a larger cluster formation begins by decrementing the second frame pointer, k, depicted in block 566. At this point, since k is decremented to -1, a test is performed to prevent invalid frame clusters, block 568. A positive result from the test performed at block 568 indicates that all partial paths have been formed and tested for optimality. The first partial path is mathematically definad as Yl=YOICO,l. It is comprised of one cluster containing the first frame. The test -~ illustrated in block 570 determines whether all frames have been clustered. There are three frames yet to ` cluster. The next partial path is initialized by - incrementing the first frame pointer ~, block 572. The second frame pointer is initialized to one frame before ;, block 554. Accordingly, ~ points to frame 2 and k points to frame 1.
Frame 2 is averaged by itself at block 556.
The test performed at block 562 determines that ~ is - equal to k+l and flow proceeds to block 564 to define the first partial path Y2. Th2 pointer k is decremanted at block 566 for the next cluster consideration.
` Frames 1 and 2 are averaged to form Y0 + C0,2, block 556,- and a distortion measure is generated, block 558. Since this is not the first path being formed, block 562, flow proceeds to block 560. The distortion measure is compared to the threshold, block 560. In this example, combining frames 1 and 2 exceeds the threshold.
Thus, the previously saved partial path, i.e., Yl+Cl,2, is saved for partial path Y2 and the flowchart branches to block 580.
The step depicted in block 580 performs a test to determine whether any additional frames should be clustered with these frames that have exceeded the .. - ;

- . I

' .`-:, :

13 2 `~ 8 3 3 CM00249G
;
threshold, blocX S80. Typically, due to the nature of most data, adding additional frames at this point will also result in an exceeded distortion threshold.
However, it has been found that if the generated distortion measure does not exceed the threshold by more than about 20~, additional frames may cluster without exceeding the distortion threshold. If further clustering is desired, the second frame pointer is decremented to specify the new cluster, block 566.
Otherwise, the test is performed to indicate whether all frames have baen clustered, block 570.
The next partial path is initialized with j set equal to 3, block 572. The second frame pointer is :15 initialized to 2. Frame 3 is averaged by itself, block 556, and a distortion measure is generated, block 558.
Since this is the first path formed for Y3, this new path ~`~is defined and saved in memory, block 564. The second frame pointer is decremented, block 566, to specify a `20 larger cluster. The larger cluster comprises frames 2 and 3.
These frames are averaged, block 556, and a 'distortion is generated, block 558. Since this is not the first path formed, block 562, flow proceeds to block 560. In this example, the threshold is not exceeded, block 560. Since this path Yl I Cl,3 is more optimal, with two clusters, than path Y2 + C2,3, with three clusters, path Yl+Cl,3 replaces the previously saved path Y2+C2,3 as partial path Y3. A larger cluster is specified as K is decremented to 0, block 566.
Frames 1 through 3 are averaged, block 556, and another distortion measure is generated, block 558. In this example, the threshold is exceeded, block 560. No additional frames are clustered, block 580, and the test is again performed to determine whether all the frames have been clustered, block 570. Since~frame 4 is still not yet clustered, ; is incremented for the next partial , :

path, Y4~ the second frame pointer is set at frame 3 and the clustering process repeats.
Frame 4 is averaged by itself, block 556. Again, this is the first path formed, in block 562, and the path 05 is defined for Y4, block 564. This partial path Y3+C3,4 has a cost of 3 clusters. A larger cluster is specified, block 566, and frames 3 and 4 are clusterQd.
Frames 3 and 4 are averaged, block 556. In this ; examplQ their distortion measure does not exceed the lo ~reshold, block 560. This partial path Y2 + C2,4 haQ a cost of 3 cluster~. Since thls has the same eost as t~e previous path (Y3 ~ C3,4), flow proceeds thru blocks 574 and 576 to block 578, and the TPD is examined to detQrminQ which path has ~he least distortlon. If the current path ~Y2 + C2,4) has a lower T~D, Block 578, than the pravious path tY3 + C3,4), then it will replace the current path, block 564 otherwise flow procede~ to block 566. A largar cluster is spQeifiQd, blocX 566, and ~rames 2 through 4 are elustered.
Frames 2 through 4 are averaged, bloek 556. In this example, their distortion measure again does not exeeed the threshold. This partial path Yl + Cl, 4 has a cost of 2 elustQrs. Sinee this is a more optimal path for partial path Y4, bloek 574 than the previous, the path i~ defined in place of the previous, block 564. A
larger cluster i-Q specified, bloeX 566, and frames 1 through 4 are elu~tered.
Averaging frame~ 1 through 4, in this example, exeQed~ the distortlon thrRshold, block 560. Clusterlng i~ stopped, bloek 580. Sinee all the ~rames have been elu~tered, bloek 570, the stQrQd information defining eaeh elustQr defines the optimal path for this 4-frame data reduc~d word template, blocX S82, mathematically defined as Y4~Yl+Cl,4.
This example illustratQs the formation of the ; optimal data reduced word template from Figure 3. The .
, .. . .

.
- , .-1324833 CMo0249G
flowchart illustratQs clustering testq for each partial - path in thQ following order:

Yl: 1 2 3 4 05 Y2: 1 2 3 4 *1 2 3 4 `~ Y3: 1 2 3 4 1 2 3 4 *1 2 3 4 Y4: 1 2 3 4 1 2 3 4 1 2 3 4 *1 2_3 4.
The numbers indicating the frame are underlined for each cluster test. Those clusters that exceQd the lo threshold are indicat~d as such by a preceding '*'.
In this oxample, 10 cluster paths are searched.
In general, using this procedure requires at most ~N(N + 1)]/2 cluster paths to search for the optimal cluster formation, wharQ N is the number of frames in the - 15 word template. For a 15 frame word template, this procedure would require searching at most 120 paths, compared to 16,384 paths for a search attempting to try all possiblQ combination~. Consequently, by using such a procedure in accordance with the present invention, an enoxmous reduction in computat~on time is realized.
Even further reduction in computation time can be realized by modifying block~ 552, 568, 554, 56à, and 580 1 of Figures 5d and 5e. Block 568 illustrates a limit being placed on the ~econd frame pointer, k. In the example, k i8 l~mited only by the null path, partial pa~h Y0, at frame 0. Since k i~ used to definè the length of each cluster, the number of frames clustered can be conetrained by con~training k. For any given distortion t~reshold, there will almo~t always be A number of frames that, when clustQrQd, will cause a distortion that exceed~ the distortion thre~hold. on the other extreme, there is always a minimal cluster formation that will never cause a distortion that exceeds the distortion threshold. Therefore, by defining a maximum cluster size, NAXCS, and minimum cluster size, MINCS, the second frame pointer, k, can be constrainQd.

, . ~..... . . .
.. - ,' '. 1, . .
, . ~ , . .
, 13 2 ~ CMo0249G
MINCS would be employed in blocks 552, 554 and S62. For block 552, j would be initialized to MINCS.
For bloc~ 554, rather than subtract one from k in this 05 step, MINCS would be subtracted. This forces k back a certain number of frames for each new partial path.
- Consequently, clusters with frames less than NINCS will : not be averaged. It should also be noted that to ` accommodate MINCS, block 562 should depict the test of `~ 10 j=k+MINCS rather than j=k+l.
MAXCS would be employed in bloc~ 568. The limit ~` becomes either frames before o (k<0) or frames before -~ that, designated by MAXCS (k<0-MAXCS). This prevents testing clusters that are known to exceed MAXCS.
According to the notation used with Figure 5e, these constraints can be mathematically expressed as follows:

- k > j - MAXCS and k ~ 0; and ~ 20 k < j - MINCS and j > MINCS.
"` For example, let MAXCS = 5 and and NINCS = 2 for a partial path Y15. Then the first cluster consists of frames 15 and 14. The last cluster consists of frames 15 ` 25 through 11. The constraint that j has to be greater or equal to MINCS prevents clusters from forming within the first MINCS frames.
~ Notice (block 562) that clusters at size MINCS
`` are not tested against the distortion threshold (block -- 30 560). This insures that a valid partial path will exist for all yj, j ~MINCS .
By utilizing such constraints in accordance with the present invention, the number of paths that are - searched is reduced according to the difference between MAXCS and MINCS.
Now referring to Figure 5f, block 582 from Figure 5e is shown in further detail. Figure 5f illustrates a .

. . : ' , . ,: .
~' '`. ' .,~, `~ ', .-'`

.
.

-method to generate output clusters after data reduction by using the trace back pointer (TBP in blocX 564 of Figure 5e) from each cluster in reverse direction. Two frame pointers, TB and CF arQ initialized, block 590. TB
05 is initialized to tha trace back pointer of the last frame. CF, the current end frame pointer, is initialized to the last frame of the word template. In the example from Figure 5d and 5e, TB would point at frame 1 and CF
would point at frame 4. Frames TB+l through CF are lo averaged to form an output frame for the resulting word template, block 592. A variable for each averaged frame, or clust~r, store~ tho number of frames combined. It is referred to as "repeat count" and can be calculated from CF-TB. See Figure 6c, infra. A test is then performed to determine whether ~11 clusters have been output, block 594. If not, the next cluster is pointed at by setting CF equal to TB and sQ~ting TB to the trace bacX pointer of new frame CF. Thls procedure continues until all clusters are averaged and output to form the resultant ; 20 word template.
Figures 5g, 5h and 51 illustrates a unique application of thQ trace back polnters. The trace bacX
pointers are used in a partlal trace back mods for outputting clusters from data with an indefinite number of frames, generally referred to as lnfinlte length data.
T~is is dlfferent than the exampleQ lllustrated in ~ Figures 3 and 5, since those examples used a word i templatQ with a finite number of fra~e~, 4.
Figure 5g illustratQ~ a seriQ~ of 24 frames, each assigned a trace bacX pointer deflning the partial paths.
In thi~ example MINCS has been set to 2 and NAXCS has been set at 5. Applying partial trace back to infinite length data requires that clustered frames be output continuously to da~ine port~ons of the input data.
3S Hence, by e~ploying the trace back point~rs ln a scheme of partial trace back, continuous data can be reduced.

~3 46 13248~3 CM00249G
:
Figure 5h illustrates all partial paths, ending -.- at frames 21-24, converging at frame 10. Frames 1-4, 5-7 and 8-10 were found to be optimal clusters and since the ~ 05 convergence point is frame 10, they can be output.
; Figure 5i shows the remaining tree after frames 1-4, 5-7 and 8-10 have been output. Figures 5g and 5h shows the null pointer at frame 0. After the formation -`~ of Figure 5i, the convergence point of frame 10 `` lo designates the location of the new null pointer. By tracing back through to the convergence point and outputting frames through that point, infinite length - data can be accommodated.
- In general, if at frame n, the points to start trace back are n, n-l, n-2, .. n-MAXCS, since these paths are still active and can be combined with more incoming data.
The flowchart of Figures 6a and 6b illustrates the sequence of steps performed by differential encoding ` 20 block 430 of Figure 4a. Starting with block 660, the differential encoding process reduces template storage requirements by generating the differences between adjacent channels for storage rather than each channel's actual energy data. The differential encoding process operates on a frame-by-frama basis as described in Figure 4b. Hence, initialization block 661 sets the ~ frame count FC to one and the channel total CT to 14.
`~ Block 662 calculates the frame total FT as before. Block 663 tests to see if all frames of the word have been encoded. If all frames have been processed, the differential encoding ends with block 664.
Block 665 begins the actual differential encoding procedure by setting tha channel count CC equal to 1.
The enargy normalized data for channel one is read into the accumulator in block 666. Block 667 quantizes the channel one data into 1.5 dB steps for reduced storage.
` The channel data from feature extractor 312 is initially -: .. .. -.
.
~: . . .

13 2 ~ 8 3 3 CM00249G
; ~-`~ represented as 0.376 dB per step utilizing 8 bits per byte. When quantized into 1.5 dB increments, only 6 bits are required to represent a 96 dB energy range 05 (26 x 1.5 dB). The first channel is not di~ferentially encoded so as to form a basis for determining ad~acent channel differences.
A significant quantization error could be introduced into the differential encoding process of ~- 10 block 430 if the quantized and limited values of the channel data ar~ not usad for calculating the channel `~ differentials. ~here~ore, an internal variable RQV, the ~ reconstructed quantized value of the channel data is `~ introduced incide the di~ferential encoding loop to take ~` 15 this error into account. Block 668 forms the channel one RQV for later use by simply assigning it a value of the channel one quantized data, since channel one is not differentially encoded~ Block 675, discussed below, forms the RQV for the remaining channels. Hence, the quantized channel one data is output (to template memory 160) in block 669.
The channel counter is incremented in block 670, and the next channel data is read into the accumulator at block 671. Block 672 quantizes the energy of this channel data at 1.5 dB per step. Since differential encoding stores the differences between channels rather `~ than the actual channel values, block 673 determines the ad~acent channel differences according to the equation:
Channel~CC)differential ~ CH(CC)data - CH(CC-l)RQV

where CHtCC-l)RQV is the reconstructed quantized value of the previous channel formed in block 675 of the previous loop, or in block 668 for CC=~.
Block 674 limits this channel differential bit ~~ value to a -8 to +7 maximum. By restricting the bit value and quantizing the energy value, the range of .. . .

.~ . .
. .

- 48 - 132~833 CM00249G

-~ adjacent channel differencQs becomes -12 dB/~10.5 dB.
Although different applications may require different quantization values or bit limits, our results indicate "` 05 these values sufficien~ for our application.
~- Furthermore, since tha limited channel difference is a four-bit signed number, two values per byte may be - stored. Hence, the limiting and quantization procedures described here substantially reduce the amount of required data storage.
. However, if the limited and quantized values of each differential were not used to form the next channel differential, a significant reconstruction error could result. Block 675 takes this error into account by reconstructing each channel differential from quantized and limited data before forming the next channel differential. The internal variable RQV is formed for ` each channel by the eguation:

-' 20 Channel(CC)RQV = CH(CC-l)RQV I CH(CC)differential where CHtCC-l)RQV is the reconstructed quantized value of the previous channel differential~ Hence, the use of the RQV variabl~ inside the differential encoding loop prevents quantization errors from propagating to subsequent channels.
Block 676 outputs the quantized/limited channel differential to the template memory such that the difference is stored in two values per byte (see Figure 6c~. Block 677 tests to see if all the channels have been encoded. If channels remain, the procedure repeats with block 670. If the channel count CC equals ` the channel total CT, the frame count FC is incremented in block 678 and tested in block 663 as before.
The following calculations illustrate the reduced data rate that can be achieved with the present invention. Feature extractor 312 generates an 8-bit . ; . ' s~ 1324833 CM00249G
logarithmic channel energy value for each of the 14 channels, wherQin the least significant bit rQprQsents three-eights of a dB. Hence, one frame of raw word data applied to data reducer block 322 comprises 14 bytes of data, at 8 bits per byte, at 100 frames per second, which equals 11,200 bits per second.
After the energy normalization and segmentation/
compression procedures have been performed, 16 bytes of data per frame are required. (one byte for each of the 14 channels, one byte for the average frame energy AVGENG, and one byte for the repeat count.) Thus, the data rate can be calculated as 16 bytes of data at 8 bits per byte, at 100 frames per second, and assuming an average of 4 frames per repeat count, gives 3200 bits per second.
After the differential encoding process of block 430 is completed, each frame of template memory 160 appears as shown in the reduced data format of Figure 6c.
The repeat count is stored in byte 1. The quantized, energy-normalized channal one data is stored in byte 2.
Bytes 3 through 9 have been divided such that two channel differences are stored in each byte. In other words, the differentially encoded channel 2 data is stored in the upper nibble of byte 3, and that of channel 3 is stored in the lower nibble of the same byte. The channel 14 differential is stored in the upper nibble of byte 9, and the average frame energy, AVGENG, is stored in the lower nibble of byte 9. At 9 bytes per frame of data, at 8 bits per byte, at 100 frames per second, and assuming an average repeat count of 4, the data rate now equals 1800 bits per second.
Hence, differential encoding block 430 has reduced 16 bytes of data into 9. If the repeat count values lie between 2 and 15, then the repeat count may also be stored in a four-bit nibble. one may then rearrange the repeat count data format to further reduce ~ .

.. :

. ~ , :',~ . ~
;' : ':

~ 50 - 1~2~8~3 CMoo24sG

storage reqU:iremQnts to 8.5 bytes per frame. Moreover, the data reduction process has also reduced the data rate by at least a factor of six (11,200 to 1800).
Consequently, the complexity and storage requirements of the speech recognition system are dramatically reduced, thereby allowing for an increase in speech recognition vocabulary.
3. Decoding Al~orithm . ~
lo Referring to Figura 7a, shown is an improved word .* model having frames 720 combined into 3 average frames 722, as discussed with blocX 420 in Figure 4a. Each average frame 722 is depicted as a state in a word model.
Each state contains one or more substates. The number of substates is dependent on the number of frames combined ` to form the state. Each substate has an associated ` distance accumulator for accumulating similarity measures, or distance scores between input frames and the average frames. Imple~entation of this improved word model is subsequently discussed with Figure 7b.
~ Figure 7b shows block 120 from Figure 3 expanded .~ to show specific detail including its relationship with template memory 160. ~e speech recognizer 326 is expanded to include a recognizer control block 730, a ~ 25 word model decoder 732, a distance ram 734, a distance `~ calculator 736 and a state decoder 738. The template decoder 328 and template memory are discussed immediately following discussion of the speech recognizer 326.
The recognizer control block 730 is used to coordinate the reco~nition process. Coordination `~ includes endpoint detection (for isolated word recognition), tracking ~est accumulated distance scores of the word models, maintenance of linX tables used to link words (for connected or continuous word recognition), special distance calculations which may be .

~ . .
-:

;
:

required by a specific recognition process and initializing the distance ram 734. The recognizer control may also buffer data from the acoustic processor.
For each frame of input speech, the recognizer updates os all active word templates in the template memory.
Specifi;c requirements of the recognixer control 730 are -~ discussed by Bridle, Brown and Chamberlain in a paper ent~tled ~'An Algorithm for Connected Word Recognition", Proceedinqs of the 1982 IEEE Int. Con~. on Acoustics, S~eech and Slqnal Processin~, pp. 899-902. A
corresponding control processer used by the recognizer control block is decribed by Peckham, Green, Canning and Stsphens in a paper entitled "A Real-Time ~ardware Continuos Speech Recognition System", Proceedin~s of the 1982 IEEE Int. Conf. on Acoustics, SPeech and Siqnal Processin~, pp. 863-866.
The distance ram 734 contains accumulated distance~ used for all substates current to the decoding ~ process. If beam decoding i8 used, as decrl~ed by B.
Y 20 Lowerre in "The Harpy Speech Recognition System" Ph.D.
Disser~ation, Computer Science Dept., Carnegie-Mellon University 1977, then the dis~ance ram 734 would also contain flags to ldentify which substates are currently active. If a connectQd word recognition procQss is used, 25 as described in ~n Algorithm for Connected Word ~ Recognitlon~, supra, then the distance ram 734 would al~o ;~ contain a linking pointer for each substate.
The distance calculator 736 calculates tha distance between the currrent input frame and the state 30 being processed. Distance-~ are usually calculated according to the type of feature data used by the system to represent the speech. ~andpass filtered data may use Euclldean or Chebychev distance calculations as descrlbed in "The Effects of Selected Signal Processing Techniques on the Performance of a Filter-Bank-Ba~ed Isolated Word Recognizer" B.A. Dautrich, L.R, Rabiner, T.B. M-rtin, ~ ` :

. - .

., .

, ` Bell SYstem Technical Journal, Vol. 62, No. 5, May-June, 1983 pp. 1311-1336. LPC data may use log-likelihood ratio distanee calculation, as described by F. Itakura in "Minimum Prediction Residual Principle Ap~lied to Speech 05 Recognition", IEE~ Trans. Acoustics, SPeech and Siqnal Processing, vol. ASSP-23, pp. 67-72, Feb. 1975. The present embodiment uses filtered data, also referred to as channel bank information; hence either Chebychev or Euclidean calculations would be appropriate.
- 10 The state deeoder 738 updates the distanee ram for each currQntly activQ state during th~ processing of thQ input frame. In other words, for each word model processed by the word model decoder 732, the state decoder 738 update~ the raquired accumulated distances ln the distance ram 734. The state deeoder also makQs use o~ tha distanee between the input frame and the current st~te determined by t~e distanee calculator 736 and, of course, the template ~emory data representing the current state.
In Figure 7e, steps performed by the word model ` deeoder 732, for proe~sssing each input frame, are shown - in flowchart form. A number of word searehing techniques can be used to coordinate the decoding process, ineluding a truneated searehing teehnique, such as Beam Deeoding, de~eribed by B. LowQrrQ ln "The Harpy Speech Recognitlon Syst~m" Ph.d. Dissartation, Computer Science Dept., Carnegie-Mellon Univer~ity 197?. It should be noted that implementing ~ truneated seareh technique require~ the speQch reeogni2er eontrol 730 to keep track of threshold levels and best aeeumulated distanees.
At bloek 74G of Figure 7e, three variables are extracted from the reeogni2er control (block 730 o~
Fiqure 7b). The t~ree variables are PCAD, PAD and Template PTR. Template PTR i8 used to direct the word model decoder to the eorrect word template. PCAD
represents the aceumulated distance from the previous ~ , - .

~ 53 ~ 1324833 CM0024~3G

state. This is the distance which is accumulated, -exiting from the previous state of the word model, in sequence.
PAD represents the previous accumulated distance, 05 although not necassarily from the previous contiguous state. PAD may dlffer from PCAD when tbe previous state has a minimum dwell time of 0, i.e., when tha previous state may be skipped all together.
:In an isolatQd word recognition system PAD and -10 PCAD would typically be initialized to o by the recognizer control. In a connected or continuous word recogni2ation system the initial value~ of PAD and PCAD
may be determined from outputs of other word models.
In blocX 742 of Figure 7c, the state decoder performs the decoding function for the first state of a particular word model. The data representing the state i8 identified by the Template PTR provided from the recognizer control. The state decoder block i9 discu~sed in detail with Flgure 7d.
A test is performed in block 744 to determine if all states of the word model have been decoded. I~ not, flow returns back to the state decoder, block 742, with an updated Template PTR. If all states of the word model have been decoded, then accmumulated distances, PCAD and 25 PAD, are returned ~o the recoqnizer control at block 748.
- At this point, the re¢ognizer control would typically specify a new word model to decode. once all word models have been processQd it should start processing the next I frame of data from the acoùstlc processor. For an i801ated word recognition system when the last frame of input is decoded, PCAD returned by the word model decoder ! for each word model would represent the total accumulated distance for matching the input utterence to that word model. Typically, the word model with the lowest total accumulated distance would be chosen as the one represented by the utterence which was recognized. Once ` 132~833 a template match has been determined, this information is passed to control unit 334.
Now refering to Figure 7d, shown i~ a flowchart for performing the actual state decoding for each state ` of each word model, i.e., block 742 of Figure 7c expanded. The accumulated distances, PCAD and PAD, are passed along to block 750. At block 750, the distance from the word model state to the input frame is computed and stored as a variable called IFD, for input frame . distance.
The maxdwell for the state is transferred from template memory, block 751. The maxdwell is determined from the number of frames which are combined in each average frame of the word template and is equivalent to - the number of substates in the state. In fact, this 15 system defines the maxdwell as the number of frames which are combined. This is because during word training, the feature extracter (block 310 of Figure 3) samples the incoming speech at twice the rate it does during the recognition process. Setting maxdwell equal to the 20 number of frames averaged allows a spoken word to be matched to a word model when the word spoken during recognition is up to twice the time length of the word represented by the template.
The mindwell for each state is determined during 25 the state decoding process. Since only the state's maxdwell is passed to the state decoder algorithm, q mindwell is calculated as the integer part of maxdwell i` divided by 4 (block 752). This allows a spoken word to ` be matched to a word model when the word spoken during -~ 30 recognition is half the time length of the word ~ represented by the template.
-~ A dwell counter, or substate pointer, i, is * initialized in block 754 to indicate the current dwell `~ count being processed. Each dwell count is referred to i 35 as a substate. The maximum number of substates for each - : .

- . . .
: , .

~ 32~ ~33 CMo0249G
state is defined according to maxdwell, aQ previously discussed. In this embodiment, the substates are processed in reverse order to facilitate the decoding -` process. Accordingly, since maxdwell is defined as the total number of substates in the state, "i" i5 initialy 05 set equal to maxdwell.
In blocX 756, a temporary accumulated distance, TAD, is set equal to substate i's accumulated distance, referred to as IFAD(i), plus the current input frame distance, IFD. The accumulated distance is presumed to have baen updated from the previously processed input fra~e, and stored in distance ram, block 734 from Figure 7b. IFAD is set to O prior to the initial input frame of the recognition process for all substates of all word ; models.
The substate pointer is decremented at block 758.
If the pointer has not reached 0, block 760,the substate's new accumulated distance, IFAD(i+l), is set equal to the accumulated distance for the previous substate, IFAD(i), plUS the current input frame distance, IFD, block 762. OtherwisQ, flow proceeds to block 768 of Figure 7e.
A test is performed in block 764, to determine w~ether the state can be exited from the current substate, i.e. if "i" is greater or equal to mindwell.
Until "i" is less than Mind~ell, the temporary accumulated distance, TAD, is updated to the minimum of eithQr the previous TAD or IFAD(i~l), block 766. In other words, TAD is defined as the best accumulated - distance leaving the current state.
Continuing on to block 768 of Figure 7e, the accumulated distance for the first substate is set to the best accumulated distance entering the state which is . PAD.
A test is then performed to determine if mindwell for the current state is 0, block 770. A mindwell of '~

. ~
~ ' 13 2 ~ 8 3 3 zero indicates that the current state may be sklpped over to yield a more accurate match in the decoding of this word template. If mindwell for the state i9 not zero, PAD is set equal to the the temporary accumulated distance, TAD, since TAD contains the best accumulated 05 distance out of thls state, block 772. If mindwell is zero, PAD is set as the minimum of either the previous state's accumulated distance out, PCAD, or the best accumulated distance out of th~s state, TAD,` block 774.
PAD represQnts thQ best accumulated distance allowed to lo enter the next state.
In block 776, the previous contiguous accumulated distance, PCAD, is sQt equal to the best accumuiated distancQ leaving the currant statQ, ~AD. Thls variable is need to complete PA~ for the following state if that state has a mindwell of sero. Note, the minimum allowed m~xdwQll is 2, so that 2 ad~acQnt states can nevQr both - be ~kipped.
Finally, tha distance ram pointer for the current state is updated to point to the next state in ~he word model, block ~78. This step is re~uired since the substatQs are decodQd from end to beginning for a more efficient algorit~m.
- ~he table shown in appendix A illustrates the flowchart of Figure~ 7c, 7d and 7e applied in an example wherQ an lnput frame i~ proces~ed through a word model slmilar to Fig. 7a) wi~h 3 states, A, B and C. In the example, it iQ presumed that previous frames have already been processed. Henco, the table includes a column showing "old accumul~ted distances (IFAD)" for each substate in state~ A, B and C.
Above the table, lnformation is provided which will be referenced a~ the example develops. The 3 states have maxdwells of 3, 8 and 4 respectively for A, B and C.
ThQ mindwells for each state are shown in the table as o, 2 and 1 respectively. It should be noted that these have 57 132~8~3 CM00249G
.
been calculated, according to block 752 of Figure 7d, as the integer part of Naxdwell/4. Also provided at the top of the table is the input frame dlstance (IFD) for each state according to block 750 of Figure 7d. This information could as well have been shown in the table, but it has been excluded to shorten the table and simplify the exampie. Only pertinant blocks are shown at the left side of the table.
The example begins at block 740 of Figure 7c.
The previous accumulated distances, PCAD and PAD, and the template pointer, which points to the first state of the word template being decoded, are receivQd from the recogn~zer control. Accordingly, in the first row of the table, state A is recorded along with PCAD and PAD.
Moving onto Figure 7d, the distance (IFD) is calculated, maxdwell is retrieved from template memory, mindwell is calculated and the substate pointer, "i", is ~ initialized. Only the initialization of the pointer is -3~ needQd to be shown in the table since maxdwell, mindwell and IFD information is already provided above the table.
The second line shows i set equal to 3, the last substate, and the previous accumulated distance is retriQvQd from the distance ram.
At block 756, the temporary accumulated distance, TAD, is calculated and recorded on the third line of the ~ 25 table.
`~ The test performed at block 760 is not recorded ~i in the table, but the fourth line of the table shows flow moving to block 762 since all substates have not been . processed.
The fourth line of the table shows both the i decrement of the suhstate pointer, block 758, and the calculation of the new accumulated distance, block 762.
Hence, recorded is i=2, the corresponding old IFAD and the new accumulated distance set at 14, i.e. the previous accumulated distance for the current substate plus the input frame distance for the state.

F

` ' ' :

.,' ~ ' ' ' . '. .
1 ' ' ~ . .

13 2 ~ 8 3 3 The test performed at block 764 results in the affirmative. The fifth line of the table shows the temporary accumulated distance, TAD, updated as the minimum of either the current TAD or IFAD~3). In this ; case, it is the latter, TAD =14.
~- Flow returns to block 7S8. The pointer is decremented and the accumulated distance for the second substate is calculated. This is shown on line six.
The first substate is processed similarly, at which point i is detected as equal to 0, and flow proceeds ~rom block 760 to block 768. At block 768, IFAD
is set for the first substate according to PAD, the accumulated distance into the current state.
At block 770, the mindwell is tested against zero. If it equals zero, flow proceeds to block 774 where PAD is determined from the minimum of the temporary accumulated distance, TAD, or the previous accumulated distance, PCAD, since the current state can be skipped due to thQ zero mindwell. Since mindwell - 0 for state A
PAD is set ~o mindwell of 9(TAD) and 5(PCAD) which is 5.
PCAD is subsequently set equal to TAD, block 776.
~ Finally, the first state is completely processed `i with the distance ram pointer updated to the next state in the word model, block 778.
Flow returns to the flowchart in Figure 7c to update the template pointer and back to Figure 7d, block 750, for the next state of the word model. This state is processed in a similar manner as the former, with the exceptions that PAD and PCAD, 5 and 9 respectively, are passed from the former state and mindwell for this state ;~ 30 is not equal to zero, and block 766 will not be executed for all substates. Hence, block 772 is processed rather than block 774.
~ The third state of the word model is processed - along the same lines as the first and second. After completing the third state, the flowchart of Figure 7c is .

'; ' ' ~ ' ' ~ .

returned to with the new PAD and PCAD variables for the recognizer control.
- In summary, each state of the word model is updated one substate at a time in reverse order. Two variables are used to carry the most optimal distance 05 from one state to the next. The flrst, PCAD, carries the - minimum accumulated d~stance from the previo~s contiguous state. The second variable, PAD, carries the minmum accumulated distance into the current stato and is either ` the minimum accumulated distance out oS the previous - 10 state (same as PCAD) or if the previous state has a mindwell of 0, the minimum of the minimum accumulated distance out of the previous state and the minimum accumulated distance out of thQ second previouR state.
To determine how many substates to process, mindwell and maxdwell are calculated according to the number of frames ~- which have bean combined in each state.
Tha flowchart~ of Figures 7c, 7d and 7e allow for an optimal decoding of each data reduced word template.
By decoding the designated substates in reverse order, processing time is minimized. However, since real time processing requires that each word template must be accessed guickly, a special arrangement is reguired to readily extract the data reduced word templateQ.
The template decoder 328 of Figure 7b is used to extract the speoially formatted word templates from the - template memory 160 in a high spead fashion. Since each ~rame 18 stored in template memory in the differential ~orm of Figure 6c, the te~plate decodar 328 utilizes a special accessing technique to allow the word model decoder 732 to access the encoded data without excQssive overhead.
Tho word modQl decoder 732 addresses the template memory 160 to spec~fy the appropriate template to decode.
The same information is provided to the template decoder 328, 6ince the address bus is shared by each. The address , 13 2 ~ 8 3 3 speclfically points to a average frame ln the template.
Each frame represent~ a state in the word model. For every sta~e requiring decoding, the address typically - changes.
05 Refering again to the reduced data forD~at of Figure 6c, once the address of a word template frame is sent out, the template decoder 328 accesses bytes 3 through 9 in a nibble accesq. Each byte is read as 8-bits and then separatad. The lower four bits are placed 10 in a temporary register with sign extension. The upper ;~ four blts are shifted to the lower four bits with sign ;~ extension and are stored in another temporary register.
Each of the differential bytes are retrieved in thi~
manner. The repeat count and the channel one data are 15 retrieved in a normal 8-bit data bus access and temporarily stored in the template decoder 328. The - repeat count ~maxdwQll) i~ passed directly to tha state decoder while the channel one data and channel 2-14 ~*~ differential data t~eparated and expanded to 8 bit~ as 20 ~ust described) are differentially decoded according to ` the flowchart in Figure 8b infra before being passed to ` di~tance calculator 736.
.

^~ 25 4. Data Expan~ion and S~eech SYnthesi~

` Referring now to Figure 8a, a detailed blocX
diagra~ of data expander 346 of Figure 3 is illustrated.
As will be shown below, data expansion block 346 perform~
30 the reciprocal function of data reduction block 322 of ~- Figure 3. Reduced word data, from template memory 160, is applied to differential decoding block 802. The decoding function performed by block 802 i8 essentially the inversQ algorith~ performed by differentlal encoding 35 block 430 of Figure 4a. Briefly stated, the differential decoding algorithm of block 802 "unpacks" the reduced ` 1324833 CM00249G
. .
word feature data stored in template memory 160 by adding the present channel difference to the previous channel data. This algorithm is fully described in the flowchart of Figure 8b.
Next, energy denormalization block 804 restores ~! the proper energy contour to the channel data by effecting the inverse algorithm performed in energy normalization block 410 of Figure 4a. The denormalization procedure adds the average energy value of all channels to each energy-normalized channel value stored in the template. The energy denormal~zation algorithm of ~lock 804 is fully described in the detailed flowchart of Figure 8c.
, Finally, frame repeating block 806 determines the number of frames compressed into a single frame by segmentation/compression block 420 of Figure 4a, and ? performs a frame-repeat function to compensate `~ accordingly. As the flowchart of Figure 8d illustrates, frame repeating block B06 outputs the same frame data "R"
number of times, where R is the prestored repeat count obtained from template memory 1~0. Hence, reduced word data from the template memory is expanded to form "unpacked" word data which can be interpreted by the } speech synthesizer.
The flowchart of Figure 8b illustrates the steps ~ performed by differential decoding block 802 of data -~ expander 3~6. Following start block 810, block 811 initializes the variables to be used in later steps.
Frame count FC is initialized to one to correspond to the first frame of the word to be synthesized, and channel total CT is initialized to the total number of channels in the channel-bank synthesizer (14 in the present embodiment~.
Next, the frame total FT is calculated in block ;~ 812. Frame total Fq' is the total number of frames in the ~ 35 word obtained from the template memory. Block 813 tests .

.
` . .

; . .:
.

- 62 - 132~833 CM00249G

whether all frames of the word have been dlfferentially decoded. If the present frame count FC is greater than the frama total FT, no frames of the word would be left to decode, so the decoding process for that word will end at block 814. If, however, FC is not greater than FT, the differential decoding process continues with the next frame of the word. The test of block 813 may alternatively be performed by checking a data flag sentinel) stored in the template memory to indicate the ~- end of all channel data.
The actual differential decoding process of each frame begins with block 815. First, the channel count cc ~- is set equal to one in block 815, to detarmine the ~ channel data to be read first from template memory 160.
.;~ Next, a full byte of data corresponding to the normalized energy of channel 1 is read from the template in block 816. Since channel 1 data is not differentially encoded, '~ this single channel data may be output (to energy denormalization block 804) immediately via block 817.
~` The channel counter CC is then incremented in block 818 ~ 20 ~O point to the location of the next channel data. Block ;~ 819 reads the differentially encoded channel data differ~ntial) ~or channel CC into an accumulator. Block 8ao then performs the differential decoding function of forming channel CC data by adding channel CC-l data to the channel CC differential. For example, if CC-2, then ~ the equation of block 820 is:
.~
Channel 2 data = Channel 1 data + Channel 2 Differential.

Block 821 then outputs this channel CC data to energy denormalization block 804 for further processing.
Block 822 tests to see whether the present channel count CC is equal to the channel total CT, which would indicate the end of a frame of data. If CC is not equal to CT, then the channel count is incremented in block 818 and - .. . : :
.

.

~ - 63 - 132'1833 CMo0249G

the differential decoding process is performed upon the -`- next channel. If all channels have been decoded (when CC
equals CT), then the frame count FC is incremented in ~ block 823 and compared in block 813 to perform an -~ end-of-data test. When all frames have been decoded, the ' differential decoding process of data expander 346 ends ~- at block 814.
Figure 8c illustrates the sequence of steps ~ performed by energy denormalization block 804. After ``~ starting at block 825, initialization of the variables takes place in block 826. Again, the frame count FC i9 ` initialized to one to correspond to the first frame of -`~ the word to be synthesized, and the channel total CT is initialized to the total number of channels in the channel bank synthesizer (14 in this case). The frame * total FT is calculated in block 827 and the frame count is tested in block 828, as previously done in blocks 812 and 813. If all frame~ of the word have been processed (FC greater than FT), the sequence of steps ends at block 829. If, however, frames still need to be processed (FC
20 not greater than FT), then the energy denormalization function is performed.
In block 830, the average frame energy AVGENG is obtained from the template for frame FC. BlocX 831 then sets the channel count CC equal to one. The channel 25 data, formed from the channel differential in differential decoding block 802 (block 820 of Figure 8b), is now read in block 832. Since the frame is normalized by subtracting the average energy from each channel in energy normalization block 410 ~Figure 4), it is 30 similarly restored tdenormalized) by adding the average energy back to each channel. ~ence, the channel is denormalized in block 833 according to the formula shown.
If, for example, CC=l, then the equation of block 833 is:

Channel 1 energy = Channel 1 data + average energy.

; ' .

:' :

` This denormalized channel energy is then output ` (to frame repeating block 806) via block 834. The next channel is obtained by incrementing the channel count in ` block 835, and testing the channel count in blocX 836 to see if all channels have been denormalized. If all * channels have not yet been processed (CC not greater than CT), then the denormalization procedure repeats starting with block 832. If all channels of the frame have been processed (CC greater than CT), then the frame count is incre~e~ted in block 837, and tested in block 828 as . 10 before. In review, Figure 8c illustrates how the channel x energies are denormalized by adding the average energy back to each channel.
Referring now to Figure 8d, the sequence of steps performed by frame repeating block 806 of Figure 8a is 15 illustrated in the flowchart~ Again, the process starts at block 840 by first initializing the frame count FC to one and the channel total CT to 14 at block 841. In block 842, the frame total, FT, representing the number of fra~es in the word, is calculated as before.
Unlike the previous two flowcharts, all channel ~` energies of the frame are simultaneously obtalned in block 843, since the individual channel processing has `2 now been completed. Next, the repeat count RC of frame FC is then read from the template data in block 844.
` 25 This repeat count RC corresponds to the number of frames ~ combined into a single frame from the data compression ; algorithm performed in segmentation/compression block 420 of Figure 4. In other words, the RC is the "maxdwell" of each frame. The repeat count is now utilized to output the particular frame "RC" number of times.
Block 845 outputs all the channel energies CH(1-14)BNG of frame FC to the speech synthesizer. This represents the first time the "unpacked" channel energy data is output. The repeat count RC is then decremented by one in block 846. For example, if frame FC was not - . .

~. ' - 65 - 132~33 CM00249G
.
.;
~` previously combined, the stored value of RC would equal ` one, and the decremented value of RC would equal zero.
. Block 847 then tests the repeat count. If RC is not -` equal to zero, then the particular frame of channel - energies is again output in block 845. RC would again be 05 decremented in block 846, and again tested in block 847.
- When RC is decremented to zero, the next frame of channel data is obtained. Thus, the repeat count RC represents the number of times the same frame is output to the synthesize~.
To obtain the next frame, the frame count FC is incremented in block 848, and tested in block 849. If all the frames of the word have been processed, the ~ sequence of steps corresponding to frame repeating block -~ 806 ends at block 850. If more frames need to be `~ 15 processed, the frame repeating function continues with " block 843.
As we have seen, data expander block 346 essentially performs the inverse function of "unpacking"
the stored template data which has been "packed" by data 20 reduction block 322. It is to be noted that the separate functions of blocXs 802, 804, and 806 may also be performed on a frame-by-frame basis, instead of the word-by-word basis illustrated in the flowcharts of Figures 8b, 8c, and 8d. In either case, it is the 25 combination of data reduction, reduced template format, . and data expansion techniques which allows the present invention to synthesize intelligible speech from speech recognition templates at a low data rate.
~s illustrated in Figure 3, both the "template"
s 30 word voice reply data, provided by data expander block ~` 346, and the "canned" word voice reply data, provided by reply memory 344, are applied to channel bank speech synthesizer 340. Speech synthesizer 340 selects one of these data sources in response to a command signal from 35 control unit 334. Both data sources 344 and 346 contain .

. . - ~ , . .

;,, .

- 66 - 1324833 CMo . .
prestored acoustic feature information corresponding to the word to be synthesized.
This acoustic feature information comprises a plurality of channel gain values (channel energies), each ~ representative of the acoustic energy in a specified - 05 frequency bandwidth, corresponding to the bandwidths of ~ feature extractor 312. There is, however, no provision -~ in the reduced template memory format to store other speech synthesizer parameters such as voicing or pitch information. This is due to the fact that voicing and pitch information is not normally provided to speech ~ recognition processor 120~ Therefore, this information `~ is usually not retained primarily to reduce template .~ memory requirements. Depending on the particular hardware configuration, reply memory 344 may or may not 1 provida voicing and pitch information. ~he following channel bank synthesizer description assumes that voicing -, and pitch information are not stored in aither memory.
Hence, channel bank speech synthesizer 340 must synthesize words from a data source which is absent voicing and pitch information. Qne important aspect of the present invention directly addresses this problem.
Figure ga illustrates a detailed block diagram of channel bank speech synthesizer 340 having N channels.
:~ Channel data inputs 912 and 914 represent the channel 25 data outputs of reply memory 344 and data axpander 346, respectively. Accordingly, switch array 910 represents ` the "data source decision" provided by device controller unit 334. For example, if a "canned" word is to be synthesized, channel data inputs 912 from reply memory 30 344 are selected as channel gain values 915. If a template word is to be synthesized, channel data inputs 914 from data expander 346 are selected. In either case, channel gain values 915 are routed to low-pass filters 940.
` 35 ' ~ , . . .

.

'. -: .

132~833 CMo0249G
- Low-pass filters 940 function to smoath the step discontinuities in frame-to-frame channel gain changes before feeding them to the modulators. These gain smoothing filters are typically configured as second-order Butterworth lowpass filters. In the present embodiment, lowpass filters 940 have a -3 dB cutoff frequency of approximately 28 Hz.
Smoothed channel gain values 945 are then applied to channel gain modulators 950. The modulators serve to adjust the gain of an excitation signal in response to the appropriate channel gain value. In the present embodiment, modulators gSO are divided into two predetermined groups: a first predetermined group (numbered 1 through N) ha~ing a first excitation signal input; and a second group of modulators ~numbered M+l through N) having a second excitation signal input. As ~, can be seen from Figur~ 9a, the first excitation signal - 925 is output fro~ pitch pulse source 920, and the second excitation signal 935 is output from noise source 930.
These excitation sources will be described in further ~ detail in the following ~igures.
~~ Speech synthesizer 340 employs the technique called "split voicing" in accordance with the present invention. This technigue allows the speech synthesizer to reconstruct speech from externally-generated acoustic feature information, such as channel gain values 915, without using external voicing information. The preferred embodiment does not utilize a voicing switch to distinguish between the pitch pulse source lvoiced excitation~ and the noise source (unvoiced excitation) to generate a single voiced/unvoiced excitation signal to the modulators. In contrast, the present invention "splits" the acoustic feature information provided by the ; channel gain values into two predetermined groups. The - first predetermined group, usually corresponding to the low frequency channels, modulates the voiced excitation . .
, .
~ , .

- 68 - 13 ~ ~ 8 3 3 CM00249G

~ signal 925. A seeond predetermined group of ehannel gain values, normally corresponding to the high frequeney ehannels, modulates the unvoieed excitation signal 935.
Together, the low frequency and high frequency ehannel gain values are individually bandpass filtered and combined to generate a high quality speech signal.
'',5 It has been found that a "9/5 split" (M = 9) for - a 14-channel synthesizer (N = 14) has provided exeellent -~ results for improving the quality of speeeh. However, it `- will be apparent to those skilled in the art that the voiced/unvoieed channel "split" ean bQ varied to maximize the voice quality characteristies in particular synthesizer applications.
Modulators 1 through N serve to amplitude modulate the appropriate excitation signal in response to the aeoustie feature information of that particular channel. In other words, the pitch pulse (buzz) or noise `~ (hiss) exeitation signal for ehannel M is multiplied by thQ channel gain value for channel M. The amplitude :~ modifieation performed by modulators 950 can readily be implemented in software using digital signal processing ~DSP) taehniques. Similarly, modulators 950 may be implemented by analog linear multipliers as known in the art.
Both groups of modulated exeitation signals 955 (1 through M, and N+l through N) are then applied to bandpass filters 960 to reeonstruet the N speech ' ehannels. As previously noted, the present embodiment `i utilizes 14 channels eovering the frequency range 250 Hz to 3400 Hz. Additionally, the preferred embodiment utilizes DSP techniques to digitally implement in software the function of bandpass filters 960.
Appropriate DSP algorithms are described in chapter 11 of L.R. Rabiner and B. Gold, Theory and Ap~lication of Diqital Siqnal Proeessing, (Prentice Hall, Englewood Cli~fs, N.J~, 1975).

., .

.

69 13 2 ~ 8 ~ 3 CM00249G
;`
; The filtered channel outputs 965 are then ` combined at summation circuit 970. Again, the summing function of the channel combiner may be implemented - either in software, using DS~ techniques, or in hardware, utilizing a summation circuit, to combine the N channels into a singla reconstructed speech signal 975.
An alternate embodiment of the modulator/bandpass filter configuration 980 is shown in Figure 9b. This figure illustrates that it is functionally equivalent to first apply excitation signal 935 (or 925) to bandpass ``~ 10 filter 960, and then amplitude modulate the filtered excitation signal by channel gain value 945 in modulator 950. This alternate configuration 980' produces the equivalent channal output 965, since the function of reconstructing the channels is still achieved.
~`t 15 Noise source 930 produces unvoiced excitation signal 935, called "hiss". The noise source output is typically a series of random amplitude pulses of a constant average power, as illustrated by waveform 935 of Figure 9d. Conversely, pitch pulse source 920 generates a pulse train of voiced excitation pitch pulses, also of a constant average power, called "buzz". A typical pitch p~lse sourcQ would have its pitch pulse rate determined by an extQrnal pitch period fo~ This pitch period information, determined from an acoustic analysis of the desired synthesizer speech signal, is normally ~! transmitted along with the channel gain information in a vocoder application, or would be stored, along with the voiced/unvoiced decision and channel gain information, in a "canned" word memory. However, as noted above, thare is no provision in the reduced template memory format of the preferred embodiment to store all of these speech synthesizer parameters, since they are not all required for speech recognition. Hence, another aspect of the present invention is directed toward providing a high quality synthesized speech signal without prestored pitch information.

.

L ' . ' ' ' . ' ', ~ ' ' . , : . '~
' ~ . ' ,, , ~ ~ , `
;` ~ ' ' . . : : ' ': ~ . ; . . ` ', . . ` : . `
~ ~ . . . .

~ . . . ' 132~833 CMo0249G
- Pitch pulse source 920 of the preferred '!`' embodiment is shown in greater detail in Figure 9c. It `` has been found that a significant improvement in synthesized voice quality can be achieved by varying the pitch pulse period such that the pitch pulse rate decreases over the length of the word synthesized.
Therefore, excitation signal 925 is preferably comprised - of pitch pulses of a constant average power and of a predetermined variable rate. This variable rate is determined as a function of the length of the word to be synthesized, and as a function of empirically-determined constant pitch rat~ changes. In the present embodiment, the pitch pulse rate linearly decreases on a frame-by-frame basis over the length o~ the word.
However, in other applications, a different variable rate may be desired to produce other speech sound characteristics.
RefQrring now to Figure 9c, pitch pulse source 20 is comprised of pitch rate control unit 940, pitch rate generator 942, and pitch pulse generator 944. Pitch rate control unit 940 determines the variable rate at which the pitch period is changed. In the preferred embodi~ent, the pitch rate decrease is determined from a pitch change constant, initialized from a pitch start constant, to provide pitch period information 922. The function of pitch rate control unit 940 may be performed in hardware by a programmable ramp generator, or in software by the controlling microcomputer. The operation of control unit 940 is fully described in conjunction with the next figure.
Pitch rate generator 942 utilizes this pitch ^ period information to generate pitch rate signal 923 at regularly spaced intervals. This signal may be impulses, rising edges, or any other type of pitch pulse period conveying signal. Pitch rate generator 942 may be a timer, a counter, or crystal clock oscillator which : -. :

.

- 71 - 1324~3~ CM00249G
provides a pulse train equal to pitch period information 922. Again, in the present embodiment, the function of pitch rate generator 942 is performed in software.
Pitch rate signal 923 is used by pitch pulse generator 944 to create the desired waveform for pitch 05 pulse excitation signal g25. Pitch pulse generator 944 ~;~ may be a hardware waveshaping circuit, a monoshot clocked `s by pitch rate signal 923, or, as in the present embodiment, a ROM look-up table having the desired c waveform information. Excitation signal 925 may exhibit - the waveform of impulses, a chirp (frequency swept sine wave) or any other broadband waveform. Hence, the nature of the pulse is dependent upon the particular excitation ` signal desired.
Since excitation signal 925 must be of a constant ~ 15 average power, pitch pulse generator 944 also utilizes --' the pitch rate signal 923, or the pitch period 922, as an amplit~dQ control signal. The amplitude of the pitch pulses are scaled by a factor proportional to the square root of the pitch period to obtain a constant average power~ Again, the actual amplitude of each pulse is -~ dependent upon the nature of the desired excitation ~` signal.
-i The following discussion of Figure 9d, as applied to pitch pulse source 920 of Figure 9c, describes the sequence of steps taken in the preferred embodiment to ~ produce the variable pitch pulse rate. First, the word `~ length WL for the particular word to be synthesized is read from the template memory. This word length is the total number of frames of the word to be synthesized. In the preferred embodiment, WL is the sum of all repeat counts for all frames of the word template. Second, the pitch start constant PSC and pitch change constant PCC
: are read from a predetermined memory location in the ; synthesizer controller. Third, the number of ~ord ~ 35 divisions are calculated by dividing the word length WL

- -: :, :
' . '.'` '' '~' ~ ': ' - . . . . . . . .

" - 72 - 13 2 4 8 3 3 CM00249G

-` by the pitch change constant PCC. The word division WDindicates how many consecutive frames will have the same ;~ pitch value. For example, waveform 921 illustrates a word length of 3 frames, a pitch start constant of 59, and a pitch change constant of 3. Thus, the word ~-~ division, in this simple example, is calculated by ` dividing the word length (3) by the pitch change constant (3), to set the number of frames betwe.en pitch changes equal to one. A more complicated example would be if W~=24 and PCC=4, then the word divisions would occur every 6 frames.
ThQ pitch start constant of 59 represents the i~ number of sample times between pitch pulses. For example, at an 8 kHz sampling rate, there would be 59 sample times taach 125 microseconds in duration) between pitch pulses. Therefore, tha pitch period would be 59 x 125 microseconds - 7.375 milliseconds or 135.6 Hz. After each word division, the pitch start constant is incr~ented by one (i.e. 60 = 133.3 Hz, 61 2 131.1 Hz) such that the pitch rate decreases over the length of the x~ 20 word. If the word length was longer, or the pitch change constant was shorter, several consecutive frames would ~ave the same pitch value. This pitch period information is represented in Figure 9d by waveform 922. As waveform 922 illustrates, the pitch period information may be represented in a hardware sense by changing voltage levels, or in software by different pitch period values.
When pitch period information 922 is applied to pitch rate generator 942, pitch rate signal waveform 923 is produced. Waveform 923 generally illustrates, in a simplified manner, that the pitch rate is decreasing at a rate determined by the variable pitch period. When the pitch rate signal 923 is applied to pitch pulse generator 944, excitation waveform 925 is produced. Waveform 925 is simply a waveshaped variation of waveform 923 having a constant average power. Waveform 935, representing the -t . . `, -. . ~ ~ -.- , 13 2 ~ 8 3 3 output of noise source 930 (hiss), illustrates the difference between periodic voiced and random unvoiced excitation signals.
As we have seen, the present invention provides a "-method and apparatus for synthesizing speech without voicing or pitch information. The speech synthesizer of the present invention employs the technique of "split voicing" and the technique of varying the pitch pulse period such that the pitch pulse rate decreases over the len~th of the word. ~lthough either technique may be used by itself, the combination of split voicing and `~variable pitch pulse rate allows natural-sounding speech to be generated without external voicing or pitch information, `While specific embodiments of the present invention have been shown and described herein, further modifications and improvements may be made by those '~skilled in the art. All such modifications which retain the basic underlying principles disclosed and claimed herein are within the scope of this invention.
What is claimed is:

.~

.

.
.

,' :. . ~`

1 3 2 4 8 3 3 CMo0249G

A~ÇENCrl A
Procooolng o~ on ~nput ~rs for 3 ~k~tc o~ a wcrd ~od~ t~t~ A, B an~ C.
Stat~ A: Mbxd~oll - 3, Mlnd~ O (?52-Flq. 7(d)), IFD - 7 t750-Flg. 7(d)) Staec B: Mbod ~ 8, Mlnd~ 2 (~52-Fig~ 7(d)), IFD - 3 (750-Fig. 7(d)) seat C: M~x~ll - 4, ~lndw-ll - 1 (752-Flg. 7(d)), IFD - 5 (7~0-Fig. 7(d)) IN oUT Old IF~D(~) NEW
~/F~ ySub k~tFAD PC~D F~D PC~D(Glv~n)IF~D(-+l) 5~D
740/7(c) A 5 5 754/7(d~ i-3 8(3) 7S6 15~+8 ~5a,~62 i~2 ~(2) 14(3)-~+~ 14 758,~62 i-l 2(1) 9(2~'2+~
~S8 i-O
~68 5(1 ~78 ~ 5 9 S,~62 i-~ 9(~) 12(8~9+3 ~66' i-6 3(6~ 6(~-3+3 8 ~6C i-5 8(5) 11(6)~8+3 6 ~66 i-4 4(4~ ~(5)-4+3 6 ~ U "62 i-3 4(3) ~(4)-4+3 6 7S8,762 i~2 5(2) 8t3)~5+3 6 56,~62 i-l 2(1) 5(2)'2+3 ~58 i-O
~2,~6 6 6 stl) n8 6 C

75~,~62 il'34 8(3) 13t4) 0+515 .+10 ~66 1~2 6(2) llt3)'6~5 11 ~58 "62 1-1 9(1) 14(2)~9+5 11 ~58 1-0 ~68 6tl) ~4/7(c) ~8 11 11 ,

Claims (43)

1. A speech synthesizer for generating reconstructed speech signals from external acoustic feature information sets without using external specific voicing or pitch information, each said acoustic feature information set comprising a plurality of modification signals, said speech synthesizer comprising:
means for generating a first and second excitation signal from an external acoustic information set, including a plurality of channel gain values, for each reconstructed speech signal using substantially common voicing or pitch information, said first excitation signal having an identifiable periodicity;
means for changing the periodicity of said first excitation signal from a predetermined initial first excitation signal period at a rate related to the length of said external acoustic feature information set; and means for modifying an operating parameter of said first excitation signal in response to a first group of said modification signals, and for modifying an operating parameter of said second excitation signal in response to a second group of said modification signals, thereby producing corresponding first and second groups of modified outputs.
2. The speech synthesizer according to claim 1 wherein each said plurality of gain values represents the acoustic energy in a specified frequency bandwidth of the desired speech signal to be synthesized.
3. The speech synthesizer according to claim 1, wherein said operating parameters of said first and second excitation signals are the amplitudes of said signals.
4. The speech synthesizer according to claim 1, wherein said first excitation signal is representative of periodic pulses of a predetermined variable rate.
5. The speech synthesizer according to claim 1, wherein said second excitation signal is representative of random noise.
6. The speech synthesizer according to claim 1, wherein said first group of modification signals is comprised of low frequency modification signals relative to said second group of modification signals which is comprised of high frequency modification signals.
7. The speech synthesizer according to claim 1, further comprising means for filtering said first and second groups of said modified outputs to produce a plurality of filtered outputs.
8. The speech synthesizer according to claim 7, further comprising means for combining each of said plurality of filtered outputs to form said reconstructed speech signal.
9. A channel bank speech synthesizer for generating reconstructed speech words from external acoustic feature information sets without using external specific voicing information, each said acoustic feature information set comprising a plurality of channel gain values, each representative of the acoustic energy in a specified frequency bandwidth, said acoustic feature information set further comprising pitch information, said speech synthesizer comprising:
means for generating a first and second excitation signal for each reconstructed speech word using substantially common voicing information, said first excitation signal representative of periodic pulses of a rate determined by said pitch information, said second excitation signal representative of random noise;
means for changing the periodicity of said first excitation signal of a reconstructed speech word from a predetermined first excitation signal period at a rate related to the length of an external acoustic information set;
means for amplitude modulating said first excitation signal of a reconstructed speech word in response to a first group of said plurality of channel gain values, and for amplitude modulating said second excitation signal of said reconstructed speech word in response to a second group of said plurality of channel gain values, thereby producing corresponding first and second groups of channel outputs for said reconstructed speech word;
means for filtering said first and second groups of channel outputs to produce a plurality of filtered channel outputs; and means for combining each of said plurality of filtered channel outputs to form said reconstructed speech word.
10. The speech synthesizer according to claim 9, wherein said speech synthesizer has fourteen channels.
11. The speech synthesizer according to claim 9, wherein said first group of channel gain values represent low frequency channels relative to said second group of channel gain values which represent high frequency channels.
12. The speech synthesizer according to claim 11, wherein the ratio of the number of channels in said first group to said second group is approximately 9/5.
13. The speech synthesizer according to claim 9, wherein said filtering means includes a plurality of bandpass filters covering the voice frequency range.
14. A channel bank speech synthesizer for generating reconstructed speech words from external acoustic feature information sets without using external specific pitch information, each said acoustic feature information set comprising a plurality of channel gain values, each representative of the acoustic energy in a specified frequency bandwidth, each said acoustic feature information set further comprising voicing information, said speech synthesizer comprising:
means for generating at least one excitation signal for each reconstructed speech word in response to said voicing information using substantially common pitch information, said excitation signal representative of periodic pulses having a variable rate related to the length of an external acoustic information set for voiced sounds, said excitation signal representative of random noise for unvoiced sounds;
means for amplitude modulating said excitation signal of a reconstructed speech word in response to a plurality of channel gain values, thereby producing a corresponding plurality of channel outputs for said reconstructed speech word;
means for fllterlng sald plurality of channel outputs to produce a plurality of filtered channel outputs; and means for combining each of said plurality of filtered channel outputs to form said reconstructed speech word.
15. The speech synthesizer according to claim 14, wherein said variable rate changes in a predetermined manner over the length of the word to be synthesized.
16. The speech synthesizer according to claim 14, wherein said variable rate decreases linearly frame-by-frame of the word to be synthesized.
17. The speech synthesizer according to claim 14, wherein said excitation signal is of a constant average power.
18. The speech synthesizer according to claim 14, wherein said filtering means includes a plurality of bandpass filters covering the voice frequency range.
19. A channel bank speech synthesizer for generating reconstructed speech words from external acoustic feature information sets without using external specific voicing or pitch information, each said acoustic feature information set comprising a plurality of channel gain values, each channel gain value representative of the acoustic energy in a specified frequency bandwidth, said speech synthesizer comprising:
means for generating a first and second excitation signal for each reconstructed speech word using substantially common voicing or pitch information, said first excitation signal representative of periodic pulses of a variable rate, related to the length of an acoustic information set, said second excitation signal representative of random noise;
means for amplitude modulating said first excitation signal of a reconstructed speech word in response to a first group of said plurality of channel gain values, and for amplitude modulating said second excitation signal of said reconstructed speech word in response to a second group of said plurality of channel gain values, thereby producing corresponding first and second groups of channel outputs for said reconstructed speech word;
means for bandpass filtering said first and second groups of channel outputs to produce a plurality of filtered channel outputs and means for combining each of said plurality of filtered channel outputs to form said reconstructed speech word.
20. The speech synthesizer according to claim 19, wherein said speech synthesizer has fourteen channels.
21. The speech synthesizer according to claim 19, wherein said first group of channel gain values represent low frequency channels relative to said second group of channel gain values which represent high frequency channels.
22. The speech synthesizer according to claim 19, wherein the ratio of the number of channels in said first group to said second group is approximately 9/5.
23. The speech synthesizer according to claim 19, wherein said predetermined variable rate decreases linearly frame-by-frame of the word to be synthesized.
24. The speech synthesizer according to claim 19, wherein said periodic pulses of said first excitation signal are of a constant average power.
25. The speech synthesizer according to claim 19, wherein said second excitation signal is a series of random pulses of a constant average power.
26. The speech synthesizer according to claim 19, wherein said bandpass filtering means is comprised of a bank of approximately 14 bandpass filters covering the frequency range from approximately 250 Hz. to 3400 Hz.
27. The speech synthesizer according to claim 19, wherein said combining means includes means for summing said plurality of filtered channel outputs to form a single reconstructed speech signal.
28. A method of synthesizing speech signals from external acoustic feature information sets without using external specific voicing or pitch information, each said acoustic feature information set comprising a plurality of modification signals, said speech synthesis method comprising the steps of:
generating a first and second excitation signal from an external acoustic feature information set, including a plurality of channel gain values, for each synthesized speech signal, using substantially common voicing or pitch information, said first excitation signal having an identifiable periodicity;
changing the periodicity of said first excitation signal from a predetermined initial first excitation signal period at a rate related to the length of said external acoustic feature information set;
modifying an operating parameter of said first excitation signal of a reconstructed speech word in response to a first group of said modification signals, and modifying an operating parameter of said second excitation signal of said reconstructed speech word in response to a second group of said modification signals, thereby producing corresponding first and second groups of modified outputs for said synthesized speech signal;
filtering said first and second groups of modified outputs to produce a plurality of filtered outputs; and combining each of said plurality of filtered outputs to form said synthesized speech signal.
29. The method according to claim 28, wherein each of said plurality of modification signals are comprised of a predetermined gain value.
30. The method according to claim 29, wherein each predetermined gain value represents the acoustic energy in a specified frequency bandwidth of the desired speech signal to be synthesized.
31. The method according to claim 28, wherein said operating parameters of said first and second excitation signals are the amplitudes of said signals.
32. The method according to claim 28, wherein said first excitation signal is representative of periodic pulses of a predetermined variable rate.
33. The method according to claim 28, wherein said second excitation signal it representative of random noise.
34. The method according to claim 28, wherein said first group of modification signals is comprised of low frequency modification signals relative to said second group of modification signals which is comprised of high frequency modification signals.
35. A method of synthesizing speech words from external acoustic feature information sets without using external specific voicing or pitch information, each said acoustic feature information set comprising a plurality of channel gain values, each gain value representative of the acoustic energy in a specified frequency bandwidth, said speech synthesis method comprising the steps of:
generating a first and second excitation signal for each synthesized speech word using substantially common voicing or pitch information, said first excitation signal representative of periodic pulses of a variable rate related to the length of an external acoustic information set, said second excitation signal representative of random noise;
amplitude modulating said first excitation signal of a synthesized speech word in response to a first group of said plurality of channel gain values, and amplitude modulating said second excitation signal of said synthesized speech word in response to a second group of said plurality of channel gain values, thereby producing corresponding first and second groups of channel outputs for said synthesized speech word;
bandpass filtering said first and second groups of channel outputs to produce a plurality of filtered channel outputs and combining each of said plurality of filtered channel outputs to form said synthesized speech word.
36. The method according to claim 35, wherein said acoustic feature information is representative of fourteen channels.
37. The method according to claim 35, wherein said first group of channel gain values represent low frequency channels relative to said second group of channel gain values which represent high frequency channels.
38. The method according to claim 37, wherein the ratio of the number of channels in said first group to said second group is approximately 9/5.
39. The method according to claim 38, wherein said predetermined variable rate decreases linearly frame-by-frame of the word to be synthesized.
40. The method according to claim 35, wherein said periodic pulses of said first excitation signal are of a constant average power.
41. The method according to claim 35, wherein said second excitation signal is a series of random pulses of a constant average power.
42. The method according to claim 35, wherein said bandpass filtering stop produces approximately 14 contiguous channels covering the frequency range from approximately 250 Hz. to 3400 Hz.
43. The method according to claim 35, wherein said combining step sums said plurality of filtered channel outputs to form form a single reconstructed speech signal.
CA000526482A 1986-01-03 1986-12-30 Method and apparatus for synthesizing speech without voicing or pitch information Expired - Fee Related CA1324833C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US81603486A 1986-01-03 1986-01-03
US816,034 1986-01-03

Publications (1)

Publication Number Publication Date
CA1324833C true CA1324833C (en) 1993-11-30

Family

ID=25219521

Family Applications (1)

Application Number Title Priority Date Filing Date
CA000526482A Expired - Fee Related CA1324833C (en) 1986-01-03 1986-12-30 Method and apparatus for synthesizing speech without voicing or pitch information

Country Status (8)

Country Link
US (1) US5133010A (en)
EP (1) EP0255524B1 (en)
JP (1) JP3219093B2 (en)
KR (1) KR950007859B1 (en)
CA (1) CA1324833C (en)
DE (1) DE3688749T2 (en)
HK (1) HK40396A (en)
WO (1) WO1987004293A1 (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5636325A (en) * 1992-11-13 1997-06-03 International Business Machines Corporation Speech synthesis and analysis of dialects
US5444818A (en) * 1992-12-03 1995-08-22 International Business Machines Corporation System and method for dynamically configuring synthesizers
US5406626A (en) * 1993-03-15 1995-04-11 Macrovision Corporation Radio receiver for information dissemenation using subcarrier
US6330334B1 (en) 1993-03-15 2001-12-11 Command Audio Corporation Method and system for information dissemination using television signals
US5590195A (en) * 1993-03-15 1996-12-31 Command Audio Corporation Information dissemination using various transmission modes
JP2770747B2 (en) * 1994-08-18 1998-07-02 日本電気株式会社 Speech synthesizer
WO1997027578A1 (en) * 1996-01-26 1997-07-31 Motorola Inc. Very low bit rate time domain speech analyzer for voice messaging
US5815671A (en) * 1996-06-11 1998-09-29 Command Audio Corporation Method and apparatus for encoding and storing audio/video information for subsequent predetermined retrieval
US5956629A (en) * 1996-08-14 1999-09-21 Command Audio Corporation Method and apparatus for transmitter identification and selection for mobile information signal services
JP3902860B2 (en) * 1998-03-09 2007-04-11 キヤノン株式会社 Speech synthesis control device, control method therefor, and computer-readable memory
US6064955A (en) 1998-04-13 2000-05-16 Motorola Low complexity MBE synthesizer for very low bit rate voice messaging
US6067511A (en) * 1998-07-13 2000-05-23 Lockheed Martin Corp. LPC speech synthesis using harmonic excitation generator with phase modulator for voiced speech
US6208969B1 (en) 1998-07-24 2001-03-27 Lucent Technologies Inc. Electronic data processing apparatus and method for sound synthesis using transfer functions of sound samples
US6173254B1 (en) * 1998-08-18 2001-01-09 Denso Corporation, Ltd. Recorded message playback system for a variable bit rate system
US6711540B1 (en) * 1998-09-25 2004-03-23 Legerity, Inc. Tone detector with noise detection and dynamic thresholding for robust performance
JP2000305582A (en) * 1999-04-23 2000-11-02 Oki Electric Ind Co Ltd Speech synthesizing device
EP1335496B1 (en) * 2000-12-14 2009-06-10 Sony Corporation Coding and decoding
US20030061049A1 (en) * 2001-08-30 2003-03-27 Clarity, Llc Synthesized speech intelligibility enhancement through environment awareness
US8521529B2 (en) * 2004-10-18 2013-08-27 Creative Technology Ltd Method for segmenting audio signals
US20070083367A1 (en) * 2005-10-11 2007-04-12 Motorola, Inc. Method and system for bandwidth efficient and enhanced concatenative synthesis based communication
US20070106513A1 (en) * 2005-11-10 2007-05-10 Boillot Marc A Method for facilitating text to speech synthesis using a differential vocoder

Family Cites Families (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US2151091A (en) * 1935-10-30 1939-03-21 Bell Telephone Labor Inc Signal transmission
US3197560A (en) * 1944-12-29 1965-07-27 Bell Telephone Labor Inc Frequency measuring system
DE1130860B (en) * 1960-11-30 1962-06-07 Telefunken Patent Method for saving channel capacity when transmitting voice signals
US3360610A (en) * 1964-05-07 1967-12-26 Bell Telephone Labor Inc Bandwidth compression utilizing magnitude and phase coded signals representative of the input signal
US3704345A (en) * 1971-03-19 1972-11-28 Bell Telephone Labor Inc Conversion of printed text into synthetic speech
US3746791A (en) * 1971-06-23 1973-07-17 A Wolf Speech synthesizer utilizing white noise
US3903366A (en) * 1974-04-23 1975-09-02 Us Navy Application of simultaneous voice/unvoice excitation in a channel vocoder
US4219695A (en) * 1975-07-07 1980-08-26 International Communication Sciences Noise estimation system for use in speech analysis
NL7600932A (en) * 1976-01-30 1977-08-02 Philips Nv TIRE COMPRESSION SYSTEM.
US4131765A (en) * 1976-08-09 1978-12-26 Kahn Leonard R Method and means for improving the spectrum utilization of communications channels
US4076431A (en) * 1976-08-09 1978-02-28 Sten Burvall Connecting element
US4230906A (en) * 1978-05-25 1980-10-28 Time And Space Processing, Inc. Speech digitizer
US4170719A (en) * 1978-06-14 1979-10-09 Bell Telephone Laboratories, Incorporated Speech transmission system
US4304965A (en) * 1979-05-29 1981-12-08 Texas Instruments Incorporated Data converter for a speech synthesizer
JPS5651638A (en) * 1979-10-04 1981-05-09 Nissan Motor Co Ltd Sound information transfer system for car
NL7908213A (en) * 1979-11-09 1981-06-01 Philips Nv SPEECH SYNTHESIS DEVICE WITH AT LEAST TWO DISTORTION CHAINS.
NL8000361A (en) * 1980-01-21 1981-08-17 Philips Nv DEVICE AND METHOD FOR GENERATING A VOICE SIGNAL
US4441201A (en) * 1980-02-04 1984-04-03 Texas Instruments Incorporated Speech synthesis system utilizing variable frame rate
DE3008830A1 (en) * 1980-03-07 1981-10-01 Siemens AG, 1000 Berlin und 8000 München METHOD FOR OPERATING A VOICE RECOGNITION DEVICE
EP0041195A1 (en) * 1980-05-30 1981-12-09 General Electric Company Improved paging arrangement
US4348550A (en) * 1980-06-09 1982-09-07 Bell Telephone Laboratories, Incorporated Spoken word controlled automatic dialer
JPS5782899A (en) * 1980-11-12 1982-05-24 Canon Kk Voice recognition apparatus
US4378603A (en) * 1980-12-23 1983-03-29 Motorola, Inc. Radiotelephone with hands-free operation
JPS57138696A (en) * 1981-02-20 1982-08-27 Canon Kk Voice input/output apparatus
JPS57146317A (en) * 1981-03-05 1982-09-09 Nippon Denso Co Ltd Method and apparatus for car mounting device
JPS57179899A (en) * 1981-04-28 1982-11-05 Seiko Instr & Electronics Voice synthesizer
GB2102254B (en) * 1981-05-11 1985-08-07 Kokusai Denshin Denwa Co Ltd A speech analysis-synthesis system
US4379949A (en) * 1981-08-10 1983-04-12 Motorola, Inc. Method of and means for variable-rate coding of LPC parameters
US4415767A (en) * 1981-10-19 1983-11-15 Votan Method and apparatus for speech recognition and reproduction
US4454586A (en) * 1981-11-19 1984-06-12 At&T Bell Laboratories Method and apparatus for generating speech pattern templates
US4426733A (en) * 1982-01-28 1984-01-17 General Electric Company Voice-controlled operator-interacting radio transceiver
US4400584A (en) * 1982-04-05 1983-08-23 Motorola, Inc. Speakerphone for radio and, landline telephones
US4520499A (en) * 1982-06-25 1985-05-28 Milton Bradley Company Combination speech synthesis and recognition apparatus
US4520576A (en) * 1983-09-06 1985-06-04 Whirlpool Corporation Conversational voice command control system for home appliance
JP2654942B2 (en) * 1985-09-03 1997-09-17 モトロ−ラ・インコ−ポレ−テツド Voice communication device and operation method thereof

Also Published As

Publication number Publication date
WO1987004293A1 (en) 1987-07-16
EP0255524A1 (en) 1988-02-10
HK40396A (en) 1996-03-15
JP3219093B2 (en) 2001-10-15
US5133010A (en) 1992-07-21
JPS63502302A (en) 1988-09-01
EP0255524B1 (en) 1993-07-21
EP0255524A4 (en) 1988-06-23
DE3688749D1 (en) 1993-08-26
KR950007859B1 (en) 1995-07-20
DE3688749T2 (en) 1993-11-11

Similar Documents

Publication Publication Date Title
CA1299752C (en) Optical decoding technique for speech recognition
CA1324833C (en) Method and apparatus for synthesizing speech without voicing or pitch information
US4905288A (en) Method of data reduction in a speech recognition
US5812965A (en) Process and device for creating comfort noise in a digital speech transmission system
EP1159736B1 (en) Distributed voice recognition system
US5251261A (en) Device for the digital recording and reproduction of speech signals
EP0255523B1 (en) Method and apparatus for synthesizing speech from speech recognition templates
US5706392A (en) Perceptual speech coder and method
US6104994A (en) Method for speech coding under background noise conditions
EP1076895B1 (en) A system and method to improve the quality of coded speech coexisting with background noise
EP1159738B1 (en) Speech synthesizer based on variable rate speech coding
GB2266213A (en) Digital signal coding
Gersho Concepts and paradigms in speech coding

Legal Events

Date Code Title Description
MKLA Lapsed