GB2231700A

GB2231700A - Speech recognition

Info

Publication number: GB2231700A
Application number: GB9010577A
Authority: GB
Inventors: Michael Robinson Taylor
Original assignee: Smiths Group PLC
Current assignee: Smiths Group PLC
Priority date: 1989-05-16
Filing date: 1990-05-11
Publication date: 1990-11-21
Anticipated expiration: 2010-05-11
Also published as: DE4015381A1; GB2231700B; JPH03208099A; GB9010577D0; FR2647248A1; GB8911153D0

Description

3 -1- -7 CO a) SPEECH RECOGNITION APPARATUS AND METHODS This invention

relates to speech recognition apparatus and methods.

Speech recognition apparatus operates by comparing speech information from the speaker with information in a store representing a reference vocabulary. If the words spoken by the speaker are closely similar with the spectral-temporal or acoustic-phonetic information in the store, this yields a high rate of matching. The reference vocabulary may be established from information derived by many different speakers in different circumstances and it can be modified to characterise it more closely with the speech patterns from one particular speaker. This can result in accurate and reliable speech recognition when the speaker's voice is similar to that used to produce the reference vocabulary.

However, under some environmental conditions, the speaker's voice can be modified sufficiently for recognition to be made unreliable. In particularly, if the speaker is influenced by linear acceleration forces, such as high g-forces in an aircraft, or by vibration or stress, this can alter his speech patterns sufficiently to reduce the ability of the speech recognition apparatus to 1 1 2 - identify the words spoken. Attempts have been made to overcome this problem, such as descibed in GB 2186726. This proposes measuring the acceleration or other environmental influences and modifying the stored reference templates or word models in the vocabulary by dynamic adaptation to anticipate the way in which speech will be influenced by the environmental influences. In this way, the stored information after adaptation will bear a closer resemblance to the actual speech, such as influenced by acceleration. This arrangement, however, requires considerable processing capacity and can lead to delay in recognition.

It is an object of the present invention to provide speech recognition appaiatus and methods that can be used to improve recognition when the speech is subject to environmental influences.

According to one aspect of the present invention there is provided speech recognition apparatus including means for deriving speech information signals from speech made by a speaker, means for sensing an environmental influence on the speaker of the kind that modifies speech sound made by the speaker, means for determining when voicing of speech occurs, means for reducing the spectral tilt of the speech information signals during voicing and when the sensed environmental influence is sufficient to cause the speaker to increase the mean fundamental excitation frequency of his speech, the reduction in speech spectral tilt being such as to compensate at least in part for this increase in mean fundamental excitation frequency, and means for comparing the speech information signals after any such reduction in speech spectral tilt with stored speech information signals.

The means for sensing an environmental influence may include an acceleration sensor, a vibration sensor and? or alternatively, a noise sensor. The means for determining when voicing occurs may include a device responsive to movement of the vocal folds. The device responsive to movement of the vocal folds may be a laryngograph.

The means for reducing spectral tilt may be located intermediate the means for deriving the speech information signals and a spectral analysis unit which is arranged to produce output signals representative of the frequency bands within which the sound falls. The means for reducing the spectral tilt is preferably arranged to increase the reduction in speech spectral tilt when the environmental influence on the speaker increases The apparatus may include means to perform sub-set selection on the stored speech information signals in accordance with words previously recognised. The apparatus may include means to perform active word selection on the stored speech information signals in accordance with mode data. The means for deriving speech information signals preferably includes a microphone.

The apparatus may be arranged provide an output in accordance with identified words to control operation of aircraft equipment.

According to another aspect of the present invention there is provided a method of speech recognition including the steps of deriving speech information signals in accordance with speech made by a speaker, sensing environmental influences on the speaker of the kind that modify speech sounds made by the speaker, determining when the speech sounds are voiced, reducing spectral tilt of the speech information signals when both voicing is sensed and when the sensed environmental influences are sufficient to cause the speaker to increase the mean excitation frequency of his speech, the reduction in speech spectral tilt being such as to compensate at least in part for this increase in excitation frequency, and comparing the speech information signals after any such reduction in spectral tilt with stored speech information signals.

- The reduction in speech spectral tilt is preferably greater for increasing sensed environmental influences.

According to a further aspect of the present invention, there is provided apparatus for performing a method according to the other aspect of the invention.

Speech recognition apparatus for an aircraft, and its method of operation, in accordance with the present invention, will now be described, by way of example, with reference to the accompanying drawings, in which:

Figure 1 shows the apparatus schematically; Figure 2 illustrates operation of a part of previous apparatus; and Figure 3 illustrates operation of a part of the apparatus of the invention.

The speech recognition apparatus includes a processing unit 10 that receives input signals from a microphone 1, a laryngograph 2, environmental sensors 3 and a databus 4.

The microphone is located close to the speaker's mouth so as to detect his speech sounds. The laryngograph, which may be of a kind described in GB 2193024, is secured to the speaker's neck to sense movement of the vocal folds and thereby provide an output signal indicative of voiced speech sounds. The environmental sensors 3 are located where they will respond to substantially the same environmental influences as those to which the speaker is subjected. More particularly, the sensors 3 may include an acceleration sensor responsive to g-forces on the speaker, a vibration sensor and a noise sensor.

Signals from the microphone 1 are first supplied to a filter unit 11 in the processing unit 10 which also receives inputs from the laryngograph 2 and the sensors 3 and the operation of which will be described later. The output of the filter unit is supplied to a spectral analysis unit 12 Which produces output signals in accordance with the frequency bands within which the sound falls. These output signals are supplied to an optional spectral correction and noise adaptation unit 13 which improves the signal-to-noise ratio or eliminates, or marks, those signals that can only have arisen from noise rather than speech. Output signals from the unit 13 are supplied to one input of a comparator or pattern matching unit 14. The other input to the pattern matching unit 14 is taken from a vocabulary store 30 which is described in greater detail below. The pattern matching unit 14 compares the spectral-temporal frequency-time patterns derived from the microphone 1 with the stored vocabulary and produces an output on line 15 in accordance with the word which is the best fit or has the highest probability of being the sound received by the microphone 1.

The output on line 15 is supplied to the input of a post recognition processing unit 16 which performs various tasks on the string of word outputs from the pattern matching unit 14 as discussed in greater detail later. The post recognition processing unit 16 has three outputs. One output is provided on line 18 as a feedback channel to an indicator 21. This may be an audible or visual indicator perceivable by the speaker which either confirms his spoken commands, as recognised by the units 14 and 16, or requests repetition of all or part of the command, where an unsatisfactory recognition is achieved. The second output is provided on line 19 to a word sub-set selection unit 32 forming a part of the vocabulary store 30, the operation of which is described in detail below.

The third output is provided on line 20 as the system command signal to the remote teminal 17. The system command signal is produced when the unit 10 identifies a spoken command with sufficient probability and may, for example, be used to effect operation of external equipment via the databus 4.

The store 30 includes a reference vocabulary 31 in the form of pattern templates or word models of the spectral-temporal pattern or state descriptions of different words. This vocabulary is established by the speaker speaking a list of words, under normal environmental circumstances of no vibration, of no noise And of lg. The sounds made are entered in the vocabulary 31 and labelled with the associated word. The total vocabulary 31 may be further reduced by an optional sub-set selection at 32 under control of signals on line 19 in accordance with words previously spoken and recognised.

Following sub-set selection, the vocabulary is further subjected to active word selection at 33 in response to mode data on line 34 from the remote terminal 17 which is derived from information supplied to the remote terminal on the databus 4. For example, in an aircraft, the mode data may indicate whether the aircraft is landing or taking off, or is in mid flight.

Alternatively, for example, if a radio channel had already been selected via a spoken command, the probability of reselection will be small so the words associated with selection of that radio channel can be excluded from the vocabulary at 33. Poor correlation with selected, active templates could be used to invoke re-processing of the speech on a wider syntax.

The tasks performed by the unit 16 are as follows:

Grammar parsing and word spotting techniques are used to detect errors and recover words that have not been identified; 0 2. Identification of the template string or word model sequence of words which best fits the contextual information at the time. Since particular strings of words are more likely than others to be spoken during particular environmental circumstances this can be used to improve the identification of the particular command spoken by the user; and 3. Following the final identification, the processing unit 16 may generate signals for use in bringing up to date the vocabulary sub-set selection performed by 32. These signals are supplied to the vocabulary store 30 via line 19.

It is well known that speech can be influenced by the environmental conditions to which the user to subjected. The result of, for example, high acceleraion on a speaker can subject the thorax and throat to high pressures, making speech difficult and unintelligible to conventional speech recognition apparatus. Similarly, high vibration also alters the ability of the speech articulators and airstream mechanism to function normally and thereby corrupts the speech. This is described in 'Effects of Low Frequency Whole-Body Sinusoidal Vibration on Speech', Michael R. Taylor; Proc. 1. 0. A. Vol 11 Part 5 (1989) pages 151 to 158 and in 'Studies in Automatic Speech Recogntion and its Application in Aerospace' Chapter 5 - PhD thesis by Michael R. Taylor. It has been found that in high noise environments a speaker will automatically alter his speech in a way that is not simply an amplitude increase. Conditions of high stress, such as caused by tiredness, high work load or impending danger also influences the speech patterns of the speaker. The alterations produced in speech by these different environmental conditions is complex and, to provide compensation in a speech recognition apparatus would require large processing capacity. It has, however, been found that these environmental conditions produce a universal effect on speech of a certain kind. More particularly, all these environmental conditions lead to an increase in the mean fundamental excitation 1 i frequency of voiced speech, that is, speech produced by movement of the vocal folds, which in turn causes an upward tilt in the voiced speech spectrum.

In conventional speech recognition apparatus, it is common practice to employ a pre-emphasis filter which acts to increase the upper frequency of the speech signal prior to supply to any pattern matching functions. The performance of such a filter is illustrated in Figure 2. By contrast, the filter unit 11 of the present invention operates in the opposite sense, such as to reduce the mean frequency of the speech input, its spectral tilt, under certain circumstances, as illustrated in Figure 3. This is achieved by attenuating higher frequencies by progressively greater amounts. Figure 3 illustrates a family of three curves A to C, although in practice a considerably larger number of curves would be used. The performance curve is selected according to the amount and nature of the environmental influences on the speaker. For example, under a high gforce acceleration and with high noise present, the filter unit 11 might have the performance characteristic illustrated by curve A, whereas for a lower acceleration and with less noise, the curve C would be used.

1 1 The spectral tilt correction function is only employed when the environmental influences are sufficiently great to affect the speech and when this speech is voiced. In normal circumstances of low environmental influences, the filter unit 11 takes a neutral (flat) characteristic or a conventional characteristic as shown in Figure 2 where the speech spectrum is modified, for both voiced and unvoiced speech.

Voiced speech is detected in the above example by means of a largyngograph but other devices responsive to vocal fold movement could be used instead. Alternatively, voiced speech could be identified by analysis of the speech signals from the microphone. A suitable analysis technique is described in-Theory and Applications of Digital Signal Processing, L. B. Rabiner and B. Gold, Prentice Hall Inc pub. 1975 pages 681 to 687. Modification of the mean frequency of the speech input signal can be achieved relatively simply without considerable processing capability but yet has been found to lead to a significant increase in the recognition rate of voiced speech under adverse environmental conditions.

t 1 1 Although the above system has been described for producing command signals, such as for controlling equipment, a similar system may be used in a speech communication system. In such an alternative arrangement, the line 20, instead of carrying command signals would carry speech signals in respect of the identified words and phrases. It will be appreciated that the various steps in the operation of the recognition system need not be carried out by discrete units, but could be made by steps in the programming of one or more computers or processing units.

Claims

Speech recognition apparatus including means for deriving speech information signals from speech made by a speaker, means for sensing an environmental influence on the speaker of the kind that modifies speech sound made by the speaker, means for determining when voicing of speech occurs, means for reducing the spectral tilt of the speech information signals during voicing and when the sensed environmental influence is sufficient to cause the speaker to increase the mean fundamental excitation frequency of his speech, the said reduction in speech spectral tilt being such as to compen-sate at least in part for this increase in mean fundamental excitation frequency, and means for comparing the speech information signals after any such reduction in speech spectral tilt with stored speech information signals.

Speech recognition apparatus according to Claim 1, wherein the means for sensing an environmental influence includes an acceleration sensor.

Speech recognition apparatus according to Claim 1 or 2, wherein the means for sensing an environmental influence includes a vibration sensor.

Speech recognition apparatus according to any one of the preceding claims, wherein the means for sensing an environmetal influence includes a noise sensor.

speech recognition apparatus according to any one of the preceding claims, wherein the means for determining when voicing occurs includes a device responsive to movement of the vocal folds.

Speech recognition apparatus according to Claim 5, wherein the device responsive to movement of the vocal folds is a laryngograph.

Speech recognition apparatus according to any one of the preceding claims, wherein the means for reducing spectral tilt is located intermediate the means for deriving speech information signals and a spectral analysis unit which is arranged to produce output signals representative of the frequency bands within which the sound falls.

Speech recognition apparatus according to any one of the preceding claims, wherein the means for reducing the spectral tilt is arranged to increase the reduction in speech spectral tilt when the environmental influence on the speaker increases.

Speech recognition apparatus according to any one of the preceding claims, including means to perform sub-set selection on the stored speech information signals in accordance with words previously recognised.

10.

11.

Speech recognition apparatus according to any one of the preceding claims, including means to perform active word selection on the stored speech information signals in accordance with mode data.

Speech recognition apparatus according to any one of the preceding claims, wherein the means for deriving speech information signals includes a microphone.

-1 12.

13.

14.

Speech recognition apparatus according to any one of the preceding claims, wherein the apparatus is arranged to provide an output in accordance with identified words to control operation of aircraft equipment.

Speech recognition apparatus substantially as hereinbefore described with reference to the accompanying drawings.

A method of speech recognition including the steps of deriving speech information signals in accordance with speech made by a speaker, sensing environmental influences on the speaker of the kind that modify speech sounds made by the speaker, determining when the speech sounds are voiced, reducing spectral tilt of the speech information signals when both voicing is sensed and when the sensed environmental influences are sufficient to cause the speaker to increase the mean excitation frequency of his speech, the said reduction in speech spectral tilt being such as to compensate at least in part for this increase in excitation frequency, and comparing the speech information signals after any such reduction in spectral tilt with stored speech information signals.

- 18 15.

A method according to Claim 14, wherein the reduction in speech spectral tilt is greater for increasing sensed environmental influences.

16.

A method of speech recognition substantially as hereinbefore described with reference to the accompanying drawings.

Apparatus for performing a method according to any one of Claims 14 to 16.

18.

Any novel feature or combination of features as hereinbefore described.

Published 1990 atThe Patent Office. Stats House. 66 71 High Holborn. London WC1R 4TP. Further copies maybe obtainedfrom The Patent Office. Sales Branch, St Mary Cray, Orpington. Kent BR5 3RD. Printed by Multiplex techniques ltd. St Mary Cray, Kent, Con. 1187