GB2412997A

GB2412997A - Method and apparatus for hands-free speech recognition using a microphone array

Info

Publication number: GB2412997A
Application number: GB0407900A
Authority: GB
Inventors: Stephane Dedieu; Franck Beaucoup
Original assignee: Mitel Networks Corp
Current assignee: Mitel Networks Corp
Priority date: 2004-04-07
Filing date: 2004-04-07
Publication date: 2005-10-12
Also published as: GB0407900D0

Abstract

A method is provided for improving hands-free speech recognition accuracy IN a microphone array. The microphone array generates and streams multiple output signals for application to the speech recognition engine. Speech recognition is performed on the multiple output signals in connection with which multiple level of confidence values are generated, indicative of the likelihood of correctly recognized speech in each of the multiple output signals. The level of confidence values are compared and in response speech recognition with the highest level of confidence value is selected. The method can use a beam former algorithm for combining the outputs from the microphones or select a subset of output signals based on the spatial localization or combine the levels of confidence values with spatial likelihood estimators related to localization functionality.

Description

2412997.

METHOD AND APPARATUS fOR IMPROVING HANDS-FREE SPEECH RECOGNITION

USING BEAMFORMING TECHNOLOGY

1] The present invention relates generally to hands-free speech recognition systems and in particular to an end-pont method and apparatus for improving hands-free Speech recognition accuracy.

10002] It is well known in the art that the performance of speech recognition engines degrades significantly in hands-free scenarios (see, for example, Environmental robustness in automatic speech recognitions, Acero and RStem, Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing), pp 849-852, 1990). This performance degradation results from the presence of reverberation as vvell as background noise mixed with the original speech signal.

t00031 Accordingly, spatially directive sound pickup systems for speech recognition engines have been used in hands-free scenarios to reduce both reverberations and background noise via spatial filtering. One variety of spatially directive sound pickup systems is microphone arrays (see HA microphone array system for speech recoqnihonr, IN Kiyohara et. al., Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing), pp 215-218, 1997, and "Microphone array based speech recognition with different talker-array positions, M.C)mologo et. al., Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing), pp. 227-131, 1997). Another variety of spatially directive sound pickup systems is gradient, directional microphones.

4] Traditionally, spatially directive sound pickup systems either impose a requirement that the talker speak from a certain fixed position, or use one of a number of beamforming techniques to localize the talker in the room and synthesize a spatially filtered signal towards that direction in order to reduce the effects of reverberations and background noise. for example, Acoustic talker localization", M.Amiri, D.Schulz and M.Tetelbaum, US patent application 20030051532, discloses localizing the talker in a room and streaming speech signals detected by fixed steam pointing in the direction of the talker to a speech recognition engine According to US patent loo 4.956,867 entitled Adaptive Beamforrning for Noise Reduction", the talker is localized in the room and adaptive beamforrning techniques are used to aes ate: A. : . . . . . form a beam for picking up the speech signal coming from the talker while reducing other spatial components of the sound signals at the microphones.

1ou053 Whichever technique is used to perform talker localization, it has to present a very short reaction time so that proper spatial flltefing can be perForrned as quickly as possible after the start of speech activity. This is Very important since the first 5Oms to 1 5Oms of an utterance contain critical information frown the speech ret ognition perspective, for instance differentiating UBefrom"Pee"orDeen.U.S. PatentApplicationNo.10/421,316 entitled hMethodof compensating for beamfornier steering delay during hands-free speech recognition" by M. Amiri and G.Thompson, discloses a method for minimizing the impact of the latency of the localization system on the signal streamed to the speech recognition engine. The method of U. S. Patent Application No. 10/421,316 introduces a delay corresponding to this latency to ensure that compensation is applied at the start of the signal.

6] It is interesting to note, however, that in either of US patent application 20030051532 or US patent No 4956867, the sound pickup system chooses or synthesizes the signal to be sent to the speech recognition engine based on merit characteristics that are mostly of a spatial nature In these two patents and in the area of talker localization, the talker location in the rotund is chosen based on such ment characteristics as the power of the signs) at the output of a beemformer pointed in several directions, or the time delay between the microphone signals.

These characteristics attempt to localize the talker, which is a relevant objective for some applications, such as video conferencing, where the microphone array localizes the talker so that a camera can point to the talker. However, such characteristics do not necessarily reflect the specific needs of a speech recognition engine, which are generally not known to the end- point sound pickup scheme. For instance, power and ShlR, although they are likely to be positive factors in increasing the likelihood of confect speech recognition, normally are not tee the most vital considerations for speech recognition, depending on the specific speech recognition engine.

100 71 Even with the improvements resulting from U.S. Patent Application No. 10/421,316, these characteristics do not necessarily offer the best performance in terms of speech recognition. It is contemplated that there are better ways of using all of the information available to the sound pickup scheme to assist operation of the speech recognition engine.

lOOOB] Accordingly, it is an object of the present invention to best use the real-time information received from spatially directive sound pickup schemes to increase the performance of speech . ... . . . . . . recognition engines.

[ 0091 The present invention is based on the realization that most speech recognition engines provide a Level of confidence" value together with the response output. This level of confidence is a good indicator of the likelihood of the response being correct. Thus, according to the present invention several spatially filtered voice signals are synthesized and provided to the speech recognition engine, and the engine's response with the highest level of confidence is selected This offers the system the ability to make decisions in ways that are more relevant for speech recognition. Indeed, it is clear that the level of confidence provided by the speech recognition engine is a better indicator of speech recognition success than the level of noise or reverberation on the signal. The method of the present invention also solves the problem discussed above relating to first SOms to 150ms of an utterance since all beam outputs are streamed from the Very beginning of speech activity.

10010J The streaming of multiple signals to the speech recognition engine is relatively simple to implement over digital networks (particularly over IP networks). Also, the method of this invention generally increases the reaction time of the speech recognition engine, but by an acceptable amount given the relatively 'loose' real-time requirements of this functionality [0011] Figure 1 shows a microphone array K microphones resulting in N beams; and 10012] Figure 2 is a block diagram of an apparatus for implementing the method according to the present invention.

l0014l With reference to figure 1, a microphone array 1 comprises a plurality (K) of microphones arranged to pick up voice in a hands-free environment. The K signal outputs of array 1 are processed by a beamformer 3 using weighting coefficients to generate a plurality (N) of beamformer signal outputs B1 (tJ, E32(t), ., BN(t) The beamfonning aIgorithn, combines the signals from the various microphones of the array 1 to enhance the audio signal originating from each of the N desired locations and attenuate the audio signals originating from all other respective locations.

10015] Turning to Figure 2, the output of beanforrner 3 is shown connected to an optional filter and transmit block block for selecting a subset of M beams (M s b1) Bj1 (t), Bj2(t), . .. .

, 8 a, , ,, Bjil!.(t) based on characteristics such as power, SNR, or other characteristics that can be extracted using signal processing techniques. The filter and transmit block 5 may be omitted or replaced by a simple streaming block that simply streanns the output signals of all bearn$ over the network 7 to speech recognition engine 9. However, to the extent that filter and transmit block perForrns filtering, then the delay compensation technique set forth in U.S. Patent Application No. 10/421, 316 should be used to compensate for delay latency at the start of each speech utterance, prior to the signal being sent to the speech recognition engine 9.

lO016] Speech recognition engines normally provide confidence level output along with the detected speech. This output is sometimes referred to as a Confidence score" orconfidence measurer, and is an area of significant research in the art. For example, C)elaney, D. \/\1 "Voice User Interface for wireless InternetworkingT" Qualifying Examination Report, Georgia Institute of Technology; School of Electrical and Computer Engineering; Atlanta, Ga. Jan. 30, 2001' discussed confidence neasures as part of the general subject of speech recognition. Articles dealing vvith confidence levels in speech recognition engines include 1) Thomas Kemp and Thomas Schaaf, Confidence measures for spontaneous speech recognitions, International Conference on Acoustics, Speech and Signal Processing 1997 (ICASSP), 1997; 2) Timothy Hazen, Stephanie Seneff and Joseph Polifroni, Recognition confidence scoring and its use in speech understanding systems", Computer Speech and Language (2002) #16, pp 4947; and 3) Daniel \/Villett, Andreas Worm, Christoph Neukirchen, Gerhard Rigoll, "Confidence Measures For HNIM-Baseci Speech Recognition", 5th International Conference on Spoken Language Processing (ICSLP), 1998 10017] Patents dealing with confidence levels in speech recognition engines include; US patent 6,539,353 Conhdence measures using ub-word- dependent Freighting of sub-word confidence scores for robust speech recognition", Jiang et al.; US patent 6,571,210 "Confidence measure system using a near-miss patterns, Hon et al; US patent 6006,183 "Speech recognition confidence level displays, Lai et al.; and US patent 6,421, 640'iSpeech recognition method using confidence measure evaluations, Dolfing et al. Thus, according to the present invention an analysis. blocic 11 receives the response of the speech recognition engine 9 for each of the M beamformer outputs, togetherwith a level of confidence generated In the usual course by the speech recognition engine 9. The analysis block 11 compares the levels of confidence given by the engine and selects the speech c c. . . recognition response with the highest level of confidence. Altematively, the filter and transmit block S can transmit other likelihood estimators to the analysis block 11 that can then be used in conjunction with the levels of confidence to make the final decision. For instance the filter and transmit block 5 may transmit the beam output poorer or STIR for each of the M candidate beams to the analysis block 1 l. The analysis block 11 then uses a form of multiple-input state machine (e.g. logic) to combine the information and make a decision on which response from the speech recognition engine' should be streamed to the end point.

100181 The many features and advantages of the invention are apparent from the detailed specification and, thus, it is intended by the appended claims to cover all such features and advantages of the invention that fall within the sphere and scope of the invention. For example, a person of ordinary skill in the art will appreciate that directional microphones may be used instead of beamforming. or the output signals of several directional microphones may be combined to effectively perForrn beamforming on the directional microphones. The optional filter and transmit brook 5 nosy employ any of a number of well-known spatial localization techniques for limiting the output to a subset of the beam outputs (e.g only the beams with the highest power) Also, the levels of confidence generated by speech recognition engine 9 for several bean, output signals may be Combined with spatial likelihood estimators related to the localization functionality in order to compute revised success estimators. The exact manner by which the multiple streams are transmitted to the speech recognition engine 9 does not force part of the present invention but would be well known to a person of skill in the art. In particular, the exact inplernentation of this communication protocol may depend on the type of netvvork being used (e g. TOM or IP network). Since numerous modifications and changes will readily occur to those skilled in the ad, it is not desired to limit the invention to the exact construction and operation illustrated and described, and accordingly all suitable modifications and equivalents may be resorted to, falling Within the scope of the invention. a.

Claims

. . . CLAIIAS

1. For use with a speech recognition engine and a microphone array for generating multiple output signals, a method of improving hands-free speech recognition accuracy comprising: streaming and transmitting said multiple output signals to said speech recognition engine; performing speech recognition on said multiple output signals in connection with which multiple level of confidence values are genera ted indicative of the likelihood of correctly recognized speech in each of said multiple output signals; and comparing said level of confidence valLtes and in response selecting as recognized speech the speech recognition with highest level of confidence value.

2. The method of claim 1, further conprlsing bearnrrning said multiple output sigr\als before transmission to said speech recognition engine.

3. The method of claim 1, further comprising spatial localization of said multiple output signals for selecting a subset of active ones of said multiple output signals for transmission to said speech recognition engine.

4. me method of claim 1, further comprising combining said level of confidence values win spatial likelihood estimators related to localization functionality in order to generate revised level of confidence values. l