GB2412997A - Method and apparatus for hands-free speech recognition using a microphone array - Google Patents

Method and apparatus for hands-free speech recognition using a microphone array Download PDF

Info

Publication number
GB2412997A
GB2412997A GB0407900A GB0407900A GB2412997A GB 2412997 A GB2412997 A GB 2412997A GB 0407900 A GB0407900 A GB 0407900A GB 0407900 A GB0407900 A GB 0407900A GB 2412997 A GB2412997 A GB 2412997A
Authority
GB
United Kingdom
Prior art keywords
speech recognition
output signals
multiple output
level
confidence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0407900A
Other versions
GB0407900D0 (en
Inventor
Stephane Dedieu
Franck Beaucoup
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mitel Networks Corp
Original Assignee
Mitel Networks Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitel Networks Corp filed Critical Mitel Networks Corp
Priority to GB0407900A priority Critical patent/GB2412997A/en
Publication of GB0407900D0 publication Critical patent/GB0407900D0/en
Publication of GB2412997A publication Critical patent/GB2412997A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A method is provided for improving hands-free speech recognition accuracy IN a microphone array. The microphone array generates and streams multiple output signals for application to the speech recognition engine. Speech recognition is performed on the multiple output signals in connection with which multiple level of confidence values are generated, indicative of the likelihood of correctly recognized speech in each of the multiple output signals. The level of confidence values are compared and in response speech recognition with the highest level of confidence value is selected. The method can use a beam former algorithm for combining the outputs from the microphones or select a subset of output signals based on the spatial localization or combine the levels of confidence values with spatial likelihood estimators related to localization functionality.

Description

2412997.
METHOD AND APPARATUS fOR IMPROVING HANDS-FREE SPEECH RECOGNITION
USING BEAMFORMING TECHNOLOGY
1] The present invention relates generally to hands-free speech recognition systems and in particular to an end-pont method and apparatus for improving hands-free Speech recognition accuracy.
10002] It is well known in the art that the performance of speech recognition engines degrades significantly in hands-free scenarios (see, for example, Environmental robustness in automatic speech recognitions, Acero and RStem, Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing), pp 849-852, 1990). This performance degradation results from the presence of reverberation as vvell as background noise mixed with the original speech signal.
t00031 Accordingly, spatially directive sound pickup systems for speech recognition engines have been used in hands-free scenarios to reduce both reverberations and background noise via spatial filtering. One variety of spatially directive sound pickup systems is microphone arrays (see HA microphone array system for speech recoqnihonr, IN Kiyohara et. al., Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing), pp 215-218, 1997, and "Microphone array based speech recognition with different talker-array positions, M.C)mologo et. al., Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing), pp. 227-131, 1997). Another variety of spatially directive sound pickup systems is gradient, directional microphones.
4] Traditionally, spatially directive sound pickup systems either impose a requirement that the talker speak from a certain fixed position, or use one of a number of beamforming techniques to localize the talker in the room and synthesize a spatially filtered signal towards that direction in order to reduce the effects of reverberations and background noise. for example, Acoustic talker localization", M.Amiri, D.Schulz and M.Tetelbaum, US patent application 20030051532, discloses localizing the talker in a room and streaming speech signals detected by fixed steam pointing in the direction of the talker to a speech recognition engine According to US patent loo 4.956,867 entitled Adaptive Beamforrning for Noise Reduction", the talker is localized in the room and adaptive beamforrning techniques are used to aes ate: A. : . . . . . form a beam for picking up the speech signal coming from the talker while reducing other spatial components of the sound signals at the microphones.
1ou053 Whichever technique is used to perform talker localization, it has to present a very short reaction time so that proper spatial flltefing can be perForrned as quickly as possible after the start of speech activity. This is Very important since the first 5Oms to 1 5Oms of an utterance contain critical information frown the speech ret ognition perspective, for instance differentiating UBefrom"Pee"orDeen.U.S. PatentApplicationNo.10/421,316 entitled hMethodof compensating for beamfornier steering delay during hands-free speech recognition" by M. Amiri and G.Thompson, discloses a method for minimizing the impact of the latency of the localization system on the signal streamed to the speech recognition engine. The method of U. S. Patent Application No. 10/421,316 introduces a delay corresponding to this latency to ensure that compensation is applied at the start of the signal.
6] It is interesting to note, however, that in either of US patent application 20030051532 or US patent No 4956867, the sound pickup system chooses or synthesizes the signal to be sent to the speech recognition engine based on merit characteristics that are mostly of a spatial nature In these two patents and in the area of talker localization, the talker location in the rotund is chosen based on such ment characteristics as the power of the signs) at the output of a beemformer pointed in several directions, or the time delay between the microphone signals.
These characteristics attempt to localize the talker, which is a relevant objective for some applications, such as video conferencing, where the microphone array localizes the talker so that a camera can point to the talker. However, such characteristics do not necessarily reflect the specific needs of a speech recognition engine, which are generally not known to the end- point sound pickup scheme. For instance, power and ShlR, although they are likely to be positive factors in increasing the likelihood of confect speech recognition, normally are not tee the most vital considerations for speech recognition, depending on the specific speech recognition engine.
100 71 Even with the improvements resulting from U.S. Patent Application No. 10/421,316, these characteristics do not necessarily offer the best performance in terms of speech recognition. It is contemplated that there are better ways of using all of the information available to the sound pickup scheme to assist operation of the speech recognition engine.
lOOOB] Accordingly, it is an object of the present invention to best use the real-time information received from spatially directive sound pickup schemes to increase the performance of speech . ... . . . . . . recognition engines.
[ 0091 The present invention is based on the realization that most speech recognition engines provide a Level of confidence" value together with the response output. This level of confidence is a good indicator of the likelihood of the response being correct. Thus, according to the present invention several spatially filtered voice signals are synthesized and provided to the speech recognition engine, and the engine's response with the highest level of confidence is selected This offers the system the ability to make decisions in ways that are more relevant for speech recognition. Indeed, it is clear that the level of confidence provided by the speech recognition engine is a better indicator of speech recognition success than the level of noise or reverberation on the signal. The method of the present invention also solves the problem discussed above relating to first SOms to 150ms of an utterance since all beam outputs are streamed from the Very beginning of speech activity.
10010J The streaming of multiple signals to the speech recognition engine is relatively simple to implement over digital networks (particularly over IP networks). Also, the method of this invention generally increases the reaction time of the speech recognition engine, but by an acceptable amount given the relatively 'loose' real-time requirements of this functionality [0011] Figure 1 shows a microphone array K microphones resulting in N beams; and 10012] Figure 2 is a block diagram of an apparatus for implementing the method according to the present invention.
l0014l With reference to figure 1, a microphone array 1 comprises a plurality (K) of microphones arranged to pick up voice in a hands-free environment. The K signal outputs of array 1 are processed by a beamformer 3 using weighting coefficients to generate a plurality (N) of beamformer signal outputs B1 (tJ, E32(t), ., BN(t) The beamfonning aIgorithn, combines the signals from the various microphones of the array 1 to enhance the audio signal originating from each of the N desired locations and attenuate the audio signals originating from all other respective locations.
10015] Turning to Figure 2, the output of beanforrner 3 is shown connected to an optional filter and transmit block block for selecting a subset of M beams (M s b1) Bj1 (t), Bj2(t), . .. .
, 8 a, , ,, Bjil!.(t) based on characteristics such as power, SNR, or other characteristics that can be extracted using signal processing techniques. The filter and transmit block 5 may be omitted or replaced by a simple streaming block that simply streanns the output signals of all bearn$ over the network 7 to speech recognition engine 9. However, to the extent that filter and transmit block perForrns filtering, then the delay compensation technique set forth in U.S. Patent Application No. 10/421, 316 should be used to compensate for delay latency at the start of each speech utterance, prior to the signal being sent to the speech recognition engine 9.
lO016] Speech recognition engines normally provide confidence level output along with the detected speech. This output is sometimes referred to as a Confidence score" orconfidence measurer, and is an area of significant research in the art. For example, C)elaney, D. \/\1 "Voice User Interface for wireless InternetworkingT" Qualifying Examination Report, Georgia Institute of Technology; School of Electrical and Computer Engineering; Atlanta, Ga. Jan. 30, 2001' discussed confidence neasures as part of the general subject of speech recognition. Articles dealing vvith confidence levels in speech recognition engines include 1) Thomas Kemp and Thomas Schaaf, Confidence measures for spontaneous speech recognitions, International Conference on Acoustics, Speech and Signal Processing 1997 (ICASSP), 1997; 2) Timothy Hazen, Stephanie Seneff and Joseph Polifroni, Recognition confidence scoring and its use in speech understanding systems", Computer Speech and Language (2002) #16, pp 4947; and 3) Daniel \/Villett, Andreas Worm, Christoph Neukirchen, Gerhard Rigoll, "Confidence Measures For HNIM-Baseci Speech Recognition", 5th International Conference on Spoken Language Processing (ICSLP), 1998 10017] Patents dealing with confidence levels in speech recognition engines include; US patent 6,539,353 Conhdence measures using ub-word- dependent Freighting of sub-word confidence scores for robust speech recognition", Jiang et al.; US patent 6,571,210 "Confidence measure system using a near-miss patterns, Hon et al; US patent 6006,183 "Speech recognition confidence level displays, Lai et al.; and US patent 6,421, 640'iSpeech recognition method using confidence measure evaluations, Dolfing et al. Thus, according to the present invention an analysis. blocic 11 receives the response of the speech recognition engine 9 for each of the M beamformer outputs, togetherwith a level of confidence generated In the usual course by the speech recognition engine 9. The analysis block 11 compares the levels of confidence given by the engine and selects the speech c c. . . recognition response with the highest level of confidence. Altematively, the filter and transmit block S can transmit other likelihood estimators to the analysis block 11 that can then be used in conjunction with the levels of confidence to make the final decision. For instance the filter and transmit block 5 may transmit the beam output poorer or STIR for each of the M candidate beams to the analysis block 1 l. The analysis block 11 then uses a form of multiple-input state machine (e.g. logic) to combine the information and make a decision on which response from the speech recognition engine' should be streamed to the end point.
100181 The many features and advantages of the invention are apparent from the detailed specification and, thus, it is intended by the appended claims to cover all such features and advantages of the invention that fall within the sphere and scope of the invention. For example, a person of ordinary skill in the art will appreciate that directional microphones may be used instead of beamforming. or the output signals of several directional microphones may be combined to effectively perForrn beamforming on the directional microphones. The optional filter and transmit brook 5 nosy employ any of a number of well-known spatial localization techniques for limiting the output to a subset of the beam outputs (e.g only the beams with the highest power) Also, the levels of confidence generated by speech recognition engine 9 for several bean, output signals may be Combined with spatial likelihood estimators related to the localization functionality in order to compute revised success estimators. The exact manner by which the multiple streams are transmitted to the speech recognition engine 9 does not force part of the present invention but would be well known to a person of skill in the art. In particular, the exact inplernentation of this communication protocol may depend on the type of netvvork being used (e g. TOM or IP network). Since numerous modifications and changes will readily occur to those skilled in the ad, it is not desired to limit the invention to the exact construction and operation illustrated and described, and accordingly all suitable modifications and equivalents may be resorted to, falling Within the scope of the invention. a.

Claims (4)

. . . CLAIIAS
1. For use with a speech recognition engine and a microphone array for generating multiple output signals, a method of improving hands-free speech recognition accuracy comprising: streaming and transmitting said multiple output signals to said speech recognition engine; performing speech recognition on said multiple output signals in connection with which multiple level of confidence values are genera ted indicative of the likelihood of correctly recognized speech in each of said multiple output signals; and comparing said level of confidence valLtes and in response selecting as recognized speech the speech recognition with highest level of confidence value.
2. The method of claim 1, further conprlsing bearnrrning said multiple output sigr\als before transmission to said speech recognition engine.
3. The method of claim 1, further comprising spatial localization of said multiple output signals for selecting a subset of active ones of said multiple output signals for transmission to said speech recognition engine.
4. me method of claim 1, further comprising combining said level of confidence values win spatial likelihood estimators related to localization functionality in order to generate revised level of confidence values. l
GB0407900A 2004-04-07 2004-04-07 Method and apparatus for hands-free speech recognition using a microphone array Withdrawn GB2412997A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0407900A GB2412997A (en) 2004-04-07 2004-04-07 Method and apparatus for hands-free speech recognition using a microphone array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0407900A GB2412997A (en) 2004-04-07 2004-04-07 Method and apparatus for hands-free speech recognition using a microphone array

Publications (2)

Publication Number Publication Date
GB0407900D0 GB0407900D0 (en) 2004-05-12
GB2412997A true GB2412997A (en) 2005-10-12

Family

ID=32320507

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0407900A Withdrawn GB2412997A (en) 2004-04-07 2004-04-07 Method and apparatus for hands-free speech recognition using a microphone array

Country Status (1)

Country Link
GB (1) GB2412997A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010114935A1 (en) * 2009-04-02 2010-10-07 Qualcomm Incorporated Beamforming options with partial channel knowledge
JP2016080750A (en) * 2014-10-10 2016-05-16 株式会社Nttドコモ Voice recognition device, voice recognition method, and voice recognition program
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110299136A (en) * 2018-03-22 2019-10-01 上海擎感智能科技有限公司 A kind of processing method and its system for speech recognition

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5353376A (en) * 1992-03-20 1994-10-04 Texas Instruments Incorporated System and method for improved speech acquisition for hands-free voice telecommunication in a noisy environment
WO1996041214A1 (en) * 1995-06-07 1996-12-19 Sensimetrics Corporation Apparatus and method for speech recognition using spatial information
JP2000101598A (en) * 1998-09-25 2000-04-07 Matsushita Electric Works Ltd Voice communication system
EP1160772A2 (en) * 2000-06-02 2001-12-05 Canon Kabushiki Kaisha Multisensor based acoustic signal processing
WO2003058604A1 (en) * 2001-12-29 2003-07-17 Motorola Inc., A Corporation Of The State Of Delaware Method and apparatus for multi-level distributed speech recognition
US20030204397A1 (en) * 2002-04-26 2003-10-30 Mitel Knowledge Corporation Method of compensating for beamformer steering delay during handsfree speech recognition
US20040044516A1 (en) * 2002-06-03 2004-03-04 Kennewick Robert A. Systems and methods for responding to natural language speech utterance

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5353376A (en) * 1992-03-20 1994-10-04 Texas Instruments Incorporated System and method for improved speech acquisition for hands-free voice telecommunication in a noisy environment
WO1996041214A1 (en) * 1995-06-07 1996-12-19 Sensimetrics Corporation Apparatus and method for speech recognition using spatial information
JP2000101598A (en) * 1998-09-25 2000-04-07 Matsushita Electric Works Ltd Voice communication system
EP1160772A2 (en) * 2000-06-02 2001-12-05 Canon Kabushiki Kaisha Multisensor based acoustic signal processing
WO2003058604A1 (en) * 2001-12-29 2003-07-17 Motorola Inc., A Corporation Of The State Of Delaware Method and apparatus for multi-level distributed speech recognition
US20030204397A1 (en) * 2002-04-26 2003-10-30 Mitel Knowledge Corporation Method of compensating for beamformer steering delay during handsfree speech recognition
US20040044516A1 (en) * 2002-06-03 2004-03-04 Kennewick Robert A. Systems and methods for responding to natural language speech utterance

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010114935A1 (en) * 2009-04-02 2010-10-07 Qualcomm Incorporated Beamforming options with partial channel knowledge
US8463191B2 (en) 2009-04-02 2013-06-11 Qualcomm Incorporated Beamforming options with partial channel knowledge
JP2016080750A (en) * 2014-10-10 2016-05-16 株式会社Nttドコモ Voice recognition device, voice recognition method, and voice recognition program
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant

Also Published As

Publication number Publication date
GB0407900D0 (en) 2004-05-12

Similar Documents

Publication Publication Date Title
JP4734070B2 (en) Multi-channel adaptive audio signal processing with noise reduction
US8233352B2 (en) Audio source localization system and method
JP4929685B2 (en) Remote conference equipment
US9837099B1 (en) Method and system for beam selection in microphone array beamformers
US9338549B2 (en) Acoustic localization of a speaker
US9171551B2 (en) Unified microphone pre-processing system and method
KR101619578B1 (en) Apparatus and method for geometry-based spatial audio coding
JP3521914B2 (en) Super directional microphone array
US7366310B2 (en) Microphone array diffracting structure
WO2019089486A1 (en) Multi-channel speech separation
US8204247B2 (en) Position-independent microphone system
EP2146519A1 (en) Beamforming pre-processing for speaker localization
US20100008518A1 (en) Methods for processing audio input received at an input device
US20160165338A1 (en) Directional audio recording system
Taherian et al. Deep learning based multi-channel speaker recognition in noisy and reverberant environments
Ince et al. Assessment of general applicability of ego noise estimation
CN106303870B (en) Method for the signal processing in binaural listening equipment
Ryan et al. Application of near-field optimum microphone arrays to hands-free mobile telephony
CN115457971A (en) Noise reduction method, electronic device and storage medium
Zheng et al. A microphone array system for multimedia applications with near-field signal targets
GB2412997A (en) Method and apparatus for hands-free speech recognition using a microphone array
CA2485728C (en) Detecting acoustic echoes using microphone arrays
US11889261B2 (en) Adaptive beamformer for enhanced far-field sound pickup
Adcock et al. Practical issues in the use of a frequency‐domain delay estimator for microphone‐array applications
US11937047B1 (en) Ear-worn device with neural network for noise reduction and/or spatial focusing using multiple input audio signals

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)