GB2412997A - Method and apparatus for hands-free speech recognition using a microphone array - Google Patents
Method and apparatus for hands-free speech recognition using a microphone array Download PDFInfo
- Publication number
- GB2412997A GB2412997A GB0407900A GB0407900A GB2412997A GB 2412997 A GB2412997 A GB 2412997A GB 0407900 A GB0407900 A GB 0407900A GB 0407900 A GB0407900 A GB 0407900A GB 2412997 A GB2412997 A GB 2412997A
- Authority
- GB
- United Kingdom
- Prior art keywords
- speech recognition
- output signals
- multiple output
- level
- confidence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 230000004807 localization Effects 0.000 claims abstract description 10
- 230000004044 response Effects 0.000 claims abstract description 8
- 230000005540 biological transmission Effects 0.000 claims 2
- 238000012545 processing Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000035484 reaction time Effects 0.000 description 2
- 241000747049 Aceros Species 0.000 description 1
- 235000010627 Phaseolus vulgaris Nutrition 0.000 description 1
- 244000046052 Phaseolus vulgaris Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002269 spontaneous effect Effects 0.000 description 1
- 238000003756 stirring Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
A method is provided for improving hands-free speech recognition accuracy IN a microphone array. The microphone array generates and streams multiple output signals for application to the speech recognition engine. Speech recognition is performed on the multiple output signals in connection with which multiple level of confidence values are generated, indicative of the likelihood of correctly recognized speech in each of the multiple output signals. The level of confidence values are compared and in response speech recognition with the highest level of confidence value is selected. The method can use a beam former algorithm for combining the outputs from the microphones or select a subset of output signals based on the spatial localization or combine the levels of confidence values with spatial likelihood estimators related to localization functionality.
Description
2412997.
METHOD AND APPARATUS fOR IMPROVING HANDS-FREE SPEECH RECOGNITION
USING BEAMFORMING TECHNOLOGY
1] The present invention relates generally to hands-free speech recognition systems and in particular to an end-pont method and apparatus for improving hands-free Speech recognition accuracy.
10002] It is well known in the art that the performance of speech recognition engines degrades significantly in hands-free scenarios (see, for example, Environmental robustness in automatic speech recognitions, Acero and RStem, Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing), pp 849-852, 1990). This performance degradation results from the presence of reverberation as vvell as background noise mixed with the original speech signal.
t00031 Accordingly, spatially directive sound pickup systems for speech recognition engines have been used in hands-free scenarios to reduce both reverberations and background noise via spatial filtering. One variety of spatially directive sound pickup systems is microphone arrays (see HA microphone array system for speech recoqnihonr, IN Kiyohara et. al., Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing), pp 215-218, 1997, and "Microphone array based speech recognition with different talker-array positions, M.C)mologo et. al., Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing), pp. 227-131, 1997). Another variety of spatially directive sound pickup systems is gradient, directional microphones.
4] Traditionally, spatially directive sound pickup systems either impose a requirement that the talker speak from a certain fixed position, or use one of a number of beamforming techniques to localize the talker in the room and synthesize a spatially filtered signal towards that direction in order to reduce the effects of reverberations and background noise. for example, Acoustic talker localization", M.Amiri, D.Schulz and M.Tetelbaum, US patent application 20030051532, discloses localizing the talker in a room and streaming speech signals detected by fixed steam pointing in the direction of the talker to a speech recognition engine According to US patent loo 4.956,867 entitled Adaptive Beamforrning for Noise Reduction", the talker is localized in the room and adaptive beamforrning techniques are used to aes ate: A. : . . . . . form a beam for picking up the speech signal coming from the talker while reducing other spatial components of the sound signals at the microphones.
1ou053 Whichever technique is used to perform talker localization, it has to present a very short reaction time so that proper spatial flltefing can be perForrned as quickly as possible after the start of speech activity. This is Very important since the first 5Oms to 1 5Oms of an utterance contain critical information frown the speech ret ognition perspective, for instance differentiating UBefrom"Pee"orDeen.U.S. PatentApplicationNo.10/421,316 entitled hMethodof compensating for beamfornier steering delay during hands-free speech recognition" by M. Amiri and G.Thompson, discloses a method for minimizing the impact of the latency of the localization system on the signal streamed to the speech recognition engine. The method of U. S. Patent Application No. 10/421,316 introduces a delay corresponding to this latency to ensure that compensation is applied at the start of the signal.
6] It is interesting to note, however, that in either of US patent application 20030051532 or US patent No 4956867, the sound pickup system chooses or synthesizes the signal to be sent to the speech recognition engine based on merit characteristics that are mostly of a spatial nature In these two patents and in the area of talker localization, the talker location in the rotund is chosen based on such ment characteristics as the power of the signs) at the output of a beemformer pointed in several directions, or the time delay between the microphone signals.
These characteristics attempt to localize the talker, which is a relevant objective for some applications, such as video conferencing, where the microphone array localizes the talker so that a camera can point to the talker. However, such characteristics do not necessarily reflect the specific needs of a speech recognition engine, which are generally not known to the end- point sound pickup scheme. For instance, power and ShlR, although they are likely to be positive factors in increasing the likelihood of confect speech recognition, normally are not tee the most vital considerations for speech recognition, depending on the specific speech recognition engine.
100 71 Even with the improvements resulting from U.S. Patent Application No. 10/421,316, these characteristics do not necessarily offer the best performance in terms of speech recognition. It is contemplated that there are better ways of using all of the information available to the sound pickup scheme to assist operation of the speech recognition engine.
lOOOB] Accordingly, it is an object of the present invention to best use the real-time information received from spatially directive sound pickup schemes to increase the performance of speech . ... . . . . . . recognition engines.
[ 0091 The present invention is based on the realization that most speech recognition engines provide a Level of confidence" value together with the response output. This level of confidence is a good indicator of the likelihood of the response being correct. Thus, according to the present invention several spatially filtered voice signals are synthesized and provided to the speech recognition engine, and the engine's response with the highest level of confidence is selected This offers the system the ability to make decisions in ways that are more relevant for speech recognition. Indeed, it is clear that the level of confidence provided by the speech recognition engine is a better indicator of speech recognition success than the level of noise or reverberation on the signal. The method of the present invention also solves the problem discussed above relating to first SOms to 150ms of an utterance since all beam outputs are streamed from the Very beginning of speech activity.
10010J The streaming of multiple signals to the speech recognition engine is relatively simple to implement over digital networks (particularly over IP networks). Also, the method of this invention generally increases the reaction time of the speech recognition engine, but by an acceptable amount given the relatively 'loose' real-time requirements of this functionality [0011] Figure 1 shows a microphone array K microphones resulting in N beams; and 10012] Figure 2 is a block diagram of an apparatus for implementing the method according to the present invention.
l0014l With reference to figure 1, a microphone array 1 comprises a plurality (K) of microphones arranged to pick up voice in a hands-free environment. The K signal outputs of array 1 are processed by a beamformer 3 using weighting coefficients to generate a plurality (N) of beamformer signal outputs B1 (tJ, E32(t), ., BN(t) The beamfonning aIgorithn, combines the signals from the various microphones of the array 1 to enhance the audio signal originating from each of the N desired locations and attenuate the audio signals originating from all other respective locations.
10015] Turning to Figure 2, the output of beanforrner 3 is shown connected to an optional filter and transmit block block for selecting a subset of M beams (M s b1) Bj1 (t), Bj2(t), . .. .
, 8 a, , ,, Bjil!.(t) based on characteristics such as power, SNR, or other characteristics that can be extracted using signal processing techniques. The filter and transmit block 5 may be omitted or replaced by a simple streaming block that simply streanns the output signals of all bearn$ over the network 7 to speech recognition engine 9. However, to the extent that filter and transmit block perForrns filtering, then the delay compensation technique set forth in U.S. Patent Application No. 10/421, 316 should be used to compensate for delay latency at the start of each speech utterance, prior to the signal being sent to the speech recognition engine 9.
lO016] Speech recognition engines normally provide confidence level output along with the detected speech. This output is sometimes referred to as a Confidence score" orconfidence measurer, and is an area of significant research in the art. For example, C)elaney, D. \/\1 "Voice User Interface for wireless InternetworkingT" Qualifying Examination Report, Georgia Institute of Technology; School of Electrical and Computer Engineering; Atlanta, Ga. Jan. 30, 2001' discussed confidence neasures as part of the general subject of speech recognition. Articles dealing vvith confidence levels in speech recognition engines include 1) Thomas Kemp and Thomas Schaaf, Confidence measures for spontaneous speech recognitions, International Conference on Acoustics, Speech and Signal Processing 1997 (ICASSP), 1997; 2) Timothy Hazen, Stephanie Seneff and Joseph Polifroni, Recognition confidence scoring and its use in speech understanding systems", Computer Speech and Language (2002) #16, pp 4947; and 3) Daniel \/Villett, Andreas Worm, Christoph Neukirchen, Gerhard Rigoll, "Confidence Measures For HNIM-Baseci Speech Recognition", 5th International Conference on Spoken Language Processing (ICSLP), 1998 10017] Patents dealing with confidence levels in speech recognition engines include; US patent 6,539,353 Conhdence measures using ub-word- dependent Freighting of sub-word confidence scores for robust speech recognition", Jiang et al.; US patent 6,571,210 "Confidence measure system using a near-miss patterns, Hon et al; US patent 6006,183 "Speech recognition confidence level displays, Lai et al.; and US patent 6,421, 640'iSpeech recognition method using confidence measure evaluations, Dolfing et al. Thus, according to the present invention an analysis. blocic 11 receives the response of the speech recognition engine 9 for each of the M beamformer outputs, togetherwith a level of confidence generated In the usual course by the speech recognition engine 9. The analysis block 11 compares the levels of confidence given by the engine and selects the speech c c. . . recognition response with the highest level of confidence. Altematively, the filter and transmit block S can transmit other likelihood estimators to the analysis block 11 that can then be used in conjunction with the levels of confidence to make the final decision. For instance the filter and transmit block 5 may transmit the beam output poorer or STIR for each of the M candidate beams to the analysis block 1 l. The analysis block 11 then uses a form of multiple-input state machine (e.g. logic) to combine the information and make a decision on which response from the speech recognition engine' should be streamed to the end point.
100181 The many features and advantages of the invention are apparent from the detailed specification and, thus, it is intended by the appended claims to cover all such features and advantages of the invention that fall within the sphere and scope of the invention. For example, a person of ordinary skill in the art will appreciate that directional microphones may be used instead of beamforming. or the output signals of several directional microphones may be combined to effectively perForrn beamforming on the directional microphones. The optional filter and transmit brook 5 nosy employ any of a number of well-known spatial localization techniques for limiting the output to a subset of the beam outputs (e.g only the beams with the highest power) Also, the levels of confidence generated by speech recognition engine 9 for several bean, output signals may be Combined with spatial likelihood estimators related to the localization functionality in order to compute revised success estimators. The exact manner by which the multiple streams are transmitted to the speech recognition engine 9 does not force part of the present invention but would be well known to a person of skill in the art. In particular, the exact inplernentation of this communication protocol may depend on the type of netvvork being used (e g. TOM or IP network). Since numerous modifications and changes will readily occur to those skilled in the ad, it is not desired to limit the invention to the exact construction and operation illustrated and described, and accordingly all suitable modifications and equivalents may be resorted to, falling Within the scope of the invention. a.
Claims (4)
1. For use with a speech recognition engine and a microphone array for generating multiple output signals, a method of improving hands-free speech recognition accuracy comprising: streaming and transmitting said multiple output signals to said speech recognition engine; performing speech recognition on said multiple output signals in connection with which multiple level of confidence values are genera ted indicative of the likelihood of correctly recognized speech in each of said multiple output signals; and comparing said level of confidence valLtes and in response selecting as recognized speech the speech recognition with highest level of confidence value.
2. The method of claim 1, further conprlsing bearnrrning said multiple output sigr\als before transmission to said speech recognition engine.
3. The method of claim 1, further comprising spatial localization of said multiple output signals for selecting a subset of active ones of said multiple output signals for transmission to said speech recognition engine.
4. me method of claim 1, further comprising combining said level of confidence values win spatial likelihood estimators related to localization functionality in order to generate revised level of confidence values. l
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0407900A GB2412997A (en) | 2004-04-07 | 2004-04-07 | Method and apparatus for hands-free speech recognition using a microphone array |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0407900A GB2412997A (en) | 2004-04-07 | 2004-04-07 | Method and apparatus for hands-free speech recognition using a microphone array |
Publications (2)
Publication Number | Publication Date |
---|---|
GB0407900D0 GB0407900D0 (en) | 2004-05-12 |
GB2412997A true GB2412997A (en) | 2005-10-12 |
Family
ID=32320507
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0407900A Withdrawn GB2412997A (en) | 2004-04-07 | 2004-04-07 | Method and apparatus for hands-free speech recognition using a microphone array |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2412997A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010114935A1 (en) * | 2009-04-02 | 2010-10-07 | Qualcomm Incorporated | Beamforming options with partial channel knowledge |
JP2016080750A (en) * | 2014-10-10 | 2016-05-16 | 株式会社Nttドコモ | Voice recognition device, voice recognition method, and voice recognition program |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110299136A (en) * | 2018-03-22 | 2019-10-01 | 上海擎感智能科技有限公司 | A kind of processing method and its system for speech recognition |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5353376A (en) * | 1992-03-20 | 1994-10-04 | Texas Instruments Incorporated | System and method for improved speech acquisition for hands-free voice telecommunication in a noisy environment |
WO1996041214A1 (en) * | 1995-06-07 | 1996-12-19 | Sensimetrics Corporation | Apparatus and method for speech recognition using spatial information |
JP2000101598A (en) * | 1998-09-25 | 2000-04-07 | Matsushita Electric Works Ltd | Voice communication system |
EP1160772A2 (en) * | 2000-06-02 | 2001-12-05 | Canon Kabushiki Kaisha | Multisensor based acoustic signal processing |
WO2003058604A1 (en) * | 2001-12-29 | 2003-07-17 | Motorola Inc., A Corporation Of The State Of Delaware | Method and apparatus for multi-level distributed speech recognition |
US20030204397A1 (en) * | 2002-04-26 | 2003-10-30 | Mitel Knowledge Corporation | Method of compensating for beamformer steering delay during handsfree speech recognition |
US20040044516A1 (en) * | 2002-06-03 | 2004-03-04 | Kennewick Robert A. | Systems and methods for responding to natural language speech utterance |
-
2004
- 2004-04-07 GB GB0407900A patent/GB2412997A/en not_active Withdrawn
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5353376A (en) * | 1992-03-20 | 1994-10-04 | Texas Instruments Incorporated | System and method for improved speech acquisition for hands-free voice telecommunication in a noisy environment |
WO1996041214A1 (en) * | 1995-06-07 | 1996-12-19 | Sensimetrics Corporation | Apparatus and method for speech recognition using spatial information |
JP2000101598A (en) * | 1998-09-25 | 2000-04-07 | Matsushita Electric Works Ltd | Voice communication system |
EP1160772A2 (en) * | 2000-06-02 | 2001-12-05 | Canon Kabushiki Kaisha | Multisensor based acoustic signal processing |
WO2003058604A1 (en) * | 2001-12-29 | 2003-07-17 | Motorola Inc., A Corporation Of The State Of Delaware | Method and apparatus for multi-level distributed speech recognition |
US20030204397A1 (en) * | 2002-04-26 | 2003-10-30 | Mitel Knowledge Corporation | Method of compensating for beamformer steering delay during handsfree speech recognition |
US20040044516A1 (en) * | 2002-06-03 | 2004-03-04 | Kennewick Robert A. | Systems and methods for responding to natural language speech utterance |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010114935A1 (en) * | 2009-04-02 | 2010-10-07 | Qualcomm Incorporated | Beamforming options with partial channel knowledge |
US8463191B2 (en) | 2009-04-02 | 2013-06-11 | Qualcomm Incorporated | Beamforming options with partial channel knowledge |
JP2016080750A (en) * | 2014-10-10 | 2016-05-16 | 株式会社Nttドコモ | Voice recognition device, voice recognition method, and voice recognition program |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
Also Published As
Publication number | Publication date |
---|---|
GB0407900D0 (en) | 2004-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4734070B2 (en) | Multi-channel adaptive audio signal processing with noise reduction | |
US8233352B2 (en) | Audio source localization system and method | |
JP4929685B2 (en) | Remote conference equipment | |
US9837099B1 (en) | Method and system for beam selection in microphone array beamformers | |
US9338549B2 (en) | Acoustic localization of a speaker | |
US9171551B2 (en) | Unified microphone pre-processing system and method | |
KR101619578B1 (en) | Apparatus and method for geometry-based spatial audio coding | |
JP3521914B2 (en) | Super directional microphone array | |
US7366310B2 (en) | Microphone array diffracting structure | |
WO2019089486A1 (en) | Multi-channel speech separation | |
US8204247B2 (en) | Position-independent microphone system | |
EP2146519A1 (en) | Beamforming pre-processing for speaker localization | |
US20100008518A1 (en) | Methods for processing audio input received at an input device | |
US20160165338A1 (en) | Directional audio recording system | |
Taherian et al. | Deep learning based multi-channel speaker recognition in noisy and reverberant environments | |
Ince et al. | Assessment of general applicability of ego noise estimation | |
CN106303870B (en) | Method for the signal processing in binaural listening equipment | |
Ryan et al. | Application of near-field optimum microphone arrays to hands-free mobile telephony | |
CN115457971A (en) | Noise reduction method, electronic device and storage medium | |
Zheng et al. | A microphone array system for multimedia applications with near-field signal targets | |
GB2412997A (en) | Method and apparatus for hands-free speech recognition using a microphone array | |
CA2485728C (en) | Detecting acoustic echoes using microphone arrays | |
US11889261B2 (en) | Adaptive beamformer for enhanced far-field sound pickup | |
Adcock et al. | Practical issues in the use of a frequency‐domain delay estimator for microphone‐array applications | |
US11937047B1 (en) | Ear-worn device with neural network for noise reduction and/or spatial focusing using multiple input audio signals |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |