WO2007138503A1 - Method of driving a speech recognition system - Google Patents

Method of driving a speech recognition system Download PDF

Info

Publication number
WO2007138503A1
WO2007138503A1 PCT/IB2007/051742 IB2007051742W WO2007138503A1 WO 2007138503 A1 WO2007138503 A1 WO 2007138503A1 IB 2007051742 W IB2007051742 W IB 2007051742W WO 2007138503 A1 WO2007138503 A1 WO 2007138503A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech recognition
microphone
speech
source
speech signal
Prior art date
Application number
PCT/IB2007/051742
Other languages
French (fr)
Inventor
Christian Benien
Thomas Portele
Original Assignee
Philips Intellectual Property & Standards Gmbh
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Philips Intellectual Property & Standards Gmbh, Koninklijke Philips Electronics N.V. filed Critical Philips Intellectual Property & Standards Gmbh
Publication of WO2007138503A1 publication Critical patent/WO2007138503A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal

Definitions

  • the invention relates to a method of driving a speech recognition system, to a speech recognition system, and to an interactive system.
  • Speech recognition can be made use of in basic systems - to carry out specific functions in response to a spoken command - or can be incorporated into more complex systems which allow an interaction or 'dialogue' to be carried out.
  • Developments are being made in the field of home dialogue systems, which allow interaction between a user and, for example, a humanoid or robot-like device. Basically, words or phrases spoken by a user are processed in a speech recogniser, and the results of the speech recognition are interpreted in a dialogue manager or dialogue engine in order to determine what the user has said, and how to react to what he has said.
  • Such dialogues or interactions on the basis of speech alone are particularly useful in "hands free” environments, i.e. environments in which a user cannot interact with a system using, say, a keyboard or similar user interface, or where such a physical interaction is undesirable.
  • a spoken command such as "Turn on the lights” or "Play some music”, and more convenient for a user than for him to actually press a switch or push buttons on a remote control.
  • WO2004/038697A1 suggests using a 'beam-former' to cancel out any unwanted acoustic signals detected by the microphones of the system.
  • a beam- former uses a microphone setup with several microphones for collecting acoustic input, and determines which acoustic signals originate from outside of a certain direction. These unwanted or invalid signals are then negated, for example by overlaying them with the appropriate inverse signals to cancel them out.
  • Another disadvantage common to some state of the art speech recognition systems is that they require a first speech recogniser solely for the purpose of reacting to an initiation or 'wake-up' phrase. As long as such a speech recognition system is waiting for input in an idle state, this dedicated speech recogniser is continually 'listening' for a wake-up or initiation phrase. Upon recognition of the wake-up phrase, a second speech recogniser commences processing the speech input according to an interaction grammar. The need for two speech recognisers makes such systems comparatively expensive and more complex.
  • the present invention provides a method of driving a speech recognition system, which method comprises detecting a speech signal using a microphone arrangement, performing speech recognition on the speech signal detected by a first microphone of the microphone arrangement to obtain a speech recognition output, using the first microphone of the microphone arrangement and at least a second microphone of the microphone arrangement to determine a relative direction between the source of the speech signal and the microphone arrangement, and processing the speech recognition output according to the determined relative direction.
  • the source of the speech signal can be a human user of the speech recognition system, but might equally well be an artificially generated source of speech. In the following, however, for the sake of simplicity, it will be assumed that the source of the speech signal is a user, without restricting the invention in any way.
  • a user might be positioned at any location relative to the speech recognition system.
  • the user in a home dialogue system, the user might be close to or at a distance from the microphones of the speech recognition system, which microphones can be arranged, for example, as the "ears" on either side of a "head" of the home dialogue system.
  • the user's relative position with respect to the microphone arrangement can easily be determined, for example by simply evaluating the temporal difference in the acoustic input detected by the first and second microphones, i.e. the delay between the sound signal impinging at the first microphone and at the second microphone. If there is a significant delay, for example, this would indicate that the user is positioned towards the side of the microphone arrangement with that microphone that first detected the signal.
  • the length or size of the measured delay between the two microphones is an indication of how far the user is positioned to one side of the microphone arrangement. If, on the other hand, little or no delay is determined, it can be assumed that the user is positioned essentially 'between' the microphones, for example more or less directly in front of or behind the microphone arrangement. To discern whether the user is front of or behind the microphone arrangement, it is advantageous to make use of an additional third microphone. Use of such an additional microphone also allows determination of the relative position of the user 'above' or 'below' the microphone arrangement.
  • Evaluating a sound signal detected by more than one microphone in this way to determine the source of the sound is known as 'acoustic source localisation'.
  • this relative direction is used, in the method according to the invention, to process the speech recognition output.
  • the speech recogniser can continually work on any acoustic or speech input signal, regardless of its source.
  • every acoustic signal, regardless of its source relative to the microphone arrangement is 'admitted' and subject to the speech recognition process. Whether a speech recognition output is deemed to be valid or not is later decided on the basis of the determined relative direction.
  • the speech recognition and relative direction determination processes can take place simultaneously, or might be delayed with respect to each other.
  • An obvious advantage of the method according to the invention is that the position of the user relative to the microphone arrangement, and therefore to the speech recognition system, can be simply and easily determined, and, using this direction to process the speech recognition output means that there is no need for computationally expensive beam- forming to cancel unwanted input. A distortion of the valid input, i.e. the speech signal coming from the user, will therefore not arise with the method according to the invention. Furthermore, the simplicity of the approach - speech recognition on a single microphone and direction estimation with at least one additional microphone, allows for an advantageously economical realisation.
  • a suitable speech recognition system comprises a microphone arrangement with a first microphone and at least a second microphone for detecting a speech signal, an acoustic source localisation unit for determining a relative direction between the source of the speech signal and the microphone arrangement using the first microphone and at least a second microphone of the microphone arrangement, a speech recognition unit for performing speech recognition on the speech signal detected by the first microphone of the microphone arrangement and a processing unit for processing the speech recognition output according to the determined relative direction.
  • a speech recognition system can be used in a simple realisation, for example to cause a specific action to be taken in response to a predefined command spoken by a user, or in a more complex application such as an interactive system allowing a dialogue with the user.
  • An interactive system comprises the units or modules of the speech recognition system described above, and a dialogue engine for interpreting the speech recognition output of the speech recognition system, whereby the processing unit can be realised as part of the dialogue engine.
  • processing the speech recognition output according to the determined relative direction comprises filtering the speech recognition output to disregard any speech recognition output corresponding to a speech signal originating from a direction other than a specific, or given, direction. In other words, an interaction is only allowed in the specific direction, so that any speech recognition output originating from a different direction, i.e.
  • the filtering might involve, for example, accepting a speech recognition output corresponding to a speech input whose origin deviates by no more than a pre-defined degree from the specific direction. In this way, only speech recognition output originating from essentially the specific direction is then processed. Other speech input coming from a different direction, such as other people talking, or speech coming from a television or radio, will be ignored.
  • a speech recognition output is associated or linked in some appropriate manner with higher-level information about the source or direction of the corresponding speech signal by simply annotating the speech recognition output with the determined relative direction.
  • the term 'annotation' simply means marking or tagging a speech recognition output - for example the output corresponding to a word or a phrase - with information specifying the relative direction from which the corresponding speech signal originated.
  • a dialogue engine for controlling the dialogue in an interactive system can avail of speech understanding algorithms to determine what speech could belong to an interaction between a user and the interactive system, and which speech input is to be ignored.
  • a dialogue engine generally has a predefined 'interaction grammar', such as a set of possible commands or queries to which the interactive system can respond.
  • a simple home dialogue system might have an interaction grammar consisting of basic commands such as "Turn on the lights", “Play music”, "Any new e-mail?", etc.
  • the corresponding speech recognition result must be examined to determine whether it corresponds to a part of the interaction grammar.
  • the step of filtering the speech recognition output to disregard any results originating from a direction other than a specific direction is carried out in a dialogue engine.
  • the dialogue engine preferably avails of a language understanding unit to carry out the necessary evaluation of the speech recognition results. Since even a basic system using speech recognition needs language understanding and interpreting units to deal with the spoken input commands, it is advantageous to perform the filtering at this point.
  • the dialogue engine can establish or define the specific direction for the current interaction. For example, if the dialogue engine has several speech inputs from different directions to evaluate, and it determines that only the speech input from a certain direction matches a part of the interaction grammar, i.e., is 'valid', the dialogue engine can then decide that the direction corresponding to the valid speech input is to be the specific direction from which to accept speech recognition results, and any speech recognition results originating from other directions are to be disregarded.
  • An interaction between a user and the interactive system can be of any duration. A user can use the interactive system in a short interaction to have a command executed, for example to turn on or off the lights.
  • a longer interaction might be started by the user to have the interactive system respond to a series of commands or queries over a longer period of time.
  • the start of an interaction can be announced or made known to the interactive system by the user uttering a certain pre-defined phrase, referred to in the following as a 'wake-up' phrase.
  • the dialog engine Once the dialog engine has recognised this phrase, it can assume that an interaction is likely to follow from relative direction from which the wake-up phrase was spoken. For example, if a user has uttered a wake-up phrase from a certain relative direction, the dialogue engine can decide on that direction to be the specific direction for the following interaction.
  • the predefined speech signal or wake-up phrase can be accepted or considered from any direction relative to the microphone arrangement of the speech recognition system, so that the user of the interactive system can initiate an interaction from any position. Because the speech recognition results are accepted or discarded on the basis of the direction from which they originate, an obvious advantage of the speech recognition system according to the invention is that a second speech recogniser is not required for the purpose of identifying a wake-up phrase, so that a speech recognition system according to the invention, with a single speech recogniser, is more economical and easier to realise compared to state of the art speech recognition systems. By expecting input from a single speaker positioned at a certain specific direction with respect to the interactive system, computationally expensive beam- formers are not needed.
  • a user might like to have some kind of feedback that the interactive system has identified him as the interacting user. Therefore, the interactive system might avail of some kind of feedback to indicate that it is focusing on the user. For example, a moveable head of an interactive system might swivel to 'face' the user when he has spoken the wake-up phrase, and the 'eyes' in the head might briefly light up. The user then knows that the interactive system has identified him as the interacting user. Some other type of feedback could be used if the interactive system is not equipped with a moveable head. For example, one or more LEDs (light-emitting diodes), positioned on that part of the interactive system facing the user, might light up to indicate to the user that the interactive system is 'listening' to him.
  • LEDs light-emitting diodes
  • the interaction between the user and the interactive system is allowed in the determined relative direction, i.e. speech coming from any other direction is ignored.
  • the user can simply utter the wake-up phrase whenever he changes position. For example, the wake-up phrase might be "Look at me” or "Over here".
  • the dialogue engine Whenever the dialogue engine 'hears' this phrase, it can re-adjust to the direction from where the phrase came, and can show this by causing the head of the interactive system to swivel or rotate so that is facing in the direction from which the wake-up phrase was spoken, i.e. in the direction of the user.
  • he forgets to say the wake-up phrase after moving to a different position he might be annoyed to find out that the interaction has been terminated, or that his commands are being ignored. Therefore, in a preferred method according to the invention, in particular for interactive systems, source characterising information, i.e.
  • some high-level information describing the user of the interactive system is obtained to characterise the user speaking from the specific direction. For example, speaker characteristic information about the speech signal corresponding to a wake-up phrase can be determined and later used to decide, on the basis of speaker recognition, whether a speech input signal has been spoken by that same user, in which case the speech input is accepted, or by another user, in which case the speech input will be discarded.
  • Another way to characterise the interacting user might be to perform image processing.
  • a camera of the interactive system can be aimed at the direction from which the user has spoken, and an image can be captured of the user's face.
  • the interactive system can use the camera to track the user when he changes position.
  • the images captured by the camera can be analysed to determine whether the person who has spoken is the user that initiated the interaction.
  • the relative direction between the user and the microphone arrangement is adjusted or updated on the basis of the source characterising information.
  • Obtaining source characterising information such as speaker descriptive information or image recognition information can be carried out in a suitable unit.
  • An interactive system according to the invention preferably comprises a feature extraction unit for carrying out such calculations.
  • the feature extraction unit can be supplied with suitable data from, for example, a microphone or a digital camera of the interactive system.
  • the processing steps described above might all be carried out using appropriate software algorithms, many of which are already well-known, such as those for comparing digitised speech signals, for performing speech recognition, and for interpreting speech recognition results in a dialogue on the basis of an interaction grammar.
  • the software can run as a program, or a number of programs, on one or more microprocessors of a speech recognition system, or an interactive system.
  • Fig. 1 shows a block diagram of an interactive system comprising a speech recognition system according to a first embodiment of the invention
  • Fig. 2 shows an interactive system according to an embodiment of the invention, and a number of sources of speech input
  • Fig. 3a shows the interactive system according to Fig. 2, and a user at a first position relative to the interactive system
  • Fig. 3b shows the interactive system according to Fig. 3a, and the user at a second position relative to the interactive system
  • Fig. 4 shows a block diagram of a speech recognition system according to a second embodiment of the invention.
  • Fig. 1 shows a block diagram of an interactive system 3 comprising a speech recognition system 1 according to an embodiment of the invention.
  • a microphone arrangement M avails of a first microphone M 1 , and a second microphone M 2 separated from each other by a certain distance.
  • the distance between the two microphones M 1 , M 2 can be governed by the ultimate realisation of the device in which the speech recognition system is to be incorporated, or by desired accuracy levels.
  • the input from the first microphone Mi is forwarded to a speech recognition unit 12, which performs speech recognition on the input signal.
  • the input from the second microphone M 2 is forwarded to an acoustic source localisation unit 11 , which also receives the input from the first microphone M 1 .
  • the direction D of the source 2 of the detected signal S is estimated by evaluating the temporal delta between the signals detected at the two microphones M 1 , M 2 .
  • the source 2, or user 2 is shown to be closer to the second microphone M 2 , so that sound S emanating from this source 2 will impinge on the second microphone M 2 first, and then, after a brief delay or temporal delta, the sound S will impinge on the first microphone Mi .
  • the magnitude of the temporal delta, and the known distance between the two microphones M 1 , M 2 can be used by the acoustic source localisation unit 11 to estimate the relative direction D from which the sound S has originated.
  • the speech recognition output 10 passes unchanged through an annotation unit 17 to a dialogue engine 15, where it further passes, unchanged, through a filtering unit 18 before being subsequently analysed in a language understanding unit 19 of the dialogue engine 15 to see if a wake-up phrase, described in a wake-up grammar supplied by a database 9, has been uttered.
  • a wake-up phrase described in a wake-up grammar supplied by a database 9, has been uttered.
  • an appropriate signal 44 is forwarded to the annotation unit 17, informing the annotation unit 17 that the direction information 13 is now to be defined as the specific interaction direction D.
  • the speech recognition output 10 is annotated in the annotation unit 17 with the direction information 13 from the acoustic source localisation unit 11, so that any acoustic input signals are associated with the direction from where they originated.
  • the annotation unit 17 can compare the direction information 13 to the specific interaction direction D and annotate any subsequent speech recognition output 10 with a marker, tag, or other descriptive label describing the degree of closeness of the speech recognition output 10 to the specific interaction direction D.
  • the annotated speech recognition output 14 is forwarded to the filtering unit
  • the output 16 of the dialogue engine 15, in response to the interaction taking place, can be a suitable signal 16, for example a signal 16 to a controller, not shown in the diagram, to cause some event to take place in response to a command spoken by the user 2 and interpreted by the dialogue engine 15.
  • a suitable signal 16 for example a signal 16 to a controller, not shown in the diagram, to cause some event to take place in response to a command spoken by the user 2 and interpreted by the dialogue engine 15.
  • Fig. 2 an interactive system 3 is shown in an environment with two potential users 2, 21 and a television 22.
  • the interactive system 3, shown as a 'robot' in a human- like realisation, is equipped with a pair of microphones M 1 , M 2 positioned as the 'ears' on a 'head' 20 of the interactive system 3.
  • the head 20 can rotate on a 'neck' 23 attached to a 'body' 24 of the interactive system 3.
  • the two potential users 2, 21 and the television 22 can all issue speech signals
  • the microphones M 1 , M 2 detect any incoming acoustic signals S, S 1 , S 2 , which are processed in a speech recognition unit, not shown in the diagram, to determine their validity.
  • the output of the speech recognition is analysed for a wake-up signal.
  • a first user 2 speaks a wake-up phrase, this is recognised by the interactive system 3, and the direction D from which the speech signal S originated, is specified to be the interaction direction.
  • the interactive system 3 only responds to speech S coming from this direction D.
  • Other speech signals S 1 , S 2 for example when the person 21 says something, or when speech emanates from the television 22, will be recognised but simply ignored.
  • the second user 21 might say the wake-up phrase, upon which the interactive system 3 determines the relative direction Di for this current interaction on the basis of the speech signals Si originating from that user 21.
  • the interactive system 3 determines the relative direction Di for this current interaction on the basis of the speech signals Si originating from that user 21.
  • any speech signals S, S 2 coming from the first user 2 or the TV 22, will be recognised but ignored.
  • Figs. 3a and 3b show a sequence of positions of the user 2 with respect to the interactive system 3.
  • the user 2 is positioned to the right of the interactive system 3 when he speaks the wake-up phrase. This is detected by the interactive system 3, and the head 20 is rotated so that the user 2 has the impression that the robot 3 is 'attentive' and 'listening'.
  • a initial specific interaction direction D is established by acoustic source localisation using the input of the two microphones M 1 , M 2 .
  • a camera incorporated into one of the 'eyes' 30, 31 of the robot 3 captures one or more images of the user 2. These images can be processed in a feature extraction unit to identify characterising features of the user 2.
  • the user 2 can move around. Images captured by the camera can be processed to track the position of the user 2, for example by using face recognition or pattern recognition algorithms, so that the direction of interaction is updated to the altered direction D'. Now, only speech input S from this altered direction D' will be considered in the continued interaction.
  • Fig. 4 shows an interactive system 3 which is similar to that described above in Fig. 1, but with a different approach to annotating and filtering acoustic input detected by the microphone arrangement M. Furthermore, the interactive system 3 of this embodiment avails of an additional input modality 40 in the form of a camera 40.
  • the camera 40 can be incorporated in an 'eye' of the interactive system 3 as described above under Figs. 2 - 3b.
  • an initial interaction direction D is determined.
  • the input modality 40 is directed at the source 2 (the user 2) of the speech signal S, and can continually or intermittently capture images 41 of the user, which are then evaluated or processed in a feature extraction unit 42.
  • analysis of the images 41 can show that the user 2 is moving out of the field of vision of the camera 40, so that the interaction direction will need to be re-determined.
  • Information 43 obtained from the feature extraction unit 42 is forwarded to the dialogue engine 15. Such information 43 can be used to trigger events in a dialogue on the basis of gesture recognition, etc.
  • the information 43 from the feature extraction engine 42 is interpreted by the dialogue engine 15 to adapt prior knowledge about the original interaction direction D to reflect the new position (not shown in the diagram) of the user 2.
  • a corresponding signal 44 is supplied to the filtering unit 18 for the decision to accept or discard an annotated speech recognition output 14.
  • the user 2 might have initiated interaction from the direction D. Then, without speaking, he moves to a new position. In the meantime, someone else moves to the original position of the user 2.
  • the camera 40 has tracked the movements of the user 2, so that if the other person speaks from the initial direction D, the filtering unit 18, using the updated information 44 obtained with the help of the feature extraction unit 42, will be able to decide that speech input from the other person is to be discarded, and only the annotated speech input 14 corresponding to the user 2 at his new position is to be recognised as valid.
  • any number of microphones can be used for determining the relative direction of the user with respect to the speech recognition system, two being simply the minimum for a fairly simple way of performing acoustic source localisation.
  • the units or modules used to perform the various functions can be realised in any suitable configuration or constellation in an interactive system.
  • the speech recognition system according to the invention can be incorporated in any suitable interactive system intended for use also in noisy environments, or in environments where it may be that several users sometimes speak at the same time.
  • the method according to the invention can be used for situations in which only a single user is speaking, and an alternative method, for example using beam-forming, can be applied for other, noisier, situations.
  • an alternative method for example using beam-forming, can be applied for other, noisier, situations.
  • a “unit” or “module” can comprises a number of units or modules, unless otherwise stated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)
  • Machine Translation (AREA)

Abstract

A method of driving a speech recognition system (1), which method comprises detecting a speech signal (S, S1, S2) using a microphone arrangement (M), performing speech recognition on the speech signal (S, S1, S2) detected by a first microphone (Mi) of the microphone arrangement (M) to obtain a speech recognition output (10), using the first microphone (Mi) of the microphone arrangement (M) and at least a second microphone (M2) of the microphone arrangement (M) to determine a relative direction (D, D', D1, D2) between the source (2, 21, 22) of the speech signal (S, S1, S2) and the microphone arrangement (M), and processing the speech recognition output (10) according to the determined relative direction (D, D', D1, D2). The invention also relates to a speech recognition system (1) and to an interactive system (3).

Description

Method of driving a speech recognition system
The invention relates to a method of driving a speech recognition system, to a speech recognition system, and to an interactive system.
The use of speech recognition systems is becoming more and more widespread, as developments in this field lead to better speech recognition algorithms and more compact systems. Speech recognition can be made use of in basic systems - to carry out specific functions in response to a spoken command - or can be incorporated into more complex systems which allow an interaction or 'dialogue' to be carried out. Developments are being made in the field of home dialogue systems, which allow interaction between a user and, for example, a humanoid or robot-like device. Basically, words or phrases spoken by a user are processed in a speech recogniser, and the results of the speech recognition are interpreted in a dialogue manager or dialogue engine in order to determine what the user has said, and how to react to what he has said. Such dialogues or interactions on the basis of speech alone are particularly useful in "hands free" environments, i.e. environments in which a user cannot interact with a system using, say, a keyboard or similar user interface, or where such a physical interaction is undesirable. For example, it is simple and intuitive to give a spoken command such as "Turn on the lights" or "Play some music", and more convenient for a user than for him to actually press a switch or push buttons on a remote control.
However, such an interactive system must be able to distinguish between valid spoken input from the user with which the system is interacting, and any other "unwanted" or invalid input, i.e. a 'false alarm'. There are a number of approaches in dealing with this problem. For example, WO2004/038697A1 suggests using a 'beam-former' to cancel out any unwanted acoustic signals detected by the microphones of the system. A beam- former uses a microphone setup with several microphones for collecting acoustic input, and determines which acoustic signals originate from outside of a certain direction. These unwanted or invalid signals are then negated, for example by overlaying them with the appropriate inverse signals to cancel them out. However, this approach has the disadvantage that the desired or valid signal, i.e. the speech signal originating from the user interacting with the system, becomes distorted. The distortion in the speech signal could result in erroneous speech recognition results, so the speech recognition system must be trained specifically for the beam former environment, meaning that extensive speech data collections have to be done with that specific microphone setup. This training makes such a system comparatively expensive. Furthermore, should the speech recognition system be moved for use in a different environment, it would have to be re-trained for the new environment, adding to the overall cost. Because of the necessity to re-train, such systems are generally not portable and, in consequence, limited in their usefulness. Another disadvantage common to some state of the art speech recognition systems is that they require a first speech recogniser solely for the purpose of reacting to an initiation or 'wake-up' phrase. As long as such a speech recognition system is waiting for input in an idle state, this dedicated speech recogniser is continually 'listening' for a wake-up or initiation phrase. Upon recognition of the wake-up phrase, a second speech recogniser commences processing the speech input according to an interaction grammar. The need for two speech recognisers makes such systems comparatively expensive and more complex.
Therefore, it is an object of the invention to provide a simpler and more economical way of identifying valid speech input to a speech recognition system.
To this end, the present invention provides a method of driving a speech recognition system, which method comprises detecting a speech signal using a microphone arrangement, performing speech recognition on the speech signal detected by a first microphone of the microphone arrangement to obtain a speech recognition output, using the first microphone of the microphone arrangement and at least a second microphone of the microphone arrangement to determine a relative direction between the source of the speech signal and the microphone arrangement, and processing the speech recognition output according to the determined relative direction.
The source of the speech signal can be a human user of the speech recognition system, but might equally well be an artificially generated source of speech. In the following, however, for the sake of simplicity, it will be assumed that the source of the speech signal is a user, without restricting the invention in any way.
A user might be positioned at any location relative to the speech recognition system. For example, in a home dialogue system, the user might be close to or at a distance from the microphones of the speech recognition system, which microphones can be arranged, for example, as the "ears" on either side of a "head" of the home dialogue system. The user's relative position with respect to the microphone arrangement can easily be determined, for example by simply evaluating the temporal difference in the acoustic input detected by the first and second microphones, i.e. the delay between the sound signal impinging at the first microphone and at the second microphone. If there is a significant delay, for example, this would indicate that the user is positioned towards the side of the microphone arrangement with that microphone that first detected the signal. The length or size of the measured delay between the two microphones is an indication of how far the user is positioned to one side of the microphone arrangement. If, on the other hand, little or no delay is determined, it can be assumed that the user is positioned essentially 'between' the microphones, for example more or less directly in front of or behind the microphone arrangement. To discern whether the user is front of or behind the microphone arrangement, it is advantageous to make use of an additional third microphone. Use of such an additional microphone also allows determination of the relative position of the user 'above' or 'below' the microphone arrangement.
Evaluating a sound signal detected by more than one microphone in this way to determine the source of the sound is known as 'acoustic source localisation'.
Once the relative position of the user with respect to the microphone arrangement has been established, this relative direction is used, in the method according to the invention, to process the speech recognition output. In other words, the speech recogniser can continually work on any acoustic or speech input signal, regardless of its source. With the method according to the invention, every acoustic signal, regardless of its source relative to the microphone arrangement, is 'admitted' and subject to the speech recognition process. Whether a speech recognition output is deemed to be valid or not is later decided on the basis of the determined relative direction. The speech recognition and relative direction determination processes can take place simultaneously, or might be delayed with respect to each other.
An obvious advantage of the method according to the invention is that the position of the user relative to the microphone arrangement, and therefore to the speech recognition system, can be simply and easily determined, and, using this direction to process the speech recognition output means that there is no need for computationally expensive beam- forming to cancel unwanted input. A distortion of the valid input, i.e. the speech signal coming from the user, will therefore not arise with the method according to the invention. Furthermore, the simplicity of the approach - speech recognition on a single microphone and direction estimation with at least one additional microphone, allows for an advantageously economical realisation.
A suitable speech recognition system comprises a microphone arrangement with a first microphone and at least a second microphone for detecting a speech signal, an acoustic source localisation unit for determining a relative direction between the source of the speech signal and the microphone arrangement using the first microphone and at least a second microphone of the microphone arrangement, a speech recognition unit for performing speech recognition on the speech signal detected by the first microphone of the microphone arrangement and a processing unit for processing the speech recognition output according to the determined relative direction. Such a speech recognition system can be used in a simple realisation, for example to cause a specific action to be taken in response to a predefined command spoken by a user, or in a more complex application such as an interactive system allowing a dialogue with the user.
An interactive system according to the invention comprises the units or modules of the speech recognition system described above, and a dialogue engine for interpreting the speech recognition output of the speech recognition system, whereby the processing unit can be realised as part of the dialogue engine.
The dependent claims and the subsequent description disclose particularly advantageous embodiments and features of the invention. As described above, evaluating the temporal difference between detection of a sound signal at two or more separate microphones yields information about the direction of the source of the signal relative to the microphones. Knowing the direction from which a speech signal originates can be used to decide whether the speech signal is 'valid' or 'invalid'. Therefore, in a particularly preferred embodiment of the invention, processing the speech recognition output according to the determined relative direction comprises filtering the speech recognition output to disregard any speech recognition output corresponding to a speech signal originating from a direction other than a specific, or given, direction. In other words, an interaction is only allowed in the specific direction, so that any speech recognition output originating from a different direction, i.e. unwanted false alarm input, is simply disregarded. The filtering might involve, for example, accepting a speech recognition output corresponding to a speech input whose origin deviates by no more than a pre-defined degree from the specific direction. In this way, only speech recognition output originating from essentially the specific direction is then processed. Other speech input coming from a different direction, such as other people talking, or speech coming from a television or radio, will be ignored.
Filtering out any invalid speech recognition results according to direction can take place at any suitable stage. In a further preferred embodiment of the invention, a speech recognition output is associated or linked in some appropriate manner with higher-level information about the source or direction of the corresponding speech signal by simply annotating the speech recognition output with the determined relative direction. Here, the term 'annotation' simply means marking or tagging a speech recognition output - for example the output corresponding to a word or a phrase - with information specifying the relative direction from which the corresponding speech signal originated.
A dialogue engine for controlling the dialogue in an interactive system can avail of speech understanding algorithms to determine what speech could belong to an interaction between a user and the interactive system, and which speech input is to be ignored. To this end, a dialogue engine generally has a predefined 'interaction grammar', such as a set of possible commands or queries to which the interactive system can respond. For example, a simple home dialogue system might have an interaction grammar consisting of basic commands such as "Turn on the lights", "Play music", "Any new e-mail?", etc. To determine whether a speech input is valid or invalid, the corresponding speech recognition result must be examined to determine whether it corresponds to a part of the interaction grammar. Therefore, in a preferred embodiment of the invention, the step of filtering the speech recognition output to disregard any results originating from a direction other than a specific direction is carried out in a dialogue engine. The dialogue engine preferably avails of a language understanding unit to carry out the necessary evaluation of the speech recognition results. Since even a basic system using speech recognition needs language understanding and interpreting units to deal with the spoken input commands, it is advantageous to perform the filtering at this point.
Furthermore, on the basis of its interpretation of the words or phrases recognised, the dialogue engine can establish or define the specific direction for the current interaction. For example, if the dialogue engine has several speech inputs from different directions to evaluate, and it determines that only the speech input from a certain direction matches a part of the interaction grammar, i.e., is 'valid', the dialogue engine can then decide that the direction corresponding to the valid speech input is to be the specific direction from which to accept speech recognition results, and any speech recognition results originating from other directions are to be disregarded. An interaction between a user and the interactive system can be of any duration. A user can use the interactive system in a short interaction to have a command executed, for example to turn on or off the lights. A longer interaction might be started by the user to have the interactive system respond to a series of commands or queries over a longer period of time. In either case, the start of an interaction can be announced or made known to the interactive system by the user uttering a certain pre-defined phrase, referred to in the following as a 'wake-up' phrase. Once the dialog engine has recognised this phrase, it can assume that an interaction is likely to follow from relative direction from which the wake-up phrase was spoken. For example, if a user has uttered a wake-up phrase from a certain relative direction, the dialogue engine can decide on that direction to be the specific direction for the following interaction. The predefined speech signal or wake-up phrase can be accepted or considered from any direction relative to the microphone arrangement of the speech recognition system, so that the user of the interactive system can initiate an interaction from any position. Because the speech recognition results are accepted or discarded on the basis of the direction from which they originate, an obvious advantage of the speech recognition system according to the invention is that a second speech recogniser is not required for the purpose of identifying a wake-up phrase, so that a speech recognition system according to the invention, with a single speech recogniser, is more economical and easier to realise compared to state of the art speech recognition systems. By expecting input from a single speaker positioned at a certain specific direction with respect to the interactive system, computationally expensive beam- formers are not needed.
A user might like to have some kind of feedback that the interactive system has identified him as the interacting user. Therefore, the interactive system might avail of some kind of feedback to indicate that it is focusing on the user. For example, a moveable head of an interactive system might swivel to 'face' the user when he has spoken the wake-up phrase, and the 'eyes' in the head might briefly light up. The user then knows that the interactive system has identified him as the interacting user. Some other type of feedback could be used if the interactive system is not equipped with a moveable head. For example, one or more LEDs (light-emitting diodes), positioned on that part of the interactive system facing the user, might light up to indicate to the user that the interactive system is 'listening' to him.
In the method according to the invention, the interaction between the user and the interactive system is allowed in the determined relative direction, i.e. speech coming from any other direction is ignored. However, it may be desirable for the user to be able to move around while interacting with the system. In doing so, the position of the user with respect to the interactive system will change. To have the interactive system re-determine the relative or specific direction of the interaction, the user can simply utter the wake-up phrase whenever he changes position. For example, the wake-up phrase might be "Look at me" or "Over here". Whenever the dialogue engine 'hears' this phrase, it can re-adjust to the direction from where the phrase came, and can show this by causing the head of the interactive system to swivel or rotate so that is facing in the direction from which the wake-up phrase was spoken, i.e. in the direction of the user. However, it may be inconvenient or irritating for the user to have to say the wake-up phrase whenever he moves. Also, if he forgets to say the wake-up phrase after moving to a different position, he might be annoyed to find out that the interaction has been terminated, or that his commands are being ignored. Therefore, in a preferred method according to the invention, in particular for interactive systems, source characterising information, i.e. some high-level information describing the user of the interactive system, is obtained to characterise the user speaking from the specific direction. For example, speaker characteristic information about the speech signal corresponding to a wake-up phrase can be determined and later used to decide, on the basis of speaker recognition, whether a speech input signal has been spoken by that same user, in which case the speech input is accepted, or by another user, in which case the speech input will be discarded.
Another way to characterise the interacting user might be to perform image processing. For example, when the user says the wake-up phrase, a camera of the interactive system can be aimed at the direction from which the user has spoken, and an image can be captured of the user's face. In the following interaction, the interactive system can use the camera to track the user when he changes position. In subsequent speech analysis, the images captured by the camera can be analysed to determine whether the person who has spoken is the user that initiated the interaction. In this way, the relative direction between the user and the microphone arrangement is adjusted or updated on the basis of the source characterising information. Obtaining source characterising information such as speaker descriptive information or image recognition information can be carried out in a suitable unit. An interactive system according to the invention preferably comprises a feature extraction unit for carrying out such calculations. The feature extraction unit can be supplied with suitable data from, for example, a microphone or a digital camera of the interactive system. The processing steps described above might all be carried out using appropriate software algorithms, many of which are already well-known, such as those for comparing digitised speech signals, for performing speech recognition, and for interpreting speech recognition results in a dialogue on the basis of an interaction grammar. The software can run as a program, or a number of programs, on one or more microprocessors of a speech recognition system, or an interactive system.
Other objects and features of the present invention will become apparent from the following detailed descriptions considered in conjunction with the accompanying drawing. It is to be understood, however, that the drawings are designed solely for the purposes of illustration and not as a definition of the limits of the invention.
Fig. 1 shows a block diagram of an interactive system comprising a speech recognition system according to a first embodiment of the invention; Fig. 2 shows an interactive system according to an embodiment of the invention, and a number of sources of speech input;
Fig. 3a shows the interactive system according to Fig. 2, and a user at a first position relative to the interactive system;
Fig. 3b shows the interactive system according to Fig. 3a, and the user at a second position relative to the interactive system;
Fig. 4 shows a block diagram of a speech recognition system according to a second embodiment of the invention.
In the diagrams, like numbers refer to like objects throughout. Objects are not necessarily drawn to scale.
Fig. 1 shows a block diagram of an interactive system 3 comprising a speech recognition system 1 according to an embodiment of the invention. To detect acoustic input, a microphone arrangement M avails of a first microphone M1, and a second microphone M2 separated from each other by a certain distance. The distance between the two microphones M1, M2 can be governed by the ultimate realisation of the device in which the speech recognition system is to be incorporated, or by desired accuracy levels. The input from the first microphone Mi is forwarded to a speech recognition unit 12, which performs speech recognition on the input signal. The input from the second microphone M2 is forwarded to an acoustic source localisation unit 11 , which also receives the input from the first microphone M1. The direction D of the source 2 of the detected signal S is estimated by evaluating the temporal delta between the signals detected at the two microphones M1, M2. In the diagram, the source 2, or user 2, is shown to be closer to the second microphone M2, so that sound S emanating from this source 2 will impinge on the second microphone M2 first, and then, after a brief delay or temporal delta, the sound S will impinge on the first microphone Mi . The magnitude of the temporal delta, and the known distance between the two microphones M1, M2 can be used by the acoustic source localisation unit 11 to estimate the relative direction D from which the sound S has originated. As long as there is no active interaction taking place, the speech recognition output 10 passes unchanged through an annotation unit 17 to a dialogue engine 15, where it further passes, unchanged, through a filtering unit 18 before being subsequently analysed in a language understanding unit 19 of the dialogue engine 15 to see if a wake-up phrase, described in a wake-up grammar supplied by a database 9, has been uttered. Once the wake- up phrase has been detected by the language understanding unit 19, an appropriate signal 44 is forwarded to the annotation unit 17, informing the annotation unit 17 that the direction information 13 is now to be defined as the specific interaction direction D. Thereafter, the speech recognition output 10 is annotated in the annotation unit 17 with the direction information 13 from the acoustic source localisation unit 11, so that any acoustic input signals are associated with the direction from where they originated. The annotation unit 17 can compare the direction information 13 to the specific interaction direction D and annotate any subsequent speech recognition output 10 with a marker, tag, or other descriptive label describing the degree of closeness of the speech recognition output 10 to the specific interaction direction D. The annotated speech recognition output 14 is forwarded to the filtering unit
18. An interaction is now assumed to be taking place, since the wake-up phrase was detected from the direction D, and any speech recognition results originating from a different direction will be discarded. The decision to discard a speech recognition output 10 is made on the basis of the annotation information. For example, any speech recognition output 10 deemed to have originated from a direction within 5° of the specific direction D is to be accepted. The 'valid' annotated speech recognition output 14 survives the filtering process and is forwarded to the language understanding unit 19, which proceeds to interpret the current interaction taking place between the user 2 and the interactive system 3. The output 16 of the dialogue engine 15, in response to the interaction taking place, can be a suitable signal 16, for example a signal 16 to a controller, not shown in the diagram, to cause some event to take place in response to a command spoken by the user 2 and interpreted by the dialogue engine 15. In Fig. 2, an interactive system 3 is shown in an environment with two potential users 2, 21 and a television 22. The interactive system 3, shown as a 'robot' in a human- like realisation, is equipped with a pair of microphones M1, M2 positioned as the 'ears' on a 'head' 20 of the interactive system 3. To make the interactive system appear realistic, the head 20 can rotate on a 'neck' 23 attached to a 'body' 24 of the interactive system 3. The two potential users 2, 21 and the television 22 can all issue speech signals
S, S1, S2. The microphones M1, M2 detect any incoming acoustic signals S, S1, S2, which are processed in a speech recognition unit, not shown in the diagram, to determine their validity. As long as the interactive system 3 is in an idle state, the output of the speech recognition is analysed for a wake-up signal. When a first user 2 speaks a wake-up phrase, this is recognised by the interactive system 3, and the direction D from which the speech signal S originated, is specified to be the interaction direction. For the duration of the interaction, the interactive system 3 only responds to speech S coming from this direction D. Other speech signals S1, S2, for example when the person 21 says something, or when speech emanates from the television 22, will be recognised but simply ignored. At some later stage, when the interaction is concluded, the second user 21 might say the wake-up phrase, upon which the interactive system 3 determines the relative direction Di for this current interaction on the basis of the speech signals Si originating from that user 21. For this interaction, any speech signals S, S2 coming from the first user 2 or the TV 22, will be recognised but ignored.
Naturally, the user 2 is free to move about. Figs. 3a and 3b show a sequence of positions of the user 2 with respect to the interactive system 3. In Fig. 3a, the user 2 is positioned to the right of the interactive system 3 when he speaks the wake-up phrase. This is detected by the interactive system 3, and the head 20 is rotated so that the user 2 has the impression that the robot 3 is 'attentive' and 'listening'. A initial specific interaction direction D is established by acoustic source localisation using the input of the two microphones M1, M2. A camera incorporated into one of the 'eyes' 30, 31 of the robot 3 captures one or more images of the user 2. These images can be processed in a feature extraction unit to identify characterising features of the user 2. During the subsequent interaction, the user 2 can move around. Images captured by the camera can be processed to track the position of the user 2, for example by using face recognition or pattern recognition algorithms, so that the direction of interaction is updated to the altered direction D'. Now, only speech input S from this altered direction D' will be considered in the continued interaction.
Fig. 4 shows an interactive system 3 which is similar to that described above in Fig. 1, but with a different approach to annotating and filtering acoustic input detected by the microphone arrangement M. Furthermore, the interactive system 3 of this embodiment avails of an additional input modality 40 in the form of a camera 40. The camera 40 can be incorporated in an 'eye' of the interactive system 3 as described above under Figs. 2 - 3b. Here, once the language understanding unit 19 has identified a wake-up phrase in an input signal S to the speech recognition system 1 , an initial interaction direction D is determined. The input modality 40 is directed at the source 2 (the user 2) of the speech signal S, and can continually or intermittently capture images 41 of the user, which are then evaluated or processed in a feature extraction unit 42. For example, analysis of the images 41 can show that the user 2 is moving out of the field of vision of the camera 40, so that the interaction direction will need to be re-determined. Information 43 obtained from the feature extraction unit 42 is forwarded to the dialogue engine 15. Such information 43 can be used to trigger events in a dialogue on the basis of gesture recognition, etc.
In this embodiment, the information 43 from the feature extraction engine 42 is interpreted by the dialogue engine 15 to adapt prior knowledge about the original interaction direction D to reflect the new position (not shown in the diagram) of the user 2. A corresponding signal 44 is supplied to the filtering unit 18 for the decision to accept or discard an annotated speech recognition output 14. For example, the user 2 might have initiated interaction from the direction D. Then, without speaking, he moves to a new position. In the meantime, someone else moves to the original position of the user 2. The camera 40 has tracked the movements of the user 2, so that if the other person speaks from the initial direction D, the filtering unit 18, using the updated information 44 obtained with the help of the feature extraction unit 42, will be able to decide that speech input from the other person is to be discarded, and only the annotated speech input 14 corresponding to the user 2 at his new position is to be recognised as valid.
Although the present invention has been disclosed in the form of preferred embodiments and variations thereon, it will be understood that numerous additional modifications and variations could be made thereto without departing from the scope of the invention. Naturally, any number of microphones can be used for determining the relative direction of the user with respect to the speech recognition system, two being simply the minimum for a fairly simple way of performing acoustic source localisation. Furthermore, the units or modules used to perform the various functions such as speech recognition, annotation, filtering, language understanding, etc., can be realised in any suitable configuration or constellation in an interactive system. The speech recognition system according to the invention can be incorporated in any suitable interactive system intended for use also in noisy environments, or in environments where it may be that several users sometimes speak at the same time. The method according to the invention can be used for situations in which only a single user is speaking, and an alternative method, for example using beam-forming, can be applied for other, noisier, situations. For the sake of clarity, it is to be understood that the use of "a" or "an" throughout this application does not exclude a plurality, and "comprising" does not exclude other steps or elements. A "unit" or "module" can comprises a number of units or modules, unless otherwise stated.

Claims

CLAIMS:
1. A method of driving a speech recognition system (1), which method comprises detecting a speech signal (S, S1, S2) using a microphone arrangement (M); performing speech recognition on the speech signal (S, S1, S2) detected by a first microphone (Mi) of the microphone arrangement (M) to obtain a speech recognition output (10); using the first microphone (Mi) of the microphone arrangement (M) and at least a second microphone (M2) of the microphone arrangement (M) to determine a relative direction (D, D', Dh D2) between the source (2, 21, 22) of the speech signal (S, Si, S2) and the microphone arrangement (M); - processing the speech recognition output (10) according to the determined relative direction (D, D', D1, D2).
2. A method according to claim 1, wherein processing the speech recognition output (10) according to the determined relative direction (D, D', D1, D2) comprises filtering the speech recognition output (10) to disregard any speech recognition output (10) corresponding to a speech signal (S1, S2) originating from a direction (D1, D2) other than a specific direction (D, D').
3. A method according to claim 1 or claim 2, wherein the speech recognition output (10) is annotated, for further processing, with the determined relative direction (D, D', Di, D2).
4. A method according to claim 2 or claim 3, wherein the step of filtering the speech recognition output (10) to disregard any speech recognition output (10) corresponding to a speech signal ( S1, S2) originating from a direction (D1, D2) other than a specific direction (D, D') is carried out in a dialogue engine (15), which dialogue engine (15) establishes the specific direction (D, D').
5. A method according to any of claims 2 to 4, wherein an initial specific direction (D) between the source (2) of a predefined speech signal (S) and the microphone arrangement (M) is determined upon recognition of the predefined speech signal (S), which predefined speech signal (S) can be accepted from any direction relative to the microphone arrangement (M).
6. A method according to any of claims 2 to 5, wherein source characterising information (43) is obtained to characterise the source (2) of the speech signal (S) originating from the specific direction (D, D').
7. A method according to claim 6, wherein the specific direction (D, D') between the source (2) of the speech signal (S) and the microphone arrangement (M) is re-determined on the basis of the source characterising information (43).
8. A speech recognition system (1) comprising a microphone arrangement (M) with a first microphone (Mi) and at least a second microphone (M2) for detecting a speech signal (S, S1, S2); an acoustic source localisation unit (11) for determining a relative direction (D, D', Di, D2) between the source (2, 21, 22) of the speech signal (S, Si, S2) and the microphone arrangement (M) using the first microphone (Mi) and at least a second microphone (M2) of the microphone arrangement (M); a speech recognition unit (12) for performing speech recognition on the speech signal (S, S1, S2) detected by the first microphone (Mi) of the microphone arrangement (M); a processing unit (18) for processing the speech recognition output (10) according to the determined relative direction (D, D', D1, D2).
9. An interactive system (3) comprising a microphone arrangement (M) with a first microphone (Mi) and at least a second microphone (M2) for detecting a speech signal (S, S1, S2); - an acoustic source localisation unit (11) for determining a relative direction
(D, D', D", D1, D2) between the source (2, 21, 22) of the speech signal (S, S1, S2) and the microphone arrangement (M) using the first microphone (Mi) and at least a second microphone (M2) of the microphone arrangement (M); a speech recognition unit (12) for performing speech recognition on the speech signal (S, S1, S2) detected by the first microphone (Mi) of the microphone arrangement (M); a processing unit () for processing the speech recognition output (10) according to the determined relative direction (D, D', D", D1, D2). - and a dialogue engine (15) for interpreting the speech recognition output (10) of the speech recognition system (1).
10. An interactive system (3) according to claim 9, comprising a feature extraction unit (42) for obtaining source characterising information () pertaining to the source (2) of a speech signal (S).
11. A computer program product directly loadable into the memory of a programmable speech recognition system (1) for use in an interactive system (3), comprising software code portions for performing the steps of a method according to claims 1 to 7, when said product is run on the speech recognition system (1).
PCT/IB2007/051742 2006-05-31 2007-05-09 Method of driving a speech recognition system WO2007138503A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP06114774 2006-05-31
EP06114774.0 2006-05-31

Publications (1)

Publication Number Publication Date
WO2007138503A1 true WO2007138503A1 (en) 2007-12-06

Family

ID=38441523

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2007/051742 WO2007138503A1 (en) 2006-05-31 2007-05-09 Method of driving a speech recognition system

Country Status (2)

Country Link
TW (1) TW200809768A (en)
WO (1) WO2007138503A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009130591A1 (en) 2008-04-25 2009-10-29 Nokia Corporation Method and apparatus for voice activity determination
EP2211337A1 (en) * 2009-01-23 2010-07-28 Victor Company Of Japan, Ltd. Electronic apparatus operable by external sound
US8275136B2 (en) 2008-04-25 2012-09-25 Nokia Corporation Electronic device speech enhancement
US8611556B2 (en) 2008-04-25 2013-12-17 Nokia Corporation Calibrating multiple microphones
WO2014210392A2 (en) 2013-06-27 2014-12-31 Rawles Llc Detecting self-generated wake expressions
WO2015160561A1 (en) * 2014-04-17 2015-10-22 Microsoft Technology Licensing, Llc Conversation detection
US9922667B2 (en) 2014-04-17 2018-03-20 Microsoft Technology Licensing, Llc Conversation, presence and context detection for hologram suppression
EP3318888A4 (en) * 2015-06-30 2019-03-27 Yutou Technology (Hangzhou) Co., Ltd. Robot voice direction-seeking turning system and method

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105845135A (en) * 2015-01-12 2016-08-10 芋头科技(杭州)有限公司 Sound recognition system and method for robot system
CN106328165A (en) * 2015-06-30 2017-01-11 芋头科技(杭州)有限公司 Robot autologous sound source elimination system
CN108231073B (en) 2016-12-16 2021-02-05 深圳富泰宏精密工业有限公司 Voice control device, system and control method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004038697A1 (en) * 2002-10-23 2004-05-06 Koninklijke Philips Electronics N.V. Controlling an apparatus based on speech

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004038697A1 (en) * 2002-10-23 2004-05-06 Koninklijke Philips Electronics N.V. Controlling an apparatus based on speech

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LLEIDA E ET AL: "Robust continuous speech recognition system based on a microphone array", ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 1998. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON SEATTLE, WA, USA 12-15 MAY 1998, NEW YORK, NY, USA,IEEE, US, vol. 1, 12 May 1998 (1998-05-12), pages 241 - 244, XP010279154, ISBN: 0-7803-4428-6 *
YAMADA T ET AL: "Robust speech recognition with speaker localization by a microphone array", SPOKEN LANGUAGE, 1996. ICSLP 96. PROCEEDINGS., FOURTH INTERNATIONAL CONFERENCE ON PHILADELPHIA, PA, USA 3-6 OCT. 1996, NEW YORK, NY, USA,IEEE, US, vol. 3, 3 October 1996 (1996-10-03), pages 1317 - 1320, XP010237923, ISBN: 0-7803-3555-4 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2266113A4 (en) * 2008-04-25 2015-12-16 Nokia Technologies Oy Method and apparatus for voice activity determination
US8244528B2 (en) 2008-04-25 2012-08-14 Nokia Corporation Method and apparatus for voice activity determination
US8275136B2 (en) 2008-04-25 2012-09-25 Nokia Corporation Electronic device speech enhancement
US8611556B2 (en) 2008-04-25 2013-12-17 Nokia Corporation Calibrating multiple microphones
US8682662B2 (en) 2008-04-25 2014-03-25 Nokia Corporation Method and apparatus for voice activity determination
WO2009130591A1 (en) 2008-04-25 2009-10-29 Nokia Corporation Method and apparatus for voice activity determination
EP2211337A1 (en) * 2009-01-23 2010-07-28 Victor Company Of Japan, Ltd. Electronic apparatus operable by external sound
US8189430B2 (en) 2009-01-23 2012-05-29 Victor Company Of Japan, Ltd. Electronic apparatus operable by external sound
CN105556592A (en) * 2013-06-27 2016-05-04 亚马逊技术股份有限公司 Detecting self-generated wake expressions
US10720155B2 (en) 2013-06-27 2020-07-21 Amazon Technologies, Inc. Detecting self-generated wake expressions
WO2014210392A2 (en) 2013-06-27 2014-12-31 Rawles Llc Detecting self-generated wake expressions
JP2016524193A (en) * 2013-06-27 2016-08-12 ロウルズ リミテッド ライアビリティ カンパニー Detection of self-generated wake expressions
EP3014607A4 (en) * 2013-06-27 2016-11-30 Rawles Llc Detecting self-generated wake expressions
US9747899B2 (en) 2013-06-27 2017-08-29 Amazon Technologies, Inc. Detecting self-generated wake expressions
US11600271B2 (en) 2013-06-27 2023-03-07 Amazon Technologies, Inc. Detecting self-generated wake expressions
US11568867B2 (en) 2013-06-27 2023-01-31 Amazon Technologies, Inc. Detecting self-generated wake expressions
US10679648B2 (en) 2014-04-17 2020-06-09 Microsoft Technology Licensing, Llc Conversation, presence and context detection for hologram suppression
US10529359B2 (en) 2014-04-17 2020-01-07 Microsoft Technology Licensing, Llc Conversation detection
WO2015160561A1 (en) * 2014-04-17 2015-10-22 Microsoft Technology Licensing, Llc Conversation detection
RU2685970C2 (en) * 2014-04-17 2019-04-23 МАЙКРОСОФТ ТЕКНОЛОДЖИ ЛАЙСЕНСИНГ, ЭлЭлСи Conversation detection
US9922667B2 (en) 2014-04-17 2018-03-20 Microsoft Technology Licensing, Llc Conversation, presence and context detection for hologram suppression
EP3318888A4 (en) * 2015-06-30 2019-03-27 Yutou Technology (Hangzhou) Co., Ltd. Robot voice direction-seeking turning system and method

Also Published As

Publication number Publication date
TW200809768A (en) 2008-02-16

Similar Documents

Publication Publication Date Title
WO2007138503A1 (en) Method of driving a speech recognition system
CN109410957B (en) Front human-computer interaction voice recognition method and system based on computer vision assistance
EP3414759B1 (en) Techniques for spatially selective wake-up word recognition and related systems and methods
CN111370014B (en) System and method for multi-stream target-voice detection and channel fusion
US7707035B2 (en) Autonomous integrated headset and sound processing system for tactical applications
KR100754384B1 (en) Method and apparatus for robust speaker localization and camera control system employing the same
JP2006251266A (en) Audio-visual coordinated recognition method and device
KR100822880B1 (en) User identification system through sound localization based audio-visual under robot environments and method thereof
KR20100119250A (en) Appratus for detecting voice using motion information and method thereof
EP3002753A1 (en) Speech enhancement method and apparatus for same
JP4825552B2 (en) Speech recognition device, frequency spectrum acquisition device, and speech recognition method
US11790900B2 (en) System and method for audio-visual multi-speaker speech separation with location-based selection
KR101889465B1 (en) voice recognition device and lighting device therewith and lighting system therewith
WO2018155981A1 (en) Method and system for providing voice recognition trigger and non-transitory computer-readable recording medium
EP1494208A1 (en) Method for controlling a speech dialog system and speech dialog system
Yamamoto et al. Improvement of robot audition by interfacing sound source separation and automatic speech recognition with missing feature theory
CN108665907B (en) Voice recognition device, voice recognition method, recording medium, and robot
JP2008052178A (en) Voice recognition device and voice recognition method
JP3838159B2 (en) Speech recognition dialogue apparatus and program
EP2795616A1 (en) Speech recognition method and system
KR20190059381A (en) Method for Device Control and Media Editing Based on Automatic Speech/Gesture Recognition
JP2004318026A (en) Security pet robot and signal processing method related to the device
WO2020240789A1 (en) Speech interaction control device and speech interaction control method
JP7172120B2 (en) Speech recognition device and speech recognition method
Gomez et al. Utilizing visual cues in robot audition for sound source discrimination in speech-based human-robot communication

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07735822

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07735822

Country of ref document: EP

Kind code of ref document: A1