WO2007138503A1 - Method of driving a speech recognition system - Google Patents
Method of driving a speech recognition system Download PDFInfo
- Publication number
- WO2007138503A1 WO2007138503A1 PCT/IB2007/051742 IB2007051742W WO2007138503A1 WO 2007138503 A1 WO2007138503 A1 WO 2007138503A1 IB 2007051742 W IB2007051742 W IB 2007051742W WO 2007138503 A1 WO2007138503 A1 WO 2007138503A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech recognition
- microphone
- speech
- source
- speech signal
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000002452 interceptive effect Effects 0.000 claims abstract description 60
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000001914 filtration Methods 0.000 claims description 14
- 230000004807 localization Effects 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims 1
- 230000003993 interaction Effects 0.000 description 44
- 238000010586 diagram Methods 0.000 description 8
- 210000003128 head Anatomy 0.000 description 8
- 230000008569 process Effects 0.000 description 5
- 230000002123 temporal effect Effects 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 210000005069 ears Anatomy 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000010399 physical interaction Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02165—Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
Definitions
- the invention relates to a method of driving a speech recognition system, to a speech recognition system, and to an interactive system.
- Speech recognition can be made use of in basic systems - to carry out specific functions in response to a spoken command - or can be incorporated into more complex systems which allow an interaction or 'dialogue' to be carried out.
- Developments are being made in the field of home dialogue systems, which allow interaction between a user and, for example, a humanoid or robot-like device. Basically, words or phrases spoken by a user are processed in a speech recogniser, and the results of the speech recognition are interpreted in a dialogue manager or dialogue engine in order to determine what the user has said, and how to react to what he has said.
- Such dialogues or interactions on the basis of speech alone are particularly useful in "hands free” environments, i.e. environments in which a user cannot interact with a system using, say, a keyboard or similar user interface, or where such a physical interaction is undesirable.
- a spoken command such as "Turn on the lights” or "Play some music”, and more convenient for a user than for him to actually press a switch or push buttons on a remote control.
- WO2004/038697A1 suggests using a 'beam-former' to cancel out any unwanted acoustic signals detected by the microphones of the system.
- a beam- former uses a microphone setup with several microphones for collecting acoustic input, and determines which acoustic signals originate from outside of a certain direction. These unwanted or invalid signals are then negated, for example by overlaying them with the appropriate inverse signals to cancel them out.
- Another disadvantage common to some state of the art speech recognition systems is that they require a first speech recogniser solely for the purpose of reacting to an initiation or 'wake-up' phrase. As long as such a speech recognition system is waiting for input in an idle state, this dedicated speech recogniser is continually 'listening' for a wake-up or initiation phrase. Upon recognition of the wake-up phrase, a second speech recogniser commences processing the speech input according to an interaction grammar. The need for two speech recognisers makes such systems comparatively expensive and more complex.
- the present invention provides a method of driving a speech recognition system, which method comprises detecting a speech signal using a microphone arrangement, performing speech recognition on the speech signal detected by a first microphone of the microphone arrangement to obtain a speech recognition output, using the first microphone of the microphone arrangement and at least a second microphone of the microphone arrangement to determine a relative direction between the source of the speech signal and the microphone arrangement, and processing the speech recognition output according to the determined relative direction.
- the source of the speech signal can be a human user of the speech recognition system, but might equally well be an artificially generated source of speech. In the following, however, for the sake of simplicity, it will be assumed that the source of the speech signal is a user, without restricting the invention in any way.
- a user might be positioned at any location relative to the speech recognition system.
- the user in a home dialogue system, the user might be close to or at a distance from the microphones of the speech recognition system, which microphones can be arranged, for example, as the "ears" on either side of a "head" of the home dialogue system.
- the user's relative position with respect to the microphone arrangement can easily be determined, for example by simply evaluating the temporal difference in the acoustic input detected by the first and second microphones, i.e. the delay between the sound signal impinging at the first microphone and at the second microphone. If there is a significant delay, for example, this would indicate that the user is positioned towards the side of the microphone arrangement with that microphone that first detected the signal.
- the length or size of the measured delay between the two microphones is an indication of how far the user is positioned to one side of the microphone arrangement. If, on the other hand, little or no delay is determined, it can be assumed that the user is positioned essentially 'between' the microphones, for example more or less directly in front of or behind the microphone arrangement. To discern whether the user is front of or behind the microphone arrangement, it is advantageous to make use of an additional third microphone. Use of such an additional microphone also allows determination of the relative position of the user 'above' or 'below' the microphone arrangement.
- Evaluating a sound signal detected by more than one microphone in this way to determine the source of the sound is known as 'acoustic source localisation'.
- this relative direction is used, in the method according to the invention, to process the speech recognition output.
- the speech recogniser can continually work on any acoustic or speech input signal, regardless of its source.
- every acoustic signal, regardless of its source relative to the microphone arrangement is 'admitted' and subject to the speech recognition process. Whether a speech recognition output is deemed to be valid or not is later decided on the basis of the determined relative direction.
- the speech recognition and relative direction determination processes can take place simultaneously, or might be delayed with respect to each other.
- An obvious advantage of the method according to the invention is that the position of the user relative to the microphone arrangement, and therefore to the speech recognition system, can be simply and easily determined, and, using this direction to process the speech recognition output means that there is no need for computationally expensive beam- forming to cancel unwanted input. A distortion of the valid input, i.e. the speech signal coming from the user, will therefore not arise with the method according to the invention. Furthermore, the simplicity of the approach - speech recognition on a single microphone and direction estimation with at least one additional microphone, allows for an advantageously economical realisation.
- a suitable speech recognition system comprises a microphone arrangement with a first microphone and at least a second microphone for detecting a speech signal, an acoustic source localisation unit for determining a relative direction between the source of the speech signal and the microphone arrangement using the first microphone and at least a second microphone of the microphone arrangement, a speech recognition unit for performing speech recognition on the speech signal detected by the first microphone of the microphone arrangement and a processing unit for processing the speech recognition output according to the determined relative direction.
- a speech recognition system can be used in a simple realisation, for example to cause a specific action to be taken in response to a predefined command spoken by a user, or in a more complex application such as an interactive system allowing a dialogue with the user.
- An interactive system comprises the units or modules of the speech recognition system described above, and a dialogue engine for interpreting the speech recognition output of the speech recognition system, whereby the processing unit can be realised as part of the dialogue engine.
- processing the speech recognition output according to the determined relative direction comprises filtering the speech recognition output to disregard any speech recognition output corresponding to a speech signal originating from a direction other than a specific, or given, direction. In other words, an interaction is only allowed in the specific direction, so that any speech recognition output originating from a different direction, i.e.
- the filtering might involve, for example, accepting a speech recognition output corresponding to a speech input whose origin deviates by no more than a pre-defined degree from the specific direction. In this way, only speech recognition output originating from essentially the specific direction is then processed. Other speech input coming from a different direction, such as other people talking, or speech coming from a television or radio, will be ignored.
- a speech recognition output is associated or linked in some appropriate manner with higher-level information about the source or direction of the corresponding speech signal by simply annotating the speech recognition output with the determined relative direction.
- the term 'annotation' simply means marking or tagging a speech recognition output - for example the output corresponding to a word or a phrase - with information specifying the relative direction from which the corresponding speech signal originated.
- a dialogue engine for controlling the dialogue in an interactive system can avail of speech understanding algorithms to determine what speech could belong to an interaction between a user and the interactive system, and which speech input is to be ignored.
- a dialogue engine generally has a predefined 'interaction grammar', such as a set of possible commands or queries to which the interactive system can respond.
- a simple home dialogue system might have an interaction grammar consisting of basic commands such as "Turn on the lights", “Play music”, "Any new e-mail?", etc.
- the corresponding speech recognition result must be examined to determine whether it corresponds to a part of the interaction grammar.
- the step of filtering the speech recognition output to disregard any results originating from a direction other than a specific direction is carried out in a dialogue engine.
- the dialogue engine preferably avails of a language understanding unit to carry out the necessary evaluation of the speech recognition results. Since even a basic system using speech recognition needs language understanding and interpreting units to deal with the spoken input commands, it is advantageous to perform the filtering at this point.
- the dialogue engine can establish or define the specific direction for the current interaction. For example, if the dialogue engine has several speech inputs from different directions to evaluate, and it determines that only the speech input from a certain direction matches a part of the interaction grammar, i.e., is 'valid', the dialogue engine can then decide that the direction corresponding to the valid speech input is to be the specific direction from which to accept speech recognition results, and any speech recognition results originating from other directions are to be disregarded.
- An interaction between a user and the interactive system can be of any duration. A user can use the interactive system in a short interaction to have a command executed, for example to turn on or off the lights.
- a longer interaction might be started by the user to have the interactive system respond to a series of commands or queries over a longer period of time.
- the start of an interaction can be announced or made known to the interactive system by the user uttering a certain pre-defined phrase, referred to in the following as a 'wake-up' phrase.
- the dialog engine Once the dialog engine has recognised this phrase, it can assume that an interaction is likely to follow from relative direction from which the wake-up phrase was spoken. For example, if a user has uttered a wake-up phrase from a certain relative direction, the dialogue engine can decide on that direction to be the specific direction for the following interaction.
- the predefined speech signal or wake-up phrase can be accepted or considered from any direction relative to the microphone arrangement of the speech recognition system, so that the user of the interactive system can initiate an interaction from any position. Because the speech recognition results are accepted or discarded on the basis of the direction from which they originate, an obvious advantage of the speech recognition system according to the invention is that a second speech recogniser is not required for the purpose of identifying a wake-up phrase, so that a speech recognition system according to the invention, with a single speech recogniser, is more economical and easier to realise compared to state of the art speech recognition systems. By expecting input from a single speaker positioned at a certain specific direction with respect to the interactive system, computationally expensive beam- formers are not needed.
- a user might like to have some kind of feedback that the interactive system has identified him as the interacting user. Therefore, the interactive system might avail of some kind of feedback to indicate that it is focusing on the user. For example, a moveable head of an interactive system might swivel to 'face' the user when he has spoken the wake-up phrase, and the 'eyes' in the head might briefly light up. The user then knows that the interactive system has identified him as the interacting user. Some other type of feedback could be used if the interactive system is not equipped with a moveable head. For example, one or more LEDs (light-emitting diodes), positioned on that part of the interactive system facing the user, might light up to indicate to the user that the interactive system is 'listening' to him.
- LEDs light-emitting diodes
- the interaction between the user and the interactive system is allowed in the determined relative direction, i.e. speech coming from any other direction is ignored.
- the user can simply utter the wake-up phrase whenever he changes position. For example, the wake-up phrase might be "Look at me” or "Over here".
- the dialogue engine Whenever the dialogue engine 'hears' this phrase, it can re-adjust to the direction from where the phrase came, and can show this by causing the head of the interactive system to swivel or rotate so that is facing in the direction from which the wake-up phrase was spoken, i.e. in the direction of the user.
- he forgets to say the wake-up phrase after moving to a different position he might be annoyed to find out that the interaction has been terminated, or that his commands are being ignored. Therefore, in a preferred method according to the invention, in particular for interactive systems, source characterising information, i.e.
- some high-level information describing the user of the interactive system is obtained to characterise the user speaking from the specific direction. For example, speaker characteristic information about the speech signal corresponding to a wake-up phrase can be determined and later used to decide, on the basis of speaker recognition, whether a speech input signal has been spoken by that same user, in which case the speech input is accepted, or by another user, in which case the speech input will be discarded.
- Another way to characterise the interacting user might be to perform image processing.
- a camera of the interactive system can be aimed at the direction from which the user has spoken, and an image can be captured of the user's face.
- the interactive system can use the camera to track the user when he changes position.
- the images captured by the camera can be analysed to determine whether the person who has spoken is the user that initiated the interaction.
- the relative direction between the user and the microphone arrangement is adjusted or updated on the basis of the source characterising information.
- Obtaining source characterising information such as speaker descriptive information or image recognition information can be carried out in a suitable unit.
- An interactive system according to the invention preferably comprises a feature extraction unit for carrying out such calculations.
- the feature extraction unit can be supplied with suitable data from, for example, a microphone or a digital camera of the interactive system.
- the processing steps described above might all be carried out using appropriate software algorithms, many of which are already well-known, such as those for comparing digitised speech signals, for performing speech recognition, and for interpreting speech recognition results in a dialogue on the basis of an interaction grammar.
- the software can run as a program, or a number of programs, on one or more microprocessors of a speech recognition system, or an interactive system.
- Fig. 1 shows a block diagram of an interactive system comprising a speech recognition system according to a first embodiment of the invention
- Fig. 2 shows an interactive system according to an embodiment of the invention, and a number of sources of speech input
- Fig. 3a shows the interactive system according to Fig. 2, and a user at a first position relative to the interactive system
- Fig. 3b shows the interactive system according to Fig. 3a, and the user at a second position relative to the interactive system
- Fig. 4 shows a block diagram of a speech recognition system according to a second embodiment of the invention.
- Fig. 1 shows a block diagram of an interactive system 3 comprising a speech recognition system 1 according to an embodiment of the invention.
- a microphone arrangement M avails of a first microphone M 1 , and a second microphone M 2 separated from each other by a certain distance.
- the distance between the two microphones M 1 , M 2 can be governed by the ultimate realisation of the device in which the speech recognition system is to be incorporated, or by desired accuracy levels.
- the input from the first microphone Mi is forwarded to a speech recognition unit 12, which performs speech recognition on the input signal.
- the input from the second microphone M 2 is forwarded to an acoustic source localisation unit 11 , which also receives the input from the first microphone M 1 .
- the direction D of the source 2 of the detected signal S is estimated by evaluating the temporal delta between the signals detected at the two microphones M 1 , M 2 .
- the source 2, or user 2 is shown to be closer to the second microphone M 2 , so that sound S emanating from this source 2 will impinge on the second microphone M 2 first, and then, after a brief delay or temporal delta, the sound S will impinge on the first microphone Mi .
- the magnitude of the temporal delta, and the known distance between the two microphones M 1 , M 2 can be used by the acoustic source localisation unit 11 to estimate the relative direction D from which the sound S has originated.
- the speech recognition output 10 passes unchanged through an annotation unit 17 to a dialogue engine 15, where it further passes, unchanged, through a filtering unit 18 before being subsequently analysed in a language understanding unit 19 of the dialogue engine 15 to see if a wake-up phrase, described in a wake-up grammar supplied by a database 9, has been uttered.
- a wake-up phrase described in a wake-up grammar supplied by a database 9, has been uttered.
- an appropriate signal 44 is forwarded to the annotation unit 17, informing the annotation unit 17 that the direction information 13 is now to be defined as the specific interaction direction D.
- the speech recognition output 10 is annotated in the annotation unit 17 with the direction information 13 from the acoustic source localisation unit 11, so that any acoustic input signals are associated with the direction from where they originated.
- the annotation unit 17 can compare the direction information 13 to the specific interaction direction D and annotate any subsequent speech recognition output 10 with a marker, tag, or other descriptive label describing the degree of closeness of the speech recognition output 10 to the specific interaction direction D.
- the annotated speech recognition output 14 is forwarded to the filtering unit
- the output 16 of the dialogue engine 15, in response to the interaction taking place, can be a suitable signal 16, for example a signal 16 to a controller, not shown in the diagram, to cause some event to take place in response to a command spoken by the user 2 and interpreted by the dialogue engine 15.
- a suitable signal 16 for example a signal 16 to a controller, not shown in the diagram, to cause some event to take place in response to a command spoken by the user 2 and interpreted by the dialogue engine 15.
- Fig. 2 an interactive system 3 is shown in an environment with two potential users 2, 21 and a television 22.
- the interactive system 3, shown as a 'robot' in a human- like realisation, is equipped with a pair of microphones M 1 , M 2 positioned as the 'ears' on a 'head' 20 of the interactive system 3.
- the head 20 can rotate on a 'neck' 23 attached to a 'body' 24 of the interactive system 3.
- the two potential users 2, 21 and the television 22 can all issue speech signals
- the microphones M 1 , M 2 detect any incoming acoustic signals S, S 1 , S 2 , which are processed in a speech recognition unit, not shown in the diagram, to determine their validity.
- the output of the speech recognition is analysed for a wake-up signal.
- a first user 2 speaks a wake-up phrase, this is recognised by the interactive system 3, and the direction D from which the speech signal S originated, is specified to be the interaction direction.
- the interactive system 3 only responds to speech S coming from this direction D.
- Other speech signals S 1 , S 2 for example when the person 21 says something, or when speech emanates from the television 22, will be recognised but simply ignored.
- the second user 21 might say the wake-up phrase, upon which the interactive system 3 determines the relative direction Di for this current interaction on the basis of the speech signals Si originating from that user 21.
- the interactive system 3 determines the relative direction Di for this current interaction on the basis of the speech signals Si originating from that user 21.
- any speech signals S, S 2 coming from the first user 2 or the TV 22, will be recognised but ignored.
- Figs. 3a and 3b show a sequence of positions of the user 2 with respect to the interactive system 3.
- the user 2 is positioned to the right of the interactive system 3 when he speaks the wake-up phrase. This is detected by the interactive system 3, and the head 20 is rotated so that the user 2 has the impression that the robot 3 is 'attentive' and 'listening'.
- a initial specific interaction direction D is established by acoustic source localisation using the input of the two microphones M 1 , M 2 .
- a camera incorporated into one of the 'eyes' 30, 31 of the robot 3 captures one or more images of the user 2. These images can be processed in a feature extraction unit to identify characterising features of the user 2.
- the user 2 can move around. Images captured by the camera can be processed to track the position of the user 2, for example by using face recognition or pattern recognition algorithms, so that the direction of interaction is updated to the altered direction D'. Now, only speech input S from this altered direction D' will be considered in the continued interaction.
- Fig. 4 shows an interactive system 3 which is similar to that described above in Fig. 1, but with a different approach to annotating and filtering acoustic input detected by the microphone arrangement M. Furthermore, the interactive system 3 of this embodiment avails of an additional input modality 40 in the form of a camera 40.
- the camera 40 can be incorporated in an 'eye' of the interactive system 3 as described above under Figs. 2 - 3b.
- an initial interaction direction D is determined.
- the input modality 40 is directed at the source 2 (the user 2) of the speech signal S, and can continually or intermittently capture images 41 of the user, which are then evaluated or processed in a feature extraction unit 42.
- analysis of the images 41 can show that the user 2 is moving out of the field of vision of the camera 40, so that the interaction direction will need to be re-determined.
- Information 43 obtained from the feature extraction unit 42 is forwarded to the dialogue engine 15. Such information 43 can be used to trigger events in a dialogue on the basis of gesture recognition, etc.
- the information 43 from the feature extraction engine 42 is interpreted by the dialogue engine 15 to adapt prior knowledge about the original interaction direction D to reflect the new position (not shown in the diagram) of the user 2.
- a corresponding signal 44 is supplied to the filtering unit 18 for the decision to accept or discard an annotated speech recognition output 14.
- the user 2 might have initiated interaction from the direction D. Then, without speaking, he moves to a new position. In the meantime, someone else moves to the original position of the user 2.
- the camera 40 has tracked the movements of the user 2, so that if the other person speaks from the initial direction D, the filtering unit 18, using the updated information 44 obtained with the help of the feature extraction unit 42, will be able to decide that speech input from the other person is to be discarded, and only the annotated speech input 14 corresponding to the user 2 at his new position is to be recognised as valid.
- any number of microphones can be used for determining the relative direction of the user with respect to the speech recognition system, two being simply the minimum for a fairly simple way of performing acoustic source localisation.
- the units or modules used to perform the various functions can be realised in any suitable configuration or constellation in an interactive system.
- the speech recognition system according to the invention can be incorporated in any suitable interactive system intended for use also in noisy environments, or in environments where it may be that several users sometimes speak at the same time.
- the method according to the invention can be used for situations in which only a single user is speaking, and an alternative method, for example using beam-forming, can be applied for other, noisier, situations.
- an alternative method for example using beam-forming, can be applied for other, noisier, situations.
- a “unit” or “module” can comprises a number of units or modules, unless otherwise stated.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
A method of driving a speech recognition system (1), which method comprises detecting a speech signal (S, S1, S2) using a microphone arrangement (M), performing speech recognition on the speech signal (S, S1, S2) detected by a first microphone (Mi) of the microphone arrangement (M) to obtain a speech recognition output (10), using the first microphone (Mi) of the microphone arrangement (M) and at least a second microphone (M2) of the microphone arrangement (M) to determine a relative direction (D, D', D1, D2) between the source (2, 21, 22) of the speech signal (S, S1, S2) and the microphone arrangement (M), and processing the speech recognition output (10) according to the determined relative direction (D, D', D1, D2). The invention also relates to a speech recognition system (1) and to an interactive system (3).
Description
Method of driving a speech recognition system
The invention relates to a method of driving a speech recognition system, to a speech recognition system, and to an interactive system.
The use of speech recognition systems is becoming more and more widespread, as developments in this field lead to better speech recognition algorithms and more compact systems. Speech recognition can be made use of in basic systems - to carry out specific functions in response to a spoken command - or can be incorporated into more complex systems which allow an interaction or 'dialogue' to be carried out. Developments are being made in the field of home dialogue systems, which allow interaction between a user and, for example, a humanoid or robot-like device. Basically, words or phrases spoken by a user are processed in a speech recogniser, and the results of the speech recognition are interpreted in a dialogue manager or dialogue engine in order to determine what the user has said, and how to react to what he has said. Such dialogues or interactions on the basis of speech alone are particularly useful in "hands free" environments, i.e. environments in which a user cannot interact with a system using, say, a keyboard or similar user interface, or where such a physical interaction is undesirable. For example, it is simple and intuitive to give a spoken command such as "Turn on the lights" or "Play some music", and more convenient for a user than for him to actually press a switch or push buttons on a remote control.
However, such an interactive system must be able to distinguish between valid spoken input from the user with which the system is interacting, and any other "unwanted" or invalid input, i.e. a 'false alarm'. There are a number of approaches in dealing with this problem. For example, WO2004/038697A1 suggests using a 'beam-former' to cancel out any unwanted acoustic signals detected by the microphones of the system. A beam- former uses a microphone setup with several microphones for collecting acoustic input, and determines which acoustic signals originate from outside of a certain direction. These unwanted or invalid signals are then negated, for example by overlaying them with the appropriate inverse signals to cancel them out. However, this approach has the disadvantage that the desired or
valid signal, i.e. the speech signal originating from the user interacting with the system, becomes distorted. The distortion in the speech signal could result in erroneous speech recognition results, so the speech recognition system must be trained specifically for the beam former environment, meaning that extensive speech data collections have to be done with that specific microphone setup. This training makes such a system comparatively expensive. Furthermore, should the speech recognition system be moved for use in a different environment, it would have to be re-trained for the new environment, adding to the overall cost. Because of the necessity to re-train, such systems are generally not portable and, in consequence, limited in their usefulness. Another disadvantage common to some state of the art speech recognition systems is that they require a first speech recogniser solely for the purpose of reacting to an initiation or 'wake-up' phrase. As long as such a speech recognition system is waiting for input in an idle state, this dedicated speech recogniser is continually 'listening' for a wake-up or initiation phrase. Upon recognition of the wake-up phrase, a second speech recogniser commences processing the speech input according to an interaction grammar. The need for two speech recognisers makes such systems comparatively expensive and more complex.
Therefore, it is an object of the invention to provide a simpler and more economical way of identifying valid speech input to a speech recognition system.
To this end, the present invention provides a method of driving a speech recognition system, which method comprises detecting a speech signal using a microphone arrangement, performing speech recognition on the speech signal detected by a first microphone of the microphone arrangement to obtain a speech recognition output, using the first microphone of the microphone arrangement and at least a second microphone of the microphone arrangement to determine a relative direction between the source of the speech signal and the microphone arrangement, and processing the speech recognition output according to the determined relative direction.
The source of the speech signal can be a human user of the speech recognition system, but might equally well be an artificially generated source of speech. In the following, however, for the sake of simplicity, it will be assumed that the source of the speech signal is a user, without restricting the invention in any way.
A user might be positioned at any location relative to the speech recognition system. For example, in a home dialogue system, the user might be close to or at a distance
from the microphones of the speech recognition system, which microphones can be arranged, for example, as the "ears" on either side of a "head" of the home dialogue system. The user's relative position with respect to the microphone arrangement can easily be determined, for example by simply evaluating the temporal difference in the acoustic input detected by the first and second microphones, i.e. the delay between the sound signal impinging at the first microphone and at the second microphone. If there is a significant delay, for example, this would indicate that the user is positioned towards the side of the microphone arrangement with that microphone that first detected the signal. The length or size of the measured delay between the two microphones is an indication of how far the user is positioned to one side of the microphone arrangement. If, on the other hand, little or no delay is determined, it can be assumed that the user is positioned essentially 'between' the microphones, for example more or less directly in front of or behind the microphone arrangement. To discern whether the user is front of or behind the microphone arrangement, it is advantageous to make use of an additional third microphone. Use of such an additional microphone also allows determination of the relative position of the user 'above' or 'below' the microphone arrangement.
Evaluating a sound signal detected by more than one microphone in this way to determine the source of the sound is known as 'acoustic source localisation'.
Once the relative position of the user with respect to the microphone arrangement has been established, this relative direction is used, in the method according to the invention, to process the speech recognition output. In other words, the speech recogniser can continually work on any acoustic or speech input signal, regardless of its source. With the method according to the invention, every acoustic signal, regardless of its source relative to the microphone arrangement, is 'admitted' and subject to the speech recognition process. Whether a speech recognition output is deemed to be valid or not is later decided on the basis of the determined relative direction. The speech recognition and relative direction determination processes can take place simultaneously, or might be delayed with respect to each other.
An obvious advantage of the method according to the invention is that the position of the user relative to the microphone arrangement, and therefore to the speech recognition system, can be simply and easily determined, and, using this direction to process the speech recognition output means that there is no need for computationally expensive beam- forming to cancel unwanted input. A distortion of the valid input, i.e. the speech signal coming from the user, will therefore not arise with the method according to the invention. Furthermore, the simplicity of the approach - speech recognition on a single microphone and
direction estimation with at least one additional microphone, allows for an advantageously economical realisation.
A suitable speech recognition system comprises a microphone arrangement with a first microphone and at least a second microphone for detecting a speech signal, an acoustic source localisation unit for determining a relative direction between the source of the speech signal and the microphone arrangement using the first microphone and at least a second microphone of the microphone arrangement, a speech recognition unit for performing speech recognition on the speech signal detected by the first microphone of the microphone arrangement and a processing unit for processing the speech recognition output according to the determined relative direction. Such a speech recognition system can be used in a simple realisation, for example to cause a specific action to be taken in response to a predefined command spoken by a user, or in a more complex application such as an interactive system allowing a dialogue with the user.
An interactive system according to the invention comprises the units or modules of the speech recognition system described above, and a dialogue engine for interpreting the speech recognition output of the speech recognition system, whereby the processing unit can be realised as part of the dialogue engine.
The dependent claims and the subsequent description disclose particularly advantageous embodiments and features of the invention. As described above, evaluating the temporal difference between detection of a sound signal at two or more separate microphones yields information about the direction of the source of the signal relative to the microphones. Knowing the direction from which a speech signal originates can be used to decide whether the speech signal is 'valid' or 'invalid'. Therefore, in a particularly preferred embodiment of the invention, processing the speech recognition output according to the determined relative direction comprises filtering the speech recognition output to disregard any speech recognition output corresponding to a speech signal originating from a direction other than a specific, or given, direction. In other words, an interaction is only allowed in the specific direction, so that any speech recognition output originating from a different direction, i.e. unwanted false alarm input, is simply disregarded. The filtering might involve, for example, accepting a speech recognition output corresponding to a speech input whose origin deviates by no more than a pre-defined degree from the specific direction. In this way, only speech recognition output originating from essentially the specific direction is then processed. Other speech input coming from a
different direction, such as other people talking, or speech coming from a television or radio, will be ignored.
Filtering out any invalid speech recognition results according to direction can take place at any suitable stage. In a further preferred embodiment of the invention, a speech recognition output is associated or linked in some appropriate manner with higher-level information about the source or direction of the corresponding speech signal by simply annotating the speech recognition output with the determined relative direction. Here, the term 'annotation' simply means marking or tagging a speech recognition output - for example the output corresponding to a word or a phrase - with information specifying the relative direction from which the corresponding speech signal originated.
A dialogue engine for controlling the dialogue in an interactive system can avail of speech understanding algorithms to determine what speech could belong to an interaction between a user and the interactive system, and which speech input is to be ignored. To this end, a dialogue engine generally has a predefined 'interaction grammar', such as a set of possible commands or queries to which the interactive system can respond. For example, a simple home dialogue system might have an interaction grammar consisting of basic commands such as "Turn on the lights", "Play music", "Any new e-mail?", etc. To determine whether a speech input is valid or invalid, the corresponding speech recognition result must be examined to determine whether it corresponds to a part of the interaction grammar. Therefore, in a preferred embodiment of the invention, the step of filtering the speech recognition output to disregard any results originating from a direction other than a specific direction is carried out in a dialogue engine. The dialogue engine preferably avails of a language understanding unit to carry out the necessary evaluation of the speech recognition results. Since even a basic system using speech recognition needs language understanding and interpreting units to deal with the spoken input commands, it is advantageous to perform the filtering at this point.
Furthermore, on the basis of its interpretation of the words or phrases recognised, the dialogue engine can establish or define the specific direction for the current interaction. For example, if the dialogue engine has several speech inputs from different directions to evaluate, and it determines that only the speech input from a certain direction matches a part of the interaction grammar, i.e., is 'valid', the dialogue engine can then decide that the direction corresponding to the valid speech input is to be the specific direction from which to accept speech recognition results, and any speech recognition results originating from other directions are to be disregarded.
An interaction between a user and the interactive system can be of any duration. A user can use the interactive system in a short interaction to have a command executed, for example to turn on or off the lights. A longer interaction might be started by the user to have the interactive system respond to a series of commands or queries over a longer period of time. In either case, the start of an interaction can be announced or made known to the interactive system by the user uttering a certain pre-defined phrase, referred to in the following as a 'wake-up' phrase. Once the dialog engine has recognised this phrase, it can assume that an interaction is likely to follow from relative direction from which the wake-up phrase was spoken. For example, if a user has uttered a wake-up phrase from a certain relative direction, the dialogue engine can decide on that direction to be the specific direction for the following interaction. The predefined speech signal or wake-up phrase can be accepted or considered from any direction relative to the microphone arrangement of the speech recognition system, so that the user of the interactive system can initiate an interaction from any position. Because the speech recognition results are accepted or discarded on the basis of the direction from which they originate, an obvious advantage of the speech recognition system according to the invention is that a second speech recogniser is not required for the purpose of identifying a wake-up phrase, so that a speech recognition system according to the invention, with a single speech recogniser, is more economical and easier to realise compared to state of the art speech recognition systems. By expecting input from a single speaker positioned at a certain specific direction with respect to the interactive system, computationally expensive beam- formers are not needed.
A user might like to have some kind of feedback that the interactive system has identified him as the interacting user. Therefore, the interactive system might avail of some kind of feedback to indicate that it is focusing on the user. For example, a moveable head of an interactive system might swivel to 'face' the user when he has spoken the wake-up phrase, and the 'eyes' in the head might briefly light up. The user then knows that the interactive system has identified him as the interacting user. Some other type of feedback could be used if the interactive system is not equipped with a moveable head. For example, one or more LEDs (light-emitting diodes), positioned on that part of the interactive system facing the user, might light up to indicate to the user that the interactive system is 'listening' to him.
In the method according to the invention, the interaction between the user and the interactive system is allowed in the determined relative direction, i.e. speech coming from
any other direction is ignored. However, it may be desirable for the user to be able to move around while interacting with the system. In doing so, the position of the user with respect to the interactive system will change. To have the interactive system re-determine the relative or specific direction of the interaction, the user can simply utter the wake-up phrase whenever he changes position. For example, the wake-up phrase might be "Look at me" or "Over here". Whenever the dialogue engine 'hears' this phrase, it can re-adjust to the direction from where the phrase came, and can show this by causing the head of the interactive system to swivel or rotate so that is facing in the direction from which the wake-up phrase was spoken, i.e. in the direction of the user. However, it may be inconvenient or irritating for the user to have to say the wake-up phrase whenever he moves. Also, if he forgets to say the wake-up phrase after moving to a different position, he might be annoyed to find out that the interaction has been terminated, or that his commands are being ignored. Therefore, in a preferred method according to the invention, in particular for interactive systems, source characterising information, i.e. some high-level information describing the user of the interactive system, is obtained to characterise the user speaking from the specific direction. For example, speaker characteristic information about the speech signal corresponding to a wake-up phrase can be determined and later used to decide, on the basis of speaker recognition, whether a speech input signal has been spoken by that same user, in which case the speech input is accepted, or by another user, in which case the speech input will be discarded.
Another way to characterise the interacting user might be to perform image processing. For example, when the user says the wake-up phrase, a camera of the interactive system can be aimed at the direction from which the user has spoken, and an image can be captured of the user's face. In the following interaction, the interactive system can use the camera to track the user when he changes position. In subsequent speech analysis, the images captured by the camera can be analysed to determine whether the person who has spoken is the user that initiated the interaction. In this way, the relative direction between the user and the microphone arrangement is adjusted or updated on the basis of the source characterising information. Obtaining source characterising information such as speaker descriptive information or image recognition information can be carried out in a suitable unit. An interactive system according to the invention preferably comprises a feature extraction unit for carrying out such calculations. The feature extraction unit can be supplied with suitable data from, for example, a microphone or a digital camera of the interactive system.
The processing steps described above might all be carried out using appropriate software algorithms, many of which are already well-known, such as those for comparing digitised speech signals, for performing speech recognition, and for interpreting speech recognition results in a dialogue on the basis of an interaction grammar. The software can run as a program, or a number of programs, on one or more microprocessors of a speech recognition system, or an interactive system.
Other objects and features of the present invention will become apparent from the following detailed descriptions considered in conjunction with the accompanying drawing. It is to be understood, however, that the drawings are designed solely for the purposes of illustration and not as a definition of the limits of the invention.
Fig. 1 shows a block diagram of an interactive system comprising a speech recognition system according to a first embodiment of the invention; Fig. 2 shows an interactive system according to an embodiment of the invention, and a number of sources of speech input;
Fig. 3a shows the interactive system according to Fig. 2, and a user at a first position relative to the interactive system;
Fig. 3b shows the interactive system according to Fig. 3a, and the user at a second position relative to the interactive system;
Fig. 4 shows a block diagram of a speech recognition system according to a second embodiment of the invention.
In the diagrams, like numbers refer to like objects throughout. Objects are not necessarily drawn to scale.
Fig. 1 shows a block diagram of an interactive system 3 comprising a speech recognition system 1 according to an embodiment of the invention. To detect acoustic input, a microphone arrangement M avails of a first microphone M1, and a second microphone M2 separated from each other by a certain distance. The distance between the two microphones M1, M2 can be governed by the ultimate realisation of the device in which the speech recognition system is to be incorporated, or by desired accuracy levels. The input from the first microphone Mi is forwarded to a speech recognition unit 12, which performs speech recognition on the input signal. The input from the second microphone M2 is forwarded to an
acoustic source localisation unit 11 , which also receives the input from the first microphone M1. The direction D of the source 2 of the detected signal S is estimated by evaluating the temporal delta between the signals detected at the two microphones M1, M2. In the diagram, the source 2, or user 2, is shown to be closer to the second microphone M2, so that sound S emanating from this source 2 will impinge on the second microphone M2 first, and then, after a brief delay or temporal delta, the sound S will impinge on the first microphone Mi . The magnitude of the temporal delta, and the known distance between the two microphones M1, M2 can be used by the acoustic source localisation unit 11 to estimate the relative direction D from which the sound S has originated. As long as there is no active interaction taking place, the speech recognition output 10 passes unchanged through an annotation unit 17 to a dialogue engine 15, where it further passes, unchanged, through a filtering unit 18 before being subsequently analysed in a language understanding unit 19 of the dialogue engine 15 to see if a wake-up phrase, described in a wake-up grammar supplied by a database 9, has been uttered. Once the wake- up phrase has been detected by the language understanding unit 19, an appropriate signal 44 is forwarded to the annotation unit 17, informing the annotation unit 17 that the direction information 13 is now to be defined as the specific interaction direction D. Thereafter, the speech recognition output 10 is annotated in the annotation unit 17 with the direction information 13 from the acoustic source localisation unit 11, so that any acoustic input signals are associated with the direction from where they originated. The annotation unit 17 can compare the direction information 13 to the specific interaction direction D and annotate any subsequent speech recognition output 10 with a marker, tag, or other descriptive label describing the degree of closeness of the speech recognition output 10 to the specific interaction direction D. The annotated speech recognition output 14 is forwarded to the filtering unit
18. An interaction is now assumed to be taking place, since the wake-up phrase was detected from the direction D, and any speech recognition results originating from a different direction will be discarded. The decision to discard a speech recognition output 10 is made on the basis of the annotation information. For example, any speech recognition output 10 deemed to have originated from a direction within 5° of the specific direction D is to be accepted. The 'valid' annotated speech recognition output 14 survives the filtering process and is forwarded to the language understanding unit 19, which proceeds to interpret the current interaction taking place between the user 2 and the interactive system 3.
The output 16 of the dialogue engine 15, in response to the interaction taking place, can be a suitable signal 16, for example a signal 16 to a controller, not shown in the diagram, to cause some event to take place in response to a command spoken by the user 2 and interpreted by the dialogue engine 15. In Fig. 2, an interactive system 3 is shown in an environment with two potential users 2, 21 and a television 22. The interactive system 3, shown as a 'robot' in a human- like realisation, is equipped with a pair of microphones M1, M2 positioned as the 'ears' on a 'head' 20 of the interactive system 3. To make the interactive system appear realistic, the head 20 can rotate on a 'neck' 23 attached to a 'body' 24 of the interactive system 3. The two potential users 2, 21 and the television 22 can all issue speech signals
S, S1, S2. The microphones M1, M2 detect any incoming acoustic signals S, S1, S2, which are processed in a speech recognition unit, not shown in the diagram, to determine their validity. As long as the interactive system 3 is in an idle state, the output of the speech recognition is analysed for a wake-up signal. When a first user 2 speaks a wake-up phrase, this is recognised by the interactive system 3, and the direction D from which the speech signal S originated, is specified to be the interaction direction. For the duration of the interaction, the interactive system 3 only responds to speech S coming from this direction D. Other speech signals S1, S2, for example when the person 21 says something, or when speech emanates from the television 22, will be recognised but simply ignored. At some later stage, when the interaction is concluded, the second user 21 might say the wake-up phrase, upon which the interactive system 3 determines the relative direction Di for this current interaction on the basis of the speech signals Si originating from that user 21. For this interaction, any speech signals S, S2 coming from the first user 2 or the TV 22, will be recognised but ignored.
Naturally, the user 2 is free to move about. Figs. 3a and 3b show a sequence of positions of the user 2 with respect to the interactive system 3. In Fig. 3a, the user 2 is positioned to the right of the interactive system 3 when he speaks the wake-up phrase. This is detected by the interactive system 3, and the head 20 is rotated so that the user 2 has the impression that the robot 3 is 'attentive' and 'listening'. A initial specific interaction direction D is established by acoustic source localisation using the input of the two microphones M1, M2. A camera incorporated into one of the 'eyes' 30, 31 of the robot 3 captures one or more images of the user 2. These images can be processed in a feature extraction unit to identify characterising features of the user 2. During the subsequent interaction, the user 2 can move around. Images captured by the camera can be processed to track the position of the user 2, for example by using face recognition or pattern recognition algorithms, so that the direction
of interaction is updated to the altered direction D'. Now, only speech input S from this altered direction D' will be considered in the continued interaction.
Fig. 4 shows an interactive system 3 which is similar to that described above in Fig. 1, but with a different approach to annotating and filtering acoustic input detected by the microphone arrangement M. Furthermore, the interactive system 3 of this embodiment avails of an additional input modality 40 in the form of a camera 40. The camera 40 can be incorporated in an 'eye' of the interactive system 3 as described above under Figs. 2 - 3b. Here, once the language understanding unit 19 has identified a wake-up phrase in an input signal S to the speech recognition system 1 , an initial interaction direction D is determined. The input modality 40 is directed at the source 2 (the user 2) of the speech signal S, and can continually or intermittently capture images 41 of the user, which are then evaluated or processed in a feature extraction unit 42. For example, analysis of the images 41 can show that the user 2 is moving out of the field of vision of the camera 40, so that the interaction direction will need to be re-determined. Information 43 obtained from the feature extraction unit 42 is forwarded to the dialogue engine 15. Such information 43 can be used to trigger events in a dialogue on the basis of gesture recognition, etc.
In this embodiment, the information 43 from the feature extraction engine 42 is interpreted by the dialogue engine 15 to adapt prior knowledge about the original interaction direction D to reflect the new position (not shown in the diagram) of the user 2. A corresponding signal 44 is supplied to the filtering unit 18 for the decision to accept or discard an annotated speech recognition output 14. For example, the user 2 might have initiated interaction from the direction D. Then, without speaking, he moves to a new position. In the meantime, someone else moves to the original position of the user 2. The camera 40 has tracked the movements of the user 2, so that if the other person speaks from the initial direction D, the filtering unit 18, using the updated information 44 obtained with the help of the feature extraction unit 42, will be able to decide that speech input from the other person is to be discarded, and only the annotated speech input 14 corresponding to the user 2 at his new position is to be recognised as valid.
Although the present invention has been disclosed in the form of preferred embodiments and variations thereon, it will be understood that numerous additional modifications and variations could be made thereto without departing from the scope of the invention. Naturally, any number of microphones can be used for determining the relative direction of the user with respect to the speech recognition system, two being simply the minimum for a fairly simple way of performing acoustic source localisation. Furthermore, the
units or modules used to perform the various functions such as speech recognition, annotation, filtering, language understanding, etc., can be realised in any suitable configuration or constellation in an interactive system. The speech recognition system according to the invention can be incorporated in any suitable interactive system intended for use also in noisy environments, or in environments where it may be that several users sometimes speak at the same time. The method according to the invention can be used for situations in which only a single user is speaking, and an alternative method, for example using beam-forming, can be applied for other, noisier, situations. For the sake of clarity, it is to be understood that the use of "a" or "an" throughout this application does not exclude a plurality, and "comprising" does not exclude other steps or elements. A "unit" or "module" can comprises a number of units or modules, unless otherwise stated.
Claims
1. A method of driving a speech recognition system (1), which method comprises detecting a speech signal (S, S1, S2) using a microphone arrangement (M); performing speech recognition on the speech signal (S, S1, S2) detected by a first microphone (Mi) of the microphone arrangement (M) to obtain a speech recognition output (10); using the first microphone (Mi) of the microphone arrangement (M) and at least a second microphone (M2) of the microphone arrangement (M) to determine a relative direction (D, D', Dh D2) between the source (2, 21, 22) of the speech signal (S, Si, S2) and the microphone arrangement (M); - processing the speech recognition output (10) according to the determined relative direction (D, D', D1, D2).
2. A method according to claim 1, wherein processing the speech recognition output (10) according to the determined relative direction (D, D', D1, D2) comprises filtering the speech recognition output (10) to disregard any speech recognition output (10) corresponding to a speech signal (S1, S2) originating from a direction (D1, D2) other than a specific direction (D, D').
3. A method according to claim 1 or claim 2, wherein the speech recognition output (10) is annotated, for further processing, with the determined relative direction (D, D', Di, D2).
4. A method according to claim 2 or claim 3, wherein the step of filtering the speech recognition output (10) to disregard any speech recognition output (10) corresponding to a speech signal ( S1, S2) originating from a direction (D1, D2) other than a specific direction (D, D') is carried out in a dialogue engine (15), which dialogue engine (15) establishes the specific direction (D, D').
5. A method according to any of claims 2 to 4, wherein an initial specific direction (D) between the source (2) of a predefined speech signal (S) and the microphone arrangement (M) is determined upon recognition of the predefined speech signal (S), which predefined speech signal (S) can be accepted from any direction relative to the microphone arrangement (M).
6. A method according to any of claims 2 to 5, wherein source characterising information (43) is obtained to characterise the source (2) of the speech signal (S) originating from the specific direction (D, D').
7. A method according to claim 6, wherein the specific direction (D, D') between the source (2) of the speech signal (S) and the microphone arrangement (M) is re-determined on the basis of the source characterising information (43).
8. A speech recognition system (1) comprising a microphone arrangement (M) with a first microphone (Mi) and at least a second microphone (M2) for detecting a speech signal (S, S1, S2); an acoustic source localisation unit (11) for determining a relative direction (D, D', Di, D2) between the source (2, 21, 22) of the speech signal (S, Si, S2) and the microphone arrangement (M) using the first microphone (Mi) and at least a second microphone (M2) of the microphone arrangement (M); a speech recognition unit (12) for performing speech recognition on the speech signal (S, S1, S2) detected by the first microphone (Mi) of the microphone arrangement (M); a processing unit (18) for processing the speech recognition output (10) according to the determined relative direction (D, D', D1, D2).
9. An interactive system (3) comprising a microphone arrangement (M) with a first microphone (Mi) and at least a second microphone (M2) for detecting a speech signal (S, S1, S2); - an acoustic source localisation unit (11) for determining a relative direction
(D, D', D", D1, D2) between the source (2, 21, 22) of the speech signal (S, S1, S2) and the microphone arrangement (M) using the first microphone (Mi) and at least a second microphone (M2) of the microphone arrangement (M); a speech recognition unit (12) for performing speech recognition on the speech signal (S, S1, S2) detected by the first microphone (Mi) of the microphone arrangement (M); a processing unit () for processing the speech recognition output (10) according to the determined relative direction (D, D', D", D1, D2). - and a dialogue engine (15) for interpreting the speech recognition output (10) of the speech recognition system (1).
10. An interactive system (3) according to claim 9, comprising a feature extraction unit (42) for obtaining source characterising information () pertaining to the source (2) of a speech signal (S).
11. A computer program product directly loadable into the memory of a programmable speech recognition system (1) for use in an interactive system (3), comprising software code portions for performing the steps of a method according to claims 1 to 7, when said product is run on the speech recognition system (1).
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP06114774 | 2006-05-31 | ||
EP06114774.0 | 2006-05-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2007138503A1 true WO2007138503A1 (en) | 2007-12-06 |
Family
ID=38441523
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2007/051742 WO2007138503A1 (en) | 2006-05-31 | 2007-05-09 | Method of driving a speech recognition system |
Country Status (2)
Country | Link |
---|---|
TW (1) | TW200809768A (en) |
WO (1) | WO2007138503A1 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009130591A1 (en) | 2008-04-25 | 2009-10-29 | Nokia Corporation | Method and apparatus for voice activity determination |
EP2211337A1 (en) * | 2009-01-23 | 2010-07-28 | Victor Company Of Japan, Ltd. | Electronic apparatus operable by external sound |
US8275136B2 (en) | 2008-04-25 | 2012-09-25 | Nokia Corporation | Electronic device speech enhancement |
US8611556B2 (en) | 2008-04-25 | 2013-12-17 | Nokia Corporation | Calibrating multiple microphones |
WO2014210392A2 (en) | 2013-06-27 | 2014-12-31 | Rawles Llc | Detecting self-generated wake expressions |
WO2015160561A1 (en) * | 2014-04-17 | 2015-10-22 | Microsoft Technology Licensing, Llc | Conversation detection |
US9922667B2 (en) | 2014-04-17 | 2018-03-20 | Microsoft Technology Licensing, Llc | Conversation, presence and context detection for hologram suppression |
EP3318888A4 (en) * | 2015-06-30 | 2019-03-27 | Yutou Technology (Hangzhou) Co., Ltd. | Robot voice direction-seeking turning system and method |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105845135A (en) * | 2015-01-12 | 2016-08-10 | 芋头科技(杭州)有限公司 | Sound recognition system and method for robot system |
CN106328165A (en) * | 2015-06-30 | 2017-01-11 | 芋头科技(杭州)有限公司 | Robot autologous sound source elimination system |
CN108231073B (en) | 2016-12-16 | 2021-02-05 | 深圳富泰宏精密工业有限公司 | Voice control device, system and control method |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004038697A1 (en) * | 2002-10-23 | 2004-05-06 | Koninklijke Philips Electronics N.V. | Controlling an apparatus based on speech |
-
2007
- 2007-05-09 WO PCT/IB2007/051742 patent/WO2007138503A1/en active Application Filing
- 2007-05-28 TW TW96119021A patent/TW200809768A/en unknown
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004038697A1 (en) * | 2002-10-23 | 2004-05-06 | Koninklijke Philips Electronics N.V. | Controlling an apparatus based on speech |
Non-Patent Citations (2)
Title |
---|
LLEIDA E ET AL: "Robust continuous speech recognition system based on a microphone array", ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 1998. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON SEATTLE, WA, USA 12-15 MAY 1998, NEW YORK, NY, USA,IEEE, US, vol. 1, 12 May 1998 (1998-05-12), pages 241 - 244, XP010279154, ISBN: 0-7803-4428-6 * |
YAMADA T ET AL: "Robust speech recognition with speaker localization by a microphone array", SPOKEN LANGUAGE, 1996. ICSLP 96. PROCEEDINGS., FOURTH INTERNATIONAL CONFERENCE ON PHILADELPHIA, PA, USA 3-6 OCT. 1996, NEW YORK, NY, USA,IEEE, US, vol. 3, 3 October 1996 (1996-10-03), pages 1317 - 1320, XP010237923, ISBN: 0-7803-3555-4 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2266113A4 (en) * | 2008-04-25 | 2015-12-16 | Nokia Technologies Oy | Method and apparatus for voice activity determination |
US8244528B2 (en) | 2008-04-25 | 2012-08-14 | Nokia Corporation | Method and apparatus for voice activity determination |
US8275136B2 (en) | 2008-04-25 | 2012-09-25 | Nokia Corporation | Electronic device speech enhancement |
US8611556B2 (en) | 2008-04-25 | 2013-12-17 | Nokia Corporation | Calibrating multiple microphones |
US8682662B2 (en) | 2008-04-25 | 2014-03-25 | Nokia Corporation | Method and apparatus for voice activity determination |
WO2009130591A1 (en) | 2008-04-25 | 2009-10-29 | Nokia Corporation | Method and apparatus for voice activity determination |
EP2211337A1 (en) * | 2009-01-23 | 2010-07-28 | Victor Company Of Japan, Ltd. | Electronic apparatus operable by external sound |
US8189430B2 (en) | 2009-01-23 | 2012-05-29 | Victor Company Of Japan, Ltd. | Electronic apparatus operable by external sound |
CN105556592A (en) * | 2013-06-27 | 2016-05-04 | 亚马逊技术股份有限公司 | Detecting self-generated wake expressions |
US10720155B2 (en) | 2013-06-27 | 2020-07-21 | Amazon Technologies, Inc. | Detecting self-generated wake expressions |
WO2014210392A2 (en) | 2013-06-27 | 2014-12-31 | Rawles Llc | Detecting self-generated wake expressions |
JP2016524193A (en) * | 2013-06-27 | 2016-08-12 | ロウルズ リミテッド ライアビリティ カンパニー | Detection of self-generated wake expressions |
EP3014607A4 (en) * | 2013-06-27 | 2016-11-30 | Rawles Llc | Detecting self-generated wake expressions |
US9747899B2 (en) | 2013-06-27 | 2017-08-29 | Amazon Technologies, Inc. | Detecting self-generated wake expressions |
US11600271B2 (en) | 2013-06-27 | 2023-03-07 | Amazon Technologies, Inc. | Detecting self-generated wake expressions |
US11568867B2 (en) | 2013-06-27 | 2023-01-31 | Amazon Technologies, Inc. | Detecting self-generated wake expressions |
US10679648B2 (en) | 2014-04-17 | 2020-06-09 | Microsoft Technology Licensing, Llc | Conversation, presence and context detection for hologram suppression |
US10529359B2 (en) | 2014-04-17 | 2020-01-07 | Microsoft Technology Licensing, Llc | Conversation detection |
WO2015160561A1 (en) * | 2014-04-17 | 2015-10-22 | Microsoft Technology Licensing, Llc | Conversation detection |
RU2685970C2 (en) * | 2014-04-17 | 2019-04-23 | МАЙКРОСОФТ ТЕКНОЛОДЖИ ЛАЙСЕНСИНГ, ЭлЭлСи | Conversation detection |
US9922667B2 (en) | 2014-04-17 | 2018-03-20 | Microsoft Technology Licensing, Llc | Conversation, presence and context detection for hologram suppression |
EP3318888A4 (en) * | 2015-06-30 | 2019-03-27 | Yutou Technology (Hangzhou) Co., Ltd. | Robot voice direction-seeking turning system and method |
Also Published As
Publication number | Publication date |
---|---|
TW200809768A (en) | 2008-02-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2007138503A1 (en) | Method of driving a speech recognition system | |
CN109410957B (en) | Front human-computer interaction voice recognition method and system based on computer vision assistance | |
EP3414759B1 (en) | Techniques for spatially selective wake-up word recognition and related systems and methods | |
CN111370014B (en) | System and method for multi-stream target-voice detection and channel fusion | |
US7707035B2 (en) | Autonomous integrated headset and sound processing system for tactical applications | |
KR100754384B1 (en) | Method and apparatus for robust speaker localization and camera control system employing the same | |
JP2006251266A (en) | Audio-visual coordinated recognition method and device | |
KR100822880B1 (en) | User identification system through sound localization based audio-visual under robot environments and method thereof | |
KR20100119250A (en) | Appratus for detecting voice using motion information and method thereof | |
EP3002753A1 (en) | Speech enhancement method and apparatus for same | |
JP4825552B2 (en) | Speech recognition device, frequency spectrum acquisition device, and speech recognition method | |
US11790900B2 (en) | System and method for audio-visual multi-speaker speech separation with location-based selection | |
KR101889465B1 (en) | voice recognition device and lighting device therewith and lighting system therewith | |
WO2018155981A1 (en) | Method and system for providing voice recognition trigger and non-transitory computer-readable recording medium | |
EP1494208A1 (en) | Method for controlling a speech dialog system and speech dialog system | |
Yamamoto et al. | Improvement of robot audition by interfacing sound source separation and automatic speech recognition with missing feature theory | |
CN108665907B (en) | Voice recognition device, voice recognition method, recording medium, and robot | |
JP2008052178A (en) | Voice recognition device and voice recognition method | |
JP3838159B2 (en) | Speech recognition dialogue apparatus and program | |
WO2013091677A1 (en) | Speech recognition method and system | |
KR20190059381A (en) | Method for Device Control and Media Editing Based on Automatic Speech/Gesture Recognition | |
JP2004318026A (en) | Security pet robot and signal processing method related to the device | |
WO2020240789A1 (en) | Speech interaction control device and speech interaction control method | |
JP7172120B2 (en) | Speech recognition device and speech recognition method | |
Gomez et al. | Utilizing visual cues in robot audition for sound source discrimination in speech-based human-robot communication |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07735822 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07735822 Country of ref document: EP Kind code of ref document: A1 |