US20130085757A1 - Apparatus and method for speech recognition - Google Patents

Apparatus and method for speech recognition Download PDF

Info

Publication number
US20130085757A1
US20130085757A1 US13/537,740 US201213537740A US2013085757A1 US 20130085757 A1 US20130085757 A1 US 20130085757A1 US 201213537740 A US201213537740 A US 201213537740A US 2013085757 A1 US2013085757 A1 US 2013085757A1
Authority
US
United States
Prior art keywords
detection unit
trigger detection
trigger
user
gesture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/537,740
Inventor
Masanobu Nakamura
Akinori Kawamura
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Corp
Original Assignee
Toshiba Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp filed Critical Toshiba Corp
Assigned to KABUSHIKI KAISHA TOSHIBA reassignment KABUSHIKI KAISHA TOSHIBA ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NAKAMURA, MASANOBU
Assigned to KABUSHIKI KAISHA TOSHIBA reassignment KABUSHIKI KAISHA TOSHIBA CORRECTIVE ASSIGNMENT TO CORRECT THE OMISSION OF THE 2ND ASSIGNOR PREVIOUSLY RECORDED ON REEL 028470 FRAME 0868. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT. Assignors: KAWAMURA, AKINORI, NAKAMURA, MASANOBU
Publication of US20130085757A1 publication Critical patent/US20130085757A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Definitions

  • Embodiments described herein relate generally to an apparatus and a method for speech recognition.
  • a speech recognition apparatus that recognizes a command utterance from a user and controls a device has been commercially realized.
  • various start triggers such as key word utterance, gesture and handclaps are proposed.
  • the speech recognition apparatus starts to recognize the command utterance after detecting the start trigger.
  • Each start trigger has both merits and demerits based on the usage environment of the device.
  • the detecting performance of the start trigger was deteriorated when the start trigger was not appropriate to the usage environment. For example, it is hard to detect the start trigger by gesture (gesture-trigger) in a dark environment because image recognition performance is deteriorated in such environment. Moreover, it is hard for the user to select an appropriate start trigger for the usage environment even when multiple start triggers are supported in the speech recognition apparatus.
  • FIG. 1 is a block diagram of an apparatus for speech recognition according to a first embodiment.
  • FIG. 2 is a system diagram of a hardware component of the apparatus.
  • FIG. 3 is a system diagram of a flow chart illustrating processing of a handclap-trigger detection unit.
  • FIG. 4 is a figure illustrating handclaps detected by the handclap-trigger detection unit.
  • FIG. 5 is a system diagram of a flow chart illustrating processing of the apparatus for speech recognition.
  • FIG. 6 is a system diagram of a flow chart illustrating processing of a selection unit according to the first embodiment.
  • FIG. 7 is a system diagram of a flow chart illustrating processing of a selection unit according to a first variation.
  • FIG. 8 is an image on a television screen.
  • FIG. 9 is an image on a television screen.
  • an apparatus for speech recognition comprises a voice-trigger detection unit, a gesture-trigger detection unit, a handclap-trigger detection unit, a selection unit and a recognition unit.
  • the voice-trigger detection unit detects a voice-trigger from a sound obtained by a microphone.
  • the gesture-trigger detection unit detects a gesture-trigger from an image obtained by a camera.
  • the handclap-trigger detection unit detects a handclap-trigger from the sound obtained by the microphone.
  • the selection unit selects and activates a selected trigger detection unit.
  • the selected trigger detection unit is an appropriate trigger detection unit for the usage environment of the television.
  • the trigger detection unit is selected from among the voice-trigger detection unit, the gesture-trigger detection unit and the handclap-trigger detection unit.
  • the selection unit selects the selected trigger detection unit based on signals from a sound sensor which measures a sound volume of the usage environment, a distance sensor which measures a distance from the television to the user and a light sensor which measures an amount of light in the usage environment.
  • the recognition unit starts to recognize the command utterance by the user when the start trigger is detected by the selected trigger detection unit.
  • an apparatus for speech recognition recognizes a command utterance from a user and controls a device.
  • the apparatus is embedded in a television.
  • the user can control the television such as channel switching or searching the content of TV program listing by the command utterance.
  • the apparatus does not need an operation such as a button push when the user gives a start trigger of speech recognition to the television.
  • the apparatus selects a start trigger which is appropriate to the usage environment of the television among gesture-trigger, voice-trigger and handclap trigger.
  • the gesture-trigger is a start trigger by a predefined gesture by the user
  • the voice-trigger is a start trigger by a predefined keyword utterance by the user
  • the handclap-trigger is a start trigger by a handclap or claps by the user.
  • FIG. 1 is a block diagram of an apparatus 100 for speech recognition.
  • the apparatus 100 of FIG. 1 comprises a voice-trigger detection unit 101 , a gesture-trigger detection unit 102 , a handclap-trigger detection unit 103 , a selection unit 104 and a recognition unit 105 .
  • the voice-trigger detection unit 101 detects a voice-trigger from a sound obtained by a microphone 208 .
  • the gesture-trigger detection unit 102 detects a gesture-trigger from an image obtained by a camera 209 .
  • the handclap-trigger detection unit detects a handclap-trigger from the sound obtained by the microphone 208 .
  • the selection unit 104 selects and activates a selected trigger detection unit.
  • the selected trigger detection unit is an appropriate trigger detection unit for the usage environment of the television. The appropriate unit is selected from among the voice-trigger detection unit 101 , the gesture-trigger detection unit 102 and the handclap-trigger detection unit 103 .
  • the selection unit 104 selects the selected trigger detection unit based on signals from a sound sensor 210 which measures a sound volume of the usage environment, a distance sensor 211 which measures a distance from the television to the user and a light sensor 212 which measures an amount of light in the usage environment.
  • the recognition unit 105 starts to recognize the command utterance by the user when the start trigger is detected by the selected trigger detection unit.
  • the apparatus selects an appropriate trigger detection unit for the usage environment of the television by utilizing a signal from one or more sensors embedded on the television. Accordingly, the apparatus can detect a start trigger with high accuracy, and results in improving recognition performance of the command utterance by the user.
  • the apparatus 100 is composed of hardware using a regular computer shown in FIG. 2 .
  • This hardware comprises a control unit 201 such as a CPU (Central Processing Unit) to control the entire apparatus, a storage unit 202 such as a ROM (Read Only Memory) or a RAM (Random Access Memory) to store various kinds of data and programs, an external storage unit 203 such as a HDD (Hard Access Memory) or a CD (Compact Disk) to store various kinds of data and programs, an operation unit 204 such as a keyboard, a mouse or a touch screen to accept a user's indication, a communication unit 205 to control communication with an external apparatus, the microphone 208 to input a sound, the camera 209 to take an image, the sound sensor 210 to measure a sound volume, the distance sensor 211 to measure a distance from the television, the light sensor 212 to measure an amount of light and a bus 206 to connect the hardware elements.
  • a control unit 201 such as a CPU (Central Processing Unit) to control the entire apparatus
  • control unit 201 executes various programs stored in the storage unit 202 (such as the ROM) or the external storage unit 203 . As a result, the following functions are realized.
  • the selection unit 104 selects and activates a selected trigger detection unit.
  • the selected trigger detection unit is an appropriate trigger detection unit for the usage environment of the television.
  • the appropriate unit is selected from among the voice-trigger detection unit 101 , the gesture-trigger detection unit 102 and the handclap-trigger detection unit 103 .
  • the selection unit 104 selects the selected trigger detection unit based on signals from the sound sensor 210 , the distance sensor 211 and the light sensor 212 .
  • the selection unit 104 can select more than one trigger detection unit as the selected trigger detection units.
  • the sound sensor 210 measures a sound volume of the usage environment of the television. It can measure the sound volume of both the sound obtained by the microphone 208 and the sound outputted through a loudspeaker of the television.
  • the sound sensor 210 can obtain the sound as a digital signal, and the selection unit 104 can calculate sound volume (such as power) of the digital signal instead of the sound sensor 210 .
  • the sound sensor 210 can be replaced by the microphone 208 .
  • the distance sensor 211 measures a distance from the television to the user. It can be replaced by a human detection sensor such as an infrared light sensor, which is able to detect whether the user exists within a predefined distance.
  • the light sensor 212 measures an amount of light in the usage environment of the television.
  • the voice-trigger detection unit 101 detects a voice-trigger from the sound obtained by the microphone 208 .
  • a speech recognition apparatus with voice-trigger detects a predefined keyword utterance by a user as a start trigger, and starts to recognize the command utterance following the keyword utterance. For example, in the case that the predefined keyword is “hello”, the speech recognition apparatus detects the user utterance of “hello”, and outputs a bleep to notify the user that it is in a state to be able to recognize the command utterance. The speech recognition recognizes the command utterance such as “channel eight” following the bleep.
  • the voice-trigger detection unit 101 continues to recognize the sound obtained by the microphone 208 by utilizing recognition vocabulary including the predefined keyword utterance. It judges that the voice-trigger is detected when a recognition score obtained by the recognition process exceeds a threshold L.
  • the threshold L is set to a value which can divide between the distribution of recognition scores of predefined keyword utterances and the distribution of recognition scores of other utterances.
  • the voice-trigger detection unit 101 can decrease recognition errors caused by environmental noises by narrowing down the recognition vocabulary only to the predefined keyword utterance.
  • the gesture-trigger detection unit 102 detects a gesture-trigger from the image obtained by the camera 209 .
  • a speech recognition apparatus with gesture-trigger detects predefined gesture by a user as a start trigger, and starts to recognize the command utterance following the gesture.
  • the predefined gesture is the action of waving a hand from side to side
  • the speech recognition apparatus detects user's action of waving his hand from side to side by utilizing an image recognition technique, and outputs a bleep to notify the user that it is in a state to be able to recognize command utterance.
  • the speech recognition recognizes the command utterance such as “channel eight” following the bleep.
  • the gesture-trigger detection unit 102 detects the gesture-trigger by utilizing an image recognition technique. Therefore, there is a need for the user to gesture in the region where the camera 209 can take the image. Although the detection performance of the gesture-trigger detection unit 102 is not affected by environmental noises at all, it is affected by the lighting condition of the usage environment. Because of the image processing, moreover, it requires much more electric power compared to the other trigger detection units.
  • the handclap-trigger detection unit 103 detects a handclap-trigger from the sound obtained by the microphone 208 .
  • the handclaps detected by the handclap-trigger detection unit 103 are defined to handclaps two times in a row such as “clap, clap”.
  • a speech recognition apparatus with the handclap-trigger detects the handclaps as a start trigger, and outputs a bleep to notify the user that it is in a state to be able to recognize the command utterance.
  • the speech recognition recognizes the command utterance following the bleep.
  • FIG. 3 is a flow chart of processing of the handclap-trigger detection unit 103 .
  • the handclap-trigger detection unit 103 detects a sound waveform whose power exceeds a predefined threshold S two times in a row during a predefined interval T 0 , as shown in FIG. 4 .
  • the threshold T 0 is set to a value which covers the distribution of intervals of handclaps.
  • the threshold S is set to a value which can divide between distributions of power with and without handclaps.
  • the microphone 208 starts to obtain a sound and a time parameter t is set to zero.
  • the sound obtained by the microphone 208 is divided into frames each of which has a 25 msec length and an 8 msec interval.
  • the t represents frame number.
  • t is incremented by one.
  • the power of the sound at t frame and compares the power to the threshold S is calculated. If the power exceeds the threshold 5 , the process goes to S 4 . Otherwise, it goes to S 2 .
  • a parameter T is set to zero.
  • T is incremented by one, and t is incremented by T.
  • T is compared to the threshold T 0 .
  • T is less than T 0 , the process goes to S 7 . Otherwise, it goes to S 2 .
  • S 7 it calculates the power of the sound at t frame and compares the power to the threshold S. If the power exceeds the threshold 5 , it goes to S 8 and the handclap-trigger detection unit 103 judges that it detects a start-trigger by the handclaps. Otherwise, it goes to S 2 and continues to process the flow.
  • the handclap-trigger detection unit 103 has robustness against environmental noises because the handclaps have unique sound features compared to environmental noises.
  • the recognition unit 105 starts to recognize the command utterance by the user when the start trigger is detected by the selected trigger detection unit. Specifically, the sound obtained by the microphone 208 is input to unit 105 and unit 105 recognizes the command utterance included in the sound after the selected trigger detection unit detects the start trigger.
  • the recognition unit 105 can continually input and recognize the sound regardless of the detection of the start trigger.
  • Unit 105 can output only a recognition result which is obtained after the detection of the start trigger.
  • FIG. 5 is a flow chart of processing of the apparatus 100 for speech recognition according to this embodiment.
  • the selection unit 104 selects and activates a selected trigger detection unit.
  • the selected trigger detection unit is selected from among the voice-trigger detection unit 101 , the gesture-trigger detection unit 102 and the handclap-trigger detection unit 103 .
  • the selection unit 104 selects the selected trigger detection unit based on signals from the sound sensor 210 , the distance sensor 211 and the light sensor 212 .
  • FIG. 6 is a flow chart of processing of S 11 in FIG. 5 .
  • the selection unit 104 deactivates all of the trigger detection units (the voice-trigger detection unit 101 , the gesture-trigger detection unit 102 and the handclap-trigger detection unit 103 ).
  • the selection unit 104 judges whether the distance from the television to the user measured by the distance sensor 211 exceeds a predefined threshold D. If the distance exceeds the threshold D, there is a possibility that image recognition performance by the gesture-trigger detection unit 102 is deteriorated because it is distant from the user. In this case, the selection unit 104 determines that the gesture-trigger detection unit 102 is not appropriate, and the process moves to S 25 . Otherwise, the process moves to S 23 .
  • the threshold D is experimentally determined based on the relationship between image recognition performance and the distance measured by the distance sensor 211 .
  • the selection unit 104 judges whether the amount of light in the usage environment measured by the light sensor 212 exceeds a predefined threshold L. If the amount of light does not exceed the threshold L, there is a possibility that image recognition performance by the gesture-trigger detection unit 102 is deteriorated because the usage environment is too dark. In this case, the selection unit 104 determines that the gesture-trigger detection unit 102 is not appropriate to the usage environment, and the process moves to S 25 .
  • the process moves to S 24 , and activates the gesture-trigger detection unit 102 because both the distance and the light conditions are appropriate for recognizing the predefined gesture by the gesture-trigger detection unit 102 .
  • the threshold L is experimentally determined based on the relationship between image recognition performance and the amount of light measured by the light sensor 212 .
  • the selection unit 104 judges whether the sound volume in the usage environment measured by the sound sensor 210 exceeds a predefined threshold N. If the sound volume exceeds the threshold N, there is a possibility that detection performance of the keyword utterance by the voice-trigger detection unit 101 is deteriorated because the usage environment is noisy. In this case, the selection unit 104 determines that the voice-trigger detection unit 102 is not appropriate to the usage environment, and the process moves to S 27 .
  • the process moves to S 26 , and activates the voice-trigger detection unit 101 because the usage environment is not noisy and appropriate for recognizing the keyword utterance by the voice-trigger detection unit 102 .
  • the threshold N is experimentally determined based on the relationship between detection performance of the keyword utterance and the sound volume measured by the sound sensor 210 .
  • the selection unit 104 activates the handclap-trigger detection unit 103 .
  • it always activates the handclap-trigger detection unit 103 . This is because the handclap-trigger detection unit 103 can detect the handclap-trigger with high accuracy even when environmental noises are loud or the user is distant from the television.
  • the apparatus 100 for speech recognition starts the operation of the selected trigger detection unit activated by S 11 .
  • apparatus 100 judges whether the start trigger is detected by the selected trigger detection unit. If the start trigger is detected, the process moves to S 14 . Otherwise, the process waits until the selected trigger detection unit detects the start trigger.
  • the recognition unit 105 starts to recognize the command utterance by the user.
  • the apparatus selects an appropriate trigger detection unit under the usage environment of the television by utilizing a signal from one or more sensors embedded on the television. Accordingly, the apparatus can detect a start trigger with high accuracy, and results in improving recognition performance of the command utterance by the user.
  • the selection unit 104 can select one or more selected trigger detection units by utilizing only one of the sound sensor 210 , the distance sensor 211 and the light sensor 212 . For example, the selection unit 104 can determine whether to activate or deactivate the voice-trigger detection unit 101 by utilizing only the sound sensor 210 as shown in S 25 of FIG. 6 .
  • the selection unit 104 can determine whether to activate or deactivate the voice-trigger detection unit 101 by utilizing the distance sensor 211 .
  • unit 104 activates the voice-trigger detection unit 101 when the distance measured by the distance sensor 211 becomes equal to or less than the threshold D. This is because the sound volume of the user utterance becomes loud when the distance is small and the detection performance of the voice-trigger by the voice-trigger detection unit 101 becomes high enough.
  • the selection unit 104 can determine whether to activate or deactivate each trigger detection unit based on a control signal other than the sound sensor 210 , the distance sensor 211 and the light sensor 212 .
  • a control signal other than the sound sensor 210 , the distance sensor 211 and the light sensor 212 can act as the control signal.
  • the selection unit 104 can deactivate the gesture-trigger detection unit 102 which requires much more electric power compared to the other trigger detection units.
  • FIG. 7 is a flow chart of processing of the selection unit 104 which utilizes the electric power mode.
  • selection unit 104 determines the electric power mode specified by the user. If the electric power mode is the normal mode, the process moves to S 22 , and the selection unit 104 determines whether to activate or deactivate each trigger detection unit including the gesture-trigger detection unit 102 . If the electric power mode is the power-saving mode, the process moves to S 25 , and the selection unit 104 deactivates the gesture-trigger detection unit 102 which requires much more electric power because of image processing.
  • the apparatus 100 for speech recognition can display the selected trigger detection unit to the user via the television screen.
  • FIGS. 8 and 9 illustrate an image on television screen 400 .
  • mark 401 in FIG. 8 represents that the voice-trigger detection unit 101 is activated by the selection unit 104 .
  • Marks 402 and 403 represent that the handclap-trigger detection unit 103 and the gesture-trigger detection unit 102 are activated, respectively.
  • all of the trigger detection units are activated. Therefore, the user can give a start trigger to the television by keyword utterance, gesture or handclaps.
  • the apparatus 100 displays the selected trigger detection unit to the user. Accordingly, it helps the user select the appropriate action for giving a start trigger to the television.
  • the apparatus 100 can mount three LED illuminations and notify the selected trigger detection unit to the user by turning on the LED illumination corresponding to each trigger detection unit.
  • the command utterance includes a phrase such as “search sports programs”.
  • the recognition unit 105 can be composed by utilizing an external server connected via the communication unit 205 .
  • the trigger detection units are not limited to the voice-trigger detection unit 101 , the gesture-trigger detection unit 102 and the handclap-trigger detection unit 103 .
  • the apparatus for speech recognition can utilize another trigger detection unit which detects another kind of start trigger. For example, the apparatus can detect
  • the apparatus for speech recognition can always activate the all trigger detection units and starts to recognize the command utterance only when the trigger detection unit selected by the selection unit 104 detects the start trigger.
  • the processing can be performed by a computer program stored in a computer-readable medium.
  • the computer readable medium may be, for example, a magnetic disk, a flexible disk, a hard disk, an optical disk (e.g., CD-ROM, CD-R, DVD), an optical magnetic disk (e.g., MD).
  • any computer readable medium which is configured to store a computer program for causing a computer to perform the processing described above, may be used.
  • OS operation system
  • MW middle ware software
  • the memory device is not limited to a device independent from the computer. By downloading a program transmitted through a LAN or the Internet, a memory device in which the program is stored is included. Furthermore, the memory device is not limited to one. In the case that the processing of the embodiments is executed by a plurality of memory devices, a plurality of memory devices may be included in the memory device.
  • a computer may execute each processing stage of the embodiments according to the program stored in the memory device.
  • the computer may be one apparatus such as a personal computer or a system in which a plurality of processing apparatuses are connected through a network.
  • the computer is not limited to a personal computer.
  • a computer includes a processing unit in an information processor, a microcomputer, and so on.
  • the equipment and the apparatus that can execute the functions in embodiments using the program are generally called the computer.

Abstract

An embodiment of an apparatus for speech recognition includes a plurality of trigger detection units, each of which is configured to detect a start trigger for recognizing a command utterance for controlling a device, a selection unit, utilizing a signal from one or more sensors embedded on the device, configured to select a selected trigger detection unit among the trigger detection units, the selected trigger detection unit being appropriate to a usage environment of the device, and a recognition unit configured to recognize the command utterance when the start trigger is detected by the selected trigger detection unit.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2011-218679 filed on Sep. 30, 2011, the entire contents of which are incorporated herein by reference.
  • FIELD
  • Embodiments described herein relate generally to an apparatus and a method for speech recognition.
  • BACKGROUND
  • Recently, a speech recognition apparatus that recognizes a command utterance from a user and controls a device has been commercially realized. In order to activate the recognition process of the speech recognition apparatus, various start triggers such as key word utterance, gesture and handclaps are proposed. The speech recognition apparatus starts to recognize the command utterance after detecting the start trigger.
  • Each start trigger has both merits and demerits based on the usage environment of the device. The detecting performance of the start trigger was deteriorated when the start trigger was not appropriate to the usage environment. For example, it is hard to detect the start trigger by gesture (gesture-trigger) in a dark environment because image recognition performance is deteriorated in such environment. Moreover, it is hard for the user to select an appropriate start trigger for the usage environment even when multiple start triggers are supported in the speech recognition apparatus.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • A more complete appreciation of the invention and many of the attendant advantages thereof will be readily obtained as the same become better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:
  • FIG. 1 is a block diagram of an apparatus for speech recognition according to a first embodiment.
  • FIG. 2 is a system diagram of a hardware component of the apparatus.
  • FIG. 3 is a system diagram of a flow chart illustrating processing of a handclap-trigger detection unit.
  • FIG. 4 is a figure illustrating handclaps detected by the handclap-trigger detection unit.
  • FIG. 5 is a system diagram of a flow chart illustrating processing of the apparatus for speech recognition.
  • FIG. 6 is a system diagram of a flow chart illustrating processing of a selection unit according to the first embodiment.
  • FIG. 7 is a system diagram of a flow chart illustrating processing of a selection unit according to a first variation.
  • FIG. 8 is an image on a television screen.
  • FIG. 9 is an image on a television screen.
  • DETAILED DESCRIPTION
  • According to one embodiment, an apparatus for speech recognition comprises a voice-trigger detection unit, a gesture-trigger detection unit, a handclap-trigger detection unit, a selection unit and a recognition unit. The voice-trigger detection unit detects a voice-trigger from a sound obtained by a microphone. The gesture-trigger detection unit detects a gesture-trigger from an image obtained by a camera. The handclap-trigger detection unit detects a handclap-trigger from the sound obtained by the microphone. The selection unit selects and activates a selected trigger detection unit. The selected trigger detection unit is an appropriate trigger detection unit for the usage environment of the television. The trigger detection unit is selected from among the voice-trigger detection unit, the gesture-trigger detection unit and the handclap-trigger detection unit. The selection unit selects the selected trigger detection unit based on signals from a sound sensor which measures a sound volume of the usage environment, a distance sensor which measures a distance from the television to the user and a light sensor which measures an amount of light in the usage environment. The recognition unit starts to recognize the command utterance by the user when the start trigger is detected by the selected trigger detection unit.
  • Various embodiments will be described hereinafter with reference to the accompanying drawings, wherein the same reference numeral designations represent the same or corresponding parts throughout the several views.
  • The First Embodiment
  • In the first embodiment, an apparatus for speech recognition recognizes a command utterance from a user and controls a device. The apparatus is embedded in a television. The user can control the television such as channel switching or searching the content of TV program listing by the command utterance.
  • The apparatus according to this embodiment does not need an operation such as a button push when the user gives a start trigger of speech recognition to the television. The apparatus selects a start trigger which is appropriate to the usage environment of the television among gesture-trigger, voice-trigger and handclap trigger. Here, the gesture-trigger is a start trigger by a predefined gesture by the user, the voice-trigger is a start trigger by a predefined keyword utterance by the user and the handclap-trigger is a start trigger by a handclap or claps by the user.
  • FIG. 1 is a block diagram of an apparatus 100 for speech recognition. The apparatus 100 of FIG. 1 comprises a voice-trigger detection unit 101, a gesture-trigger detection unit 102, a handclap-trigger detection unit 103, a selection unit 104 and a recognition unit 105.
  • The voice-trigger detection unit 101 detects a voice-trigger from a sound obtained by a microphone 208. The gesture-trigger detection unit 102 detects a gesture-trigger from an image obtained by a camera 209. The handclap-trigger detection unit detects a handclap-trigger from the sound obtained by the microphone 208. The selection unit 104 selects and activates a selected trigger detection unit. The selected trigger detection unit is an appropriate trigger detection unit for the usage environment of the television. The appropriate unit is selected from among the voice-trigger detection unit 101, the gesture-trigger detection unit 102 and the handclap-trigger detection unit 103. The selection unit 104 selects the selected trigger detection unit based on signals from a sound sensor 210 which measures a sound volume of the usage environment, a distance sensor 211 which measures a distance from the television to the user and a light sensor 212 which measures an amount of light in the usage environment. The recognition unit 105 starts to recognize the command utterance by the user when the start trigger is detected by the selected trigger detection unit.
  • In this way, the apparatus according to this embodiment selects an appropriate trigger detection unit for the usage environment of the television by utilizing a signal from one or more sensors embedded on the television. Accordingly, the apparatus can detect a start trigger with high accuracy, and results in improving recognition performance of the command utterance by the user.
  • (Hardware Component)
  • The apparatus 100 is composed of hardware using a regular computer shown in FIG. 2. This hardware comprises a control unit 201 such as a CPU (Central Processing Unit) to control the entire apparatus, a storage unit 202 such as a ROM (Read Only Memory) or a RAM (Random Access Memory) to store various kinds of data and programs, an external storage unit 203 such as a HDD (Hard Access Memory) or a CD (Compact Disk) to store various kinds of data and programs, an operation unit 204 such as a keyboard, a mouse or a touch screen to accept a user's indication, a communication unit 205 to control communication with an external apparatus, the microphone 208 to input a sound, the camera 209 to take an image, the sound sensor 210 to measure a sound volume, the distance sensor 211 to measure a distance from the television, the light sensor 212 to measure an amount of light and a bus 206 to connect the hardware elements.
  • In such hardware, the control unit 201 executes various programs stored in the storage unit 202 (such as the ROM) or the external storage unit 203. As a result, the following functions are realized.
  • (The Selection Unit)
  • The selection unit 104 selects and activates a selected trigger detection unit. The selected trigger detection unit is an appropriate trigger detection unit for the usage environment of the television. The appropriate unit is selected from among the voice-trigger detection unit 101, the gesture-trigger detection unit 102 and the handclap-trigger detection unit 103. The selection unit 104 selects the selected trigger detection unit based on signals from the sound sensor 210, the distance sensor 211 and the light sensor 212. The selection unit 104 can select more than one trigger detection unit as the selected trigger detection units.
  • Here, the sound sensor 210 measures a sound volume of the usage environment of the television. It can measure the sound volume of both the sound obtained by the microphone 208 and the sound outputted through a loudspeaker of the television. The sound sensor 210 can obtain the sound as a digital signal, and the selection unit 104 can calculate sound volume (such as power) of the digital signal instead of the sound sensor 210. In this case, the sound sensor 210 can be replaced by the microphone 208.
  • The distance sensor 211 measures a distance from the television to the user. It can be replaced by a human detection sensor such as an infrared light sensor, which is able to detect whether the user exists within a predefined distance.
  • The light sensor 212 measures an amount of light in the usage environment of the television.
  • (The Voice-Trigger Detection Unit)
  • The voice-trigger detection unit 101 detects a voice-trigger from the sound obtained by the microphone 208.
  • A speech recognition apparatus with voice-trigger detects a predefined keyword utterance by a user as a start trigger, and starts to recognize the command utterance following the keyword utterance. For example, in the case that the predefined keyword is “hello”, the speech recognition apparatus detects the user utterance of “hello”, and outputs a bleep to notify the user that it is in a state to be able to recognize the command utterance. The speech recognition recognizes the command utterance such as “channel eight” following the bleep.
  • The voice-trigger detection unit 101 continues to recognize the sound obtained by the microphone 208 by utilizing recognition vocabulary including the predefined keyword utterance. It judges that the voice-trigger is detected when a recognition score obtained by the recognition process exceeds a threshold L. The threshold L is set to a value which can divide between the distribution of recognition scores of predefined keyword utterances and the distribution of recognition scores of other utterances.
  • The voice-trigger detection unit 101 can decrease recognition errors caused by environmental noises by narrowing down the recognition vocabulary only to the predefined keyword utterance.
  • However, detection performance of the voice-trigger detection unit 101 deteriorates in the environment that environmental noises or the sound of the television is too loud and the SNR (signal to noise ratio) of the keyword utterance becomes low.
  • (The Gesture-Trigger Detection Unit)
  • The gesture-trigger detection unit 102 detects a gesture-trigger from the image obtained by the camera 209.
  • A speech recognition apparatus with gesture-trigger detects predefined gesture by a user as a start trigger, and starts to recognize the command utterance following the gesture. For example, in the case that the predefined gesture is the action of waving a hand from side to side, the speech recognition apparatus detects user's action of waving his hand from side to side by utilizing an image recognition technique, and outputs a bleep to notify the user that it is in a state to be able to recognize command utterance. The speech recognition recognizes the command utterance such as “channel eight” following the bleep.
  • The gesture-trigger detection unit 102 detects the gesture-trigger by utilizing an image recognition technique. Therefore, there is a need for the user to gesture in the region where the camera 209 can take the image. Although the detection performance of the gesture-trigger detection unit 102 is not affected by environmental noises at all, it is affected by the lighting condition of the usage environment. Because of the image processing, moreover, it requires much more electric power compared to the other trigger detection units.
  • (The Handclap-Trigger Detection Unit)
  • The handclap-trigger detection unit 103 detects a handclap-trigger from the sound obtained by the microphone 208. Here, the handclaps detected by the handclap-trigger detection unit 103 are defined to handclaps two times in a row such as “clap, clap”.
  • A speech recognition apparatus with the handclap-trigger detects the handclaps as a start trigger, and outputs a bleep to notify the user that it is in a state to be able to recognize the command utterance. The speech recognition recognizes the command utterance following the bleep.
  • FIG. 3 is a flow chart of processing of the handclap-trigger detection unit 103. The handclap-trigger detection unit 103 detects a sound waveform whose power exceeds a predefined threshold S two times in a row during a predefined interval T0, as shown in FIG. 4.
  • Here, the threshold T0 is set to a value which covers the distribution of intervals of handclaps. The threshold S is set to a value which can divide between distributions of power with and without handclaps.
  • At S1 in FIG. 3, the microphone 208 starts to obtain a sound and a time parameter t is set to zero. The sound obtained by the microphone 208 is divided into frames each of which has a 25 msec length and an 8 msec interval. The t represents frame number. At S2, t is incremented by one. At S3, the power of the sound at t frame and compares the power to the threshold S is calculated. If the power exceeds the threshold 5, the process goes to S4. Otherwise, it goes to S2. At S4, a parameter T is set to zero. At S5, T is incremented by one, and t is incremented by T. At S6, T is compared to the threshold T0. If T is less than T0, the process goes to S7. Otherwise, it goes to S2. At S7, it calculates the power of the sound at t frame and compares the power to the threshold S. If the power exceeds the threshold 5, it goes to S8 and the handclap-trigger detection unit 103 judges that it detects a start-trigger by the handclaps. Otherwise, it goes to S2 and continues to process the flow.
  • The handclap-trigger detection unit 103 has robustness against environmental noises because the handclaps have unique sound features compared to environmental noises.
  • (The Recognition Unit)
  • The recognition unit 105 starts to recognize the command utterance by the user when the start trigger is detected by the selected trigger detection unit. Specifically, the sound obtained by the microphone 208 is input to unit 105 and unit 105 recognizes the command utterance included in the sound after the selected trigger detection unit detects the start trigger.
  • In addition, the recognition unit 105 can continually input and recognize the sound regardless of the detection of the start trigger. Unit 105 can output only a recognition result which is obtained after the detection of the start trigger.
  • (Flow Chart)
  • FIG. 5 is a flow chart of processing of the apparatus 100 for speech recognition according to this embodiment.
  • At S11, the selection unit 104 selects and activates a selected trigger detection unit. The selected trigger detection unit is selected from among the voice-trigger detection unit 101, the gesture-trigger detection unit 102 and the handclap-trigger detection unit 103. The selection unit 104 selects the selected trigger detection unit based on signals from the sound sensor 210, the distance sensor 211 and the light sensor 212.
  • FIG. 6 is a flow chart of processing of S11 in FIG. 5. At S21, the selection unit 104 deactivates all of the trigger detection units (the voice-trigger detection unit 101, the gesture-trigger detection unit 102 and the handclap-trigger detection unit 103).
  • At S22, the selection unit 104 judges whether the distance from the television to the user measured by the distance sensor 211 exceeds a predefined threshold D. If the distance exceeds the threshold D, there is a possibility that image recognition performance by the gesture-trigger detection unit 102 is deteriorated because it is distant from the user. In this case, the selection unit 104 determines that the gesture-trigger detection unit 102 is not appropriate, and the process moves to S25. Otherwise, the process moves to S23.
  • The threshold D is experimentally determined based on the relationship between image recognition performance and the distance measured by the distance sensor 211.
  • At S23, the selection unit 104 judges whether the amount of light in the usage environment measured by the light sensor 212 exceeds a predefined threshold L. If the amount of light does not exceed the threshold L, there is a possibility that image recognition performance by the gesture-trigger detection unit 102 is deteriorated because the usage environment is too dark. In this case, the selection unit 104 determines that the gesture-trigger detection unit 102 is not appropriate to the usage environment, and the process moves to S25.
  • Otherwise, the process moves to S24, and activates the gesture-trigger detection unit 102 because both the distance and the light conditions are appropriate for recognizing the predefined gesture by the gesture-trigger detection unit 102.
  • The threshold L is experimentally determined based on the relationship between image recognition performance and the amount of light measured by the light sensor 212.
  • At S25, the selection unit 104 judges whether the sound volume in the usage environment measured by the sound sensor 210 exceeds a predefined threshold N. If the sound volume exceeds the threshold N, there is a possibility that detection performance of the keyword utterance by the voice-trigger detection unit 101 is deteriorated because the usage environment is noisy. In this case, the selection unit 104 determines that the voice-trigger detection unit 102 is not appropriate to the usage environment, and the process moves to S27.
  • Otherwise, the process moves to S26, and activates the voice-trigger detection unit 101 because the usage environment is not noisy and appropriate for recognizing the keyword utterance by the voice-trigger detection unit 102.
  • The threshold N is experimentally determined based on the relationship between detection performance of the keyword utterance and the sound volume measured by the sound sensor 210.
  • At S27, the selection unit 104 activates the handclap-trigger detection unit 103. In this embodiment, it always activates the handclap-trigger detection unit 103. This is because the handclap-trigger detection unit 103 can detect the handclap-trigger with high accuracy even when environmental noises are loud or the user is distant from the television.
  • The flow chart in FIG. 5 will now be explained. At S12, the apparatus 100 for speech recognition starts the operation of the selected trigger detection unit activated by S11.
  • At S13, apparatus 100 judges whether the start trigger is detected by the selected trigger detection unit. If the start trigger is detected, the process moves to S14. Otherwise, the process waits until the selected trigger detection unit detects the start trigger.
  • At S14, the recognition unit 105 starts to recognize the command utterance by the user.
  • (Effect)
  • In this way, the apparatus according to this embodiment selects an appropriate trigger detection unit under the usage environment of the television by utilizing a signal from one or more sensors embedded on the television. Accordingly, the apparatus can detect a start trigger with high accuracy, and results in improving recognition performance of the command utterance by the user.
  • (Variation 1)
  • The selection unit 104 can select one or more selected trigger detection units by utilizing only one of the sound sensor 210, the distance sensor 211 and the light sensor 212. For example, the selection unit 104 can determine whether to activate or deactivate the voice-trigger detection unit 101 by utilizing only the sound sensor 210 as shown in S25 of FIG. 6.
  • In addition, the selection unit 104 can determine whether to activate or deactivate the voice-trigger detection unit 101 by utilizing the distance sensor 211. In this case, unit 104 activates the voice-trigger detection unit 101 when the distance measured by the distance sensor 211 becomes equal to or less than the threshold D. This is because the sound volume of the user utterance becomes loud when the distance is small and the detection performance of the voice-trigger by the voice-trigger detection unit 101 becomes high enough.
  • In addition, the selection unit 104 can determine whether to activate or deactivate each trigger detection unit based on a control signal other than the sound sensor 210, the distance sensor 211 and the light sensor 212. For example, an electric power mode of the apparatus 100 can act as the control signal. For example, if the user selects power-saving mode, the selection unit 104 can deactivate the gesture-trigger detection unit 102 which requires much more electric power compared to the other trigger detection units.
  • FIG. 7 is a flow chart of processing of the selection unit 104 which utilizes the electric power mode. At S31, selection unit 104 determines the electric power mode specified by the user. If the electric power mode is the normal mode, the process moves to S22, and the selection unit 104 determines whether to activate or deactivate each trigger detection unit including the gesture-trigger detection unit 102. If the electric power mode is the power-saving mode, the process moves to S 25, and the selection unit 104 deactivates the gesture-trigger detection unit 102 which requires much more electric power because of image processing.
  • (Variation 2)
  • The apparatus 100 for speech recognition can display the selected trigger detection unit to the user via the television screen.
  • FIGS. 8 and 9 illustrate an image on television screen 400. For example, mark 401 in FIG. 8 represents that the voice-trigger detection unit 101 is activated by the selection unit 104. Marks 402 and 403 represent that the handclap-trigger detection unit 103 and the gesture-trigger detection unit 102 are activated, respectively. In FIG. 8, all of the trigger detection units are activated. Therefore, the user can give a start trigger to the television by keyword utterance, gesture or handclaps.
  • In FIG. 9, only the marks 401 and 402 are displayed on the television screen 400. Therefore, the user is not able to give a start trigger to the television by gesture.
  • In this way, the apparatus 100 according to this variation displays the selected trigger detection unit to the user. Accordingly, it helps the user select the appropriate action for giving a start trigger to the television.
  • The apparatus 100 can mount three LED illuminations and notify the selected trigger detection unit to the user by turning on the LED illumination corresponding to each trigger detection unit.
  • (Variation 3)
  • The command utterance includes a phrase such as “search sports programs”. The recognition unit 105 can be composed by utilizing an external server connected via the communication unit 205.
  • The trigger detection units are not limited to the voice-trigger detection unit 101, the gesture-trigger detection unit 102 and the handclap-trigger detection unit 103. The apparatus for speech recognition can utilize another trigger detection unit which detects another kind of start trigger. For example, the apparatus can detect
  • The apparatus for speech recognition can always activate the all trigger detection units and starts to recognize the command utterance only when the trigger detection unit selected by the selection unit 104 detects the start trigger.
  • In the disclosed embodiments, the processing can be performed by a computer program stored in a computer-readable medium.
  • In the embodiments, the computer readable medium may be, for example, a magnetic disk, a flexible disk, a hard disk, an optical disk (e.g., CD-ROM, CD-R, DVD), an optical magnetic disk (e.g., MD). However, any computer readable medium, which is configured to store a computer program for causing a computer to perform the processing described above, may be used.
  • Furthermore, based on an indication of the program installed from the memory device to the computer, OS (operation system) operating on the computer, or MW (middle ware software), such as database management software or network, may execute one part of each processing to realize the embodiments.
  • Furthermore, the memory device is not limited to a device independent from the computer. By downloading a program transmitted through a LAN or the Internet, a memory device in which the program is stored is included. Furthermore, the memory device is not limited to one. In the case that the processing of the embodiments is executed by a plurality of memory devices, a plurality of memory devices may be included in the memory device.
  • A computer may execute each processing stage of the embodiments according to the program stored in the memory device. The computer may be one apparatus such as a personal computer or a system in which a plurality of processing apparatuses are connected through a network. Furthermore, the computer is not limited to a personal computer. Those skilled in the art will appreciate that a computer includes a processing unit in an information processor, a microcomputer, and so on. In short, the equipment and the apparatus that can execute the functions in embodiments using the program are generally called the computer.
  • While certain embodiments have been described, these embodiments have been presented by way of examples only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms, furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the invention. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the invention.

Claims (20)

What is claimed is:
1. An apparatus for speech recognition, comprising:
a plurality of trigger detection units, each of which is configured to detect a start trigger for recognizing a command utterance for controlling a device;
a selection unit, utilizing a signal from one or more sensors embedded on the device, configured to select a selected trigger detection unit among the trigger detection units, the selected trigger detection unit being appropriate to a usage environment of the device; and
a recognition unit configured to recognize the command utterance when the start trigger is detected by the selected trigger detection unit.
2. The apparatus according to claim 1, wherein
at least one of the sensors is a sound sensor that measures sound volume in the usage environment, at least one of the trigger detection units is a voice-trigger detection unit that detects a start trigger corresponding to a predefined keyword utterance by the user, and
the selection unit selects the voice-trigger detection unit as the selected trigger detection unit when the sound volume measured by the sound sensor is less than or equal to a predefined threshold.
3. The apparatus according to claim 1, wherein
at least one of the sensors is a light sensor that measures an amount of light in the usage environment, at least one of the trigger detection units is a gesture-trigger detection unit that detects a start trigger corresponding to a predefined gesture by the user, and
the selection unit selects the gesture-trigger detection unit as the selected trigger detection unit when the amount of light measured by the light sensor is more than a predefined threshold.
4. The apparatus according to claim 1, wherein
at least one of the sensors is a distance sensor that measures a distance from the device to the user, at least one of the trigger detection units is a gesture-trigger detection unit that detects a start trigger corresponding to a predefined gesture by the user, and
the selection unit selects the gesture-trigger detection unit as the selected trigger detection unit when the distance measured by the distance sensor is less than or equal to a predefined threshold.
5. The apparatus according to claim 1, wherein
at least one of the sensors is a distance sensor that measures a distance from the device to the user, at least one of the trigger detection units is a voice-trigger detection unit that detects a start trigger corresponding to a predefined keyword utterance by the user, and
the selection unit selects the voice-trigger detection unit as the selected trigger detection unit when the distance measured by the distance sensor is less than or equal to a predefined threshold.
6. The apparatus according to claim 1, wherein
the selection unit selects the selected trigger detection unit based on a control signal other than the signal from the one or more sensors.
7. The apparatus according to claim 1, wherein the device is connected to a television and is configured to display information on a screen of the television corresponding to at least one selected trigger detection unit.
8. A method for speech recognition, comprising:
selecting a selected trigger detection unit among a plurality of trigger detection units, each of which is configured to detect a start trigger for recognizing a command utterance for controlling a device, the selected trigger detection unit being appropriate to a usage environment of the device; and
recognizing the command utterance when the start trigger is detected by the selected trigger detection unit.
9. The method according to claim 8, comprising:
measuring a sound volume in the usage environment,
detecting a start trigger corresponding to a predefined keyword utterance by the user, and
selecting a voice-trigger detection unit as the selected trigger detection unit when the sound volume measured by the sound sensor is less than or equal to a predefined threshold.
10. The method according to claim 8, comprising:
measuring an amount of light in the usage environment;
detecting a start trigger corresponding to a predefined gesture by the user, and
selecting a gesture-trigger detection unit as the selected trigger detection unit when the amount of light measured by the light sensor is more than a predefined threshold.
11. The method according to claim 8, comprising:
measuring a distance from the device to the user;
detecting a start trigger corresponding to a predefined gesture by the user, and
selecting a gesture-trigger detection unit as the selected trigger detection unit when the distance measured by the distance sensor is less than or equal to a predefined threshold.
12. The method according to claim 8, comprising:
measuring a distance from the device to the user;
detecting a start trigger corresponding to a predefined keyword utterance by the user, and
selecting a voice-trigger detection unit as the selected trigger detection unit when the distance measured by the distance sensor is less than or equal to a predefined threshold.
13. The method according to claim 8, wherein the device includes one or more sensors for detecting a signal corresponding a condition of the usage environment, the method comprising:
selecting the selected trigger detection unit based on a control signal other than the signal from the one or more sensors.
14. The method according to claim 8, wherein the device is connected to a television, the method comprising:
displaying information on a screen of the television corresponding to at least one selected trigger detection unit.
15. A non-transitory computer readable medium having a program stored therein, when the program is executed by a computer causes the computer to perform a method comprising:
selecting a selected trigger detection unit among a plurality of trigger detection units, each of which is configured to detect a start trigger for recognizing a command utterance for controlling a device, the selected trigger detection unit being appropriate to a usage environment of the device; and
recognizing the command utterance when the start trigger is detected by the selected trigger detection unit.
16. The medium according to claim 15, wherein executing the program causes the computer to perform a method comprising:
receiving information of a sound volume in the usage environment,
detecting a start trigger corresponding to a predefined keyword utterance by the user, and
selecting a voice-trigger detection unit as the selected trigger detection unit when the sound volume measured by the sound sensor is less than or equal to a predefined threshold.
17. The medium according to claim 15, wherein executing the program causes the computer to perform a method comprising:
receiving information of an amount of light in the usage environment;
detecting a start trigger corresponding to a predefined gesture by the user, and
selecting a gesture-trigger detection unit as the selected trigger detection unit when the amount of light measured by the light sensor is more than a predefined threshold.
18. The medium according to claim 15, wherein executing the program causes the computer to perform a method comprising:
receiving information of a distance from the device to the user;
detecting a start trigger corresponding to a predefined gesture by the user, and
selecting a gesture-trigger detection unit as the selected trigger detection unit when the distance measured by the distance sensor is less than or equal to a predefined threshold.
19. The medium according to claim 15, wherein executing the program causes the computer to perform a method comprising:
receiving information of a distance from the device to the user;
detecting a start trigger corresponding to a predefined keyword utterance by the user, and
selecting a voice-trigger detection unit as the selected trigger detection unit when the distance measured by the distance sensor is less than or equal to a predefined threshold.
20. The medium according to claim 15, wherein executing the program causes the computer to perform a method comprising: receiving information from one or more sensors for detecting a signal corresponding a condition of the usage environment; and
selecting the selected trigger detection unit based on a control signal other than the signal from the one or more sensors.
US13/537,740 2011-09-30 2012-06-29 Apparatus and method for speech recognition Abandoned US20130085757A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2011-218679 2011-09-30
JP2011218679A JP2013080015A (en) 2011-09-30 2011-09-30 Speech recognition device and speech recognition method

Publications (1)

Publication Number Publication Date
US20130085757A1 true US20130085757A1 (en) 2013-04-04

Family

ID=47993413

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/537,740 Abandoned US20130085757A1 (en) 2011-09-30 2012-06-29 Apparatus and method for speech recognition

Country Status (2)

Country Link
US (1) US20130085757A1 (en)
JP (1) JP2013080015A (en)

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140050354A1 (en) * 2012-08-16 2014-02-20 Microchip Technology Incorporated Automatic Gesture Recognition For A Sensor System
US20140281628A1 (en) * 2013-03-15 2014-09-18 Maxim Integrated Products, Inc. Always-On Low-Power Keyword spotting
US20150154983A1 (en) * 2013-12-03 2015-06-04 Lenovo (Singapore) Pted. Ltd. Detecting pause in audible input to device
WO2014210392A3 (en) * 2013-06-27 2015-07-16 Rawles Llc Detecting self-generated wake expressions
US20150206535A1 (en) * 2012-08-10 2015-07-23 Honda Access Corp. Speech recognition method and speech recognition device
US20150345065A1 (en) * 2012-12-05 2015-12-03 Lg Electronics Inc. Washing machine and control method thereof
US9251787B1 (en) * 2012-09-26 2016-02-02 Amazon Technologies, Inc. Altering audio to improve automatic speech recognition
CN107195304A (en) * 2017-06-30 2017-09-22 珠海格力电器股份有限公司 The voice control circuit and method of a kind of electric equipment
WO2017171357A1 (en) * 2016-03-28 2017-10-05 Samsung Electronics Co., Ltd. Multi-dimensional remote control device and operation controlling method thereof
US9825773B2 (en) 2015-06-18 2017-11-21 Panasonic Intellectual Property Corporation Of America Device control by speech commands with microphone and camera to acquire line-of-sight information
US20180018965A1 (en) * 2016-07-12 2018-01-18 Bose Corporation Combining Gesture and Voice User Interfaces
US20180033430A1 (en) * 2015-02-23 2018-02-01 Sony Corporation Information processing system and information processing method
US20180173494A1 (en) * 2016-12-15 2018-06-21 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
EP3246807A4 (en) * 2015-01-15 2018-08-29 Xiaomi Inc. Method and apparatus for triggering execution of operation instruction
CN110097875A (en) * 2019-06-03 2019-08-06 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium
US10438058B2 (en) * 2012-11-08 2019-10-08 Sony Corporation Information processing apparatus, information processing method, and program
WO2020015473A1 (en) * 2018-01-30 2020-01-23 钉钉控股(开曼)有限公司 Interaction method and device
US10621992B2 (en) * 2016-07-22 2020-04-14 Lenovo (Singapore) Pte. Ltd. Activating voice assistant based on at least one of user proximity and context
US10664533B2 (en) 2017-05-24 2020-05-26 Lenovo (Singapore) Pte. Ltd. Systems and methods to determine response cue for digital assistant based on context
US10699718B2 (en) 2015-03-13 2020-06-30 Samsung Electronics Co., Ltd. Speech recognition system and speech recognition method thereof
US10726837B2 (en) 2017-11-02 2020-07-28 Hisense Visual Technology Co., Ltd. Voice interactive device and method for controlling voice interactive device
EP3192072B1 (en) * 2014-09-12 2020-09-23 Apple Inc. Dynamic thresholds for always listening speech trigger
US10861463B2 (en) * 2018-01-09 2020-12-08 Sennheiser Electronic Gmbh & Co. Kg Method for speech processing and speech processing device
WO2021021970A1 (en) * 2019-07-30 2021-02-04 Qualcomm Incorporated Activating speech recognition
US11145315B2 (en) * 2019-10-16 2021-10-12 Motorola Mobility Llc Electronic device with trigger phrase bypass and corresponding systems and methods
US20220179617A1 (en) * 2020-12-04 2022-06-09 Wistron Corp. Video device and operation method thereof
US11630525B2 (en) 2018-06-01 2023-04-18 Apple Inc. Attention aware virtual assistant dismissal
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11798547B2 (en) * 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US11809886B2 (en) 2015-11-06 2023-11-07 Apple Inc. Intelligent automated assistant in a messaging environment
US11837237B2 (en) 2017-05-12 2023-12-05 Apple Inc. User-specific acoustic models
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11838579B2 (en) 2014-06-30 2023-12-05 Apple Inc. Intelligent automated assistant for TV user interactions
US11862151B2 (en) 2017-05-12 2024-01-02 Apple Inc. Low-latency intelligent automated assistant
US11862186B2 (en) 2013-02-07 2024-01-02 Apple Inc. Voice trigger for a digital assistant
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11907436B2 (en) 2018-05-07 2024-02-20 Apple Inc. Raise to speak
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11954405B2 (en) 2015-09-08 2024-04-09 Apple Inc. Zero latency digital assistant

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6329833B2 (en) * 2013-10-04 2018-05-23 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Wearable terminal and method for controlling wearable terminal
JP6359935B2 (en) * 2014-09-30 2018-07-18 株式会社Nttドコモ Dialogue device and dialogue method
JP6227209B2 (en) 2015-09-09 2017-11-08 三菱電機株式会社 In-vehicle voice recognition device and in-vehicle device

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4972490A (en) * 1981-04-03 1990-11-20 At&T Bell Laboratories Distance measurement control of a multiple detector system
US6157403A (en) * 1996-08-05 2000-12-05 Kabushiki Kaisha Toshiba Apparatus for detecting position of object capable of simultaneously detecting plural objects and detection method therefor
US20080120113A1 (en) * 2000-11-03 2008-05-22 Zoesis, Inc., A Delaware Corporation Interactive character system
US20080221883A1 (en) * 2005-10-27 2008-09-11 International Business Machines Corporation Hands free contact database information entry at a communication device
US20090292528A1 (en) * 2008-05-21 2009-11-26 Denso Corporation Apparatus for providing information for vehicle
US20090326954A1 (en) * 2008-06-25 2009-12-31 Canon Kabushiki Kaisha Imaging apparatus, method of controlling same and computer program therefor
US20100103106A1 (en) * 2007-07-11 2010-04-29 Hsien-Hsiang Chui Intelligent robotic interface input device
US20100305807A1 (en) * 2009-05-28 2010-12-02 Basir Otman A Communication system with personal information management and remote vehicle monitoring and control features
US20100312547A1 (en) * 2009-06-05 2010-12-09 Apple Inc. Contextual voice commands
US20100315329A1 (en) * 2009-06-12 2010-12-16 Southwest Research Institute Wearable workspace
US20120072944A1 (en) * 2010-09-16 2012-03-22 Verizon New Jersey Method and apparatus for providing seamless viewing
US20120221334A1 (en) * 2011-02-25 2012-08-30 Hon Hai Precision Industry Co., Ltd. Security system and method
US20120229411A1 (en) * 2009-12-04 2012-09-13 Sony Corporation Information processing device, display method, and program

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3764302B2 (en) * 1999-08-04 2006-04-05 株式会社東芝 Voice recognition device
JP3581881B2 (en) * 2000-07-13 2004-10-27 独立行政法人産業技術総合研究所 Voice complement method, apparatus and recording medium
JP2003345390A (en) * 2002-05-23 2003-12-03 Matsushita Electric Ind Co Ltd Voice processor and remote controller
JP2004354722A (en) * 2003-05-29 2004-12-16 Nissan Motor Co Ltd Speech recognition device
JP2006133939A (en) * 2004-11-04 2006-05-25 Matsushita Electric Ind Co Ltd Content data retrieval device
JP2006337659A (en) * 2005-06-01 2006-12-14 Nissan Motor Co Ltd Speech input device and speech recognition device
JP2007121579A (en) * 2005-10-26 2007-05-17 Matsushita Electric Works Ltd Operation device
JP5473520B2 (en) * 2009-10-06 2014-04-16 キヤノン株式会社 Input device and control method thereof
JP5771002B2 (en) * 2010-12-22 2015-08-26 株式会社東芝 Speech recognition apparatus, speech recognition method, and television receiver equipped with speech recognition apparatus

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4972490A (en) * 1981-04-03 1990-11-20 At&T Bell Laboratories Distance measurement control of a multiple detector system
US6157403A (en) * 1996-08-05 2000-12-05 Kabushiki Kaisha Toshiba Apparatus for detecting position of object capable of simultaneously detecting plural objects and detection method therefor
US20080120113A1 (en) * 2000-11-03 2008-05-22 Zoesis, Inc., A Delaware Corporation Interactive character system
US20080221883A1 (en) * 2005-10-27 2008-09-11 International Business Machines Corporation Hands free contact database information entry at a communication device
US20100103106A1 (en) * 2007-07-11 2010-04-29 Hsien-Hsiang Chui Intelligent robotic interface input device
US8552983B2 (en) * 2007-07-11 2013-10-08 Hsien-Hsiang Chiu Intelligent robotic interface input device
US20090292528A1 (en) * 2008-05-21 2009-11-26 Denso Corporation Apparatus for providing information for vehicle
US20090326954A1 (en) * 2008-06-25 2009-12-31 Canon Kabushiki Kaisha Imaging apparatus, method of controlling same and computer program therefor
US20100305807A1 (en) * 2009-05-28 2010-12-02 Basir Otman A Communication system with personal information management and remote vehicle monitoring and control features
US20100312547A1 (en) * 2009-06-05 2010-12-09 Apple Inc. Contextual voice commands
US20100315329A1 (en) * 2009-06-12 2010-12-16 Southwest Research Institute Wearable workspace
US20120229411A1 (en) * 2009-12-04 2012-09-13 Sony Corporation Information processing device, display method, and program
US20120072944A1 (en) * 2010-09-16 2012-03-22 Verizon New Jersey Method and apparatus for providing seamless viewing
US20120221334A1 (en) * 2011-02-25 2012-08-30 Hon Hai Precision Industry Co., Ltd. Security system and method

Cited By (58)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150206535A1 (en) * 2012-08-10 2015-07-23 Honda Access Corp. Speech recognition method and speech recognition device
US9704484B2 (en) * 2012-08-10 2017-07-11 Honda Access Corp. Speech recognition method and speech recognition device
US20140050354A1 (en) * 2012-08-16 2014-02-20 Microchip Technology Incorporated Automatic Gesture Recognition For A Sensor System
US9323985B2 (en) * 2012-08-16 2016-04-26 Microchip Technology Incorporated Automatic gesture recognition for a sensor system
US10354649B2 (en) 2012-09-26 2019-07-16 Amazon Technologies, Inc. Altering audio to improve automatic speech recognition
US9251787B1 (en) * 2012-09-26 2016-02-02 Amazon Technologies, Inc. Altering audio to improve automatic speech recognition
US9916830B1 (en) 2012-09-26 2018-03-13 Amazon Technologies, Inc. Altering audio to improve automatic speech recognition
US11488591B1 (en) 2012-09-26 2022-11-01 Amazon Technologies, Inc. Altering audio to improve automatic speech recognition
US10438058B2 (en) * 2012-11-08 2019-10-08 Sony Corporation Information processing apparatus, information processing method, and program
US20150345065A1 (en) * 2012-12-05 2015-12-03 Lg Electronics Inc. Washing machine and control method thereof
US11862186B2 (en) 2013-02-07 2024-01-02 Apple Inc. Voice trigger for a digital assistant
US9703350B2 (en) * 2013-03-15 2017-07-11 Maxim Integrated Products, Inc. Always-on low-power keyword spotting
US20140281628A1 (en) * 2013-03-15 2014-09-18 Maxim Integrated Products, Inc. Always-On Low-Power Keyword spotting
US11798547B2 (en) * 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US9747899B2 (en) 2013-06-27 2017-08-29 Amazon Technologies, Inc. Detecting self-generated wake expressions
US10720155B2 (en) 2013-06-27 2020-07-21 Amazon Technologies, Inc. Detecting self-generated wake expressions
WO2014210392A3 (en) * 2013-06-27 2015-07-16 Rawles Llc Detecting self-generated wake expressions
US11600271B2 (en) 2013-06-27 2023-03-07 Amazon Technologies, Inc. Detecting self-generated wake expressions
US11568867B2 (en) 2013-06-27 2023-01-31 Amazon Technologies, Inc. Detecting self-generated wake expressions
US10269377B2 (en) * 2013-12-03 2019-04-23 Lenovo (Singapore) Pte. Ltd. Detecting pause in audible input to device
US10163455B2 (en) * 2013-12-03 2018-12-25 Lenovo (Singapore) Pte. Ltd. Detecting pause in audible input to device
US20150154983A1 (en) * 2013-12-03 2015-06-04 Lenovo (Singapore) Pted. Ltd. Detecting pause in audible input to device
US11838579B2 (en) 2014-06-30 2023-12-05 Apple Inc. Intelligent automated assistant for TV user interactions
EP3192072B1 (en) * 2014-09-12 2020-09-23 Apple Inc. Dynamic thresholds for always listening speech trigger
EP3246807A4 (en) * 2015-01-15 2018-08-29 Xiaomi Inc. Method and apparatus for triggering execution of operation instruction
US20180033430A1 (en) * 2015-02-23 2018-02-01 Sony Corporation Information processing system and information processing method
US10522140B2 (en) * 2015-02-23 2019-12-31 Sony Corporation Information processing system and information processing method
US10699718B2 (en) 2015-03-13 2020-06-30 Samsung Electronics Co., Ltd. Speech recognition system and speech recognition method thereof
US9825773B2 (en) 2015-06-18 2017-11-21 Panasonic Intellectual Property Corporation Of America Device control by speech commands with microphone and camera to acquire line-of-sight information
US11954405B2 (en) 2015-09-08 2024-04-09 Apple Inc. Zero latency digital assistant
US11809886B2 (en) 2015-11-06 2023-11-07 Apple Inc. Intelligent automated assistant in a messaging environment
WO2017171357A1 (en) * 2016-03-28 2017-10-05 Samsung Electronics Co., Ltd. Multi-dimensional remote control device and operation controlling method thereof
US20180018965A1 (en) * 2016-07-12 2018-01-18 Bose Corporation Combining Gesture and Voice User Interfaces
US10621992B2 (en) * 2016-07-22 2020-04-14 Lenovo (Singapore) Pte. Ltd. Activating voice assistant based on at least one of user proximity and context
US20180173494A1 (en) * 2016-12-15 2018-06-21 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
US11003417B2 (en) * 2016-12-15 2021-05-11 Samsung Electronics Co., Ltd. Speech recognition method and apparatus with activation word based on operating environment of the apparatus
US11687319B2 (en) 2016-12-15 2023-06-27 Samsung Electronics Co., Ltd. Speech recognition method and apparatus with activation word based on operating environment of the apparatus
US11862151B2 (en) 2017-05-12 2024-01-02 Apple Inc. Low-latency intelligent automated assistant
US11837237B2 (en) 2017-05-12 2023-12-05 Apple Inc. User-specific acoustic models
US10664533B2 (en) 2017-05-24 2020-05-26 Lenovo (Singapore) Pte. Ltd. Systems and methods to determine response cue for digital assistant based on context
CN107195304A (en) * 2017-06-30 2017-09-22 珠海格力电器股份有限公司 The voice control circuit and method of a kind of electric equipment
US11302328B2 (en) 2017-11-02 2022-04-12 Hisense Visual Technology Co., Ltd. Voice interactive device and method for controlling voice interactive device
US10726837B2 (en) 2017-11-02 2020-07-28 Hisense Visual Technology Co., Ltd. Voice interactive device and method for controlling voice interactive device
US10861463B2 (en) * 2018-01-09 2020-12-08 Sennheiser Electronic Gmbh & Co. Kg Method for speech processing and speech processing device
WO2020015473A1 (en) * 2018-01-30 2020-01-23 钉钉控股(开曼)有限公司 Interaction method and device
US11907436B2 (en) 2018-05-07 2024-02-20 Apple Inc. Raise to speak
US11630525B2 (en) 2018-06-01 2023-04-18 Apple Inc. Attention aware virtual assistant dismissal
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
CN110097875A (en) * 2019-06-03 2019-08-06 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium
WO2021021970A1 (en) * 2019-07-30 2021-02-04 Qualcomm Incorporated Activating speech recognition
US11437031B2 (en) 2019-07-30 2022-09-06 Qualcomm Incorporated Activating speech recognition based on hand patterns detected using plurality of filters
US11145315B2 (en) * 2019-10-16 2021-10-12 Motorola Mobility Llc Electronic device with trigger phrase bypass and corresponding systems and methods
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11750962B2 (en) 2020-07-21 2023-09-05 Apple Inc. User identification using headphones
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US20220179617A1 (en) * 2020-12-04 2022-06-09 Wistron Corp. Video device and operation method thereof

Also Published As

Publication number Publication date
JP2013080015A (en) 2013-05-02

Similar Documents

Publication Publication Date Title
US20130085757A1 (en) Apparatus and method for speech recognition
JP6325626B2 (en) Hybrid performance scaling or speech recognition
US11062705B2 (en) Information processing apparatus, information processing method, and computer program product
US11756563B1 (en) Multi-path calculations for device energy levels
US11355104B2 (en) Post-speech recognition request surplus detection and prevention
US11189273B2 (en) Hands free always on near field wakeword solution
US9720644B2 (en) Information processing apparatus, information processing method, and computer program
EP2639793B1 (en) Electronic device and method for controlling power using voice recognition
US8421932B2 (en) Apparatus and method for speech recognition, and television equipped with apparatus for speech recognition
US9837068B2 (en) Sound sample verification for generating sound detection model
US20140304606A1 (en) Information processing apparatus, information processing method and computer program
US20130289992A1 (en) Voice recognition method and voice recognition apparatus
US20140303975A1 (en) Information processing apparatus, information processing method and computer program
KR20180127065A (en) Speech-controlled apparatus for preventing false detections of keyword and method of operating the same
KR20180132011A (en) Electronic device and Method for controlling power using voice recognition thereof
JP2015194766A (en) speech recognition device and speech recognition method
KR20190062369A (en) Speech-controlled apparatus for preventing false detections of keyword and method of operating the same
KR20210063698A (en) Electronic device and method for controlling the same, and storage medium
US11600275B2 (en) Electronic device and control method thereof
JP2006163285A (en) Device, method and program for speech recognition, and recording medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: KABUSHIKI KAISHA TOSHIBA, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NAKAMURA, MASANOBU;REEL/FRAME:028470/0868

Effective date: 20120606

AS Assignment

Owner name: KABUSHIKI KAISHA TOSHIBA, JAPAN

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE OMISSION OF THE 2ND ASSIGNOR PREVIOUSLY RECORDED ON REEL 028470 FRAME 0868. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNORS:NAKAMURA, MASANOBU;KAWAMURA, AKINORI;REEL/FRAME:028583/0369

Effective date: 20120606

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION