EP4004908B1 - Aktivierung von spracherkennung - Google Patents
Aktivierung von spracherkennung Download PDFInfo
- Publication number
- EP4004908B1 EP4004908B1 EP20757126.6A EP20757126A EP4004908B1 EP 4004908 B1 EP4004908 B1 EP 4004908B1 EP 20757126 A EP20757126 A EP 20757126A EP 4004908 B1 EP4004908 B1 EP 4004908B1
- Authority
- EP
- European Patent Office
- Prior art keywords
- hand
- indication
- detector
- speech recognition
- sensors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3206—Monitoring of events, devices or parameters that trigger a change in power modality
- G06F1/3231—Monitoring the presence, absence or movement of users
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/10—Image acquisition
- G06V10/12—Details of acquisition arrangements; Constructional details thereof
- G06V10/14—Optical characteristics of the device performing the acquisition or on the illumination arrangements
- G06V10/143—Sensing or illuminating at different wavelengths
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/20—Scenes; Scene-specific elements in augmented reality scenes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/107—Static hand or arm
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
- G10L2015/227—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology
Definitions
- the present disclosure is generally related to speech recognition and more specifically, to activating a speech activation system.
- Speech recognition is often used to enable an electronic device to interpret spoken questions or commands from users. Such spoken questions or commands can be identified by analyzing an audio signal, such as a microphone input, at an automatic speech recognition (ASR) engine that generates a textual output of the spoken questions or commands.
- ASR automatic speech recognition
- An "always-on" ASR system enables the electronic device to continually scan audio input to detect user commands or questions in the audio input.
- continual operation of the ASR system results in relatively high power consumption, which reduces battery life when implemented in a mobile device.
- a spoken voice command will not be recognized unless it is preceded by a spoken activation keyword.
- Recognition of the activation keyword enables such devices to activate the ASR engine to process the voice command.
- speaking an activation keyword before every command uses additional time and requires the speaker to use correct pronunciation and proper intonation.
- a dedicated button is provided for the user to press to initiate speech recognition.
- locating and precisely pressing the button can result in a diversion of the user's attention from other tasks.
- EP 2109295 describes a mobile terminal including an input unit configured to receive an input to activate a voice recognition function on the mobile terminal and a memory configured to store multiple domains related to menus and operations of the mobile terminal.
- the voice recognition function may be activated by interpreting body movements of the user. It further includes a controller configured to access a specific domain among the multiple domains included in the memory based on the received input to activate the voice recognition function, to recognize user speech based on a language model of the accessed domain, and to determine at least one menu and operation of mobile terminal based on the accessed specific domain and the recognized user speech.
- US 2013/085757 describes an apparatus for speech recognition including a plurality of trigger detection units, each of which being configured to detect a start trigger for recognizing a command utterance for controlling the device.
- the trigger detection units consist of a gesture-trigger detection unit, a handclap-trigger detection unit, and a voice-trigger detection unit.
- EP 2743799 describes a control apparatus to be connected to a route guidance apparatus, comprising a hand information detection part for detecting information on a hand of a user from a taken image.
- CN 207758675 describes a vehicle rear-view mirror with an infrared sensor that may be used to activate a voice recognition function module.
- Devices and methods to activate a speech recognition system are disclosed. Because an always-on ASR system that continually scans audio input to detect user commands or questions in the audio input results in relatively high power consumption, battery life is reduced when the ASR engine is implemented in a mobile device. In an attempt to reduce power consumption, some systems may use a reduced-capacity speech recognition processor that consumes less power than a full-power ASR engine to perform keyword detection on the audio input. When an activation keyword is detected, the full-power ASR engine can be activated to process a voice command that follows the activation keyword. However, requiring a user to speak an activation keyword before every command is time consuming and requires the speaker to use correct pronunciation and proper intonation. Devices that require the user to press a dedicated button to initiate speech recognition can result in an unsafe diversion of the user's attention, such as when operating a vehicle.
- speech recognition is activated in response to detecting a hand over a portion of a device, such as a user's hand hovering over a screen of the device.
- the user can activate speech recognition for a voice command by positioning the user's hand over the device and without having to speak an activation keyword or having to precisely locate and press a dedicated button. Removal of the user's hand from over the device can signal that the user has finished speaking the voice command.
- speech recognition can be activated conveniently and safely, such as when the user is operating a vehicle.
- the term “producing” is used to indicate any of its ordinary meanings, such as calculating, generating, and/or providing.
- the term “providing” is used to indicate any of its ordinary meanings, such as calculating, generating, and/or producing.
- the term “coupled” is used to indicate a direct or indirect electrical or physical connection. If the connection is indirect, there may be other blocks or components between the structures being “coupled”.
- a loudspeaker may be acoustically coupled to a nearby wall via an intervening medium (e.g., air) that enables propagation of waves (e.g., sound) from the loudspeaker to the wall (or vice-versa).
- intervening medium e.g., air
- configuration may be used in reference to a method, apparatus, device, system, or any combination thereof, as indicated by its particular context. Where the term “comprising” is used in the present description and claims, it does not exclude other elements or operations.
- the term “based on” (as in “A is based on B") is used to indicate any of its ordinary meanings, including the cases (i) “based on at least” (e.g., "A is based on at least B") and, if appropriate in the particular context, (ii) "equal to” (e.g., "A is equal to B”). In the case (i) where A is based on B includes based on at least, this may include the configuration where A is coupled to B.
- the term "in response to” is used to indicate any of its ordinary meanings, including “in response to at least.”
- the term “at least one” is used to indicate any of its ordinary meanings, including “one or more”.
- the term “at least two” is used to indicate any of its ordinary meanings, including “two or more.”
- any disclosure of an operation of an apparatus having a particular feature is also expressly intended to disclose a method having an analogous feature (and vice versa), and any disclosure of an operation of an apparatus according to a particular configuration is also expressly intended to disclose a method according to an analogous configuration (and vice versa).
- the terms “method,” “process,” “procedure,” and “technique” are used generically and interchangeably unless otherwise indicated by the particular context.
- the terms “element” and “module” may be used to indicate a portion of a greater configuration.
- the term “packet” may correspond to a unit of data that includes a header portion and a payload portion.
- communication device refers to an electronic device that may be used for voice and/or data communication over a wireless communication network.
- Examples of communication devices include smart speakers, speaker bars, cellular phones, personal digital assistants (PDAs), handheld devices, headsets, wireless modems, laptop computers, personal computers, etc.
- PDAs personal digital assistants
- headsets wireless modems
- laptop computers personal computers, etc.
- FIG. 1 depicts a system 100 that includes a device 102 that is configured to activate an ASR system 140 to process an input sound 106, such as a voice command, when at least a portion of a hand 190 is positioned over the device 102.
- the device 102 includes one or more microphones, represented as a microphone 112, a screen 110, one or more sensors 120, a hand detector 130, and the ASR system 140.
- a perspective view 180 illustrates the hand 190 positioned over the device 102, and a block diagram 182 illustrates components of the device 102.
- the device 102 can include a portable communication device (e.g., a "smart phone"), a vehicle system (e.g., a speech interface for an automobile entertainment system, navigation system, or self-driving control system), a virtual reality or augmented reality headset, or a wireless speaker and voice command device with an integrated assistant application (e.g., a "smart speaker” device), as illustrative, non-limiting examples.
- a portable communication device e.g., a "smart phone”
- vehicle system e.g., a speech interface for an automobile entertainment system, navigation system, or self-driving control system
- a virtual reality or augmented reality headset e.g., a virtual reality or augmented reality headset
- a wireless speaker and voice command device with an integrated assistant application e.g., a "smart speaker” device
- the microphone 112 is configured to generate an audio signal 114 responsive to the input sound 106.
- the microphone 112 is configured to be activated, responsive to an indication 132, to generate the audio signal 114, as described further with reference to FIG. 3 .
- the one or more sensors 120 are coupled to the hand detector 130 and configured to provide sensor data 122 to the hand detector 130.
- the sensor(s) 120 can include one or more cameras, such as a low-power ambient light sensor or a main camera, an infrared sensor, an ultrasound sensor, one or more other sensors, or any combination thereof, such as described further with reference to FIG. 2 .
- the hand detector 130 is configured to generate the indication 132 responsive to detection of at least a portion of a hand being positioned within a range of 10cm to 30cm from the one or more sensors 120, such as over the screen 110.
- "at least a portion of a hand” can correspond to any part of a hand (e.g., one or more fingers, a thumb, a palm or a back of the hand, or any portion thereof, or any combination thereof) or can correspond to an entire hand, as illustrative, non-limiting examples.
- detecting a hand is equivalent to “detecting at least a portion of a hand” and can include detecting two or more fingers, detecting at least one finger connected to a portion of a palm, detecting a thumb and at least one finger, detecting a thumb connected to at least a portion of a palm, or detecting an entire hand (e.g., four fingers, a thumb, and a palm), as illustrative, non-limiting examples.
- the hand 190 is described as being detected “over” the device 102, “over” the device 102 refers to being located at a specified relative position (or within a specified range of positions) relative to the position and orientation of the one or more sensors 120. In an example in which the device 102 is oriented so that the sensor(s) 120 face upward, such as illustrated in FIG. 1 , detecting the hand 190 over the device 102 indicates that the hand 190 is above the device 102. In an example in which the device 102 is oriented so that the sensor(s) 120 face downward, detecting the hand 190 over the device 102 indicates that the hand 190 is below the device 102.
- the hand detector 130 is configured to process the sensor data 122 to determine whether the hand 190 is detected over the device 102. For example, as described further with reference to FIG. 2 , in some implementations the hand detector 130 processes image data to determine whether a hand shape has been captured by a camera, processes infrared data to determine whether a detected temperature of the hand 190 corresponds to a hand temperature, processes ultrasound data to determine whether a distance between the hand 190 and the device 102 is within a specified range, or a combination thereof.
- the device 102 is configured to generate a notification for a user of the device 102 to indicate, responsive to detecting the hand 190 being positioned within a range of 10cm to 30cm from the one or more sensors 120, that speech recognition has been activated, and is further configured to generate a second notification to indicate, responsive to no longer detecting the hand 190 over the device 102, that voice input for speech recognition is deactivated.
- the device 102 may be configured to generate an audio signal such as a chime or a voice message such as "ready,” a visual signal such as an illuminated or blinking light, a digital signal to be played out by another device, such as by a car entertainment system in communication with the device, or any combination thereof.
- Generating the notification(s) enables the user to confirm that the device 102 is ready to receive a voice command and may further enable the user to detect and prevent false activations (e.g., caused by another object that may be misidentified as the hand 190) and missed activations due to improper positioning of the hand 190. Because each activation of the ASR system 140 consumes power and uses processing resources, reducing false activations results in reduced power consumption and processing resource usage.
- the ASR system 140 is configured to be activated, responsive to the indication 132, to process the audio signal 114.
- a specific bit of a control register represents the presence or absence of the indication 132 and a control circuit within or coupled to the ASR system 140 is configured to read the specific bit.
- a "1" value of the bit corresponds to the indication 132 and causes the ASR system 140 to activate.
- the indication 132 is instead implemented as a digital or analog signal on a bus or a control line, an interrupt flag at an interrupt controller, or an optical or mechanical signal, as illustrative, non-limiting examples.
- the ASR system 140 When activated, the ASR system 140 is configured to process one or more portions (e.g., frames) of the audio signal 114 that include the input sound 106. For example, the device 102 can buffer a series of frames of the audio signal 114 as the sensor data 122 being processed by the hand detector 130 so that, upon the indication 132 being generated, the ASR system 140 can process the buffered series of frames and generate an output indicative of the user's speech.
- the ASR system 140 can provide recognized speech 142 as a text output of the speech content of the input sound 106 to another component of the device 102, such as a "virtual assistant" application or other application as described with reference to FIG. 3 , to initiate an action based on the speech content.
- deactivation of the ASR system 140 can include gating an input circuit of the ASR system 140 to prevent the audio signal 114 from being input to the ASR system 140, gating a clock signal to prevent circuit switching within the ASR system 140, or both, to reduce dynamic power consumption.
- deactivation of the ASR system 140 can include reducing a power supply to the ASR system 140 to reduce static power consumption without losing the state of the circuit elements, removing power from at least a portion of the ASR system 140, or a combination thereof.
- the hand detector 130, the ASR system 140, or any combination thereof are implemented using dedicated circuitry or hardware. In some implementations, the hand detector 130, the ASR system 140, or any combination thereof, are implemented via execution of firmware or software.
- the device 102 can include a memory configured to store instructions and one or more processors configured to execute the instructions to implement the hand detector 130 and the ASR system 140, such as described further with reference to FIG. 9 .
- a user can position the user's hand 190 over the device 102 prior to speaking a voice command.
- the hand detector 130 processes the sensor data 122 to determine that the hand 190 is positioned within a range of 10cm to 30cm from the one or more sensors 120.
- the hand detector 130 In response to detecting the hand 190 being positioned within a range of 10cm to 30cm from the one or more sensors 120, the hand detector 130 generates the indication 132, which causes activation of the ASR system 140.
- the ASR system 140 After receiving the voice command at the microphone 112, the ASR system 140 processes the corresponding portion(s) of the audio signal 114 to generate the recognized speech 142 indicating the voice command.
- Activation of the ASR system 140 when a hand is detected over the device 102 enables a user of the device 102 to activate speech recognition for a voice command by positioning the user's hand 190 over the device, without the user having to speak an activation keyword or having to precisely locate and press a dedicated button.
- speech recognition can be activated conveniently and safely, such as when the user is operating a vehicle.
- positioning the user's hand over the device signals the device to initiate speech recognition, improper activation of the speech recognition can both be reduced as compared to a system that instead uses keyword detection to activate speech recognition.
- FIG. 2 depicts an example 200 showing further aspects of components that can be implemented in the device 102 of FIG. 1 .
- the sensors 120 include one or more cameras 202 configured to provide image data 212 to the hand detector 130, an infrared (IR) sensor 208 configured to provide infrared sensor data 218 to the hand detector 130, and an ultrasound sensor 210 configured to provide ultrasound sensor data 220 to the hand detector.
- the image data 212, the infrared sensor data 218, and the ultrasound sensor data 220 are included in the sensor data 122.
- the cameras 202 include a low-power ambient light sensor 204 configured to generate at least part of the image data 212, a main camera 206 configured to generate at least part of the image data 212, or both. Although the main camera 206 can capture image data having a higher resolution than the ambient light sensor 204, the ambient light sensor 204 can generate image data having sufficient resolution to perform hand detection and operates using less power than the main camera 206.
- the hand detector 130 includes a hand pattern detector 230, a hand temperature detector 234, a hand distance detector 236, and an activation signal unit 240.
- the hand pattern detector 230 is configured to process the image data 212 to determine whether the image data 212 includes a hand pattern 232.
- the hand pattern detector 230 processes the image data 212 using a neural network trained to recognize the hand pattern 232.
- the hand pattern detector 230 applies one or more filters to the image data 212 to identify the hand pattern 232.
- the hand pattern detector 230 is configured to send a first signal 231 to the activation signal unit 240 that indicates whether the hand pattern 232 is detected.
- a single hand pattern 232 is depicted, in other implementations multiple hand patterns may be included that represent differing aspects of a hand, such as a fingers-together pattern, a fingers-spread pattern, a partial hand pattern, etc.
- the hand temperature detector 234 is configured to process the infrared sensor data 218 from the infrared sensor 208 and to send a second signal 235 to the activation signal unit 240 that indicates whether the infrared sensor data 218 indicates a temperature source having a temperature indicative of a human hand. In some implementations, the hand temperature detector 234 is configured to determine whether at least a portion of a field of view of the infrared sensor 208 has a temperature sources in a temperature range indicative of a human hand. In some implementations, the hand temperature detector 234 is configured to receive data indicating a location of a hand from the hand pattern detector 230 to determine whether a temperature source at the hand location matches the temperature range of a human hand.
- the hand distance detector 236 is configured to determine a distance 250 between the hand 190 and at least a portion of the device 102.
- the hand distance detector 236 processes the ultrasound sensor data 220 and generates a third signal 237 that indicates whether the hand 190 is within specified a range 238 of distances.
- the hand distance detector 236 receives data from the hand pattern detector 230, from the hand temperature detector 234, or both, that indicates a location of the hand 190 and uses the hand location data to determine a region in the field of view of the ultrasound sensor 210 that corresponds to the hand 190.
- the hand distance detector 236 identifies the hand 190 by locating a nearest object to the screen 110 that exceeds a specified portion (e.g., 25%) of the field of view of the ultrasound sensor 210.
- the range 238 has a lower bound of 10 centimeters (cm) and an upper bound of 30 cm (i.e., the range 238 includes distances that are greater than or equal to 10 cm and less than or equal to 30 cm).
- the range 238 is adjustable.
- the device 102 may be configured to perform an update operation in which the user positions the hand 190 in a preferred position relative to the device 102 so that the distance 250 can be detected and used to generate the range 238 (e.g., by applying a lower offset from the detected distance 250 to set a lower bound and applying an upper offset from the detected distance 250 to set an upper bound).
- the activation signal unit 240 is configured to generate the indication 132 responsive to the first signal 231 indicating detection of the hand pattern 232 in the image data 212, the second signal 235 indicating detection of a hand temperature within a human hand temperature range, and the third signal 237 indicating detection that the hand 190 is within the range 238 (e.g., the hand 190 is at a distance 250 of 10 centimeters to 30 centimeters from the screen 110).
- the activation signal unit 240 can generate the indication 132 as a logical AND of the signals 231, 235, and 237 (e.g., the indication 132 has a 1 value in response to all three signals 231, 235, 237 having a 1 value).
- the activation signal unit 240 is also configured to generate the indication 132 having a 1 value in response to any two of the signals 231, 235, 237 having a 1 value.
- one or more of the signals 231, 235, and 237 has a multi-bit value indicating a likelihood that the corresponding hand detection criterion is satisfied.
- the first signal 231 may have a multi-bit value that indicates a confidence that a hand pattern is detected
- the second signal 235 may have a multi-bit value that indicates a confidence that a hand temperature is detected
- the third signal 237 may have a multi-bit value that indicates a confidence that the distance of the hand 190 from the device 102 is within the range 238.
- the activation signal unit 240 can combine the signals 231, 235, and 237 and compare the combined result to a threshold to generate the indication 132.
- the activation signal unit 240 may apply a set of weights to determine a weighted sum of the signals 231, 235, and 237.
- the activation signal unit 240 may output the indication 132 having a value indicating hand detection responsive to the weighted sum exceeding the threshold.
- Values of weights and thresholds can be hardcoded or, alternatively, can be dynamically or periodically adjusted based on user feedback regarding false positives and false negatives, as described further below.
- the hand detector 130 is further configured to generate a second indication 242 in response to detection that the hand 190 is no longer over the device 102.
- hand detector may output the second indication 242 as having a 0 value (indicating that hand removal is not detected) responsive to detecting the hand 190, and may update the second indication 242 to have a 1 value in response to determining that the hand is no longer detected (e.g., to indicate a transition from a "hand detected" state to a "hand not detected” state).
- the second indication 242 can correspond to an end-of-utterance signal for the ASR system 140, as explained further with reference to FIG. 3 .
- FIG. 2 depicts multiple sensors including the ambient light sensor 204, the main camera 206, the infrared sensor 208, and the ultrasound sensor 210
- one or more of the ambient light sensor 204, the main camera 206, the infrared sensor 208, or the ultrasound sensor 210 is omitted.
- the ambient light sensor 204 enables generation of at least part of the image data 212 to detect hand shapes using reduced power as compared to using the main camera 206
- the ambient light sensor 204 is omitted and the main camera 206 is used to generate the image data.
- the main camera 206 can be operated according to an on/off duty cycle, such as at quarter-second intervals, for hand detection.
- the main camera 206 enables generation of at least part of the image data 212 to detect hand shapes with higher resolution, and therefore higher accuracy, as compared to using the ambient light sensor 204, in some implementations the main camera 206 is omitted and the ambient light sensor 204 is used to generate the image data 212.
- the infrared sensor 208 enables generation of at the infrared sensor data 218 to detect whether an object has a temperature matching a human hand temperature, in other implementations the infrared sensor 208 is omitted and the device 102 performs hand detection without regard to temperature.
- the ultrasound sensor 210 enables generation of the ultrasound sensor data 220 to detect whether a distance to an object is within the range 238, in other implementations the ultrasound sensor 210 is omitted and the device 102 performs hand detection without regard to distance from the device 102.
- one or more other mechanisms can be implemented for distance detection, such as by comparing object locations in image data from multiple cameras of the device 102 (e.g., parallax) o multiple cameras of a different device (e.g., a vehicle in which the device 102 is located) to estimate the distance 250, by using a size of a detected hand in the image data 212 or in the infrared sensor data 218 to estimate the distance 250, or by projecting structured light or other electromagnetic signals estimate object distance, as illustrative, non-limiting examples.
- object locations in image data from multiple cameras of the device 102 e.g., parallax
- a different device e.g., a vehicle in which the device 102 is located
- projecting structured light or other electromagnetic signals estimate object distance, as illustrative, non-limiting examples.
- the only sensor data used for hand detection is the image data 212 from the ambient light sensor 204.
- the sensors 120 are concurrently active, in other implementations one or more of the sensors 120 are controlled according to a "cascade" operation in which power is conserved by having one or more of the sensor 120 remain inactive until a hand detection criterion is satisfied based on sensor data from another of the sensors 120.
- the main camera 206, the infrared sensor 208, and the ultrasound sensor 210 may remain inactive until the hand pattern detector 230 detects the hand pattern 232 in the image data 212 generated by the ambient light sensor 204, in response to which one or more of the main camera 206, the infrared sensor 208, and the ultrasound sensor 210 is activated to provide additional sensor data for enhanced accuracy of hand detection.
- FIG. 3 depicts an example 300 showing further aspects of components that can be implemented in the device 102.
- activation circuitry 302 is coupled to the hand detector 130 and to the ASR system 140, and the ASR system includes a buffer 320 that is accessible to an ASR engine 330.
- the device 102 also includes a virtual assistant application 340 and a speaker 350 (e.g., the device 102 implemented as a wireless speaker and voice command device).
- the activation circuitry 302 is configured to activate the automatic speech recognition system 140 in response to receiving the indication 132.
- the activation circuitry 302 is configured to generate an activation signal 310 in response to the indication 132 transitioning to a state that indicates hand detection (e.g., the indication 132 transitions from a 0 value indicating no hand detection to a 1 value indicating hand detection).
- the activation signal 310 is provided to the ASR system 140 via a signal 306 to activate the ASR system 140.
- Activating the ASR system 140 includes initiating buffering of the audio signal 114 at the buffer 320 to generate buffered audio data 322.
- the activation signal 310 is also provided to the microphone 112 via a signal 304 that activates the microphone 112, enabling the microphone to generate the audio signal 114.
- the activation circuitry 302 is also configured to generate an end-of-utterance signal 312.
- the activation circuitry 302 is configured to generate the end-of-utterance signal 312 in response to the second indication 242 transitioning to a state that indicates an end of hand detection (e.g., the second indication 242 transitions from a 0 value (indicating no change in hand detection) to a 1 value (indicating that a detected hand is no longer detected)).
- the end-of-utterance signal 312 is provided to the ASR system 140 via a signal 308 to cause the ASR engine 330 to begin processing of the buffered audio data 332.
- the activation circuitry 302 is configured to selectively activate one or more components of the ASR system 140.
- the activation circuitry 302 may include or be coupled to power management circuitry, clock circuitry, head switch or foot switch circuitry, buffer control circuitry, or any combination thereof.
- the activation circuitry 302 may be configured to initiate powering-on of the buffer 320, the ASR engine 330, or both, such as by selectively applying or raising a voltage of a power supply of the buffer 320, the ASR engine 330, or both.
- the activation circuitry 302 may be configured to selectively gate or un-gate a clock signal to the buffer 320, the ASR engine 330, or both, such as to prevent circuit operation without removing a power supply.
- the recognized speech 142 output by the ASR system 140 is provided to the virtual assistant application 340.
- the virtual assistant application 340 may be implemented by one or more processors executing instructions, such as described in further detail with reference to FIG. 9 .
- the virtual assistant application 340 may be configured to perform one or more search queries, such as via wireless connection to an internet gateway, search server, or other resource, searching a local storage of the device 102, or a combination thereof.
- the audio signal 114 may represent the spoken question "what the weather like today?"
- the virtual assistant application 340 may generate a query to access an Internet-based weather service to obtain a weather forecast for a geographic region in which the device 102 is located.
- the virtual assistant application 340 is configured to generate an output, such as an output audio signal 342 that causes the speaker 350 to generate an auditory output, such as in a voice interface implementation.
- the virtual assistant application 340 generates another mode of output, such as a visual output signal that may be displayed by a screen or display that is integrated in the device 102 or coupled to the device 102.
- values of parameters, such as weights and thresholds used by the device 102 can be set by a manufacturer or provider of the device 102.
- the device 102 is configured to adjust one or more such values during the life of the device 102 based on detected false negatives, false activations, or a combination thereof, associated with the ASR system 140. For example, a history of false activations can be maintained by the device 102 so that the characteristics of sensor data 122 that triggered the false activations can be periodically used to automatically adjust one or more weights or thresholds, such as to emphasize the relative reliability of one sensor over another for use in hand detection, to reduce a likelihood of future false activations.
- the indication 132 is indicated by a "0" value.
- a "1" value of the first signal 231 indicates a high likelihood that the hand pattern 232 is in the image data 212, while in other implementations a "0" value of the first signal 231 indicates a low likelihood the hand pattern 232 is in the image data 212.
- a "1" value of the second signal 235, the third signal 237, or both indicates a high likelihood that a hand detection criterion is satisfied
- the "1" value of the second signal 235, the third signal 237, or both indicates a high likelihood that a hand detection criterion is not satisfied.
- FIG. 4 depicts an implementation 400 of a device 402 that includes the hand detector 130 and the ASR system 140 integrated in a discrete component, such as a semiconductor chip or package as described further with reference to FIG. 9 .
- the device 402 includes an audio signal input 410, such as a first bus interface, to enable the audio signal 114 to be received from a microphone external to the device 402.
- the device 402 also includes a sensor data input 412, such as a second bus interface, to enable the sensor data 122, to be received from one or more sensors external to the device 402.
- the device 402 may further include one or more outputs to provide processing results (e.g., the recognized speech 142 or the output audio signal 342) to one or more external components (e.g., the speaker 350).
- the device 402 enables implementation of hand detection and voice recognition activation as a component in a system that includes a microphone and other sensors, such as in a vehicle as depicted in FIG. 7 , a virtual reality or augmented reality headset as depicted in FIG. 8 , or a wireless communication device as depicted in FIG. 9 .
- a particular implementation of a method 500 of processing an audio signal representing input sound is depicted that may be performed by the device 102 or the device 402.
- the method begins at 502 and includes determining whether a hand is over a screen of the device, at 504, such as by the hand detector 130 processing the sensor data 122.
- a microphone and buffer are activated, at 506.
- the microphone 112 and the buffer 320 of FIG. 3 are activated by the activation circuitry 302 via the signals 304 and 306.
- the method 500 includes activating an ASR engine to process the buffered data, at 510.
- the ASR engine 330 is activated by the signal 308 generated by the activation circuitry 302 to process the buffered audio data 322.
- Activating ASR when a hand is detected over the screen enables a user to activate speech recognition for a voice command by positioning of the user's hand without having to speak an activation keyword or locate and press a dedicated button.
- speech recognition can be activated conveniently and safely, such as when the user is operating a vehicle.
- positioning the user's hand over the screen initiates activation of components to receive a voice command for speech recognition and removing the user's hand from over the screen initiates processing of the received voice command, improper activation, deactivation, or both, of speech recognition can both be reduced as compared to a system that instead uses keyword detection to activate speech recognition.
- a particular implementation of a method 600 of processing an audio signal representing input sound is depicted that may be performed by the device 102 or the device 402, as illustrative, non-limiting examples.
- the method 600 starts at 602 and includes detecting, at a device, at least a portion of a hand over at least a portion of the device, at 604.
- the hand detector 130 detects the hand 190 via processing the sensor data 122 received from the one or more sensors 120.
- detecting the portion of the hand over the portion of the device includes processing image data (e.g., the image data 212) to determine whether the image data includes a hand pattern (e.g., the hand pattern 232).
- the image data is generated at a low-power ambient light sensor of the device, such as the ambient light sensor 204.
- Detecting the portion of the hand over the portion of the device may further include processing infrared sensor data from an infrared sensor of the device, such as the infrared sensor data 218. Detecting the portion of the hand over the portion of the device may also include processing ultrasound sensor data from an ultrasound sensor of the device, such as the ultrasound sensor data 220.
- the method 600 includes, responsive to detecting the portion of the hand over the portion of the device, activating an automatic speech recognition system to process the audio signal, at 606.
- the device 102 activates the ASR system 140 in response to the indication 132.
- activating the automatic speech recognition system includes initiating buffering of the audio signal, such as the device 102 (e.g., the activation circuitry 302) activating the buffer 320 via the signal 306.
- the method 500 further includes activating a microphone to generate the audio signal based on the input sound, such as the device 102 (e.g., the activation circuitry 302) activating the microphone 112 via the signal 304.
- the method 600 includes detecting that the portion of the hand is no longer over the portion of the device, at 608, and responsive to detecting that the portion of the hand is no longer over the portion of the device, providing an end-of-utterance signal to the automatic speech recognition system, at 610.
- the hand detector 130 detects that the hand is no longer over the portion of the device, and the activation circuitry 302 provides the end-of-utterance signal 312 to the ASR engine 330 responsive to the second indication 242.
- the method 600 By activating the ASR system responsive to detecting a hand over a portion of the device, the method 600 enables a user to activate speech recognition for a voice command without having to speak an activation keyword or locate and press a dedicated button. As a result, speech recognition can be activated conveniently and safely, such as when the user is operating a vehicle. In addition, false activation of the ASR system can both be reduced as compared to a system that instead uses keyword detection to activate speech recognition.
- the method 500 of FIG. 5 , the method 600 of FIG. 6 , or both, may be implemented by a field-programmable gate array (FPGA) device, an application-specific integrated circuit (ASIC), a processing unit such as a central processing unit (CPU), a DSP, a controller, another hardware device, firmware device, or any combination thereof.
- FPGA field-programmable gate array
- ASIC application-specific integrated circuit
- processing unit such as a central processing unit (CPU), a DSP, a controller, another hardware device, firmware device, or any combination thereof.
- the method 500 of FIG. 5 , the method 600 of FIG. 6 , or both may be performed by a processor that executes instructions, such as described with reference to FIG. 9 .
- FIG. 7 depicts an example of an implementation 700 of the hand detector 130 and the ASR system 140 integrated into a vehicle dashboard device, such as a car dashboard device 702.
- a visual interface device such as the screen 110 (e.g., a touchscreen display), is mounted within the car dashboard device 702 to be visible to a driver of the car.
- the microphone 112 and one or more sensors 120 are also mounted in the car dashboard device 702, although in other implementations one or more of the microphone 112 and the sensor(s) 120 can be located elsewhere in the vehicle, such as the microphone 112 in a steering wheel or near the driver's head.
- the hand detector 130 and the ASR system 140 as illustrated with dashed borders to indicate that the hand detector 130 and the ASR system 140 are not visible to occupants of the vehicle.
- the hand detector 130 and the ASR system 140 may be implemented in a device that also includes the microphone 112 and sensor(s) 120 such as in the device 102 of FIGS. 1-3 , or may be separate from and coupled to the microphone 112 and sensor(s) 120, such as in the device 402 of FIG. 4 .
- multiple microphones 112 and sets of sensors 120 are integrated into the vehicle.
- a microphone and set of sensors can be positioned at each passenger seat, such as at an armrest control panel or seat-back display device, to enable each passenger to enter voice commands using hand-over-device detection.
- each passenger's voice command may be routed to a common ASR system 140; in other implementations, the vehicle includes multiple ASR systems 140 to enable concurrent processing of voice commands from multiple occupants of the vehicle.
- FIG. 8 depicts an example of an implementation 800 of the hand detector 130 and the ASR system 140 integrated into a headset 802, such as a virtual reality or augmented reality headset.
- the screen 110 is positioned in front of the user's eyes to enable display of augmented reality or virtual reality images or scenes to the user while the headset 802 is worn, and the sensor(s) 120 are positioned to detect when the user's hand is over (e.g., in front of) the screen 110 to initiate ASR recognition.
- the microphone 112 is located to receive the user's voice while the headset 802 is worn. While wearing the headset 802, the user can lift a hand in front of the screen 110 to indicate to the headset 802 that the user is about to speak a voice command to activate ASR, and can lower the hand to indicate that the user has finished speaking the voice command.
- FIG. 9 depicts a block diagram of a particular illustrative implementation of a device 900 that includes the hand detector 130 and the ASR engine 330, such as in a wireless communication device implementation (e.g., a smartphone).
- the device 900 may have more or fewer components than illustrated in FIG. 9 .
- the device 900 may correspond to the device 102.
- the device 900 may perform one or more operations described with reference to FIGS. 1-8 .
- the device 900 includes a processor 906 (e.g., a central processing unit (CPU)).
- the device 900 may include one or more additional processors 910 (e.g., one or more DSPs).
- the processors 910 may include a speech and music coder-decoder (CODEC) 908 and the hand detector 130.
- the speech and music codec 908 may include a voice coder ("vocoder") encoder 936, a vocoder decoder 938, or both.
- the device 900 may include a memory 986 and a CODEC 934.
- the memory 986 may include instructions 956, that are executable by the one or more additional processors 910 (or the processor 906) to implement the functionality described with reference to the hand detector 130, the ASR engine 330, the ASR system 140 of FIG. 1 , the activation circuitry 302, or any combination thereof.
- the device 900 may include a wireless controller 940 coupled, via a transceiver 950, to an antenna 952.
- the device 900 may include a display 928 (e.g., the screen 110) coupled to a display controller 926.
- the speaker 350 and the microphone 112 may be coupled to the CODEC 934.
- the CODEC 934 may include a digital-to-analog converter 902 and an analog-to-digital converter 904.
- the CODEC 934 may receive analog signals from the microphone 112, convert the analog signals to digital signals using the analog-to-digital converter 904, and provide the digital signals to the speech and music codec 908.
- the speech and music codec 908 may process the digital signals, and the digital signals may further be processed by the ASR engine 330.
- the speech and music codec 908 may provide digital signals to the CODEC 934.
- the CODEC 934 may convert the digital signals to analog signals using the digital-to-analog converter 902 and may provide the analog signals to the speaker 350.
- the device 900 may be included in a system-in-package or system-on-chip device 922.
- the memory 986, the processor 906, the processors 910, the display controller 926, the CODEC 934, and the wireless controller 940 are included in a system-in-package or system-on-chip device 922.
- an input device 930 e.g., one or more of the sensor(s) 120
- a power supply 944 are coupled to the system-on-chip device 922.
- each of the display 928, the input device 930, the speaker 350, the microphone 112, the antenna 992, and the power supply 944 are external to the system-on-chip device 922.
- each of the display 928, the input device 930, the speaker 350, the microphone 112, the antenna 992, and the power supply 944 may be coupled to a component of the system-on-chip device 922, such as an interface or a controller.
- the device 900 may include a smart speaker (e.g., the processor 906 may execute the instructions 956 to run the voice-controlled digital assistant application 340), a speaker bar, a mobile communication device, a smart phone, a cellular phone, a laptop computer, a computer, a tablet, a personal digital assistant, a display device, a television, a gaming console, a music player, a radio, a digital video player, a digital video disc (DVD) or Blu-ray disc player, a tuner, a camera, a navigation device, a virtual reality of augmented reality headset, a vehicle console device, or any combination thereof.
- a smart speaker e.g., the processor 906 may execute the instructions 956 to run the voice-controlled digital assistant application 340
- a speaker bar e.g., a speaker, a speaker bar, a mobile communication device, a smart phone, a cellular phone, a laptop computer, a computer, a tablet, a personal digital assistant, a display device,
- an apparatus to process an audio signal representing input sound includes means for detecting at least a portion of a hand over at least a portion of a device.
- the means for detecting the portion of the hand can correspond to the hand detector 130, the hand pattern detector 230, the hand temperature detector 234, the hand distance detector 236, one or more other circuits or components configured to detect at least a portion of hand over at least a portion of a device, or any combination thereof.
- the apparatus also includes means for processing the audio signal.
- the means for processing is configured to be activated responsive to detection of the portion of a hand over at the portion of the device.
- the means for processing the audio signal can correspond to the ASR system 140, the ASR engine 330, the microphone 112, the CODEC 934, the speech and music codec 908, one or more other circuits or components configured to process the audio signal and activated responsive to detection of the portion of a hand over at the portion of the device, or any combination thereof.
- the apparatus includes means for displaying information, and the means for detecting is configured to detect the portion of the hand over the means for displaying information.
- the means for displaying information can include the screen 110, the display 928, the display controller 926, one or more other circuits or components configured to display information, or any combination thereof.
- the apparatus may also include means for generating the audio signal based on the input sound, the means for generating configured to be activated responsive to detect the portion of the hand over the means for displaying information.
- the means for generating the audio signal can correspond to the microphone 112, a microphone array, the CODEC 934, the speech and music codec 908, one or more other circuits or components configured to generate the audio signal based on the input sound and to be activated responsive to the first indication, or any combination thereof.
- the apparatus includes means for generating image data, and the means for detecting is configured to determine whether the image data includes a hand pattern, such as the hand pattern detector 230.
- the apparatus includes at least one of: means for detecting a temperature associated with the portion of the hand (e.g., the hand temperature detector 234, the infrared sensor 208, or a combination thereof), and means for detecting a distance of the portion of the hand from the device (e.g., the hand distance detector 236, the ultrasound sensor 210, a camera array, a structured light projector, one or more other mechanism for detecting a distance of the portion of the hand from the device, or any combination thereof).
- a temperature associated with the portion of the hand e.g., the hand temperature detector 234, the infrared sensor 208, or a combination thereof
- means for detecting a distance of the portion of the hand from the device e.g., the hand distance detector 236, the ultrasound sensor 210, a camera array, a structured light projector, one or more other mechanism
- non-transitory computer-readable medium e.g., the memory 986) includes instructions (e.g., the instructions 956) that, when executed by one or more processors of a device (e.g., the processor 906, the processor(s) 910, or any combination thereof), cause the one or more processors to perform operations for processing an audio signal representing input sound.
- the operations include detecting at least a portion of a hand over at least a portion of the device (e.g., at the hand detector 130).
- detecting the portion of the hand over the portion of the device can include receiving the sensor data 122, processing the sensor data 122 using one or more detectors (e.g., the hand pattern detector 230, the hand temperature detector 234, or the hand distance detector 236) to determine whether one or more detection criteria are met, and generating the indication 132 at least partially in response to detection that the one or more criteria are met (e.g., as described with reference to the activation signal unit 240).
- one or more detectors e.g., the hand pattern detector 230, the hand temperature detector 234, or the hand distance detector 23
- processing the sensor data 122 to determine whether a detection criterion is met includes applying a neural network classifier (e.g., as described with reference to the hand pattern detector 230) that is trained to recognize the hand pattern 232 to process the image data 212 or applying one or more filters to the image data 212 to detect the hand pattern 232.
- a neural network classifier e.g., as described with reference to the hand pattern detector 230
- the operations also include, responsive to detecting the portion of the hand over the portion of the device, activating an automatic speech recognition system to process the audio signal.
- activating the automatic speech recognition can include detecting the indication 132 at an input to the ASR system 140 and, in response to detecting the indication 132, performing at least one of a power-up or clock activation for at least one component (e.g., the buffer 320, the ASR engine 330) of the ASR system 140.
- a software module may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of non-transient storage medium known in the art.
- An exemplary storage medium is coupled to the processor such that the processor may read information from, and write information to, the storage medium.
- the storage medium may be integral to the processor.
- the processor and the storage medium may reside in an application-specific integrated circuit (ASIC).
- ASIC application-specific integrated circuit
- the ASIC may reside in a computing device or a user terminal.
- the processor and the storage medium may reside as discrete components in a computing device or user terminal.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- General Health & Medical Sciences (AREA)
- User Interface Of Digital Computer (AREA)
- Circuit For Audible Band Transducer (AREA)
- Measuring Pulse, Heart Rate, Blood Pressure Or Blood Flow (AREA)
Claims (12)
- Gerät (102) zur Verarbeitung eines Audiosignals, das Eingangston (106) darstellt, wobei das Gerät (102) Folgendes umfasst:einen oder mehrere Sensoren (120), die mit einem Handdetektor gekoppelt und zum Liefern von Sensordaten an den Handdetektor konfiguriert sind;einen Handdetektor (130);und ein automatisches Spracherkennungssystem (140), das einen Puffer und eine automatische Spracherkennungsmaschine enthält;wobei der Handdetektor (130) konfiguriert ist zum:Erzeugen einer ersten Anzeige als Reaktion auf die Erkennung, über die ein oder mehreren Sensoren (120), dass sich mindestens ein Teil einer Hand innerhalb eines Bereichs von 10 cm bis 30 cm von den ein oder mehreren Sensoren (120) befindet; undErzeugen einer zweiten Anzeige als Reaktion auf die Erkennung, dass sich der Teil der Hand nicht mehr innerhalb eines Bereichs von 10 cm bis 30 cm von den ein oder mehreren Sensoren (120) befindet, wobei die zweite Anzeige einem Sprechende-Signal entspricht, das die automatische Spracherkennungsmaschine (330) veranlasst, mit der Verarbeitung von Audiodaten aus dem Puffer (320) zu beginnen; undwobei das automatische Spracherkennungssystem (140) zum Einleiten der Pufferung des Audiosignals als Reaktion auf die erste Anzeige und zum Verarbeiten des Audiosignals konfiguriert ist, wobei das Audiosignal den zwischen der Erzeugung der ersten Anzeige und der Erzeugung der zweiten Anzeige empfangenen Eingangston (106) umfasst.
- Gerät (102) nach Anspruch 1, das ferner Folgendes umfasst:einen Bildschirm (110), wobei der Handdetektor (130) zum Erzeugen der ersten Anzeige als Reaktion auf die Erkennung mindestens eines Teils der Hand über dem Bildschirm (110) konfiguriert ist; undein Mikrofon (112), das zum Aktivieren als Reaktion auf die erste Anzeige konfiguriert ist, um das Audiosignal auf der Basis des Eingangstons (106) zu erzeugen.
- Gerät (102) nach Anspruch 2, wobei der Handdetektor (130) zum Erzeugen der ersten Anzeige als Reaktion auf die Erkennung konfiguriert ist, dass sich der Teil der Hand in einem Abstand von 10 Zentimetern bis 30 Zentimetern von dem Bildschirm (110) befindet.
- Gerät (102) nach Anspruch 1, wobei die ein oder mehreren Sensoren (120) eine Kamera (202) umfassen, die zum Liefern von Bilddaten an den Handdetektor (130) konfiguriert ist, wobei die Kamera (202) vorzugsweise einen Low-Power-Umgebungslichtsensor (204) umfasst, der zum Erzeugen der Bilddaten konfiguriert ist.
- Gerät (102) nach Anspruch 1, wobei der Handdetektor (130) einen Handmusterdetektor (230) enthält, der zum Verarbeiten der Bilddaten konfiguriert ist, um festzustellen, ob die Bilddaten ein Handmuster enthalten.
- Gerät (102) nach Anspruch 5, wobei die ein oder mehreren Sensoren (120) ferner einen Infrarotsensor (208) umfassen.
- Gerät (102) nach Anspruch 6, wobei der Handdetektor (130) ferner einen Handtemperaturdetektor (234) enthält, der zum Verarbeiten von Infrarotsensordaten von dem Infrarotsensor (208) konfiguriert ist.
- Gerät (102) nach Anspruch 1, das ferner eine Aktivierungsschaltung (302) umfasst, die mit dem Handdetektor (130) gekoppelt und zum Aktivieren des automatischen Spracherkennungssystems (140) als Reaktion auf den Empfang der ersten Anzeige konfiguriert ist.
- Gerät (102) nach Anspruch 1, wobei der Handdetektor (130) und das automatische Spracherkennungssystem (140) in ein Fahrzeug integriert sind.
- Gerät (102) nach Anspruch 1, wobei der Handdetektor (130) und das automatische Spracherkennungssystem (140) in ein tragbares Kommunikationsgerät oder in ein Virtual-Reality- oder Augmented-Reality-Headset integriert sind.
- Verfahren zum Verarbeiten eines Audiosignals, das Eingangston darstellt, wobei das Verfahren Folgendes beinhaltet:Erzeugen einer ersten Anzeige als Reaktion auf das Erkennen (604), über einen oder mehrere Sensoren eines Geräts, dass sich mindestens ein Teil einer Hand innerhalb eines Bereichs von 10 cm bis 30 cm von den ein oder mehreren Sensoren befindet;Aktivieren (606), als Reaktion auf die erzeugte erste Anzeige, eines automatischen Spracherkennungssystems, um die Pufferung des Audiosignals einzuleiten;Erzeugen einer zweiten Anzeige als Reaktion auf die Erkennung (608), an dem Gerät, dass sich der Teil der Hand nicht mehr innerhalb eines Bereichs von 10 cm bis 30 cm von den ein oder mehreren Sensoren befindet, wobei die erzeugte zweite Anzeige einem Sprechende-Signal entspricht;Aktivieren, als Reaktion auf die erzeugte zweite Anzeige, einer automatischen Spracherkennungsmaschine, um mit der Verarbeitung des gepufferten Audiosignals zu beginnen;wobei das gepufferte Audiosignal den Eingangston umfasst, der zwischen der Erzeugung der ersten Anzeige und der Erzeugung der zweiten Anzeige empfangen wurde.
- Nichtflüchtiges computerlesbares Medium, das Befehle umfasst, die bei Ausführung durch einen oder mehrere Prozessoren eines Geräts, wobei das Gerät ein automatisches Spracherkennungssystem mit einem Puffer und einer automatischen Spracherkennungsmaschine, einen Handdetektor und einen oder mehrere Sensoren umfasst, die mit dem Handdetektor gekoppelt und zum Liefern von Sensordaten an den Handdetektor konfiguriert sind, die ein oder mehreren Prozessoren zum Durchführen des Verfahrens nach Anspruch 11 veranlassen.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16/526,608 US11437031B2 (en) | 2019-07-30 | 2019-07-30 | Activating speech recognition based on hand patterns detected using plurality of filters |
| PCT/US2020/044127 WO2021021970A1 (en) | 2019-07-30 | 2020-07-30 | Activating speech recognition |
Publications (3)
| Publication Number | Publication Date |
|---|---|
| EP4004908A1 EP4004908A1 (de) | 2022-06-01 |
| EP4004908B1 true EP4004908B1 (de) | 2024-10-09 |
| EP4004908C0 EP4004908C0 (de) | 2024-10-09 |
Family
ID=72087256
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| EP20757126.6A Active EP4004908B1 (de) | 2019-07-30 | 2020-07-30 | Aktivierung von spracherkennung |
Country Status (9)
| Country | Link |
|---|---|
| US (1) | US11437031B2 (de) |
| EP (1) | EP4004908B1 (de) |
| JP (1) | JP7645230B2 (de) |
| KR (1) | KR102926603B1 (de) |
| CN (1) | CN114144831B (de) |
| BR (1) | BR112022000922A2 (de) |
| PH (1) | PH12021553299A1 (de) |
| TW (1) | TWI871343B (de) |
| WO (1) | WO2021021970A1 (de) |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20210015234A (ko) * | 2019-08-01 | 2021-02-10 | 삼성전자주식회사 | 전자 장치, 및 그의 음성 명령에 따른 기능이 실행되도록 제어하는 방법 |
| US11682391B2 (en) * | 2020-03-30 | 2023-06-20 | Motorola Solutions, Inc. | Electronic communications device having a user interface including a single input interface for electronic digital assistant and voice control access |
| US11862189B2 (en) * | 2020-04-01 | 2024-01-02 | Qualcomm Incorporated | Method and apparatus for target sound detection |
| US11590929B2 (en) * | 2020-05-05 | 2023-02-28 | Nvidia Corporation | Systems and methods for performing commands in a vehicle using speech and image recognition |
| CN116803110A (zh) * | 2020-12-22 | 2023-09-22 | 塞伦妮经营公司 | 用于在车辆内整合不同种生态系统的平台 |
| KR20230092180A (ko) * | 2021-12-17 | 2023-06-26 | 현대자동차주식회사 | 차량 및 그의 제어방법 |
| US12412565B2 (en) * | 2022-01-28 | 2025-09-09 | Syntiant Corp. | Prediction based wake-word detection and methods for use therewith |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6532447B1 (en) * | 1999-06-07 | 2003-03-11 | Telefonaktiebolaget Lm Ericsson (Publ) | Apparatus and method of controlling a voice controlled operation |
| EP2144140A2 (de) * | 2008-07-08 | 2010-01-13 | LG Electronics Inc. | Mobiles Endgerät und Texteingabeverfahren dafür |
| KR20150059955A (ko) * | 2013-11-25 | 2015-06-03 | 현대자동차주식회사 | 음성 인식 장치, 그를 가지는 차량 및 그 방법 |
| US20160180846A1 (en) * | 2014-12-17 | 2016-06-23 | Hyundai Motor Company | Speech recognition apparatus, vehicle including the same, and method of controlling the same |
Family Cites Families (18)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020077830A1 (en) * | 2000-12-19 | 2002-06-20 | Nokia Corporation | Method for activating context sensitive speech recognition in a terminal |
| KR20090107364A (ko) * | 2008-04-08 | 2009-10-13 | 엘지전자 주식회사 | 이동 단말기 및 그 메뉴 제어방법 |
| US8958848B2 (en) | 2008-04-08 | 2015-02-17 | Lg Electronics Inc. | Mobile terminal and menu control method thereof |
| KR20100007625A (ko) * | 2008-07-14 | 2010-01-22 | 엘지전자 주식회사 | 이동 단말기 및 그 메뉴 표시 방법 |
| KR101537693B1 (ko) * | 2008-11-24 | 2015-07-20 | 엘지전자 주식회사 | 단말기 및 그 제어 방법 |
| JP5229083B2 (ja) * | 2009-04-14 | 2013-07-03 | ソニー株式会社 | 情報処理装置、情報処理方法及びプログラム |
| US9551590B2 (en) * | 2009-08-28 | 2017-01-24 | Robert Bosch Gmbh | Gesture-based information and command entry for motor vehicle |
| KR101795574B1 (ko) * | 2011-01-06 | 2017-11-13 | 삼성전자주식회사 | 모션에 의해 제어되는 전자기기 및 그 제어 방법 |
| JP2013080015A (ja) | 2011-09-30 | 2013-05-02 | Toshiba Corp | 音声認識装置および音声認識方法 |
| JP6211256B2 (ja) | 2012-09-26 | 2017-10-11 | 株式会社ナビタイムジャパン | 情報処理装置、情報処理方法、および情報処理プログラム |
| JP6030430B2 (ja) * | 2012-12-14 | 2016-11-24 | クラリオン株式会社 | 制御装置、車両及び携帯端末 |
| US10272920B2 (en) * | 2013-10-11 | 2019-04-30 | Panasonic Intellectual Property Corporation Of America | Processing method, program, processing apparatus, and detection system |
| CN105373227B (zh) * | 2015-10-29 | 2019-03-26 | 小米科技有限责任公司 | 一种智能关闭电子设备的方法及装置 |
| CN105869637B (zh) * | 2016-05-26 | 2019-10-15 | 百度在线网络技术(北京)有限公司 | 语音唤醒方法和装置 |
| CN107197090B (zh) * | 2017-05-18 | 2020-07-14 | 维沃移动通信有限公司 | 一种语音信号的接收方法及移动终端 |
| CN207758675U (zh) | 2017-12-29 | 2018-08-24 | 广州视声光电有限公司 | 一种触发式车载后视镜 |
| JP7091983B2 (ja) * | 2018-10-01 | 2022-06-28 | トヨタ自動車株式会社 | 機器制御装置 |
| CN209571226U (zh) * | 2018-12-20 | 2019-11-01 | 深圳市朗强科技有限公司 | 一种语音识别装置及系统 |
-
2019
- 2019-07-30 US US16/526,608 patent/US11437031B2/en active Active
-
2020
- 2020-07-30 TW TW109125736A patent/TWI871343B/zh active
- 2020-07-30 CN CN202080052825.1A patent/CN114144831B/zh active Active
- 2020-07-30 KR KR1020227002030A patent/KR102926603B1/ko active Active
- 2020-07-30 JP JP2022504699A patent/JP7645230B2/ja active Active
- 2020-07-30 BR BR112022000922A patent/BR112022000922A2/pt unknown
- 2020-07-30 WO PCT/US2020/044127 patent/WO2021021970A1/en not_active Ceased
- 2020-07-30 EP EP20757126.6A patent/EP4004908B1/de active Active
- 2020-07-30 PH PH1/2021/553299A patent/PH12021553299A1/en unknown
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6532447B1 (en) * | 1999-06-07 | 2003-03-11 | Telefonaktiebolaget Lm Ericsson (Publ) | Apparatus and method of controlling a voice controlled operation |
| EP2144140A2 (de) * | 2008-07-08 | 2010-01-13 | LG Electronics Inc. | Mobiles Endgerät und Texteingabeverfahren dafür |
| KR20150059955A (ko) * | 2013-11-25 | 2015-06-03 | 현대자동차주식회사 | 음성 인식 장치, 그를 가지는 차량 및 그 방법 |
| US20160180846A1 (en) * | 2014-12-17 | 2016-06-23 | Hyundai Motor Company | Speech recognition apparatus, vehicle including the same, and method of controlling the same |
Also Published As
| Publication number | Publication date |
|---|---|
| US11437031B2 (en) | 2022-09-06 |
| US20210035571A1 (en) | 2021-02-04 |
| JP2022543201A (ja) | 2022-10-11 |
| CN114144831A (zh) | 2022-03-04 |
| PH12021553299A1 (en) | 2022-08-01 |
| KR20220041831A (ko) | 2022-04-01 |
| JP7645230B2 (ja) | 2025-03-13 |
| WO2021021970A1 (en) | 2021-02-04 |
| TW202121115A (zh) | 2021-06-01 |
| BR112022000922A2 (pt) | 2022-03-08 |
| TWI871343B (zh) | 2025-02-01 |
| CN114144831B (zh) | 2025-07-25 |
| KR102926603B1 (ko) | 2026-02-11 |
| EP4004908C0 (de) | 2024-10-09 |
| EP4004908A1 (de) | 2022-06-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP4004908B1 (de) | Aktivierung von spracherkennung | |
| JP7646063B2 (ja) | デジタルアシスタントのためのボイストリガ | |
| KR102216048B1 (ko) | 음성 명령 인식 장치 및 방법 | |
| EP3179474B1 (de) | Benutzerfokusaktivierte spracherkennung | |
| CN111833872B (zh) | 对电梯的语音控制方法、装置、设备、系统及介质 | |
| CN113380275B (zh) | 语音处理方法、装置、智能设备及存储介质 | |
| WO2014130463A2 (en) | Hybrid performance scaling or speech recognition | |
| CN114220420A (zh) | 多模态语音唤醒方法、装置及计算机可读存储介质 | |
| US20200319841A1 (en) | Agent apparatus, agent apparatus control method, and storage medium | |
| CN113160802B (zh) | 语音处理方法、装置、设备及存储介质 | |
| CN111681654A (zh) | 语音控制方法、装置、电子设备及存储介质 | |
| CN115331672B (zh) | 设备控制方法、装置、电子设备及存储介质 | |
| US11562741B2 (en) | Electronic device and controlling method using non-speech audio signal in the electronic device | |
| CN116189718B (zh) | 语音活性检测方法、装置、设备及存储介质 | |
| KR20240099616A (ko) | 끼어들기 기능을 갖는 음성인식장치 및 방법 | |
| JP2026016007A (ja) | 情報処理装置、システム、情報処理方法、およびプログラム |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
| PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
| 17P | Request for examination filed |
Effective date: 20220112 |
|
| AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
| DAV | Request for validation of the european patent (deleted) | ||
| DAX | Request for extension of the european patent (deleted) | ||
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
| 17Q | First examination report despatched |
Effective date: 20230601 |
|
| GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
| RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 3/01 20060101ALN20240410BHEP Ipc: G06V 40/10 20220101ALI20240410BHEP Ipc: G06V 20/20 20220101ALI20240410BHEP Ipc: G06V 10/143 20220101ALI20240410BHEP Ipc: G06F 3/16 20060101ALI20240410BHEP Ipc: G10L 15/22 20060101AFI20240410BHEP |
|
| INTG | Intention to grant announced |
Effective date: 20240423 |
|
| GRAJ | Information related to disapproval of communication of intention to grant by the applicant or resumption of examination proceedings by the epo deleted |
Free format text: ORIGINAL CODE: EPIDOSDIGR1 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
| GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
| GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
| GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE PATENT HAS BEEN GRANTED |
|
| INTG | Intention to grant announced |
Effective date: 20240826 |
|
| RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 3/01 20060101ALN20240820BHEP Ipc: G06V 40/10 20220101ALI20240820BHEP Ipc: G06V 20/20 20220101ALI20240820BHEP Ipc: G06V 10/143 20220101ALI20240820BHEP Ipc: G06F 3/16 20060101ALI20240820BHEP Ipc: G10L 15/22 20060101AFI20240820BHEP |
|
| AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
| REG | Reference to a national code |
Ref country code: CH Ref legal event code: EP |
|
| REG | Reference to a national code |
Ref country code: DE Ref legal event code: R096 Ref document number: 602020039126 Country of ref document: DE |
|
| REG | Reference to a national code |
Ref country code: IE Ref legal event code: FG4D |
|
| U01 | Request for unitary effect filed |
Effective date: 20241022 |
|
| U07 | Unitary effect registered |
Designated state(s): AT BE BG DE DK EE FI FR IT LT LU LV MT NL PT RO SE SI Effective date: 20241105 |
|
| PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20250209 Ref country code: HR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20241009 |
|
| PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: ES Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20241009 |
|
| PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: NO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20250109 |
|
| PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: GR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20250110 |
|
| PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: PL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20241009 |
|
| PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: RS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20250109 |
|
| PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SM Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20241009 |
|
| PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: GB Payment date: 20250612 Year of fee payment: 6 |
|
| U20 | Renewal fee for the european patent with unitary effect paid |
Year of fee payment: 6 Effective date: 20250617 |
|
| PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20241009 |
|
| PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CZ Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20241009 |
|
| PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: IE Payment date: 20250612 Year of fee payment: 6 |
|
| PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
| 26N | No opposition filed |
Effective date: 20250710 |
|
| REG | Reference to a national code |
Ref country code: CH Ref legal event code: H13 Free format text: ST27 STATUS EVENT CODE: U-0-0-H10-H13 (AS PROVIDED BY THE NATIONAL OFFICE) Effective date: 20260224 |