EP4226371A1 - Détection d'activité vocale d'utilisateur à l'aide d'un classificateur dynamique - Google Patents

Détection d'activité vocale d'utilisateur à l'aide d'un classificateur dynamique

Info

Publication number
EP4226371A1
EP4226371A1 EP21790049.7A EP21790049A EP4226371A1 EP 4226371 A1 EP4226371 A1 EP 4226371A1 EP 21790049 A EP21790049 A EP 21790049A EP 4226371 A1 EP4226371 A1 EP 4226371A1
Authority
EP
European Patent Office
Prior art keywords
audio data
microphone
processors
dynamic classifier
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21790049.7A
Other languages
German (de)
English (en)
Inventor
Taher Shahbazi Mirzahasanloo
Rogerio Guedes Alves
Erik Visser
Lae-Hoon Kim
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qualcomm Inc
Original Assignee
Qualcomm Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qualcomm Inc filed Critical Qualcomm Inc
Publication of EP4226371A1 publication Critical patent/EP4226371A1/fr
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/12Speech classification or search using dynamic programming techniques, e.g. dynamic time warping [DTW]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3206Monitoring of events, devices or parameters that trigger a change in power modality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3287Power saving characterised by the action undertaken by switching off individual functional units in the computer system
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/02Casings; Cabinets ; Supports therefor; Mountings therein
    • H04R1/028Casings; Cabinets ; Supports therefor; Mountings therein associated with devices performing functions other than acoustics, e.g. electric candles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • G10L2025/786Adaptive threshold
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2420/00Details of connection covered by H04R, not provided for in its groups
    • H04R2420/07Applications of wireless loudspeakers or wireless microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure is generally related to self-voice activity detection.
  • Such computing devices often incorporate functionality to receive an audio signal from one or more microphones.
  • the audio signal may represent user speech captured by the microphones, external sounds captured by the microphones, or a combination thereof.
  • a headset device may include self-voice activity detection in an effort to distinguish between the user’s speech (e.g., speech spoken by the person wearing the headset) and speech originating from other sources.
  • selfvoice activity detection can reduce “false alarms” in which activation of one or more components or operations is initiated based on speech originating from nearby people (referred to as “non-user speech”). Reducing such false alarms improves power consumption efficiency of the device.
  • performing audio signal processing to distinguish between user speech and non-user speech also consumes power, and conventional techniques to improve the accuracy of the device in distinguishing between user speech and non-user speech also tend to increase the power consumption and processing resource requirements of the device.
  • a device includes a memory configured to store instructions and one or more processors configured to execute the instructions.
  • the one or more processors are configured to execute the instructions to receive audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone.
  • the one or more processors are also configured to execute the instructions to provide the audio data to a dynamic classifier.
  • the dynamic classifier is configured to generate a classification output corresponding to the audio data.
  • the one or more processors are further configured to execute the instructions to determine, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • a method includes receiving, at one or more processors, audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone. The method further includes providing, at the one or more processors, the audio data to a dynamic classifier to generate a classification output corresponding to the audio data. The method also includes determining, at the one or more processors and at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • a non-transitory computer-readable medium includes instructions that, when executed by one or more processors, cause the one or more processors to receive audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone.
  • the instructions when executed by the one or more processors, further cause the one or more processors to provide the audio data to a dynamic classifier to generate a classification output corresponding to the audio data.
  • the instructions when executed by the one or more processors, also cause the one or more processors to determine, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • an apparatus includes means for receiving audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone.
  • the apparatus further includes means for generating, at a dynamic classifier, a classification output corresponding to the audio data.
  • the apparatus also includes means for determining, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • FIG. l is a block diagram of a particular illustrative aspect of a system operable to perform self-voice activity detection, in accordance with some examples of the present disclosure.
  • FIG. 2 is a diagram of an illustrative aspect of operations associated with selfvoice activity detection, in accordance with some examples of the present disclosure.
  • FIG. 3 is a block diagram of an illustrative aspect of a system operable to perform self-voice activity detection, in accordance with some examples of the present disclosure.
  • FIG. 4 is a diagram of an illustrative aspect of operation of components of the system of FIG. 1, in accordance with some examples of the present disclosure.
  • FIG. 5 illustrates an example of an integrated circuit that includes a dynamic classifier to detect user voice activity, in accordance with some examples of the present disclosure.
  • FIG. 6 is a diagram of a mobile device that includes a dynamic classifier to detect user voice activity, in accordance with some examples of the present disclosure.
  • FIG. 7 is a diagram of a headset that includes a dynamic classifier to detect user voice activity, in accordance with some examples of the present disclosure.
  • FIG. 8 is a diagram of a wearable electronic device that includes a dynamic classifier to detect user voice activity, in accordance with some examples of the present disclosure.
  • FIG. 9 is a diagram of a voice-controlled speaker system that includes a dynamic classifier to detect user voice activity, in accordance with some examples of the present disclosure.
  • FIG. 10 is a diagram of a camera that includes a dynamic classifier to detect user voice activity, in accordance with some examples of the present disclosure.
  • FIG. 11 is a diagram of a headset, such as a virtual reality or augmented reality headset, that includes a dynamic classifier to detect user voice activity, in accordance with some examples of the present disclosure.
  • FIG. 12 is a diagram of a first example of a vehicle that includes a dynamic classifier to detect user voice activity, in accordance with some examples of the present disclosure.
  • FIG. 13 is a diagram of a second example of a vehicle that includes a dynamic classifier to detect user voice activity, in accordance with some examples of the present disclosure.
  • FIG. 14A is diagram of a particular implementation of a method of self-voice activity detection that may be performed by the device of FIG. 1, in accordance with some examples of the present disclosure.
  • FIG. 14B is diagram of another particular implementation of a method of selfvoice activity detection that may be performed by the device of FIG. 1, in accordance with some examples of the present disclosure.
  • FIG. 15 is a block diagram of a particular illustrative example of a device that is operable to perform self-voice activity detection, in accordance with some examples of the present disclosure.
  • SVAD Self-voice activity detection
  • conventional audio signal processing techniques to improve SVAD accuracy also increase power consumption and processing resources of the device while performing the improved- accuracy techniques. Since SVAD processing is typically continually operating, even while the device is in a low-power or sleep mode, the reduction in power consumption due to reducing false alarms using conventional SVAD techniques can be partially or fully offset by increased power consumption associated with the SVAD processing itself.
  • audio signals may be received from a first microphone that is positioned to capture the user’s voice and from a second microphone that is positioned to capture external sounds, such as to perform noise reduction and echo cancellation.
  • the audio signals may be processed to extract frequency domain feature sets including interaural phase differences (“IPDs”) and interaural intensity differences (“IIDs”).
  • IPDs interaural phase differences
  • IIDs interaural intensity differences
  • the dynamic classifier processes the extracted frequency domain feature sets and generates an output indicating classification of the feature sets.
  • the dynamic classifier may perform adaptive clustering of the feature data and adjustment of a decision boundary between the two most discriminative categories of the feature data space to distinguish between feature sets corresponding to user voice activity and feature sets corresponding to other audio activity.
  • the dynamic classifier is implemented using self-organizing maps.
  • the dynamic classifier enables discrimination using the extracted feature sets to actively respond and adapt to various conditions, such as: environmental conditions in highly nonstationary situations; mismatched microphones; changes in user headset fitting; different user head-related transfer functions ("HRTFs”); direction-of-arrival (“DOA") tracking of non-user signals; noise floor, bias, and sensitivities of microphones across the frequency spectrum; or a combination thereof.
  • the dynamic classifier enables adaptive feature mapping capable of responding to such variations and reducing or minimizing a number of thresholding parameters used and an amount of headset tuning by customers.
  • the dynamic classifier enables effective discrimination between user voice activity and other audio activity with high accuracy under varying conditions and with relatively low power consumption as compared to conventional SV D systems that provide comparable accuracy.
  • FIG. 1 depicts a device 102 including one or more processors (“processor(s)” 190 of FIG. 1), which indicates that in some implementations the device 102 includes a single processor 190 and in other implementations the device 102 includes multiple processors 190.
  • processors processors
  • an ordinal term e.g., "first,” “second,” “third,” etc.
  • an element such as a structure, a component, an operation, etc.
  • the term “set” refers to one or more of a particular element
  • the term “plurality” refers to multiple (e.g., two or more) of a particular element.
  • Coupled may include “communicatively coupled,” “electrically coupled,” or “physically coupled,” and may also (or alternatively) include any combinations thereof.
  • Two devices (or components) may be coupled (e.g., communicatively coupled, electrically coupled, or physically coupled) directly or indirectly via one or more other devices, components, wires, buses, networks (e.g., a wired network, a wireless network, or a combination thereof), etc.
  • Two devices (or components) that are electrically coupled may be included in the same device or in different devices and may be connected via electronics, one or more connectors, or inductive coupling, as illustrative, non-limiting examples.
  • two devices may send and receive signals (e.g., digital signals or analog signals) directly or indirectly, via one or more wires, buses, networks, etc.
  • signals e.g., digital signals or analog signals
  • directly coupled may include two devices that are coupled (e.g., communicatively coupled, electrically coupled, or physically coupled) without intervening components.
  • determining may be used to describe how one or more operations are performed. It should be noted that such terms are not to be construed as limiting and other techniques may be utilized to perform similar operations. Additionally, as referred to herein, “generating,” “calculating,” “estimating,” “using,” “selecting,” “accessing,” and “determining” may be used interchangeably. For example, “generating,” “calculating,” “estimating,” or “determining” a parameter (or a signal) may refer to actively generating, estimating, calculating, or determining the parameter (or the signal) or may refer to using, selecting, or accessing the parameter (or signal) that is already generated, such as by another component or device.
  • the system 100 includes a device 102 that is coupled to a first microphone 110, a second microphone 120, and a second device 160.
  • the device 102 is configured to perform self-voice activity detection of sounds captured by the microphones 110, 120 using a dynamic classifier 140.
  • the first microphone 110 e.g., a “primary” microphone
  • the second microphone 120 e.g., a “secondary” microphone
  • the device 102 corresponds to a standalone voice assistant (e.g., including a loudspeaker with microphones, as described further with reference to FIG.
  • the device 102 may be configured to detect speech from the person closest to the primary microphone as self-voice activity, even though the person may be relatively remote from the primary microphone as compared to in a headset implementation.
  • self-voice activity detection is used interchangeably with “user voice activity detection” to indicate distinguishing between speech (e.g., voice or utterance) of a user of the device 102 (e.g., “user voice activity”) as compared to sounds that do not originate from a user of the device (e.g., “other audio activity”).
  • the device 102 includes a first input interface 114, a second input interface 124, one or more processors 190, and a modem 170.
  • the first input interface 114 is coupled to the processor 190 and configured to be coupled to the first microphone 110.
  • the first input interface 114 is configured to receive a first microphone output 112 from the first microphone 110 and to provide the first microphone output 112 to the processor 190 as first audio data 116.
  • the second input interface 124 is coupled to the processor 190 and configured to be coupled to the second microphone 120.
  • the second input interface 124 is configured to receive a second microphone output 122 from the second microphone 120 and to provide the second microphone output 122 to the processor 190 as second audio data 126.
  • the processor 190 is coupled to the modem 170 and includes a feature extractor 130 and the dynamic classifier 140.
  • the processor is configured to receive audio data 128 including the first audio data 116 corresponding to the first output 112 of the first microphone 110 and the second audio data 126 corresponding to the second output 122 of the second microphone 120.
  • the processor 190 is configured to process the audio data 128 at the feature extractor 130 to generate feature data 132.
  • the processor 190 is configured to process the first audio data 116 and the second audio data 126 prior to generating feature data 132.
  • the processor 190 is configured to perform echo-cancellation, noise suppression, or both, on the first audio data 116 and the second audio data 126.
  • the processor 190 is configured to transform the first audio data 116 and the second audio data 126 (e.g., a Fourier transform) to a transform domain prior to generating the feature data 132.
  • the processor 190 is configured to generate feature data 132 based on the first audio data 116 and the second audio data 126.
  • the feature data 132 includes at least one interaural phase difference 134 between the first audio data 116 and the second audio data 126 and at least one interaural intensity difference 136 between the first audio data 116 and the second audio data 126.
  • the feature data 132 includes interaural phase differences (IPDs) 134 for multiple frequencies and interaural intensity differences (IIDs) 136 for multiple frequencies.
  • the processor 190 is configured to process the feature data 132 at the dynamic classifier 140 to generate a classification output 142 of the feature data 132.
  • the dynamic classifier 140 is configured to adaptively cluster sets (e.g., samples) of the feature data 132 based on whether a sound represented in the audio data 128 originates from a source that is closer to the first microphone 110 than to the second microphone 120.
  • the dynamic classifier 140 may be configured to receive a sequence of samples of the feature data 132 and adaptively cluster the samples in a feature space containing IID and IPD frequency values.
  • the dynamic classifier 140 may also be configured to adjust a decision boundary between the two most discriminative categories of the feature space to distinguish between sets of feature data corresponding to user voice activity (e.g., an utterance 182 of a user 180) and sets of feature data corresponding to other audio activity.
  • the dynamic classifier 140 may be configured to classify incoming feature data into one of two classes (e.g., class 0 or class 1), where one of the two classes corresponds to user voice activity, and the other of the two classes corresponds to other audio activity.
  • the classification output 142 may include a single bit or flag that has one of two values: a first value (e.g., “0”) to indicate that the feature data 132 corresponds to one of the two classes; or a second value (e.g., “1”) to indicate that the feature data 132 corresponds to the other of the two classes.
  • a first value e.g., “0”
  • a second value e.g., “1”
  • the dynamic classifier 140 performs clustering and vector quantization.
  • clustering includes reducing (e.g., minimizing) the within-cluster sum of squares, defined as Tnin ⁇ > where Ct represents cluster z, pi represents a weight assigned to cluster z, xj represents a node j in the feature space, and pi represents the centroid of cluster z.
  • the cluster weight pi may be probabilistic, such as a prior cluster distribution; possibilistic, such as a confidence measure assigned to possibility of each cluster; or determined by any other factor that would enforce some form of non-uniform bias towards different clusters.
  • Vector quantization includes reducing (e.g., minimizing) error by quantizing an input vector into a quantization weight vector defined by 1
  • the dynamic classifier 140 is configured to perform competitive learning in which units of quantization compete to absorb new samples of the feature data 132.
  • the winning unit is then adjusted in the direction of the new sample.
  • each unit’s weight vector may be initialized for separation or randomly.
  • a determination is made as to which weight vector is closest to the new sample, such as based on Euclidean distance or inner product similarity, as non-limiting examples.
  • the weight vector closest to the new sample (the “winner” or best matching unit) may then be moved in the direction of the new sample.
  • the winners strengthen their correlations with the input, such by adjusting the weights between two nodes in proportion to the product of the inputs to the two nodes.
  • the dynamic classifier 140 includes local clusters in a presynaptic sheet that are connected to local clusters in a postsynaptic sheet, and interconnections among neighboring neurons are reinforced through Hebbian learning to strengthen connections between correlating stimulations.
  • the dynamic classifier 140 may include a Kohonen self-organizing map in which the input is connected to every neuron in the postsynaptic sheet or the map. Learning causes the map to be localized in that different fields of absorption respond to different regions of input space (e.g., the feature data space).
  • the dynamic classifier 140 includes a selforganizing map 148.
  • the self-organizing map 140 may operate by initializing weight vectors, and then for each input t (e.g., each received set of the feature data 132), determining the winning unit (or cell or neuron) according t arg min
  • the weights of the winning unit and its neighbors are updated, such as according t a(t)l(v, i, t)[x(t) — w v (t)] , where AM’/(7) represents the change for unit i, a(f) represents a learning parameter, and /(v, z, f) represents a neighborhood function around the winning unit, such as a Gaussian radial basis function.
  • inner products or another metric can be used as the similarity measure instead of Euclidean distance.
  • the dynamic classifier 140 includes a variant of a Kohonen self-organizing map to accommodate sequences of speech samples, such as described further with reference to FIG. 4.
  • the dynamic classifier 140 may implement temporal sequence processing, such as according to a temporal Kohonen map in which an activation function with a time constant modeling decay (“D”) is defined for each unit and updated as Ui (t,
  • the dynamic classifier 140 may implement a recurrent network, such as according to a recurrent self-organizing map which uses a difference vector y instead of a squared norm: where y represents a forgetting factor having a value between 0 and 1, the winning unit is determined as the unit with the smallest difference vector 1 (t) — arg min
  • the processor 190 is configured to update a clustering operation 144 of the dynamic classifier 140 based on the feature data 132 and to update a classification decision criterion 146 of the dynamic classifier 140.
  • the processor 190 is configured to adapt the clustering and the decision boundary between user voice activity and other audio activity based on incoming samples of the audio data 128, enabling the dynamic classifier 140 to adjust operation based on changing conditions of the user 180, the environment, other conditions (e.g., microphone placement or adjustment), or any combination thereof.
  • the dynamic classifier 140 may incorporate one or more other techniques to generate the classification output 142 instead of, or in addition to, the self-organizing map 148.
  • the dynamic classifier 140 may include a restricted Boltzmann machine having an unsupervised configuration, an unsupervised autoencoder, an online variation of Hopfield networks, online clustering, or a combination thereof.
  • the dynamic classifier 140 may be configured to perform a principal component analysis (e.g., sequentially fitting a set orthogonal direction vectors to the feature vector samples in the feature space, where each direction vector is selected as maximizing the variance of the feature vector samples projected onto the direction vector in feature space).
  • the dynamic classifier 140 may be configured to perform an independent component analysis (e.g., determining a set of additive subcomponents of the feature vector samples in the feature space, with the assumption that the subcomponents are non-Gaussian signals that are statistically independent from each other).
  • an independent component analysis e.g., determining a set of additive subcomponents of the feature vector samples in the feature space, with the assumption that the subcomponents are non-Gaussian signals that are statistically independent from each other).
  • the processor 190 is configured to determine, at least partially based on the classification output 142, whether the audio data 128 corresponds to user voice activity, and to generate a user voice activity indicator 150 that indicates whether user voice activity is detected.
  • the classification output 142 may indicate whether the feature data 132 is classified as one of two classes (e.g., class “0” or class “1”), the classification output 142 may not indicate which class corresponds to user voice activity and which class corresponds to other audio activity.
  • the classification output 142 having the value “0” indicates user voice activity, while in other cases the classification output having the value “0” indicates other audio activity.
  • the processor 190 may determine which of the two classes indicates user voice activity and which of the two classes indicates other audio activity, further based on at least one of a sign or a magnitude of at least one value of the feature data 132, as described further with reference to FIG. 2.
  • sound propagation of the utterance 182 from the mouth of the user 180 to the first microphone 110 and to the second microphone 120 results in a phase difference (due to the utterance 182 arriving at the first microphone 110 before the second microphone 120) and a signal strength difference that may be detected in the feature data 132 and that may be distinguishable from phase and signal strength differences of sound from other audio sources.
  • the phase and signal strength differences may be determined from the IPDs 134 and the IIDs 136 in the feature data 132 and used to map the classification output 142 to user voice activity or other audio activity.
  • the processor 190 may generate a user voice activity indicator 150 that indicates whether the audio data 128 corresponds to user voice activity.
  • the processor 190 is configured to initiate a voice command processing operation 152 in response to a determination that the audio data 128 corresponds to user voice activity.
  • the voice command processing operation 152 includes a voice activation operation, such as keyword or key phrase detection, voice print authentication, natural language processing, one or more other operations, or any combination thereof.
  • the processor 190 may process the audio data 128 to perform a first stage of keyword detection and may use the user voice activity indicator 150 to confirm that a detected keyword was spoken by the user 180 of the device 102, rather than by a nearby person, prior to initiating further processing of the audio data 128 via the voice command processing operation 152 (e.g., at a second stage of detection that includes more powerful voice activity recognition and speech recognition operations).
  • the modem 170 is coupled to the processor 190 and is configured to enable communication with the second device 160, such as via wireless transmission.
  • the modem 170 is configured to transmit the audio data 128 to the second device 160 in response to a determination that the audio data 128 corresponds to user voice activity based on the dynamic classifier 140.
  • the device 102 may send the audio data 128 to the second device 160 to perform the voice command processing operation 152 at a voice activation system 162 of the second device 160.
  • the device 102 offloads more computationally expensive processing (e.g., the voice command processing operation 152) to be performed using the greater processing resources and power resources of the second device 160.
  • the device 102 corresponds to or is included in one or various types of devices.
  • the processor 190 is integrated in a headset device that includes the first microphone 110 and the second microphone 120.
  • the headset device is configured, when worn by the user 180, to position the first microphone 110 closer than the second microphone 120 to the user’s mouth to capture utterances 182 of the user 180 at the first microphone 110 with greater intensity and less delay as compared to at the second microphone 120, such as described further with reference to FIG. 7.
  • the processor 190 is integrated in at least one of a mobile phone or a tablet computer device, as described with reference to FIG. 6, a wearable electronic device, as described with reference to FIG.
  • the processor 190 is integrated into a vehicle that also includes the first microphone 110 and the second microphone 120, such as described further with reference to FIG. 12 and FIG. 13.
  • the first microphone 110 is configured to capture utterances 182 of a user 180
  • the second microphone 120 is configured to capture ambient sound 186.
  • an utterance 182 from a user 180 of the device 102 is captured by the first microphone 110 and by the second microphone 120. Because the first microphone 110 is nearer the mouth of the user 180, the speech of the user 180 is captured by the first microphone 110 with higher signal strength and less delay as compared to the second microphone 120.
  • ambient sound 186 from one or more sound sources 184 may be captured by the first microphone 110 and by the second microphone 120.
  • a signal strength difference and relative delay between capturing the ambient sound 186 at the first microphone 110 and the second microphone 120 will vary from that for the utterance 182 from the user 180.
  • the first audio data 116 and the second audio data 126 are processed at the processor 190, such as by performing echo cancellation, noise suppression, frequency domain transform etc.
  • the resulting audio data is processed at the feature extractor 130 to generate the feature data 132 including the IPDs 134 and the IIDs 136.
  • the feature data 132 is input to the dynamic classifier 140 to generate the classification output 142, which is interpreted by the processor 190 as either user voice activity or other sound activity.
  • the processor 190 generates the user voice activity indicator 150, such as a “0” value to indicate the audio data 128 corresponds to user voice activity, or a “1” value to indicate the audio data 128 corresponds to other audio activity (or vice versa).
  • the user voice activity indicator 150 can be used to determine whether to initiate the voice command processing operation 152 at the device 102. Alternatively, or in addition, the user voice activity indicator 150 can be used to determine whether to initiate generation of an output signal 135 (e.g., the audio data 128) to the second device 160 for further processing at the voice activation system 162.
  • an output signal 135 e.g., the audio data 1228
  • the dynamic classifier 140 is updated based on the feature data 132, such as by adjusting weights of the winning unit and its neighbors to be more similar to the feature data 132, updating the clustering operation 144, the classification criterion 146, or a combination thereof. In this manner, the dynamic classifier 140 automatically adapts to changes in the user speech, changes in the environment, changes in the characteristics of the device 102 or the microphones 110, 120, or a combination thereof.
  • a dynamic classifier 208 operates on the feature data 206 to generate a classification output 210.
  • the dynamic classifier 208 corresponds to the dynamic classifier 140 and is configured to perform unsupervised real-time clustering based on the feature data 206 with highly dynamic decision boundaries for “self’ vs “other” labeling for voice activation classes in a classification output 210.
  • the dynamic classifier 208 may divide the feature space into two classes, one class associated with user voice activity and the other class associated with other sound activity.
  • the classification output 210 may include a binary indicator of which class is associated with the feature data 206.
  • the classification output 210 corresponds to the classification output 142.
  • the self/other association may determine that a classification output 210 value of “0” corresponds to feature data 206 exhibiting a negative sign 230 in one or more pertinent frequency ranges, or exhibiting a magnitude 232 less than a threshold amount in one or more pertinent frequency ranges, or both, and as a result may populate a table such that “0” corresponds to “other” and “1” corresponds to “self.”
  • Dynamic classification such as described with reference to the dynamic classifier 140 of FIG. 1 and the dynamic classifier 208 of FIG. 2, assists with improving SVAD accuracy, with the objectives of responding only when the user speaks and always suppressing responses when other interferences (e.g., external speech) arrive, and maximizing the self-keyword acceptance rate (“SKAR”) and the other keyword rejection rate (“OKRR”).
  • SKAR self-keyword acceptance rate
  • OKRR other keyword rejection rate
  • the application 354 may be configured to perform one or more operations based on detected user speech.
  • the application 354 may correspond to a voice interface application, an integrated assistant application, a vehicle navigation and entertainment application, or a home automation system, as illustrative, non-limiting examples.
  • FIG. 4 is a diagram of an illustrative aspect of operation of components of the system of FIG. 1, in accordance with some examples of the present disclosure.
  • the feature extractor 130 is configured to receive a sequence 410 of audio data samples, such as a sequence of successively captured frames of the audio data 128, illustrated as a first frame (Fl) 412, a second frame (F2) 414, and one or more additional frames including an Nth frame (FN) 416 (where N is an integer greater than two).
  • the feature extractor 130 is configured to output a sequence 420 of sets of feature data including a first set 422, a second set 424, and one or more additional sets including an Nth set 426.
  • the dynamic classifier 140 is configured to receive the sequence 420 of sets of feature data and to adaptively cluster a set (e.g., the second set 424) of the sequence 420 at least partially based on a prior set (e.g., the first set 422) of feature data in the sequence 420.
  • the dynamic classifier 140 may be implemented as a temporal Kohonen map or a recurrent self-organizing map.
  • the feature extractor 130 processes the first frame 412 to generate the first set 422 of feature data, and the dynamic classifier 140 processes the first set 422 of feature data to generate a first classification output (Cl) 432 of a sequence 430 of classification outputs.
  • the feature extractor 130 processes the second frame 414 to generate the second set 424 of feature data, and the dynamic classifier 140 processes the second set 424 of feature data to generate a second classification output (C2) 434 based on the second set 424 of feature data and at least partially based on the first set 422 of feature data.
  • Such processing continues, including the feature extractor 130 processing the Nth frame 416 to generate the Nth set 426 of feature data, and the dynamic classifier 140 processes the Nth set 426 of feature data to generate an Nth classification output (CN) 436.
  • the Nth classification output 436 is based on the Nth set 426 of feature data and at least partially based on one or more of the previous sets of feature data of the sequence 420.
  • FIG. 5 depicts an implementation 500 of the device 102 as an integrated circuit 502 that includes the one or more processors 190.
  • the integrated circuit 502 also includes an audio input 504, such as one or more bus interfaces, to enable the audio data 128 to be received for processing.
  • the integrated circuit 502 also includes a signal output 512, such as a bus interface, to enable sending of an output signal, such as the user voice activity indicator 150.
  • the integrated circuit 502 enables implementation of self-voice activity detection as a component in a system that includes microphones, such as a mobile phone or tablet as depicted in FIG. 6, a headset as depicted in FIG. 7, a wearable electronic device as depicted in FIG. 8, a voice-controlled speaker system as depicted in FIG. 9, a camera as depicted in FIG. 10, a virtual reality headset, mixed reality headset, or an augmented reality headset as depicted in FIG. 11, or a vehicle as depicted in FIG. 12 or FIG. 13.
  • FIG. 6 depicts an implementation 600 in which the device 102 is a mobile device 602, such as a phone or tablet, as illustrative, non-limiting examples.
  • the mobile device 602 includes the first microphone 110 positioned to primarily capture speech of a user, multiple second microphones 120 positioned to primarily capture environmental sounds, and a display screen 604.
  • Components of the processor 190, including the feature extractor 130 and the dynamic classifier 140, are integrated in the mobile device 602 and are illustrated using dashed lines to indicate internal components that are not generally visible to a user of the mobile device 602.
  • FIG. 7 depicts an implementation 700 in which the device 102 is a headset device 702.
  • the headset device 702 includes the first microphone 110 positioned to primarily capture speech of a user and the second microphone 120 positioned to primarily capture environmental sounds.
  • Components of the processor 190 including the feature extractor 130 and the dynamic classifier 140, are integrated in the headset device 702.
  • the dynamic classifier 140 operates to detect user voice activity, which may cause the headset device 702 to perform one or more operations at the headset device 702, to transmit audio data corresponding to the user voice activity to a second device (not shown), such as the second device 160 of FIG. 1, for further processing, or a combination thereof.
  • processor 190 is illustrated as including the feature extractor 130, in other implementations the feature extractor 130 is omitted, such as when the dynamic classifier 140 is configured to extract feature data during processing of the first audio data 116 and the second audio data 126, as described with reference to FIG. 1.
  • FIG. 9 is an implementation 900 in which the device 102 is a wireless speaker and voice activated device 902.
  • the wireless speaker and voice activated device 902 can have wireless network connectivity and is configured to execute an assistant operation.
  • the processor 190 including the feature extractor 130 and the dynamic classifier 140, the first microphone 110, the second microphone 120, or a combination thereof, are included in the wireless speaker and voice activated device 902.
  • the processor 190 is illustrated as including the feature extractor 130, in other implementations the feature extractor 130 is omitted, such as when the dynamic classifier 140 is configured to extract feature data during processing of the first audio data 116 and the second audio data 126, as described with reference to FIG. 1.
  • the wireless speaker and voice activated device 902 also includes a speaker 904.
  • the camera device 1002 is illustrated as including the feature extractor 130, in other implementations the feature extractor 130 is omitted, such as when the dynamic classifier 140 is configured to extract feature data during processing of the first audio data 116 and the second audio data 126, as described with reference to FIG. 1.
  • the vehicle 1202 is illustrated as including the feature extractor 130, in other implementations the feature extractor 130 is omitted, such as when the dynamic classifier 140 is configured to extract feature data during processing of the first audio data 116 and the second audio data 126, as described with reference to FIG. 1.
  • FIG. 13 depicts another implementation 1300 in which the device 102 corresponds to, or is integrated within, a vehicle 1302, illustrated as a car.
  • vehicle 1302 includes the processor 190 including the feature extractor 130 and the dynamic classifier 140.
  • the vehicle 1302 is illustrated as including the feature extractor 130, in other implementations the feature extractor 130 is omitted, such as when the dynamic classifier 140 is configured to extract feature data during processing of the first audio data 116 and the second audio data 126, as described with reference to FIG. 1.
  • the vehicle 1302 also includes the first microphone 110 and the second microphone 120.
  • the first microphone 110 is positioned to capture utterances of an operator of the vehicle 1302.
  • user voice activity detection can be performed based on an audio signal received from external microphones (e.g., the first microphone 110 and the second microphone 120), such as an authorized user of the vehicle.
  • the voice activation system 162 in response to receiving a verbal command identified as user speech via operation of the dynamic classifier 140, the voice activation system 162 initiates one or more operations of the vehicle 1302 based on one or more keywords (e.g., “unlock”, “start engine”, “play music”, “display weather forecast”, or another voice command) detected in the output signal 135, such as by providing feedback or information via a display 1320 or one or more speakers (e.g., a speaker 1310).
  • keywords e.g., “unlock”, “start engine”, “play music”, “display weather forecast”, or another voice command
  • FIG. 14 A a particular implementation of a method 1400 of user voice activity detection is shown.
  • one or more operations of the method 1400 are performed by at least one of the feature extractor 130, the dynamic classifier 140, the processor 190, the device 102, the system 100 of FIG. 1, or a combination thereof.
  • the method 1400 includes receiving, at one or more processors, audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone, at 1402.
  • the feature extractor 130 of FIG. 1 receives the audio data 128 including the first audio data 116 corresponding to a first output of the first microphone 110 and the second audio data 126 corresponding to a second output of the second microphone 126, as described with reference to FIG. 1.
  • the method 1400 includes determining, at the one or more processors and at least partially based on the classification output, whether the audio data corresponds to user voice activity, at 1408.
  • the processor 190 of FIG. 1 determines, at least partially based on the classification output 142, whether the audio data 128 corresponds to user voice activity, as described with reference to FIG. 1.
  • the method 1400 improves performance of self-voice activity detection by using the dynamic classifier 140 to discriminate between user voice activity and other audio activity with relatively low complexity, low power consumption, and high accuracy as compared to conventional self-voice activity detection techniques. Automatically adapting to user and environment changes provides improved benefit by reducing or eliminating calibration to be performed by the user and enhancing the user’s experience.
  • FIG. 14B a particular implementation of a method 1450 of user voice activity detection is shown. In a particular aspect, one or more operations of the method 1450 are performed by at least one of the dynamic classifier 140, the processor 190, the device 102, the system 100 of FIG. 1, or a combination thereof.
  • the method 1450 includes providing, at the one or more processors, the audio data to a dynamic classifier to generate a classification output corresponding to the audio data, at 1454.
  • the feature extractor 130 of FIG. 1 generates the feature data 132 based on the first audio data 116 and the second audio data 126, and the feature data 132 is processed by the dynamic classifier 140 to generate the classification output 142, such as described in FIG. 1 and in accordance with the method 1400 of FIG. 14 A.
  • the processor 190 provides the first audio data 116 and the second audio data 126 to the dynamic classifier 140, and the dynamic classifier 140 processes the first audio data 116 and the second audio data 126 to generate the classification output 142.
  • the dynamic classifier 140 processes the first audio data 116 and the second audio data 126 to extract the feature data 132, and determines the classification output 142 based on the feature data 132.
  • the method 1450 improves performance of self-voice activity detection by using the dynamic classifier 140 to discriminate between user voice activity and other audio activity with relatively low complexity, low power consumption, and high accuracy as compared to conventional self-voice activity detection techniques. Automatically adapting to user and environment changes provides improved benefit by reducing or eliminating calibration to be performed by the user and enhancing the user’s experience.
  • the method 1400 of FIG. 14A, the method 1450 of FIG. 14B, or a combination thereof may be implemented by a field-programmable gate array (FPGA) device, an application-specific integrated circuit (ASIC), a processing unit such as a central processing unit (CPU), a DSP, a controller, another hardware device, firmware device, or any combination thereof.
  • FPGA field-programmable gate array
  • ASIC application-specific integrated circuit
  • processing unit such as a central processing unit (CPU), a DSP, a controller, another hardware device, firmware device, or any combination thereof.
  • the method 1400 of FIG. 14 A, the method 1450 of FIG. 14B, or a combination thereof may be performed by a processor that executes instructions, such as described with reference to FIG. 15.
  • FIG. 15 a block diagram of a particular illustrative implementation of a device is depicted and generally designated 1500.
  • the device 1500 may have more or fewer components than illustrated in FIG. 15.
  • the device 1500 may correspond to the device 102.
  • the device 1500 may perform one or more operations described with reference to FIGS. 1-14B.
  • the device 1500 includes a processor 1506 (e.g., a central processing unit (CPU)).
  • the device 1500 may include one or more additional processors 1510 (e.g., one or more DSPs).
  • the processor 190 of FIG. 1 corresponds to the processor 1506, the processors 1510, or a combination thereof.
  • the processors 1510 may include a speech and music coder-decoder (CODEC) 1508 that includes a voice coder (“vocoder”) encoder 1536, a vocoder decoder 1538, the feature extractor 130, the dynamic classifier 140, or a combination thereof.
  • CODEC speech and music coder-decoder
  • the CODEC 1534 may include a digital -to-analog converter (DAC) 1502, an analog-to-digital converter (ADC) 1504, or both.
  • the CODEC 1534 may receive analog signals from the first microphone 110 and the second microphone 120, convert the analog signals to digital signals using the analog-to-digital converter 1504, and provide the digital signals to the speech and music codec 1508.
  • the speech and music codec 1508 may process the digital signals, and the digital signals may further be processed by the feature extractor 130 and the dynamic classifier 140.
  • the speech and music codec 1508 may provide digital signals to the CODEC 1534.
  • the CODEC 1534 may convert the digital signals to analog signals using the digital-to-analog converter 1502 and may provide the analog signals to the speaker 1592.
  • each of the display 1528, the input device 1530, the speaker 1592, the first microphone 110, the second microphone 120, the antenna 1552, and the power supply 1544 may be coupled to a component of the system-on-chip device 1522, such as an interface (e.g., the first input interface 114 or the second input interface 124) or a controller.
  • a component of the system-on-chip device 1522 such as an interface (e.g., the first input interface 114 or the second input interface 124) or a controller.
  • the device 1500 may include a smart speaker, a speaker bar, a mobile communication device, a smart phone, a cellular phone, a laptop computer, a computer, a tablet, a personal digital assistant, a display device, a television, a gaming console, a music player, a radio, a digital video player, a digital video disc (DVD) player, a tuner, a camera, a navigation device, a vehicle, a headset, an augmented reality headset, a virtual reality headset, an aerial vehicle, a home automation system, a voice-activated device, a wireless speaker and voice activated device, a portable electronic device, a car, a vehicle, a computing device, a communication device, an internet-of-things (loT) device, a virtual reality (VR) device, a base station, a mobile device, or any combination thereof.
  • loT internet-of-things
  • VR virtual reality
  • the apparatus also includes means for determining, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • the means for determining can correspond to the dynamic classifier 140, the processor 190, the one or more processors 1510, one or more other circuits or components configured to determine, at least partially based on the classification output, whether the audio data corresponds to user voice activity, or any combination thereof
  • an apparatus includes means for receiving audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone.
  • the means for receiving can correspond to the first input interface 114, the second input interface 124, the feature extractor 130, the dynamic classifier 140, the processor 190, the one or more processors 1510, one or more other circuits or components configured to receive audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone, or any combination thereof.
  • the apparatus further includes means for generating, at a dynamic classifier, a classification output corresponding to the audio data.
  • the means for generating the classification output can correspond to the feature extractor 130, the dynamic classifier 140, the processor 190, the one or more processors 1510, one or more other circuits or components configured to generate classification output at a dynamic classifier, or any combination thereof.
  • the apparatus also includes means for determining, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • the means for determining can correspond to the dynamic classifier 140, the processor 190, the one or more processors 1510, one or more other circuits or components configured to determine, at least partially based on the classification output, whether the audio data corresponds to user voice activity, or any combination thereof.
  • a non-transitory computer-readable medium (e.g., a computer-readable storage device, such as the memory 1586) includes instructions (e.g., the instructions 1556) that, when executed by one or more processors (e.g., the one or more processors 1510 or the processor 1506), cause the one or more processors to receive audio data (e.g., the audio data 128) including first audio data (e.g., the first audio data 116) corresponding to a first output of a first microphone (e.g., the first microphone 110) and second audio data (e.g., the second audio data 126) corresponding to a second output of a second microphone (e.g., the second microphone 120).
  • audio data e.g., the audio data 128
  • first audio data e.g., the first audio data 116
  • second audio data e.g., the second audio data 126 corresponding to a second output of a second microphone (e.g., the second microphone 120).
  • Example 1 A device comprising: one or more processors configured to: receive audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone; generate feature data based on the first audio data and the second audio data; process the feature data at a dynamic classifier to generate a classification output of the feature data; and determine, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • Example 3 The device of example 1, wherein the feature data includes: at least one interaural phase difference between the first audio data and the second audio data; and at least one interaural intensity difference between the first audio data and the second audio data.
  • Example 4 The device of example 3, wherein the one or more processors are further configured to transform the first audio data and the second audio data to a transform domain prior to generating the feature data, and wherein the feature data includes interaural phase differences for multiple frequencies and interaural intensity differences for multiple frequencies.
  • Example 6 The device of example 1, wherein the one or more processors are further configured to update a clustering operation of the dynamic classifier based on the feature data.
  • Example 7 The device of example 1, wherein the one or more processors are further configured to update a classification decision criterion of the dynamic classifier.
  • Example 8 The device of example 1, wherein the dynamic classifier includes a self-organizing map.
  • Example 9 The device of example 1, wherein the dynamic classifier is further configured to receive a sequence of sets of feature data and to adaptively cluster a set of the sequence at least partially based on a prior set of feature data in the sequence.
  • Example 10 The device of example 1, wherein the one or more processors are configured to determine whether the audio data corresponds to the user voice activity further based on at least one of a sign or a magnitude of at least one value of the feature data.
  • Example 11 The device of example 1, wherein the one or more processors are further configured to initiate a voice command processing operation in response to a determination that the audio data corresponds to the user voice activity.
  • Example 12 The device of example 11, wherein the one or more processors are configured to generate at least one of a wakeup signal or an interrupt to initiate the voice command processing operation.
  • Example 13 The device of example 12, wherein the one or more processors further include: an always-on power domain that includes the dynamic classifier; and a second power domain that includes a voice command processing unit, and wherein the wakeup signal is configured to transition the second power domain from a low-power mode to activate the voice command processing unit.
  • Example 14 The device of example 1, further comprising a modem coupled to the one or more processors, the modem configured to transmit the audio data to a second device in response to a determination that the audio data corresponds to the user voice activity based on the dynamic classifier.
  • Example 15 The device of example 1, wherein the one or more processors are integrated in a headset device that includes the first microphone and the second microphone, and wherein the headset device is configured, when worn by a user, to position the first microphone closer than the second microphone to the user’s mouth to capture utterances of the user at the first microphone with greater intensity and less delay as compared to at the second microphone.
  • Example 16 The device of example 1, wherein the one or more processors are integrated in at least one of a mobile phone, a tablet computer device, a wearable electronic device, a camera device, a virtual reality headset, or an augmented reality headset.
  • Example 17 The device of example 1, wherein the one or more processors are integrated in a vehicle, the vehicle further including the first microphone and the second microphone, and wherein the first microphone is positioned to capture utterances of an operator of the vehicle.
  • Example 18 A method of voice activity detection comprising: receiving, at one or more processors, audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone; generating, at the one or more processors, feature data based on the first audio data and the second audio data; generating, at a dynamic classifier of the one or more processors, a classification output of the feature data; and determining, at the one or more processors and at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • Example 19 The method of example 18, wherein the first microphone is configured to capture utterances of a user, and wherein the second microphone is configured to capture ambient sound.
  • Example 20 The method of example 18, wherein the feature data includes: at least one interaural phase difference between the first audio data and the second audio data; and at least one interaural intensity difference between the first audio data and the second audio data.
  • Example 21 The method of example 20, further comprising transforming the first audio data and the second audio data to a transform domain prior to generating the feature data, wherein the feature data includes interaural phase differences for multiple frequencies and interaural intensity differences for multiple frequencies.
  • Example 24 The method of example 18, further comprising updating a classification decision criterion of the dynamic classifier.
  • Example 25 The method of example 18, wherein the dynamic classifier includes a self-organizing map.
  • Example 26 The method of example 18, further comprising receiving, at the dynamic classifier, a sequence of sets of feature data and adaptively clustering a set of the sequence at least partially based on a prior set of feature data in the sequence.
  • Example 27 The method of example 18, wherein determining whether the audio data corresponds to the user voice activity is further based on at least one of a sign or a magnitude of at least one value of the feature data.
  • Example 28 The method of example 18, further comprising initiating a voice command processing operation in response to a determination that the audio data corresponds to the user voice activity.
  • Example 31 The method of example 18, further comprising transmitting the audio data to a second device in response to a determination that the audio data corresponds to the user voice activity based on the dynamic classifier.
  • Example 36 The non-transitory computer-readable medium of example 34, wherein the feature data includes: at least one interaural phase difference between the first audio data and the second audio data; and at least one interaural intensity difference between the first audio data and the second audio data.
  • Example 37 The non-transitory computer-readable medium of example 34, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to transform the first audio data and the second audio data to a transform domain prior to generating the feature data, and wherein the feature data includes interaural phase differences for multiple frequencies and interaural intensity differences for multiple frequencies.
  • Example 39 The non-transitory computer-readable medium of example 34, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to update a clustering operation of the dynamic classifier based on the feature data.
  • Example 43 The non-transitory computer-readable medium of example 34, wherein determining whether the audio data corresponds to the user voice activity is further based on at least one of a sign or a magnitude of at least one value of the feature data.
  • Example 44 The non-transitory computer-readable medium of example 34, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to initiate a voice command processing operation in response to a determination that the audio data corresponds to the user voice activity.
  • Example 45 The non-transitory computer-readable medium of example 34, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to generate at least one of a wakeup signal or an interrupt to initiate a voice command processing operation.
  • Example 49 The non-transitory computer-readable medium of example 34, wherein the one or more processors are integrated in at least one of a mobile phone, a tablet computer device, a wearable electronic device, a camera device, a virtual reality headset, an augmented reality headset, or a vehicle.
  • Example 50 An apparatus comprising: means for receiving audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone; means for generating feature data based on the first audio data and the second audio data; means for generating, at a dynamic classifier, a classification output of the feature data; and means for determining, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • Example 51 The apparatus of example 50, wherein the feature data includes: at least one interaural phase difference between the first audio data and the second audio data; and at least one interaural intensity difference between the first audio data and the second audio data.
  • Example 53 The apparatus of example 50, further comprising means for adaptively clustering sets of feature data based on whether a sound represented in the audio data originates from a source that is closer to the first microphone than to the second microphone.
  • Example 56 The apparatus of example 50, wherein the dynamic classifier includes a self-organizing map.
  • Example 58 The apparatus of example 50, further comprising means for generating at least one of a wakeup signal or an interrupt to initiate a voice command processing operation.
  • Example 59 The apparatus of example 50, further comprising means for transmitting the audio data to a second device in response to a determination that the audio data corresponds to the user voice activity based on the dynamic classifier.
  • Example 60 The apparatus of example 50, wherein the means for receiving the audio data, the means for generating the feature data, the means for generating the classification output, and the means for determining whether the audio data corresponds to the user voice activity are integrated in a headset device that includes the first microphone and the second microphone, and wherein the headset device, when worn by a user, positions the first microphone closer than the second microphone to the user’s mouth to capture utterances of the user at the first microphone with greater intensity and less delay as compared to at the second microphone.
  • Example 61 The apparatus of example 50, wherein the means for receiving the audio data, the means for generating the feature data, the means for generating the classification output, and the means for determining whether the audio data corresponds to user voice activity are integrated in at least one of a mobile phone, a tablet computer device, a wearable electronic device, a camera device, a virtual reality headset, an augmented reality headset, or a vehicle.
  • Example 62 A device including: a memory configured to store instructions; and one or more processors configured execute the instructions to: receive audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone; provide the audio data to a dynamic classifier, the dynamic classifier configured to generate a classification output corresponding to the audio data; and determine, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • Example 63 The device of example 62, further including the first microphone and the second microphone, wherein the first microphone is coupled to the one or more processors and configured to capture utterances of a user, and wherein the second microphone is coupled to the one or more processors and configured to capture ambient sound.
  • Example 64 The device of example 62 or 63, wherein the classification output is based on a gain difference between the first audio data and the second audio data, a phase difference between the first audio data and the second audio data, or a combination thereof.
  • Example 65 The device of any of examples 62 to 64, wherein the one or more processors are further configured to: generate feature data based on the first audio data and the second audio data; and provide the feature data to the dynamic classifier, wherein the classification output is based on the feature data.
  • Example 66 The device of example 65, wherein the feature data includes: at least one interaural phase difference between the first audio data and the second audio data; and at least one interaural intensity difference between the first audio data and the second audio data.
  • Example 67 The device of example 65 or 66, wherein the one or more processors are configured to determine whether the audio data corresponds to the user voice activity further based on at least one of a sign or a magnitude of at least one value of the feature data.
  • Example 68 The device of any of examples 65 to 67, wherein the one or more processors are further configured to transform the first audio data and the second audio data to a transform domain prior to generating the feature data, and wherein the feature data includes interaural phase differences for multiple frequencies and interaural intensity differences for multiple frequencies.
  • Example 69 The device of any of examples 65 to 68, wherein the dynamic classifier is configured to adaptively cluster sets of feature data based on whether a sound represented in the audio data originates from a source that is closer to the first microphone than to the second microphone.
  • Example 70 The device of any of examples 62 to 69, wherein the one or more processors are further configured to update a clustering operation of the dynamic classifier based on the audio data.
  • Example 71 The device of any of examples 62 to 70, wherein the one or more processors are further configured to update a classification decision criterion of the dynamic classifier.
  • Example 72 The device of any of examples 62 to 71, wherein the dynamic classifier includes a self-organizing map.
  • Example 73 The device of any of examples 62 to 72, wherein the dynamic classifier is further configured to receive a sequence of sets of audio data and to adaptively cluster a set of the sequence at least partially based on a prior set of audio data in the sequence.
  • Example 74 The device of any of examples 62 to 73, wherein the one or more processors are further configured to initiate a voice command processing operation in response to a determination that the audio data corresponds to the user voice activity.
  • Example 75 The device of example 74, wherein the one or more processors are configured to generate at least one of a wakeup signal or an interrupt to initiate the voice command processing operation.
  • Example 76 The device of example 75, wherein the one or more processors further include: an always-on power domain that includes the dynamic classifier; and a second power domain that includes a voice command processing unit, and wherein the wakeup signal is configured to transition the second power domain from a low-power mode to activate the voice command processing unit.
  • Example 77 The device of any of examples 62 to 76, further comprising a modem coupled to the one or more processors, the modem configured to transmit the audio data to a second device in response to a determination that the audio data corresponds to the user voice activity based on the dynamic classifier.
  • Example 78 The device of any of examples 62 to 77, wherein the one or more processors are integrated in a headset device that includes the first microphone and the second microphone, and wherein the headset device is configured, when worn by a user, to position the first microphone closer than the second microphone to the user’s mouth to capture utterances of the user at the first microphone with greater intensity and less delay as compared to at the second microphone.
  • Example 79 The device of any of examples 62 to 77, wherein the one or more processors are integrated in at least one of a mobile phone, a tablet computer device, a wearable electronic device, a camera device, a virtual reality headset, or an augmented reality headset.
  • Example 80 The device of any of examples 62 to 77, wherein the one or more processors are integrated in a vehicle, the vehicle further including the first microphone and the second microphone, and wherein the first microphone is positioned to capture utterances of an operator of the vehicle.
  • Example 81 A method of voice activity detection including: receiving, at one or more processors, audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone; providing, at the one or more processors, the audio data to a dynamic classifier to generate a classification output corresponding to the audio data; and determining, at the one or more processors and at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • Example 82 The method of example 81, wherein the classification output is based on a gain difference between the first audio data and the second audio data, a phase difference between the first audio data and the second audio data, or a combination thereof.
  • Example 83 The method of example 81 or 82, wherein the dynamic classifier includes a self-organizing map.
  • Example 84 The method of any of examples 81 to 83, wherein determining whether the audio data corresponds to the user voice activity is further based on at least one of a sign or a magnitude of at least one value of feature data corresponding to the audio data.
  • Example 85 The method of any of examples 81 to 84, further including initiating a voice command processing operation in response to a determination that the audio data corresponds to the user voice activity.
  • Example 86 The method of example 85, further including generating at least one of a wakeup signal or an interrupt to initiate the voice command processing operation.
  • Example 87 The method of any of examples 81 to 86, further including transmitting the audio data to a second device in response to a determination that the audio data corresponds to the user voice activity based on the dynamic classifier.
  • Example 88 A non-transitory computer-readable medium including instructions that, when executed by one or more processors, cause the one or more processors to: receive audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone; provide the audio data to a dynamic classifier to generate a classification output corresponding to the audio data; and determine, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • Example 89 The non-transitory computer-readable medium of example 88, wherein the classification output is based on a gain difference between the first audio data and the second audio data, a phase difference between the first audio data and the second audio data, or a combination thereof.
  • Example 90 An apparatus including: means for receiving audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone; means for generating, at a dynamic classifier, a classification output corresponding to the audio data; and means for determining, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
  • Example 91 The apparatus of example 90, wherein the classification output is based on a gain difference between the first audio data and the second audio data, a phase difference between the first audio data and the second audio data, or a combination thereof.
  • a software module may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of non-transient storage medium known in the art.
  • An exemplary storage medium is coupled to the processor such that the processor may read information from, and write information to, the storage medium.
  • the storage medium may be integral to the processor.
  • the processor and the storage medium may reside in an application-specific integrated circuit (ASIC).
  • ASIC application-specific integrated circuit
  • the ASIC may reside in a computing device or a user terminal.
  • the processor and the storage medium may reside as discrete components in a computing device or user terminal.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Otolaryngology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • User Interface Of Digital Computer (AREA)
  • Power Sources (AREA)
  • Telephone Function (AREA)
  • Emergency Alarm Devices (AREA)
  • Traffic Control Systems (AREA)

Abstract

Dispositif comprenant une mémoire configurée pour stocker des instructions et un ou plusieurs processeurs configurés pour exécuter les instructions. Le ou les processeurs sont configurés pour exécuter les instructions afin de recevoir des données audio comprenant des premières données audio correspondant à une première sortie d'un premier microphone et des secondes données audio correspondant à une seconde sortie d'un second microphone. Le ou les processeurs sont également configurés pour exécuter les instructions afin de fournir les données audio à un classificateur dynamique. Le classificateur dynamique est configuré pour générer une sortie de classification correspondant aux données audio. Le ou les processeurs sont en outre configurés pour exécuter les instructions afin de déterminer, au moins partiellement sur la base de la sortie de classification, si les données audio correspondent à l'activité vocale de l'utilisateur.
EP21790049.7A 2020-10-08 2021-09-17 Détection d'activité vocale d'utilisateur à l'aide d'un classificateur dynamique Pending EP4226371A1 (fr)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202063089507P 2020-10-08 2020-10-08
US17/308,593 US11783809B2 (en) 2020-10-08 2021-05-05 User voice activity detection using dynamic classifier
PCT/US2021/071503 WO2022076963A1 (fr) 2020-10-08 2021-09-17 Détection d'activité vocale d'utilisateur à l'aide d'un classificateur dynamique

Publications (1)

Publication Number Publication Date
EP4226371A1 true EP4226371A1 (fr) 2023-08-16

Family

ID=81079407

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21790049.7A Pending EP4226371A1 (fr) 2020-10-08 2021-09-17 Détection d'activité vocale d'utilisateur à l'aide d'un classificateur dynamique

Country Status (7)

Country Link
US (1) US11783809B2 (fr)
EP (1) EP4226371A1 (fr)
JP (1) JP2023545981A (fr)
KR (1) KR20230084154A (fr)
CN (1) CN116249952A (fr)
BR (1) BR112023005828A2 (fr)
WO (1) WO2022076963A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11776550B2 (en) * 2021-03-09 2023-10-03 Qualcomm Incorporated Device operation based on dynamic classifier

Family Cites Families (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8194882B2 (en) * 2008-02-29 2012-06-05 Audience, Inc. System and method for providing single microphone noise suppression fallback
WO2009151578A2 (fr) * 2008-06-09 2009-12-17 The Board Of Trustees Of The University Of Illinois Procédé et appareil de récupération de signal aveugle dans des environnements bruyants et réverbérants
US9210503B2 (en) * 2009-12-02 2015-12-08 Audience, Inc. Audio zoom
US9165567B2 (en) * 2010-04-22 2015-10-20 Qualcomm Incorporated Systems, methods, and apparatus for speech feature detection
US20110288860A1 (en) 2010-05-20 2011-11-24 Qualcomm Incorporated Systems, methods, apparatus, and computer-readable media for processing of speech signals using head-mounted microphone pair
US8898058B2 (en) * 2010-10-25 2014-11-25 Qualcomm Incorporated Systems, methods, and apparatus for voice activity detection
US9147397B2 (en) * 2013-10-29 2015-09-29 Knowles Electronics, Llc VAD detection apparatus and method of operating the same
KR20150105847A (ko) * 2014-03-10 2015-09-18 삼성전기주식회사 음성구간 검출 방법 및 장치
CN107293287B (zh) * 2014-03-12 2021-10-26 华为技术有限公司 检测音频信号的方法和装置
US10149074B2 (en) * 2015-01-22 2018-12-04 Sonova Ag Hearing assistance system
US9685156B2 (en) 2015-03-12 2017-06-20 Sony Mobile Communications Inc. Low-power voice command detector
KR102367660B1 (ko) * 2015-03-19 2022-02-24 인텔 코포레이션 마이크로폰 어레이 스피치 향상 기법
US10242696B2 (en) * 2016-10-11 2019-03-26 Cirrus Logic, Inc. Detection of acoustic impulse events in voice applications
US10475471B2 (en) * 2016-10-11 2019-11-12 Cirrus Logic, Inc. Detection of acoustic impulse events in voice applications using a neural network
US9843861B1 (en) * 2016-11-09 2017-12-12 Bose Corporation Controlling wind noise in a bilateral microphone array
US9930447B1 (en) * 2016-11-09 2018-03-27 Bose Corporation Dual-use bilateral microphone array
US10499139B2 (en) * 2017-03-20 2019-12-03 Bose Corporation Audio signal processing for noise reduction
US10249323B2 (en) 2017-05-31 2019-04-02 Bose Corporation Voice activity detection for communication headset
US10264354B1 (en) * 2017-09-25 2019-04-16 Cirrus Logic, Inc. Spatial cues from broadside detection
US10096328B1 (en) * 2017-10-06 2018-10-09 Intel Corporation Beamformer system for tracking of speech and noise in a dynamic environment
KR20240033108A (ko) * 2017-12-07 2024-03-12 헤드 테크놀로지 에스아에르엘 음성인식 오디오 시스템 및 방법
US10885907B2 (en) * 2018-02-14 2021-01-05 Cirrus Logic, Inc. Noise reduction system and method for audio device with multiple microphones
US11062727B2 (en) * 2018-06-13 2021-07-13 Ceva D.S.P Ltd. System and method for voice activity detection
EP3675517B1 (fr) * 2018-12-31 2021-10-20 GN Audio A/S Dispositif de microphone et casque d'ecoute
US10964314B2 (en) * 2019-03-22 2021-03-30 Cirrus Logic, Inc. System and method for optimized noise reduction in the presence of speech distortion using adaptive microphone array
US11328740B2 (en) * 2019-08-07 2022-05-10 Magic Leap, Inc. Voice onset detection
US11917384B2 (en) * 2020-03-27 2024-02-27 Magic Leap, Inc. Method of waking a device using spoken voice commands

Also Published As

Publication number Publication date
WO2022076963A1 (fr) 2022-04-14
JP2023545981A (ja) 2023-11-01
US20220115007A1 (en) 2022-04-14
CN116249952A (zh) 2023-06-09
KR20230084154A (ko) 2023-06-12
US11783809B2 (en) 2023-10-10
BR112023005828A2 (pt) 2023-05-02

Similar Documents

Publication Publication Date Title
CN111699528B (zh) 电子装置及执行电子装置的功能的方法
US11694710B2 (en) Multi-stream target-speech detection and channel fusion
US9940949B1 (en) Dynamic adjustment of expression detection criteria
CN108922553B (zh) 用于音箱设备的波达方向估计方法及系统
WO2016160123A1 (fr) Commande de dispositif électronique sur la base de direction de parole
US11776550B2 (en) Device operation based on dynamic classifier
CN111415686A (zh) 针对高度不稳定的噪声源的自适应空间vad和时间-频率掩码估计
US11626104B2 (en) User speech profile management
EP4374367A1 (fr) Suppression de bruit à l'aide de réseaux en tandem
WO2022206602A1 (fr) Procédé et appareil de réveil vocal, et support de stockage et système
US11783809B2 (en) User voice activity detection using dynamic classifier
US11727926B1 (en) Systems and methods for noise reduction
US20210201928A1 (en) Integrated speech enhancement for voice trigger application
US20210110838A1 (en) Acoustic aware voice user interface
US11205433B2 (en) Method and apparatus for activating speech recognition
EP4383253A2 (fr) Sélection de source basée sur la pertinence pour des systèmes vocaux en champ lointain
CN115331672B (zh) 设备控制方法、装置、电子设备及存储介质
US20220261218A1 (en) Electronic device including speaker and microphone and method for operating the same
CN116153291A (zh) 一种语音识别方法及设备
CN115424628A (zh) 一种语音处理方法及电子设备

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20230206

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)