WO2025188612A1 - Processing of audio signals from earphones for interactivity with voice-activated applications - Google Patents
Processing of audio signals from earphones for interactivity with voice-activated applicationsInfo
- Publication number
- WO2025188612A1 WO2025188612A1 PCT/US2025/018123 US2025018123W WO2025188612A1 WO 2025188612 A1 WO2025188612 A1 WO 2025188612A1 US 2025018123 W US2025018123 W US 2025018123W WO 2025188612 A1 WO2025188612 A1 WO 2025188612A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio signal
- user
- processors
- speaking
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
- G10L2015/227—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0324—Details of processing therefor
- G10L21/0332—Details of processing therefor involving modification of waveforms
Definitions
- a computing device may be communicatively coupled with one or more input/output (I/O) devices to accept inputs and provide outputs.
- I/O input/output
- At least one aspect of the present disclosure is directed to systems and methods of identifying users from audio signals acquired via speaker transducers.
- One or more processors may receive a first audio signal corresponding to a first acoustic waveform acquired via a speaker transducer positioned relative to an ear of a first user.
- the first acoustic waveform may have (i) a first portion traveling through the first user and (ii) a second portion traveling outside the first user.
- the one or more processors may filter the second portion of the first acoustic waveform within the first audio signal to generate a second audio signal corresponding to the first portion of the first acoustic waveform.
- the one or more processors may apply the second audio signal to a machine learning (ML) model.
- ML machine learning
- the ML model may be trained using a plurality of examples. Each example of the plurality of examples may identify: (i) a respective third audio signal corresponding to a portion of a respective second acoustic waveform traveling through a respective second user and (ii) an identification of whether the second user is speaking.
- the one or more processors may identify, based on applying the second audio signal to the ML model, that the first user of the speaker transducer is speaking.
- the one or more processors may provide an output based on an identification that the first user of the speaker transducer is speaking.
- the one or more processors may identify that a third user of a second speaker transducer is not speaking, based on applying a fourth audio signal corresponding to a portion of a third acoustic waveform traveling through the third user to the ML model. In some embodiments, the one or more processors may provide a second output based on an identification that the third user of the second speaker transducer is not speaking.
- the one or more processors may receive the first audio signal corresponding to the first acoustic waveform comprising a plurality of formants of the first user. In some embodiments, the one or more processors may filter the first audio signal below a threshold frequency to pass through at least a first formant (F0) of the plurality of formants as the second audio signal.
- F0 first formant
- the one or more processors may apply the second audio signal to the ML model to determine a likelihood that the first user of the speaker transducer is speaking. In some embodiments, the one or more processors may identify that the first user of the speaker transducer is speaking, responsive to the likelihood satisfying a threshold. In some embodiments, the one or more processors may identify that the first audio signal is originating from the first user on which the speaker transducer is positioned.
- the one or more processors may apply a filter to suppress an air channel corresponding to the second portion of the first acoustic waveform and to pass a body channel corresponding to the first portion of the first acoustic waveform.
- the one or more processors may initiate a process to enhance a voice command corresponding to the second audio signal to invoke a function of an application.
- At least one aspect of the present disclosure is directed to systems and methods for enhancing voice commands from audio signals acquired via speaker transducers.
- One or more processors may identify that a user of a speaker transducer is speaking using a first audio signal corresponding to an acoustic waveform travelling through the user.
- the one or more processors may select responsive to identifying that the user is speaking, a second audio signal corresponding to a voice command comprising one or more keywords for an application.
- the one or more processors may generate a third audio signal to include (i) the first audio signal and (ii) the second audio signal.
- the one or more processors may provide, to the application, the third audio signal to invoke a function of the application corresponding to the one or more keywords of the voice command.
- the one or more processors may determine that the function of the application was not successfully invoked in response to providing the third audio signal to the application. In some embodiments, the one or more processors may select, responsive to determining that the function of the application was not successfully invoked, a fourth audio signal corresponding to a second voice command comprising one or more second keywords for at least one of (i) a second function of the application or (ii) a second application. In some embodiments, the one or more processors may generate a fifth audio signal to include (i) the first audio signal and (ii) the fourth audio signal. In some embodiments, the one or more processors may provide the fifth audio signal corresponding to the second one or more keywords of the voice command to invoke at least one of the second function of the application or the second application.
- the one or more processors may determine that the function of the application was successfully invoked in response to providing the third audio signal to the application. In some embodiments, the one or more processors may refrain from selecting another audio signal for another voice command, responsive to determining that the function of the application was successfully invoked.
- the one or more processors may maintain, on memory, a plurality of audio signals each corresponding to one or more respective keywords to invoke a respective function of at least one of a plurality of applications. In some embodiments, the one or more processors may select the second audio signal from the plurality of audio signals. In some embodiments, the one or more processors may generate the third audio signal to include, in a frequency domain, a first portion corresponding to the first audio signal and a second portion corresponding to the second audio signal. In some embodiments, the one or more processors may modify the second audio signal based on one or more characteristics of the first audio signal.
- FIG. 1 A few representative examples of the mobile voice activation service.
- Left the mobile voice activation service allows mobile users to activate their voice assistant without hand intervention.
- Light the mobile voice activation service can automatically detect the primary speaker, avoiding false alarms.
- FIGs. 2A and 2B (A): human speech production. (B): two human speech transmission channels. (1) air channel, (2) in-body bone-conduction audio pathway.
- FIG. 3 Spectrogram (left) and spectral envelope (right) of the vowel sound /i/.
- the first three formats are denoted as Fl, F2, and F3. This audio signal is recorded by a MEMS microphone.
- FIGs. 4A and 4B Feasibility study: speech measurement from (A): a primary speaker; and (B): a nearby speaker.
- Table 1 wakeup words recognition accuracy on five mainstreaming voice interfaces. Ten volunteers are invited to articulate three wakeup words 10 times each.
- FIG. 5 The spectrogram and formants of the vowel sound /i/ captured by the earphone speaker.
- FIGs. 6A-D Two distinct wakeup words “Hey Siri” and “OK Google” were recorded using the pseudo-microphone and a MEMS microphone, plotting the spectrogram of the audio recordings. Pseudo-microphone recordings of (A) “Hey Siri” and (C) “OK Google”. MEMS microphone recordings of (B) “Hey Siri” and (D) “OK Google”. [0018] FIG. 7: Measurement setup (left) and Frequency response curve of six pairs of earphones (right). A probing signal may be played across the frequency band to the earphone with a loudspeaker in an anechoic chamber.
- FIG. 8 An illustration of the enhancement of the joint speech detection and primary user identification.
- FIGs. 9A and 9B (A) Reconstructed F1-F3 formants through harmonic reconstruction. Google API cannot recognize this keyword. (B) The ground truth F1-F3 formants recorded by a MEMS microphone. Google API can successfully recognize it as “Hey Siri”.
- FIGs. 10A-D Spectrogram and recognized word of each audio clip.
- A The combined signal can be successfully recognized by Google API.
- B The speech recording with high-frequency deafness was falsely recognized as “hi babe” by Google API.
- C The high- frequency component from a template cannot be recognized by Google API.
- D The combination of a non-wakeup word and the high-frequency template cannot be recognized by Google API.
- FIG. 11 (a) syllables and (b) formants alignment.
- FIG. 12 The mobile voice activation service supports wireless (left) and wired (right) connection.
- FIG. 13 Earphones.
- FIG. 14 (a) FRR and (b) FAR across 15 subjects.
- Pi refers joint speech and primary speaker detection ( ⁇ 4.1.1);
- P2 refers pitch detection-based enhancement ( ⁇ 4.1.2).
- FIG. 15 Wakeup word recognition accuracy across 15 subjects.
- FIG. 16 Success rate of the mobile voice activation service in seven scenarios.
- FIGs. 17A-G Four stationary and three mobility scenarios for the in-wild study: (A) home; (B) cafe; (C) park; (D) train; (E) driving car; (F) lifting in the gym; (G) walking on a busy intersection.
- FIG. 18A and B Benchmark study.
- A the impact of earphone types.
- B the impact of voice loudness.
- FIG. 19 depicts a block diagram of a system for processing audio signals from earphones for interactivity with applications, in accordance with an illustrative embodiment.
- FIG. 20 depicts a block diagram of a process for identifying users from audio signals acquired via earphones in the system for processing audio signals, in accordance with an illustrative embodiment.
- FIG. 21 depicts a block diagram of a process for enhancing user voice commands in audio signals acquired via earphones in the system for processing audio signals, in accordance with an illustrative embodiment.
- FIG. 22 depicts a flow diagram of a method of identifying users from audio signals acquired via earphones, in accordance with an illustrative embodiment.
- FIG. 23 depicts a flow diagram of a method of enhancing voice commands from audio signals acquired via earphones, in accordance with an illustrative embodiment.
- FIG. 24 is a block diagram of a computing environment according to an example implementation of the present disclosure.
- a mobile voice activation service (also referred herein as “EarVoice”) was implemented, a lightweight mobile service that enables hands-free voice assistant activation on commodity earphones.
- the mobile voice activation service comprises two design modules: one for joint speech detection and primary user identification, exploring the attributes of the air channel and in-body audio pathway to differentiate between the primary user and others nearby; and another for accurate wakeup word enhancement, which employs a “copy, paste, and adapt” approach to reconstruct the missing high-frequency component in speech recordings.
- the mobile voice activation service was deployed on a dongle where the proposed signal processing algorithms are streamlined with a gating mechanism to permit only the primary user’s speech to enter the pairing device (e.g., a smartphone) for wakeup word recognition, preventing unintended disclosure of ambient conversations.
- the dongle on a 4-layer PCB board was implemented and conducted extensive experiments with 15 participants in both controlled and uncontrolled scenarios. The experiment results show that the mobile voice activation service achieves around 90% wakeup word recognition accuracy in stationary scenarios, which is on par with the high- end, multi-sensor fusion-based AirPods Pro earbud.
- VA Voice assistant
- the voice assistant offers flexibility to mobile users, the process of activating it remains inconvenient due to its heavy dependence on hand interventions, particularly on earphones.
- the user has to press and hold the talk/answer button on earphones for a few seconds until hearing the Siri beep.
- wireless earbuds including Google pixel -bud, Apple Air-Pods, and Bose’s QC35, all require users to activate the voice assistant by tapping a touch sensor or holding an action button. This precaution is taken to avoid unintended activation of Siri by someone else nearby. Yet, this would divert the user’s attention from their current focus, negatively impacting the user experience. This is especially notable in situations where the user’s hands are occupied, as illustrated in FIG. 1A.
- a hands-free voice activation service stays in idle listening mode continuously, responding whenever a voice command is initiated. To achieve a good user experience, this service should minimize false positives, ensuring that it doesn’t get triggered by ambient voice activities.
- the proposed service should respond to human speech agilely, with minimum or unnoticeable latency. Moreover, as an always-on service running on power-constrained mobile devices, the proposed system design should be low-power, incurring a minimum power consumption.
- Voice data should be handled and stored securely, and users should have control over their data. Besides the necessary voice commands for awakening corresponding services, other audio data should avoid being recorded and saved on the smartphone to minimize the risk of privacy leaks.
- a mobile voice activation service also referred to herein as “EarVoice” that explores the distinction between the acoustic air channel and the in-body boneconduction pathway formed in human speech to enable accurate, agile, and low-power handsfree voice activation, all in a privacy-preserving way.
- the system works with everyday earphones, (e.g., those earphones cost a few US dollars), without breaking their structures and requires neither in-ear microphones nor dedicated IMU sensors that are only available on those pricey ANC earphones.
- This mobile voice activation service repurposes the earphone speaker into a microphone for wakeup words (e.g., “Hey Siri”) detection. This allows mobile users to wake up their voice assistant using earphones even without a microphone.
- the mobile voice activation service explores an observation that the speech of the primary user reaches the earphone’s speaker transducer through not only the conventional air channel but also via the human body channel, whereas the nearby speaker’s speech solely propagates through the air channel to the earphone speaker transducer, with significant attenuation.
- the mobile voice activation service functions as a hybrid signal-processing pipeline with primary functions running on a low-power dongle while the wakeup word recognition runs on the smartphone.
- the dongle transforms the earphone speaker into a microphone, detects the human voice, distinguishes whether it originates from the primary user, and further enhances the speech quality.
- a prototype of the mobile voice activation service’s dongle was implemented on a 4-layer printed circuit board (PCB). It includes a low power ESP32 MCU, an audio codec chip, and other peripherals to enable the functionality.
- PCB printed circuit board
- the close contact between the ear-phone speaker transducer and the human skin was identified to offer a unique opportunity to sense the vocal cord vibrations of the user who spoke, to tell whether the voice is coming from the primary user or others in the vicinity. Consequently, a lightweight signal processing algorithm was proposed that explores this opportunity to enable hands-free voice assistant activation.
- a gated signal-processing pipeline was designed that can accurately detect, differentiate, and further enhance the incomplete voice command captured by the earphone speaker transducer, all in a low-power and privacy-preserving way. This design holds the potential to be deployed on different types of earphones.
- the mobile voice activation service was implemented on a PCB board and conducted extensive experiments in both controlled and uncontrolled environments. Experiment results demonstrated that the mobile voice activation service achieves an overall wakeup recognition accuracy of 90% across different real-world scenarios, which is on par with the high- end, multi-sensor fusion-based AirPods Pro earbud.
- the production of human speech involves intricate coordination between multiple articulatory organs in the vocal system, including lungs, vocal cords (a.k.a. vocal folds), and vocal tract.
- Vocal tract is the area from the nose and the nasal cavity down to the vocal cords, including the throat, mouth (e.g., tongue, teeth, lip), nasal cavity, and facial movement.
- the lungs provide the essential air source required for vocalization. This air subsequently passes through the vocal folds to generate a voice source and is then modulated by the vocal tract to produce output speech.
- Vocal folds generate speech signals that are voiced by dynamically controlling the airflow originating from the lung, alternatively blocking and permitting it.
- airflow from the lungs may be manipulated directly by the vocal tract to produce unvoiced signals, such as consonant sounds like /f'/, /r/, etc.
- the voiced signals may include two components, i): vowels and some consonants that own high energy pulses in the frequency domain; ii): the fundamental pitch Fo and its harmonics.
- the frequency components that determine the intelligence of speech words are called formants (spectral resonances).
- the first formants in a sentence are usually within 300-2800Hz frequency band, forming the pronunciation of vowels.
- the follow-up formants stay in a higher frequency band above 3000Hz, as shown in FIG. 3.
- the primary user’s voice reaches the earphone via both an air channel and an inbody channel, while a nearby user’s voice only travels through the air channel. Due to the earphone’s obstruction, only a small fraction of the voice energy from the nearby user reaches the ear-phone’s speaker. In contrast, the primary user’s voice arrives at the earphone speaker with less attenuation through the in-body channel, providing with an opportunity to distinguish the speaker ( ⁇ 3.1).
- Voice fingerprint is proposed to identify the registered primary user and might help determine whether the primary user is interacting with Siri or if someone else nearby is speaking.
- Voice fingerprint is proposed to identify the registered primary user and might help determine whether the primary user is interacting with Siri or if someone else nearby is speaking.
- Such a mechanism is prone to various security threats in real life, including impersonation, voice synthesis, and replay attacks.
- the distinct speech propagation channels between the primary speaker and nearby speakers were found to offer another opportunity to distinguish speakers using earphones.
- the speech of the primary user reaches the earphone’s speaker transducer through not only the conventional air channel but also via the human body channel, as depicted in FIG. 2B.
- the human speech from a nearby non-primary speaker it solely propagates through the air channel to the earphone speaker transducer.
- Air channel for voice propagation For both the primary speaker and nearby speakers, the voice signal emanating from their mouth can propagate through the air channel.
- the earphone’s speaker transducer captures this signal when the sound reaches the earphone, as denoted by 1 in FIG. 2B.
- the vibrations from her articulatory organs such as the vocal cord and tract, would travel through the human body and ultimately reach the ear canal.
- the earphone transducer maintains close contact with the human ear, the speaker transducer is highly likely to detect these vibrations through bone conductions.
- the speaker transducer is able to capture the low- frequency signals stem from the primary speaker’s vocal tract vibrations, but not from the nearby speaker. This is reasonable as both the vocal cord and tract activity travel through the body channel (in the form of bone conduction) to the earphone diaphragm, which suffers less attenuation compared with the air channel.
- the preceding section highlights the potential for distinguishing the primary speaker with dumb earphones.
- these captured wakeup words were tested with five mainstreaming voice assistant systems, it was discovered that all of them achieved very low word recognition accuracy, ranging from 1% to 31%.
- the speech recorded by a commercial MEMS microphone achieves a recognition accuracy between 58% and 93%, as shown in Table 1.
- FIG. 7 shows the frequency response of six pairs of earphones across over-ear, on-ear, and in-ear types.
- the frequency response of all six pairs of earphones was observed to decline as the frequency increases.
- the speaker maintains a high frequency response, which facilitates the accurate capture of vocal cord vibrations.
- the speaker increases frequency response, with an average attenuation of 30 dB. Consequently, the speech in this frequency range experience substantial attenuation, leading to reduced speech recognition accuracy.
- the mobile voice activation service was proposed to harvest the opportunities aforementioned and tackle the technical challenges identified in the preceding section.
- the mobile voice activation service includes two primary functionalities, namely, speech detector and primary user identification ( ⁇ 4. 1), and wakeup word enhancement ( ⁇ 4.2).
- This design component strives to promptly detect the presence of human speech from the audio recordings and determine whether it is its own user speaking (i.e., the primary speaker) or someone else nearby.
- Existing speech detectors such as webrtc-vad work in two steps. It first sends the audio recording to an energy detector to locate potential human speeches, and then feeds these high-energy pitches to a GMM model to tell whether they are human speeches or ambient noises.
- the energy detector is low-power, it analyzes energy levels of audio recordings across a wide frequency range spanning from 80Hz to 4000Hz, in which ambient noise frequently manifests and the pseudo-microphone (i.e., using the earphone speaker as a microphone) conceals ( ⁇ 3.2). This can result in frequent false-triggering of the succeeding GMM-based speech detector and lead to an increase in system power consumption.
- existing speech detectors lack the capability to identify whether it is its own user talking but instead transmit all detected speech to the subsequent speech recognition module, which leads to energy wastage.
- the mobile voice activation service instead leverages the unique in-body signal propagation channel to simultaneously identify human speech and the primary speaker through the use of only the power detector. It achieves so by detecting energy peaks specifically within the lower 1000Hz frequency band. This particular frequency range is primarily associated with the articulatory organs, making the strong energy peak within this band a reliable indicator of human speech presence. Furthermore, since speech from a nearby speaker propagates through an in-air channel, resulting in significant attenuation within this lower frequency band (as discussed in ⁇ 3.1), whether the detected speech belongs to the primary speaker or someone else speaking nearby was distinguished by analyzing the energy peaks within the frequency range of 0 to 1000Hz.
- the low-frequency energy detector proceeds in two steps: pre-processing and energy profiling.
- x(t) be the audio signal recorded by the earphone’s speaker transducer.
- x(t) was filtered with a second-order Butterworth low pass filter (LPF) with a cutoff frequency of 1000Hz to eliminate the out-band noises, which are largely likely to be polluted by the ambient environment noises.
- LPF Butterworth low pass filter
- a cutoff frequency of 50Hz was adopted to remove human motions in that frequency band.
- the signal normalization would not affect the relative amplitude and frequency distribution of the speech signal.
- Per-frame energy profiling Possible voice activity on the time domain was located by dividing speech signals into time frames. Due to speech signals being quasi- stationary within a short time (2-50ms), x(f) was divided into 20ms frames and calculate the energy of each frame is as follows: where x(n) are the data samples within frame i.
- the mobile voice activation service monitors the fluctuations in energy between consecutive frames and sends the audio frame(s) to the primary user identification module if their energy surpasses 1.2 times the average energy, denoted as Si > 1.2 Savg.
- the value of Savg is regularly updated by incorporating new frames while excluding those that have been identified as containing speech.
- the hyper-parameter 1.2 is obtained through the benchmark studies in various noise level settings.
- Fo pitch was chosen as the focus for several reasons. Firstly, Fo pitch is the essential articulation frequency determined by the rate at which the vocal cord vibrates and is controlled by the tension and length of the vocal cords. As these vibrations emanate from the articulatory organs and travel through to the ear canal, the Fo pitch carries the most potent reference of audible energy. Secondly, the frequency of Fo pitch is less susceptible to certain types of interference compared with other vocal frequencies. For instance, low-frequency vocal tract resonances may be confounded by motion artifacts, and high-frequency harmonics can be masked by ambient noise. [9080
- the spectrogram of the audio signal was obtained using Short Time Fourier Transform (STFT) and then detect the Fo pitch on the spectrogram by measuring the maximum coincidence of harmonics.
- STFT Short Time Fourier Transform
- the key insight is the spectrogram of a speech may exhibit prominent peaks at frequencies that are integer multiples of the Fo pitch, stemming from the harmonics present in the speech signal.
- the frequency band of candidate pitches may be set to [90Hz, 250Hz] for running the Fo estimator.
- the power associated with each of these candidate pitches and its corresponding harmonics may be aggregated within the 1000Hz frequency range. In each time frame, the pitch with the highest cumulative power may be identified as the estimated Fo pitch.
- FIG. 8 illustrates this process.
- this enhancement module is not in a constant state of activation. Instead, its activation is determined by per-frame energy profiling ( ⁇ 4.1), which calculates the ambient environmental energy level of each time frame. The enhancement module is activated only when the ambient energy level exceeds a predefined threshold, established based on a computation over five frames. This strategic approach allows the mobile voice activation service to activate the enhancement module in noisy environments to bolster accuracy, while also deactivating it under quieter conditions to conserve power.
- ⁇ 4.1 per-frame energy profiling
- speech recognition systems are primarily designed to interpret content-dependent elements of human speech, such as vowels and consonants, which are characterized by these crucial formants. These systems are tuned to focus less on human speaker-dependent features like tones, prosody, and intonation, aiming to enhance the scalability of speech recognition performance.
- Step 1 Syllables alignment in time domain.
- a syllable is a fundamental unit in organizing speech sounds for pronunciation in linguistic. Variations in speech pace among different users can lead to discrepancies in voice duration and the number of syllables.
- the mobile voice activation service first aligns captured speech signals with the template by stretching/squeezing the template audio on a syllable basis.
- the primary challenge in this process lies in accurately detecting the boundaries of syllables in the speech recording and adjusting the template’s voice speed to match that of the user, especially in the presence of background noise.
- the energy of the ambient background noise in the speaker’s audio recording was first calculated and then subtract this noise to enhance the speech signal SNR, making the boundary more distinct.
- a pitch identification algorithm was applied to the speech recording to pinpoint the Fo fundamental pitch. This Fo pitch information is used to determine the number and location of syllables and the stretch ratio. The voice stretch is applied on a per-syllable basis.
- the mobile voice activation service detects discrepancies in the number of syllables between the speech recording and the template (due to variations in speech pace and pronunciation habits), the mobile voice activation service merges syncopal syllables (e.g., /si-ri/) into a single syllable for alignment, as depicted in FIG. 1 1 , panel (a).
- syncopal syllables e.g., /si-ri/
- the audible band signal was divided into a 2D time-frequency matrix. Each time frame in the matrix spans 20 ms as the audio sound is quasi-stationary over a 2-50 ms period.
- the spectral envelope of each time frame was extracted. As shown in FIG. 3, the spectral envelope is an important cue for the identification of voice sounds and the characterization of formants (spectral resonances).
- the location of the Fi formant ( ⁇ 2kHz) may be aligned in the spectral envelope by determining a shift factor. This shift factor is then adapted to the higher F2-F3 formants in the template signal. Subsequently, the adapted formant signal is copied onto the speech recording for replacement.
- the mobile voice activation service adopts the linear prediction spectral envelope in the implementation.
- Step 3 Energy alignment.
- the last step is to align the energy between the template and the speech recording.
- the speech loudness may change over individuals - combining the template and the speech recording in different loudness would inevitably harm the wakeup word recognition accuracy.
- the average energy level of the high- frequency component was first calculated, denoted as Phigh, and the low-frequency component, denoted as Plow, within the template audio.
- the energy level of the filtered speech recording was computed in the low-frequency band P'low.
- the mobile voice activation service s signal processing includes a lightweight hardware circuit that transforms the earphone speaker into a microphone, an energy-efficient algorithm that detects human speech and distinguishes whether it is the primary user speaking, as well as a signal enhancement algorithm that improves the quality of wakeup word. All these signal modules run on a dongle.
- FIG. 12 shows the mobile voice activation service prototype, which supports both wireless connection (through Bluetooth) and wired connection (through a 3.5mm TRRS audio cable).
- This implementation possesses two advantages. First, because the voice detection and primary user identification features are implemented in the plug-in dongle, the earphone transducer doesn’t send all captured audio streams directly to the pairing device (such as a smartphone or laptop) for further processing. Instead, the audio data is processed locally on the dongle, and only legitimate voice commands from the primary user are forwarded to the backend for further processing. Second, this gating approach not only helps prevent unintended disclosure of ambient conversations but also unnecessary acoustic signal processing on smartphones, and thus reduces power consumption.
- the mobile voice activation service dongle comprises two 3.5 mm audio jacks, resistors in the form of a Wheatstone bridge, a power amplifier INA126, an audio codec chip ES8388, an onboard computation MCU ESP32-WROVER-E, a BLE radio.
- the size of the prototype is 6cmx4.5cm. It costs approximately 8.3 USD. Its form factor can be further reduced by adopting a stretchable PCB. It is anticipated that this design can be seamlessly incorporated into mainstream True Wireless Stereo (TWS) earbuds by placing the miniaturized circuitry between the transducer and the audio chip. 6. Evaluation
- Earphone configurations Voice data is collected using 13 pairs of earphones with different types (e.g., over-ear, on-ear, and in-ear), and transducer sizes, as shown in FIG. 13.
- the mobile voice activation service was evaluated against the AirPods Pro to assess its usability.
- the AirPods Pro takes leading position among commodity earbuds, particularly excelling in speaking sound quality. This superiority is achieved through the utilization of advanced sensor modalities, including the voice accelerometer and multimicrophone-based beamforming.
- the mobile voice activation service only adopts the speaker transducer as the basic signal receiver.
- FAR False Acceptance Rate
- False Rejection Rate This metric evaluates the frequency that the mobile voice activation service does not activate the voice assistant when the primary user intent to invoke it, over the total number of attempts. A high FRR suggests the mobile voice activation service may encounter difficulties in freely accessing the voice assistant service.
- FIG. 5 shows the recognition success rate for each individual.
- the error bars in the figure indicate performance variations across three different wakeup words.
- the mobile voice activation service achieves a success rate of 91% on average.
- subject 14 achieves the lowest SR at 61% due to his lower voice volume.
- Such reduced volume adversely affects pitch detection accuracy, subsequently impacting the precision of the alignment processes.
- FIG. 17A-D The mobile voice activation service’s end-to-end performance across various real- world scenarios was next assessed. As shown in FIGs. 17A-D, the evaluation encompasses four stationary and three mobility scenarios to represent typical indoor and outdoor settings. In each scenario, 100 utterances were collected for each wakeup word. The overall success rate of wakeup word recognition was then examined. AirPods are adopted for comparison. FIG. 16 shows the results corresponding to seven scenarios.
- the mobile voice activation service’s performance is notably lower with in-ear earphones, with a success rate of 62% on average.
- One reason for the better performance of over-ear and on-ear earphones can be attributed to their larger speaker transducers and inherently larger surface contact with the skull, allowing for more efficient transfer of vocal cord vibration energy.
- the smaller transducers of in-ear earphones exhibit reduced sensitivity to voice commands (FIG. 7).
- a potential solution is to adjust the speaker volume or incorporate a power amplifier into the dongle to enhance the signal strength of the speech recording.
- Table 4 summarizes the power consumption of each component. Given a supply voltage of 5V, the sensing module, audio codec, and MCU consume 0.2mW, 60mW, and 152mW, respectively. The total power consumption of the mobile voice activation service is approximately 212 mW in the active mode. An 820 mAh lithium battery can be used to provide up to 19.3 hours of continuous running of the mobile voice activation service. The battery life could be further optimized with duty-cycles.
- VAD Voice Assistant Activation Technologies.
- Existing general purpose voice activity detection (VAD) modules e.g., Google’s webrtc-vad, GPVAD, and Kaldi-VAD, have been well- studied and integrated into many mobile applications. Nevertheless, applying these designs to earphones face challenges as voice communication on earphones can be plagued by environmental noise and more severely, the speech commands from nearby individuals.
- VAD general purpose voice activity detection
- the mobile voice activation service takes advantage of an opportunity hidden in the earphone transducer and develops a hands-free voice activation system while guaranteeing low false positives towards environmental noise and false triggering voice commands from nearby people.
- the proposed signal-processing algorithm could run efficiently on mobile and embedded devices without complex computation requirements.
- Bone Conduction Microphones have been explored for speech enhancement and voice activation.
- bone conduction sensors such as IMU, voice pickup sensor (VPU), non-audible murmur (NAM) and throat microphone
- IMU voice pickup sensor
- NAM non-audible murmur
- WhisperMask designs a new interface that catches the user’s whispering speech with an embedded condenser microphone woven hidden in a non-woven mask to reduce the noise interference from the environment.
- In-Ear-Voice developed a low-power personalized VAD system for hearables by exploring the bone conduction sensor. VibVoice utilized the bone conduction response from IMU sensors to enhance speech quality in a noisy environment.
- HeadFi explores the reciprocal principle of earphones and demonstrates the capability of using the earphone transducer for user identification, physiological sensing, touch gesture recognition, etc.
- the hardware dongle builds upon HeadFi but extends it to a software- hardware system that explores two different voice channels to enable hands-free voice activation.
- the mobile voice activation service contributes a novel signal processing pipeline to thoroughly improve the activation accuracy and enhance speech quality.
- Silent speech interface technologies may be explored for enriching speech recognition interfaces.
- LipLearner proposes a customizable silent speech interface on mobile phones by building up the relationship between voice commands and corresponding non-verbal lip movements through a neural network model. It allows users to activate the speech service with lip motions.
- HP-Speech creates a silent speech interface on earphones by emitting inaudible acoustic signals to detect the movement of temporomandibular joint (TMJ) for silent voice command recognition. Mutelt tracks the user’s jaw motion with a dual-IMU setup to infer word articulation around the ear.
- TMJ temporomandibular joint
- EarCommand emits an ultrasonic signal in the ear canal and builds the relationship between the deformation of the ear canal and the movements of the articulator to infer the corresponding silent speech commands while speaking.
- the mobile voice activation service adheres to the current speech recognition (SR) service, focusing on enhancing their reliability.
- SR speech recognition
- the design, implementation, and evaluation of the mobile voice activation service were implemented, a software-hardware solution that enables mobile users to activate their voice assistant on earphones without hand gesture intervention.
- the mobile voice activation service contributes a plethora of low-power signal processing algorithms that take advantage of the two speech signal propagation channels to detect the human speech, differentiate the primary speaker, and further enhance the quality of the wakeup word for accurate wakeup word recognition.
- the experiment in different real-world scenarios demonstrated the efficacy and effectiveness of the mobile voice activation service.
- the system 100 can include at least one computing device 105 and at least one speaker transducer 110, among others.
- the computing device 105 can include at least one voice interaction service 115 and at least one application 120.
- the voice interaction service 115 can include at least one audio processor 125, at least one activity detector 130, at least one voice enhancer 135, and at least one machine learning (ML) model 140, among others.
- the computing device 105 and the speaker transducer 110 can be associated with at least one user 145 (also referred herein as a primary user).
- Each of the components of system 100 may be implemented using hardware or a combination of hardware and software, such as those of system 600 as detailed herein in conjunction with FIG. 24.
- Each of the components in the system 100 may implement or execute the functionalities detailed herein, such as those detailed herein in Sections 1-8.
- the computing device 105 can be any computing device comprising one or more processors coupled with memory and software and capable of performing the various processes and tasks described herein.
- the computing device 105 can be operated or associated with the user 145.
- the computing device 105 can be a smartphone, other mobile phone, tablet computer, wearable computing device (e.g., smart watch, eyeglasses), or laptop computer, among others.
- the computing device 105 can be in communication with the speaker transducer 110, among other devices (e.g., via wireless communications or wired communications).
- the computing device 105 can be in communication with other devices, such as remote servers, computing devices, or other hardware devices, among others.
- the voice interaction service 115 can process, manage, or otherwise handle the exchange of data from the speaker transducer 110 and the application 120.
- the audio processor 125 can receive and process audio signals from the speaker transducer 110.
- the activity detector 130 can apply the ML model 140 on the processed audio signals from the audio processor 125 to identify that the user 145 is speaking.
- the voice enhancer 135 can add audio recordings of voice commands to the processed audio signal to provide to invoke at least one function of the application 120.
- the ML model 140 can be used to identify whether the user 145 is speaking.
- the ML model 140 may include, for example, a deep learning artificial neural network (ANN), Naive Bayesian classifier, a relevance vector machine (RVM), a support vector machine (SVM), a regression model (e.g., linear or logistic regression), a clustering model (e.g., k-NN clustering or density-based clustering), or a decision tree (e.g., a random tree forest), among others.
- ANN deep learning artificial neural network
- RVM relevance vector machine
- SVM support vector machine
- regression model e.g., linear or logistic regression
- a clustering model e.g., k-NN clustering or density-based clustering
- a decision tree e.g., a random tree forest
- the voice interaction service 115 can be executed on the computing device 105 (e.g., as depicted).
- the voice interaction service 115 can be a process or application separate from the application 120.
- the voice interaction service 115 can be part of the application 120 running on the computing device 105.
- the functionalities ascribed to the voice interaction service 115 can be executed by the application 120.
- the voice interaction service 115 can be executed on a device separate from the computing device 105.
- the voice interaction service 115 can be executed on one or more processors and memory of an external device (e.g., a dongle or other portable device) that is in communication with the computing device 105.
- an external device e.g., a dongle or other portable device
- the application 120 can include any software program executing on the computing device 105.
- the application 120 can be a voice assistant application (sometimes herein referred to as a digital assistant application) to interact with the user 145 via audio input (e.g., spoken queries or commands) and audio output (e.g., audio replies to queries or commands).
- the application 120 can use natural language processing (NLP) and artificial intelligence (Al) techniques to process natural language in the form of audio or text from the user 145.
- the application 120 can include a set of functions corresponding to a set of voice commands from the user 145. For example, the voice command of “Open application X” from the user 145 can invoke the launching or opening of the application 120.
- the application 120 can be, for example, Amazon AlexaTM, Apple SiriTM, Google AssistantTM, or Microsoft CortanaTM.
- the application 120 can interface with other processes and applications, such as the voice interaction service 115 to communicate or exchange data.
- the speaker transducer 110 (sometimes herein referred to as an electroacoustic transducer, earphone, or headphone) can produce, output, or otherwise generate acoustic sound waves. To generate the acoustic soundwaves, the speaker transducer 110 can transform or convert electrical signals (also referred herein as audio signals) from the computing device 105 into the acoustic sound wave. Conversely, the speaker transducer 110 can also produce, output, or otherwise generate electrical signals from acoustic soundwaves. To generate the electrical signals, the speaker transducer 110 can transform or convert acoustic soundwaves arriving at the speaker transducer 110.
- the speaker transducer 110 can be optimized to function as a loudspeaker converting electrical signals into acoustic sound waves, rather than operating as a microphone converting acoustic waveforms into electrical signals. As such, the electrical signals converted from acoustic waveforms by the speaker transducer 110 can be of low quality, with high noise, low amplitude, and high interference, among others.
- the speaker transducer 110 can be a loudspeaker, such as a dynamic speaker, a cone speaker, a horn speaker, a planar magnetic speaker, an electrostatic speaker, or a ribbon speaker, among others.
- the speaker transducer 110 can be part of an earphone (e.g., fitted within the ear of the user 145) or a headphone (e.g., fitted atop the ear of the user 145), among others.
- the earphone or the headphone which the speaker transducer 110 is a part of can lack a separate, specific transducer for a microphone.
- the speaker transducer 110 can be arranged, situated, or otherwise positioned relative to a corresponding ear of the user 145.
- the speaker transducer 110 can be an earphone fitted within an ear canal of the ear of the user 145 or can be a headphone situated about the auricle of the ear of the user 145.
- one speaker transducer 110 can be positioned on the left ear of the user 145 and another speaker transducer 110 can be positioned on the right ear of the user 145.
- FIG. 20 depicted is a block diagram of a process 200 for identifying users from audio signals acquired via earphones in the system 100 for processing audio signals.
- the process 200 can correspond to or include operations performed in the system 100 to identify whether the user 145 is speaking.
- the audio processor 125 of the voice interaction service 115 can obtain, identify, or otherwise receive at least one audio signal 205 corresponding to at least one acoustic waveform 210 via the speaker transducer 110.
- the audio processor 125 can receive the audio signal 205 via a wireless communication (e.g., Bluetooth or near field communications (NFC)) with the speaker transducer 110. In some embodiments, the audio processor 125 can receive the audio signal 205 via a wire connection with the speaker transducer 110.
- the acoustic waveform 210 can include a propagation of energy through a medium, such as the air about the user 145 and the body of the user 145.
- the audio signal 205 can be an electrical (e.g., digitized or quantized) representation of the acoustic waveform 210, and can be sampled at any sampling rate, for instance, ranging from 8-200kHz.
- the acoustic waveform 210 can include at least one portion corresponding to an air channel 215A and at least one portion corresponding to a body channel 215B.
- the air channel 215A can correspond to the portion of the acoustic waveform 210 traveling through the air about the user 145.
- the air channel 215A can include acoustic waveforms originating from other sources (e g., bystanders, vehicles, or background noise) besides the user 145 and reaching the speaker transducer 110.
- the body channel 215B can correspond to the portion of the acoustic waveform 210 traveling through the body of the user 145.
- the body channel 215B can correspond to acoustic waveforms originating from the vocal cords and lungs of the user 145 and arriving at the speaker transducer 110 through the body of the user 145.
- the acoustic waveform 210 can include a set of formants (e.g., F0, Fl, F2, and so forth from lowest to highest frequency) from the user 145.
- Each formant can correspond to an acoustic resonance of a vocal tract of the user 145 and can be characterized as a maximum within a frequency domain representation of the human speech from the user 145.
- the speaker transducer 110 Upon arrival at the speaker transducer 110, the speaker transducer 110 can transform or convert the acoustic waveform 210 into the audio signal 205.
- the audio processor 125 can process or filter out the portion of the acoustic waveform 210 corresponding to the air channel 215A in the audio signal 205 to output, produce, or otherwise generate at least audio signal 205’.
- the audio signal 205’ can include or correspond to the portion of the acoustic waveform 210 corresponding to the body channel 215B.
- the audio signal 205’ can include at least the lower formants (e.g., F0 and 7) of the speech sound from the user 145.
- the audio signal 205’ can substantially (e.g., at least 75%) lack the portion of the acoustic waveform 210 corresponding to the body channel 215B.
- audio processor 125 can fdter out noise in the portion of the acoustic waveform 210 corresponding to the body channel 215B to generate the audio signal 205’.
- the audio signal 205’ can also substantially (e.g., at least 75%) lack noise within the portion of the acoustic waveform 210 corresponding to the body channel 215B.
- the audio processor 125 can apply at least one fdter.
- the fdter can include, for example, a low-pass fdter (LPF), a band-pass fdter (BPF), a band-stop fdter (BSF), or a high-pass fdter (HPF), among others, or any combination thereof.
- LPF low-pass fdter
- BPF band-pass fdter
- BSF band-stop fdter
- HPF high-pass fdter
- the fdter can be implemented using any type of architecture, such as a resistor-capacitor (RC) fdter, a resistor-inductor (RL fdter), a RLC fdter, an active fdter, a Butterworth fdter, a Chebyshev fdter, or a Bessel fdter, among others.
- the audio processor 125 can apply an LPF with a cutoff frequency to attenuate or suppress the portion of the acoustic waveform 210 corresponding to the air channel 215A in the audio signal 205.
- the cutoff frequency of the LPF can be set at a frequency value to pass the portion of the acoustic waveform 210 corresponding to the body channel 215B and can range around 800-1200Hz.
- the cutoff frequency can be set to pass through at least a first formant (F0) of the set of formants within the speech of the user 145.
- the audio processor 125 can also apply an HPF with another cutoff frequency to the filtered audio signal 205 to attenuate or suppress noise within the portion of the acoustic waveform 210 corresponding to the body channel 215B in the audio signal 205.
- the cutoff frequency for the HPF can be set at a frequency value to remove lower frequency components below the range of frequencies for a human vocal tract and can range between 25- 100Hz.
- the audio processor 125 can apply a BPF with the cutoff frequencies to suppress the portion of the acoustic waveform 210 corresponding to the air channel 215A and the noise within the portion of the acoustic waveform 210 corresponding to the body channel 215B in the audio signal 205. From applying the BPF, the audio processor 125 can generate the audio signal 205’.
- the activity detector 130 can determine or identify whether the user 145, on which the speaker transducer 110 is situated, is speaking. To identify, the activity detector 130 can apply the audio signal 205’ to the ML model 140.
- the ML model 140 can be implemented using any model architecture and can have at least one input corresponding to an audio signal, at least one output indicating whether a user is speaking, and a set of weights relating the input to the output, among others.
- the ML model 140 can be a lightweight model architecture, with minimal resource consumption specifications that a portable or mobile device (e.g., the computing device 105 or external hardware device) can satisfy.
- the activity detector 130 can feed or input the audio signal 205’ into the ML model 140. Upon feeding, the activity detector 130 can process the input audio signal 205’ in accordance with the set of weights of the ML model 140.
- the ML model 140 may have been initialized, trained, or established (e.g., by the voice interaction service 115 or another computing device) using a training dataset.
- the ML model 140 can be trained in accordance with any learning techniques, such as supervised learning, unsupervised learning, Q-learning or weakly supervised learning, among others.
- the training dataset can include a set of examples. Each example can include or identify a sample audio signal including a portion of an acoustic waveform corresponding to a body channel of a respective sample user (e.g., on which the speaker transducer for the sample audio signal is situated). Each example can also include or identify a label indicating whether the sample user is speaking for the corresponding sample audio signal. The label can also indicate whether someone else besides the sample user is speaking or whether the sample audio is of ambient noise (e.g., in the background).
- the sample audio signal can be applied to the ML model 140 to produce or generate an output indicating whether the sample user in the sample audio signal is speaking.
- the output of the ML model 140 can be compared with the indication as identified by the label. Based on the comparison, a loss metric can be calculated in accordance with a loss function (e.g., a hinge loss, a mean squared error (MSE), a mean absolute error (MAE), a crossentropy loss, a Huber loss, or a log loss).
- MSE mean squared error
- MAE mean absolute error
- the loss metric can be used to modify or update one or more of the set of weights of the ML model 140.
- the updating of the weights of the ML model 140 may be in accordance with an optimization function (e.g., stochastic gradient descent with a predefined learning rate). This process can be iteratively repeated until the ML model 140 reaches a convergence condition to stop or cease the training process.
- the training of the ML model 140 can be performed on another computing system, separate from the computing device 105, and then loaded on the computing device 105 (e.g., when the voice interaction service 1 15 is installed).
- the activity detector 130 can determine or identify whether the user 145 of the speaker transducer 110 is speaking. In some embodiments, the activity detector 130 can identify whether the audio signal 205 (and by extension, the acoustic waveform 210) is primarily (e.g., at least 75%) originating from the user 145. From processing the audio signal 205’ using the weights of the ML model 140, the activity detector 130 can produce or generate a classification indicating whether the user 145 of the speaker transducer 110 is speaking. When the classification from the ML model 140 indicates that the user is speaking, the activity detector 130 can identify the user 145 as speaking.
- the activity detector 130 can also determine or identify that the audio signal 205 is primarily originating from the user 145. Conversely, when the classification from the ML model 140 indicates that the user is not speaking, the activity detector 130 can identify the user 145 as not speaking. The activity detector 130 can also determine or identify that the audio signal 205 is not primarily originating from the user 145. [0I46
- the activity detector 130 can compare the likelihood with a threshold.
- the threshold can delineate, define, or otherwise identify a value (e.g., 80-95%) for the likelihood at which to identify the user 145 as speaking. If the likelihood satisfies (e g., greater than or equal to) the threshold, the activity detector 130 can identify the user 145 as speaking. Conversely, if the likelihood does not satisfy (e.g., is less than) the threshold, the activity detector 130 can identify the user 145 as not speaking.
- the activity detector 130 can generate or provide at least one output 220 based on the identification of whether the user 145 is speaking.
- the output 220 can include or identify information, based in part on the identification of whether the user 145.
- the activity detector 130 can generate the output 220 to indicate that the user 145 is speaking.
- the activity detector 130 can generate the output 220 to indicate that the user 145 is not speaking.
- the information of the output 220 can also include the audio signal 205’ (or the audio signal 205) or an identifier of the user 145, among others. With the generation of the output 220, the activity detector 130 can send, relay, or otherwise provide the output 220 to the application 120.
- the activity detector 130 can store and maintain an association between the output 220 and an identifier of the audio signal 205 in memory.
- the association can be maintained using one or more data structures, such as a table, a linked list, an array, a matrix, a tree, a heap, a queue, or a stack, among others.
- FIG. 21 depicted is a block diagram of a process 300 for enhancing user voice commands in audio signals acquired via earphones in the system 100 for processing audio signals.
- the process 300 can correspond to or include operations performed in the system 100 to add pre-recorded audio to audio signals acquired via the speaker transducer 110.
- the process 300 can be initiated or performed, when the activity detector 130 identifies that the user 145 of the speaker transducer 110 is speaking using the audio signal 205’ corresponding to the portion of the acoustic waveform 210 associated with the body channel 215B.
- the voice enhancer 135 of the voice interaction service 115 can access data storage 305 to retrieve, obtain or otherwise identify at least one of a set of recorded audio signals 310A- N (hereinafter generally referred to as audio signals 310).
- the voice enhancer 135 can retrieve, identify, or otherwise receive the audio signal 205’ (or the original audio signal 205) and the output 220, among others.
- Each recorded audio signal 310 can correspond to a respective voice command including one or more keywords to invoke at least one respective function of a corresponding application 120.
- the function can include, for example: a wake up command (e.g., “Open application X”) to launch or open the application 120; a volume control command (e.g., “Increase volume” or “Decrease volume”) to adjust the volume of the audio output from the application 120; a command to control household appliances (e.g., “Turn on lights” or “Turn off stove”) through the application 120, or any built-in functionality of the application 120, among others.
- each recorded audio signal 310 can be associated with at least one application 120.
- the data storage 305 can maintained on the voice interaction service 115 (e.g., as shown), a memory of the computing device 105, or a remote service accessible to the voice interaction service 115 via one or more networks, among others.
- the data storage 305 can store, maintain, or otherwise include an association between each recorded audio signal 310 and a respective application 120 (or a function of the respective application 120).
- the recorded audio signal 310 can include a portion of the acoustic waveform for the voice command corresponding to an air channel (e.g., similar to the air channel 215A).
- the audio signal corresponding to the acoustic waveform for the voice command may have been acquired from another user (e.g., different from the user 145), and may have been filtered to pass the portion corresponding to the air channel to generate the recorded audio signal 310.
- the recorded audio signal 310 as a result can include or contain higher frequency components of the acoustic waveform for the voice command.
- the recorded audio signal 310 can include higher formants (e.g., F1-F3 or F2 and 3) of the speaker uttering the voice command.
- the recorded audio signal 310 can include the entirety of the acoustic waveform for the voice command corresponding to an air channel (e.g., similar to the air channel 215A) and a body channel (e.g., similar to the body channel 215B).
- the recorded audio signal 310 can include the entirety of the frequency components of the acoustic waveform for the voice command.
- the voice enhancer 135 can identify or select at least one recorded audio signal 310’ of the set of recorded audio signals 310 from the data storage 305. In some embodiments, the voice enhancer 135 can select the recorded audio signal 310’ based on an identification of the application 120. For example, the voice enhancer 135 can identify the application 120 as a voice assistant application and can select the recorded audio signal 310’ associated with the application 120 from the data storage 305. In some embodiments, the voice enhancer 135 can select the recorded audio signal 310’ from the set of recorded audio signals 310 at random. In contrast, when the user 145 is identified as not speaking as indicated by the output 220, the voice enhancer 135 can refrain from identifying or selecting any of the recorded audio signals 310.
- the voice enhancer 135 can produce, create, or otherwise generate at least one audio signal 205” to include the audio signal 205’ and the recorded audio signal 310’.
- the voice enhancer 135 can insertjoin, or otherwise add the recorded audio signal 310 to the audio signal 205’ (or the audio signal 205).
- the voice enhancer 135 can generate the audio signal 205” to include one portion (e.g., a higher frequency band above 800-1200Hz) corresponding to the recorded audio signal 310’ and another portion (e.g., a lower frequency band below 800-1200Hz) corresponding to the audio signal 205’.
- the voice enhancer 135 can process or parse the audio signal 205’ (e.g., using a short-time frequency transform (STFT) representation) to extract, determine, or otherwise identify the one or more characteristics.
- STFT short-time frequency transform
- the voice enhancer 135 can apply one or more machine learning (ML) models (e.g., artificial neural networks (ANNs)), statistical models (e.g., autocorrelation), or other functions (e.g., linear predictive coding (LPC) or cepstral analysis), among others, to identify the characteristics.
- ML machine learning
- ANNs artificial neural networks
- LPC linear predictive coding
- cepstral analysis cepstral analysis
- the characteristics can include or identify, for example, a time length of the audio signal 205’, a time point of syllables (e.g., an onset of consonant or vowels) in the audio signal 205’, a time point of each formant (e.g., an onset of F0, Fl, or F2) of the speech from the user 145 in the audio signal 205’, a distribution of energy across time in the audio signal 205’, among others.
- voice enhancer 135 can process or parse the audio signal 310’ (e.g., using a short-time frequency transform (STFT) representation) to extract, determine, or otherwise identify the one or more characteristics.
- STFT short-time frequency transform
- the characteristics can include or identify, for example, a time length of the audio signal 310’, a time point of syllables (e.g., an onset of consonant or vowels) in the audio signal 310’, a time point of each formant (e.g., an onset of F0, Fl, or F2) of the speech from the user 145 in the audio signal 310’, a distribution of energy across time in the audio signal 310’, among others.
- a time length of the audio signal 310’ e.g., a time point of syllables (e.g., an onset of consonant or vowels) in the audio signal 310’
- a time point of each formant e.g., an onset of F0, Fl, or F2
- the voice enhancer 135 can alter, change, or otherwise modify the recorded audio signal 310 based on one or more characteristics of the audio signal 205’ (or the audio signal 205). Using the characteristics of the audio signal 205’, the voice enhancer 135 can alter, adjust, or otherwise modify corresponding characteristics of the recorded audio signal 310’. In modifying, the voice enhancer 135 can alter the characteristics of the audio signal 310’ to match, align, or otherwise correspond with the characteristics of the audio signal 205’, or vice-versa. In some embodiments, the voice enhancer 135 can modify the time length of the recorded audio signal 310’ to substantially (e.g., at least 80-95%) match or correspond to the time length of the audio signal 205’.
- the voice enhancer 135 can lengthen or shorten the time length of the audio signal 310’ to match the time length of the audio signal 205’.
- the voice enhancer 135 can modify the time point of syllables in the recorded audio signal 310’ to substantially (e.g., at least 80-95%) align, match, or correspond to the time point of syllables in the audio signal 205’.
- the voice enhancer 135 can adjust the timing of onset of at least one consonant or vowel in the recorded audio signal 310’ to align with the timing of onset of at least one consonant or vowel in the audio signal 205’.
- the voice enhancer 135 can modify the time point of the formants (e.g., F2-F3) in the recorded audio signal 310’ to substantially (e.g., at least 80-95%) match or correspond to the time point of the formants (e.g., F0, Fl, or 1'2) of the audio signal 205’.
- the voice enhancer 135 can move the time point of at least one formant (e.g., higher formants F2 or F3) in the recorded signal 310’ to align with the time point of a lower formant (e.g., F0 or Fl) in the audio signal 205’.
- the frequency information of the recorded signal 310’ can reside above the frequency information of the audio signal 205’.
- the voice enhancer 135 can modify the distribution of energy across time in the recorded audio signal 310’ to substantially (e.g., at least 80-95%) match or correspond to the distribution of energy of the audio signal 205’.
- the voice enhancer 135 can concentrate or disperse at least a portion (e.g., 60-90%) of the distribution of energy across time in the recorded audio signal 310’ to align with a portion of the distribution of energy across time in the audio signal 205’.
- the voice enhancer 135 can perform any number of modifications to the recorded audio signal 310’ to enhance the audio signal 205’.
- the voice enhancer 135 can join or add the modified, recorded audio signal 310’ to the audio signal 205’ to form, produce, or otherwise generate the audio signal 205”.
- the recorded audio signal 310 can be added to the higher frequency bands to increase or enhance the likelihood that the function of the application 120 is successfully invoked.
- the voice enhancer 135 can send, relay, or otherwise provide the audio signal 205” to the application 120.
- the voice enhancer 135 can provide the audio signal 205” via an application programming interface (API) for the application 120.
- API application programming interface
- the application 120 executing on the computing device 105 can process the audio signal 205” provided by the voice interaction service 115.
- the application 120 can include natural language processing (NLP) and artificial intelligence (Al) functionalities to process and extract information from the audio signal 205”.
- NLP natural language processing
- Al artificial intelligence
- the application 120 can carry out, perform, or carry out the specified function. For example, when the audio signal 205” is for the wake-up command, the application 120 can cease sleep mode and launch as a foreground process on the computing device 105.
- the application 120 can produce, output, or generate an error indication. In some embodiments, the application 120 can refrain from responding to the audio signal 205”.
- the application 120 can output, produce, or otherwise generate at least one response 315 to indicate whether the invocation of the function of the application 120 is successful or failed.
- the application 120 can generate the response 315 to indicate that the invocation of the function is successful.
- the application 120 can generate the response 315 to indicate that the invocation of any function of the application 120 has failed.
- the application 120 can refrain from generation of the response 315 when the audio signal 205” is identified as not corresponding to any function of the application 120. With the generation, the application 120 can return, send, or otherwise provide the response 315 to the voice interaction service 115.
- the voice enhancer 135 can identify or determine whether the invocation of the function of the application 120 was successful or a failure based on the response 315 from the application 120. When the response 315 indicates that the invocation of the function of the application 120 was successful, the voice enhancer 135 can determine that the invocation of the function of the application 120 was successful. The voice enhancer 135 can also refrain from additional selections of recorded audio signals 310’ for another voice command or for another application to add to the audio signal 205’. Conversely, when the response 315 indicates that the invocation of the function of the application 120 was a failure, the voice enhancer 135 can determine that the invocation of the function of the application 120 failed.
- the voice enhancer 135 can determine that the invocation of the function of the application 120 failed, based on lack of any response 315 from the application 120. For instance, the voice enhancer 135 can maintain a timer to keep track of time, subsequent to providing the audio signal 205” to the application 120. The voice enhancer 135 can wait for the response 315 from the application 120. When the elapsed time is above a threshold (e.g., 5-30 seconds) and no response 315 is received, the voice enhancer 135 can determine that the invocation of the function of the application 120 failed.
- a threshold e.g., 5-30 seconds
- the voice enhancer 135 can repeat the selection of another recorded audio signal 310’ for another voice command for the same application 120 or for another application 120.
- the process for selection of the subsequent recorded audio signal 310’ and the determination of whether the invocation of the function of the application 120 is successful can be similar as described herein.
- the voice enhancer 135 can select another recorded audio signal 310’ from the set of recorded audio signals 310.
- the voice enhancer 135 can generate another audio signal 205” using the audio signal 205’ and the next selected recorded audio signal 310’ to provide to invoke another function of the same application 120 or another application 120.
- the voice enhancer 135 can determine whether the function of the application 120 is successfully invoked. The processes of the voice enhancer 135 can be repeated any number of times until one of the applications 120 is invoked or a threshold number of attempts is exceeded. If the number of attempts has not exceeded the threshold (e.g., 5-20 attempts), the voice enhancer 135 can continue selection of another recorded audio signal 310’ for the voice command for the application 120. If the number of attempts has exceeded the threshold, the voice enhancer 135 can terminate selection of other recorded audio signal 310’.
- a threshold number of attempts e.g., 5-20 attempts
- the voice interaction service 115 can provide for voice command activation on speaker transducers 110 on earphones without the use of additional hardware modifications. By repurposing the speaker transducer 110 of earphones into a microphone, the voice interaction service 115 may reduce or eliminate reliance on in-ear microphones or dedicated sensors.
- the voice interaction service 115 can use the ML model 140, which is a lightweight model, to distinguish between the speech from the primary user 145 as opposed to other speakers or ambient noise.
- the ML model 140 can leverage the propagation characteristics of propagation characteristics of human speech through the air channel 215A versus the body channel 215B as sensed by the speaker transducer 110. The ability to accurately distinguish can lower the instances or prevent unintended activation of the application 120 and thereby reduce processor and power consumption, making the computing device 105 more efficient in terms of processing and electrical power.
- the voice interaction service 115 can offer a robust wake-up word recognition by compensating for the loss of higher frequency energies of the audio from propagating through the body channel 215B in the body of the primary user.
- the voice interaction service 115 can take a copy, paste, and adapt approach using a bank of recorded audio signals 310.
- the recorded audio signals 310 can be used to add back high-frequency components to the audio signal 205’. The addition of these higher-frequency components can increase the likelihood that the voice command is recognized by the application 220, thereby improving the accuracy of wake-up word recognition even in noisy environments.
- the voice interaction service 115 can also function as a gating mechanism by audio with forwarding voice commands to the application 120, thereby minimizing the risk of privacy leaks and conserving power from processing otherwise unintelligible audio. This can also enhance the quality of human-computer interactions (HCI) between the user 145 and the application 120. In certain environments, the voice interaction service 115 can permit the user 145 to use voice commands with the speaker transducer 110 in a hands-free manner, thereby allowing the user to use the application 120 even in situations where it would be otherwise impractical. Overall, the voice interaction service 115 can provide for high accuracy in wakeup word recognition through a lightweight signal processing algorithm that distinguishes between the primary user's speech and ambient noise, ensuring low false positive rates and efficient power consumption.
- a computing system can receive an audio signal from an earphone (405).
- the computing system can filter out an air channel portion from the received audio signal (410).
- the computing system can apply the filtered audio signal to a machine learning (ML) model (415).
- the computing system can identify whether the user of the earphone is speaking based on the application of the filtered audio signal to the ML model (420). If the user is identified as speaking, the computing system can identify the user as speaker (425). Otherwise, if the user is identified as not speaking, the computing system can identify the user as not speaking (430).
- the computing system can provide an output based on the identification (435).
- a computing system can identify the user as speaking from an audio signal received via an earphone (505).
- the computing system can select an audio recording for a voice command to add the audio signal (510).
- the computing system can provide an enhanced audio signal with the voice command to invoke a function of an application (515).
- the computing system can receive an indication from the application (520).
- the computing system can determine whether the invocation of the function is successful (525). If the function is not successfully invoked, the computing system can repeat the functionality from step (510). If the function is determined to have been successfully invoked, the computing system can refrain from additional selection (530).
- FIG. 24 shows a block diagram of a representative computing system 614 usable to implement the present disclosure.
- the methods 400 and 500 may be implemented by the computing system 614.
- Computing system 614 can be implemented, for example, as a consumer device such as a smartphone, other mobile phone, tablet computer, wearable computing device (e.g., smart watch, eyeglasses, head mounted display), desktop computer, laptop computer, cloud computing service or implemented with distributed computing devices.
- the computing system 614 can include computer components such as processors 616, storage device 618, network interface 620, user input device 622, and user output device 624.
- Network interface 620 can provide a connection to a wide area network (e.g., the Internet) to which the WAN interface of a remote server system is also connected.
- Network interface 620 can include a wired interface (e.g., Ethernet) and/or a wireless interface implementing various RF data communication standards such as Wi-Fi, Bluetooth, or cellular data network standards (e.g., 4G, 5G, 60 GHz, LTE, etc ).
- User input device 622 can include any device (or devices) via which a user can provide signals to computing system 614; computing system 614 can interpret the signals as indicative of particular user requests or information.
- User input device 622 can include any or all of a keyboard, a controller (e.g., joystick), touch pad, touch screen, mouse or other pointing device, scroll wheel, click wheel, dial, button, switch, keypad, microphone, sensors (e.g., a motion sensor, an eye tracking sensor, etc.), and so on.
- User output device 624 can include any device via which computing system 614 can provide information to a user.
- user output device 624 can include a display-to- display images generated by or delivered to computing system 614.
- the display can incorporate various image generation technologies, e.g., a liquid crystal display (LCD), light-emitting diode (LED) including organic light-emitting diodes (OLED), projection system, cathode ray tube (CRT), or the like, together with supporting electronics (e.g., digital-to-analog or analog-to- digital converters, signal processors, or the like).
- a device such as a touchscreen that functions as both an input and output device can be used.
- Output devices 624 can be provided in addition to or instead of a display. Examples include indicator lights, speakers, tactile “display” devices, printers, and so on.
- Some implementations include electronic components, such as microprocessors, storage, and memory that store computer program instructions in a non-transitory computer readable storage medium. Many of the features described in this specification can be implemented as processes that are specified as a set of program instructions encoded on a computer readable storage medium. When these program instructions are executed by one or more processors, they cause the processors to perform various operations indicated in the program instructions. Examples of program instructions or computer code include machine code, such as is produced by a compiler, and files including higher-level code that are executed by a computer, an electronic component, or a microprocessor using an interpreter. Through suitable programming, processor 616 can provide various functionalities for computing system 614, including any of the functionalities described herein as being performed by a server or client, or other functionalities associated with message management services.
- computing system 614 is illustrative and that variations and modifications are possible. Computer systems used in connection with the present disclosure can have other capabilities not specifically described here. Further, while computing system 614 is described with reference to particular blocks, it is to be understood that these blocks are defined for convenience of description and are not intended to imply a particular physical arrangement of component parts. For instance, different blocks can be located in the same facility, in the same server rack, or on the same motherboard. Further, the blocks need not correspond to physically distinct components. Blocks can be configured to perform various operations, e.g., by programming a processor or providing appropriate control circuitry, and various blocks might or might not be reconfigurable depending on how the initial configuration is obtained.
- Implementations of the present disclosure can be realized in a variety of apparatus including electronic devices implemented using any combination of circuitry and software. [0I72
- the hardware and data processing components used to implement the various processes, operations, illustrative logics, logical blocks, modules and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- a general-purpose processor may be a microprocessor, or any conventional processor, controller, microcontroller, or state machine.
- a processor also may be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- particular processes and methods may be performed by circuitry that is specific to a given function.
- the memory e.g., memory, memory unit, storage device, etc.
- the memory may include one or more devices (e.g., RAM, ROM, flash memory, hard disk storage, etc.) for storing data and/or computer code for completing or facilitating the various processes, layers and modules described in the present disclosure.
- the memory may be or include volatile memory or nonvolatile memory, and may include database components, object code components, script components, or any other type of information structure for supporting the various activities and information structures described in the present disclosure.
- the memory is communicably connected to the processor via a processing circuit and includes computer code for executing (e g., by the processing circuit and/or the processor) the one or more processes described herein.
- the present disclosure contemplates methods, systems and program products on any machine-readable media for accomplishing various operations.
- the embodiments of the present disclosure may be implemented using existing computer processors, or by a special purpose computer processor for an appropriate system, incorporated for this or another purpose, or by a hardwired system.
- Embodiments within the scope of the present disclosure include program products comprising machine-readable media for carrying or having machineexecutable instructions or data structures stored thereon.
- Such machine-readable media can be any available media that can be accessed by a general purpose or special purpose computer or other machine with a processor.
- machine-readable media can comprise RAM, ROM, EPROM, EEPROM, or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer or other machine with a processor. Combinations of the above are also included within the scope of machine-readable media.
- Machine-executable instructions include, for example, instructions and data which cause a general -purpose computer, special purpose computer, or special purpose processing machines to perform a certain function or group of functions.
- references to implementations or elements or acts of the systems and methods herein referred to in the singular can also embrace implementations including a plurality of these elements, and any references in plural to any implementation or element or act herein can also embrace implementations including only a single element.
- References in the singular or plural form are not intended to limit the presently disclosed systems or methods, their components, acts, or elements to single or plural configurations.
- References to any act or element being based on any information, act, or element can include implementations where the act or element is based at least in part on any information, act, or element.
- Coupled elements can be electrically, mechanically, or physically coupled with one another directly or with intervening elements.
- Coupled and variations thereof include thejoining of two members directly or indirectly to one another. Such joining may be stationary (e.g., permanent or fixed) or moveable (e.g., removable or releasable). Such joining may be achieved with the two members coupled directly with or to each other, with the two members coupled with each other using a separate intervening member and any additional intermediate members coupled with one another, or with the two members coupled with each other using an intervening member that is integrally formed as a single unitary body with one of the two members.
- Coupled or variations thereof are modified by an additional term (e.g., directly coupled)
- the generic definition of “coupled” provided above is modified by the plain language meaning of the additional term (e.g., “directly coupled” means thejoining of two members without any separate intervening member), resulting in a narrower definition than the generic definition of “coupled” provided above.
- Such coupling may be mechanical, electrical, or fluidic.
- references to “or” can be construed as inclusive so that any terms described using “or” can indicate any of a single, more than one, and all of the described terms.
- a reference to “at least one of ‘A’ and ‘B’” can include only ‘A’, only ‘B’, as well as both ‘A’ and ‘B’.
- Such references used in conjunction with “comprising” or other open terminology can include additional items.
- references herein to the positions of elements are merely used to describe the orientation of various elements in the FIGURES.
- the orientation of various elements may differ according to other exemplary embodiments, and that such variations are intended to be encompassed by the present disclosure.
- a subject can be a mammal, such as a non-primate (e.g., cows, pigs, horses, cats, dogs, rats, etc.) or a primate (e.g., monkey and human).
- a non-primate e.g., cows, pigs, horses, cats, dogs, rats, etc.
- a primate e.g., monkey and human.
- Mammals include, without limitation, humans, non-human primates, wild animals, feral animals, farm animals, sport animals, and pets.
- a subject is a human.
- a cell includes a plurality of cells, including mixtures thereof.
- the term “about” is used to indicate that a value includes the standard deviation of error for the device or method being employed to determine the value.
- Ranges throughout this disclosure, various aspects of the invention can be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges, as well as individual numerical values within that range. For example, description of a range, such as from 1 to 6, should be considered to have specifically disclosed subranges, such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 2.7, 3, 4, 5, 5.3, and 6. This applies regardless of the breadth of the range.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Presented herein are systems and methods of identifying users from audio signals acquired via speaker transducers. A computing system may receive a first audio signal corresponding to a first acoustic waveform acquired, via a speaker transducer, positioned relative to an ear of a first user. The first acoustic waveform may have (i) a first portion traveling through the first user and (ii) a second portion traveling outside the first user. The computing system may filter the second portion of the first acoustic waveform within the first audio signal to generate a second audio signal corresponding to the first portion of the first acoustic waveform. The computing system may apply the second audio signal to a machine learning (ML) model. The computing system may identify, based on applying the second audio signal to the ML model, that the first user of the speaker transducer is speaking.
Description
PROCESSING OF AUDIO SIGNALS FROM EARPHONES FOR
INTERACTIVITY WITH VOICE- ACTIVATED APPLICATIONS
CROSS REFERENCE TO RELATED APPLICATIONS
10001 | The present application claims priority to U.S. Provisional Patent Application No. 63/561,191, titled “Processing of Audio Signals From Earphones for Interactivity with Voice- Activated Applications,” filed March 4, 2024, which is incorporated by reference in its entirety.
BACKGROUND
[0002] A computing device may be communicatively coupled with one or more input/output (I/O) devices to accept inputs and provide outputs.
SUMMARY
[0003] At least one aspect of the present disclosure is directed to systems and methods of identifying users from audio signals acquired via speaker transducers. One or more processors may receive a first audio signal corresponding to a first acoustic waveform acquired via a speaker transducer positioned relative to an ear of a first user. The first acoustic waveform may have (i) a first portion traveling through the first user and (ii) a second portion traveling outside the first user. The one or more processors may filter the second portion of the first acoustic waveform within the first audio signal to generate a second audio signal corresponding to the first portion of the first acoustic waveform. The one or more processors may apply the second audio signal to a machine learning (ML) model. The ML model may be trained using a plurality of examples. Each example of the plurality of examples may identify: (i) a respective third audio signal corresponding to a portion of a respective second acoustic waveform traveling through a respective second user and (ii) an identification of whether the second user is speaking. The one or more processors may identify, based on applying the second audio signal to the ML model, that the first user of the speaker transducer is speaking. The one or more processors may
provide an output based on an identification that the first user of the speaker transducer is speaking.
[0004] In some embodiments, the one or more processors may identify that a third user of a second speaker transducer is not speaking, based on applying a fourth audio signal corresponding to a portion of a third acoustic waveform traveling through the third user to the ML model. In some embodiments, the one or more processors may provide a second output based on an identification that the third user of the second speaker transducer is not speaking.
[0005| In some embodiments, the one or more processors may receive the first audio signal corresponding to the first acoustic waveform comprising a plurality of formants of the first user. In some embodiments, the one or more processors may filter the first audio signal below a threshold frequency to pass through at least a first formant (F0) of the plurality of formants as the second audio signal.
[0006] In some embodiments, the one or more processors may apply the second audio signal to the ML model to determine a likelihood that the first user of the speaker transducer is speaking. In some embodiments, the one or more processors may identify that the first user of the speaker transducer is speaking, responsive to the likelihood satisfying a threshold. In some embodiments, the one or more processors may identify that the first audio signal is originating from the first user on which the speaker transducer is positioned.
[0007] In some embodiments, the one or more processors may apply a filter to suppress an air channel corresponding to the second portion of the first acoustic waveform and to pass a body channel corresponding to the first portion of the first acoustic waveform. In some embodiments, the one or more processors may initiate a process to enhance a voice command corresponding to the second audio signal to invoke a function of an application.
[0008] At least one aspect of the present disclosure is directed to systems and methods for enhancing voice commands from audio signals acquired via speaker transducers. One or more processors may identify that a user of a speaker transducer is speaking using a first audio
signal corresponding to an acoustic waveform travelling through the user. The one or more processors may select responsive to identifying that the user is speaking, a second audio signal corresponding to a voice command comprising one or more keywords for an application. The one or more processors may generate a third audio signal to include (i) the first audio signal and (ii) the second audio signal. The one or more processors may provide, to the application, the third audio signal to invoke a function of the application corresponding to the one or more keywords of the voice command.
[0009] In some embodiments, the one or more processors may determine that the function of the application was not successfully invoked in response to providing the third audio signal to the application. In some embodiments, the one or more processors may select, responsive to determining that the function of the application was not successfully invoked, a fourth audio signal corresponding to a second voice command comprising one or more second keywords for at least one of (i) a second function of the application or (ii) a second application. In some embodiments, the one or more processors may generate a fifth audio signal to include (i) the first audio signal and (ii) the fourth audio signal. In some embodiments, the one or more processors may provide the fifth audio signal corresponding to the second one or more keywords of the voice command to invoke at least one of the second function of the application or the second application.
[0010] In some embodiments, the one or more processors may determine that the function of the application was successfully invoked in response to providing the third audio signal to the application. In some embodiments, the one or more processors may refrain from selecting another audio signal for another voice command, responsive to determining that the function of the application was successfully invoked.
[0011 ] In some embodiments, the one or more processors may maintain, on memory, a plurality of audio signals each corresponding to one or more respective keywords to invoke a respective function of at least one of a plurality of applications. In some embodiments, the one or more processors may select the second audio signal from the plurality of audio signals. In
some embodiments, the one or more processors may generate the third audio signal to include, in a frequency domain, a first portion corresponding to the first audio signal and a second portion corresponding to the second audio signal. In some embodiments, the one or more processors may modify the second audio signal based on one or more characteristics of the first audio signal.
BRIEF DESCRIPTION OF THE DRAWINGS
|0012| FIG. 1 : A few representative examples of the mobile voice activation service. (Left): the mobile voice activation service allows mobile users to activate their voice assistant without hand intervention. (Right): the mobile voice activation service can automatically detect the primary speaker, avoiding false alarms.
[0013] FIGs. 2A and 2B: (A): human speech production. (B): two human speech transmission channels. (1) air channel, (2) in-body bone-conduction audio pathway.
|0014| FIG. 3 : Spectrogram (left) and spectral envelope (right) of the vowel sound /i/. The first three formats are denoted as Fl, F2, and F3. This audio signal is recorded by a MEMS microphone.
[0015] FIGs. 4A and 4B: Feasibility study: speech measurement from (A): a primary speaker; and (B): a nearby speaker. Table 1 : wakeup words recognition accuracy on five mainstreaming voice interfaces. Ten volunteers are invited to articulate three wakeup words 10 times each.
[0016] FIG. 5 : The spectrogram and formants of the vowel sound /i/ captured by the earphone speaker.
[0017] FIGs. 6A-D: Two distinct wakeup words “Hey Siri” and “OK Google” were recorded using the pseudo-microphone and a MEMS microphone, plotting the spectrogram of the audio recordings. Pseudo-microphone recordings of (A) “Hey Siri” and (C) “OK Google”. MEMS microphone recordings of (B) “Hey Siri” and (D) “OK Google”.
[0018] FIG. 7: Measurement setup (left) and Frequency response curve of six pairs of earphones (right). A probing signal may be played across the frequency band to the earphone with a loudspeaker in an anechoic chamber.
[0019] FIG. 8: An illustration of the enhancement of the joint speech detection and primary user identification.
[0020] FIGs. 9A and 9B: (A) Reconstructed F1-F3 formants through harmonic reconstruction. Google API cannot recognize this keyword. (B) The ground truth F1-F3 formants recorded by a MEMS microphone. Google API can successfully recognize it as “Hey Siri”.
[0021 ] FIGs. 10A-D: Spectrogram and recognized word of each audio clip. (A) The combined signal can be successfully recognized by Google API. (B) The speech recording with high-frequency deafness was falsely recognized as “hi babe” by Google API. (C) The high- frequency component from a template cannot be recognized by Google API. (D) The combination of a non-wakeup word and the high-frequency template cannot be recognized by Google API.
[0022] FIG. 11 : (a) syllables and (b) formants alignment.
[0023] FIG. 12: The mobile voice activation service supports wireless (left) and wired (right) connection.
[0024] FIG. 13: Earphones.
[0025] FIG. 14: (a) FRR and (b) FAR across 15 subjects. Pi refers joint speech and primary speaker detection (§4.1.1); P2 refers pitch detection-based enhancement (§4.1.2).
[0026] FIG. 15: Wakeup word recognition accuracy across 15 subjects.
[0027] FIG. 16: Success rate of the mobile voice activation service in seven scenarios.
[0028] FIGs. 17A-G: Four stationary and three mobility scenarios for the in-wild study: (A) home; (B) cafe; (C) park; (D) train; (E) driving car; (F) lifting in the gym; (G) walking on a busy intersection.
[0029] FIG. 18A and B: Benchmark study. (A): the impact of earphone types. (B): the impact of voice loudness.
[0030] FIG. 19 depicts a block diagram of a system for processing audio signals from earphones for interactivity with applications, in accordance with an illustrative embodiment.
[00311 FIG. 20 depicts a block diagram of a process for identifying users from audio signals acquired via earphones in the system for processing audio signals, in accordance with an illustrative embodiment.
[0032] FIG. 21 depicts a block diagram of a process for enhancing user voice commands in audio signals acquired via earphones in the system for processing audio signals, in accordance with an illustrative embodiment.
[0033] FIG. 22 depicts a flow diagram of a method of identifying users from audio signals acquired via earphones, in accordance with an illustrative embodiment.
[0034] FIG. 23 depicts a flow diagram of a method of enhancing voice commands from audio signals acquired via earphones, in accordance with an illustrative embodiment.
[0035] FIG. 24 is a block diagram of a computing environment according to an example implementation of the present disclosure.
DETAILED DESCRIPTION
[0036] Following below are more detailed descriptions of various concepts related to, and embodiments of, systems and methods for processing audio signals from earphones for interactivity with applications. It should be appreciated that various concepts introduced above and discussed in greater detail below may be implemented in any of numerous ways, as the
disclosed concepts are not limited to any particular manner of implementation. Examples of specific implementations and applications are provided primarily for illustrative purposes.
1. Introduction
[0037] Presented herein is a mobile voice activation service (also referred herein as “EarVoice”) was implemented, a lightweight mobile service that enables hands-free voice assistant activation on commodity earphones. The mobile voice activation service comprises two design modules: one for joint speech detection and primary user identification, exploring the attributes of the air channel and in-body audio pathway to differentiate between the primary user and others nearby; and another for accurate wakeup word enhancement, which employs a “copy, paste, and adapt” approach to reconstruct the missing high-frequency component in speech recordings. To minimize false positives, enhance agility, and preserve privacy, the mobile voice activation service was deployed on a dongle where the proposed signal processing algorithms are streamlined with a gating mechanism to permit only the primary user’s speech to enter the pairing device (e.g., a smartphone) for wakeup word recognition, preventing unintended disclosure of ambient conversations. The dongle on a 4-layer PCB board was implemented and conducted extensive experiments with 15 participants in both controlled and uncontrolled scenarios. The experiment results show that the mobile voice activation service achieves around 90% wakeup word recognition accuracy in stationary scenarios, which is on par with the high- end, multi-sensor fusion-based AirPods Pro earbud. The mobile voice activation service’s performance drops to 84% on mobile cases, slightly worse than AirPods (around 90%).
[0038] Voice assistant (VA) has become an indispensable part of mobile systems. It serves as a natural means of communication that transcends language barriers, making mobile applications more accessible and inclusive for a diverse range of users. The rapid growth of generative Al, fueled by the sheer size of computation resources in the cloud, has been transforming the voice assistant into a more seamless, efficient, and user-friendly user interface.
[0039] While the voice assistant offers flexibility to mobile users, the process of activating it remains inconvenient due to its heavy dependence on hand interventions,
particularly on earphones. Taking Siri as an example, the user has to press and hold the talk/answer button on earphones for a few seconds until hearing the Siri beep. Similarly, wireless earbuds, including Google pixel -bud, Apple Air-Pods, and Bose’s QC35, all require users to activate the voice assistant by tapping a touch sensor or holding an action button. This precaution is taken to avoid unintended activation of Siri by someone else nearby. Yet, this would divert the user’s attention from their current focus, negatively impacting the user experience. This is especially notable in situations where the user’s hands are occupied, as illustrated in FIG. 1A.
[0040] Notice that, in this paper a simple question was asked: is it possible to enable hands-free VA activation on earphones? An affirmative answer would enhance the accessibility of voice assistants by enabling individuals occupied with other tasks to interact with their devices conveniently. In addition, it can improve safety by reducing the need for hands-on device manipulation, particularly in situations where manual interaction may be risky such as driving or cycling.
[0041] Nevertheless, to harvest the aforementioned benefits, the following system objectives were attained:
• Low False Positive Rate. A hands-free voice activation service stays in idle listening mode continuously, responding whenever a voice command is initiated. To achieve a good user experience, this service should minimize false positives, ensuring that it doesn’t get triggered by ambient voice activities.
• Agile and Low-Power. The proposed service should respond to human speech agilely, with minimum or unnoticeable latency. Moreover, as an always-on service running on power-constrained mobile devices, the proposed system design should be low-power, incurring a minimum power consumption.
Privacy-preserving. Voice data should be handled and stored securely, and users should have control over their data. Besides the necessary voice commands for awakening
corresponding services, other audio data should avoid being recorded and saved on the smartphone to minimize the risk of privacy leaks.
[0042] Presented herein is a mobile voice activation service (also referred to herein as “EarVoice”) that explores the distinction between the acoustic air channel and the in-body boneconduction pathway formed in human speech to enable accurate, agile, and low-power handsfree voice activation, all in a privacy-preserving way. The system works with everyday earphones, (e.g., those earphones cost a few US dollars), without breaking their structures and requires neither in-ear microphones nor dedicated IMU sensors that are only available on those pricey ANC earphones.
[0043] This mobile voice activation service repurposes the earphone speaker into a microphone for wakeup words (e.g., “Hey Siri”) detection. This allows mobile users to wake up their voice assistant using earphones even without a microphone. To detect whether the recorded sounds are human speech or ambient noise, and furthermore, to distinguish if the detected speech originates from the primary user (i.e., who wears the earphone), the mobile voice activation service explores an observation that the speech of the primary user reaches the earphone’s speaker transducer through not only the conventional air channel but also via the human body channel, whereas the nearby speaker’s speech solely propagates through the air channel to the earphone speaker transducer, with significant attenuation. This discrepancy in the audio pathways is reflected in the recorded audio spectrum, with low-frequency signals originating from the primary speaker’s vocal cord vibrations being present, while the low-frequency voice components of a nearby speaker are not. The mobile voice activation service takes advantage of this unique frequency disparity to detect whether it is the primary user or someone else speaking nearby.
[0044] However, the distinct in-body bone-conduction pathway, coupled with the suboptimal frequency response of speaker transducers functioning as microphones, leads to a significant power loss in the higher-frequency speech components. The occurrence of such high- frequency deafness distorts spoken wakeup words severely, consequently diminishing the
accuracy of wakeup word recognition. To address this challenge, a wakeup word enhancement design was proposed to compensate for the high-frequency energy loss in the speech recording. This approach takes a MEMS microphone recording of the wakeup word (e.g., “Hey Siri”) as the template, extracting its high-frequency components ranging from 2 to 8 kHz, and pasting it to the voice recording. As wakeup recognition systems are primarily designed to interpret contentdependent elements of human speech such as vowels and consonants as opposed to human speaker-dependent features like tones, prosody, and intonation, the combined signal can be successfully recognized even though its frequency components come from different individuals.
[ 0045] Nevertheless, as different individuals speak the wakeup word at different speeds, frequencies, and loudness, blindly copying and pasting without considering the discrepancy between the speech recording and the template can lead to the misalignment of critical formants in the combined audio signal and further undermine the wakeup word recognition. To address this issue, an efficient signal processing algorithm was proposed to align these two signal components along the time, frequency, and amplitude domain, ensuring two frequency components are aligned in their combined form.
|0046| The mobile voice activation service functions as a hybrid signal-processing pipeline with primary functions running on a low-power dongle while the wakeup word recognition runs on the smartphone. The dongle transforms the earphone speaker into a microphone, detects the human voice, distinguishes whether it originates from the primary user, and further enhances the speech quality. By exclusively forwarding only the legitimate voice commands from the dongle to the smartphone, this gating approach not only prevents inadvertent disclosure of ambient conversations but also minimizes unnecessary wakeup word recognition on the pairing device, thereby conserving power.
(0047] A prototype of the mobile voice activation service’s dongle was implemented on a 4-layer printed circuit board (PCB). It includes a low power ESP32 MCU, an audio codec chip, and other peripherals to enable the functionality.
[0048] The close contact between the ear-phone speaker transducer and the human skin was identified to offer a unique opportunity to sense the vocal cord vibrations of the user who spoke, to tell whether the voice is coming from the primary user or others in the vicinity. Consequently, a lightweight signal processing algorithm was proposed that explores this opportunity to enable hands-free voice assistant activation.
100491 A gated signal-processing pipeline was designed that can accurately detect, differentiate, and further enhance the incomplete voice command captured by the earphone speaker transducer, all in a low-power and privacy-preserving way. This design holds the potential to be deployed on different types of earphones.
[0050] The mobile voice activation service was implemented on a PCB board and conducted extensive experiments in both controlled and uncontrolled environments. Experiment results demonstrated that the mobile voice activation service achieves an overall wakeup recognition accuracy of 90% across different real-world scenarios, which is on par with the high- end, multi-sensor fusion-based AirPods Pro earbud.
2. Speech Production
[0051[ Before the potential of the earphone’s speaker transducer for hands-free voice assistant activation was described, how human speech production works was first explained.
[0052] As illustrated in FIG. 2A, the production of human speech involves intricate coordination between multiple articulatory organs in the vocal system, including lungs, vocal cords (a.k.a. vocal folds), and vocal tract. Vocal tract is the area from the nose and the nasal cavity down to the vocal cords, including the throat, mouth (e.g., tongue, teeth, lip), nasal cavity, and facial movement. Specifically, the lungs provide the essential air source required for vocalization. This air subsequently passes through the vocal folds to generate a voice source and is then modulated by the vocal tract to produce output speech. Vocal folds generate speech signals that are voiced by dynamically controlling the airflow originating from the lung, alternatively blocking and permitting it. On the contrary, if vocal folds do not vibrate, airflow
from the lungs may be manipulated directly by the vocal tract to produce unvoiced signals, such as consonant sounds like /f'/, /r/, etc.
[0053] The voiced signals may include two components, i): vowels and some consonants that own high energy pulses in the frequency domain; ii): the fundamental pitch Fo and its harmonics. The frequency components that determine the intelligence of speech words are called formants (spectral resonances). The first formants in a sentence are usually within 300-2800Hz frequency band, forming the pronunciation of vowels. The follow-up formants stay in a higher frequency band above 3000Hz, as shown in FIG. 3.
3. Challenges
[0054] Facilitating hands-free voice assistant activation on earphones requires the agile detection of human voice, precise identification of the primary speaker, and robust recognition of the wakeup word hereafter. There are two observations contribute to achieving these goals: the first a reflection on observations, the second a consequence of unique voice channels:
100551 Other approaches have demonstrated that the speaker transducer on commodity earphones can be used as a microphone for acoustic signal reception. This leaves an opportunity to capture spoken words on all types of earphones without requiring a microphone.
[0056] The primary user’s voice reaches the earphone via both an air channel and an inbody channel, while a nearby user’s voice only travels through the air channel. Due to the earphone’s obstruction, only a small fraction of the voice energy from the nearby user reaches the ear-phone’s speaker. In contrast, the primary user’s voice arrives at the earphone speaker with less attenuation through the in-body channel, providing with an opportunity to distinguish the speaker (§3.1).
[0057] In the following sections, the practicality of these opportunities and identifying potential challenges were assessed.
3.1 Identifying the Primary Speaker: An Opportunity
[0058| Voice fingerprint is proposed to identify the registered primary user and might help determine whether the primary user is interacting with Siri or if someone else nearby is speaking. However, such a mechanism is prone to various security threats in real life, including impersonation, voice synthesis, and replay attacks.
[0059] Instead of applying the fingerprint technology, the distinct speech propagation channels between the primary speaker and nearby speakers were found to offer another opportunity to distinguish speakers using earphones. Specifically, the speech of the primary user reaches the earphone’s speaker transducer through not only the conventional air channel but also via the human body channel, as depicted in FIG. 2B. In contrast, when it comes to human speech from a nearby non-primary speaker, it solely propagates through the air channel to the earphone speaker transducer. Below these two channels were elaborated:
[0060] (1) Air channel for voice propagation. For both the primary speaker and nearby speakers, the voice signal emanating from their mouth can propagate through the air channel. The earphone’s speaker transducer captures this signal when the sound reaches the earphone, as denoted by 1 in FIG. 2B.
[0061] (2) Body channel for the propagation of articulatory organ vibrations.
For the primary speaker, the vibrations from her articulatory organs, such as the vocal cord and tract, would travel through the human body and ultimately reach the ear canal. Given the fact that the earphone transducer maintains close contact with the human ear, the speaker transducer is highly likely to detect these vibrations through bone conductions.
ASR Earphone speaker transducer MEMS Microphone
Google API 9% 82%
DeepSpeech 1% 58% iFLYTEK 18% 76%
SpeechBrain 1% 66%
Whisper 31% 93%
[0062| Benchmark studies in a controlled environment were conducted to assess whether human speakers are differentiable based on these two channel propagation characteristics.
[0063] Setups. Two volunteers were invited, Alice and Bob, to conduct the experiment. As shown in FIG. 4A, Alice wears the earphones and acts as the primary user to activate the voice service by uttering “Hey Siri” at her preferred pace and intensity. The frequency spectrogram of the signal recorded by the earphone speaker transducer was plotted in the range between 0 and 8kHz. In FIG. 4B, Bob takes on the role of the primary user, wearing the earphones and remaining stationary, while Alice acts as a nearby speaker, uttering “Hey Siri” at the same pace and intensity. The earphone captures the voice from Alice via only the air channel.
|0064| Results. Upon comparing these two spectrograms, distinct energy gaps were observed (around 20dB), especially when the O-lOOOHz frequency range were zoomed in to. This frequency range is where vibrations originating from the articulatory organs are prominent. More specifically, these articulatory organ vibrations are primarily stemming from the vocal cords and vocal tract. Vibrations related to the vocal tract, such as movements of the lips, tongue, and facial features, typically fall within the 0 to 100Hz range. In contrast, vocal cord vibrations span the frequency range of 100 to 1000Hz, with variations depending on genders, i.e., around 90-500Hz for males while 150-1000Hz for females.
[0065] The result indicates that the speaker transducer is able to capture the low- frequency signals stem from the primary speaker’s vocal tract vibrations, but not from the nearby speaker. This is reasonable as both the vocal cord and tract activity travel through the body channel (in the form of bone conduction) to the earphone diaphragm, which suffers less attenuation compared with the air channel.
3.2 Wakeup Word Recognition: Challenges
[0066] The preceding section highlights the potential for distinguishing the primary speaker with dumb earphones. However, these captured wakeup words were tested with five
mainstreaming voice assistant systems, it was discovered that all of them achieved very low word recognition accuracy, ranging from 1% to 31%. The word level recognition accuracy ACC
may be calculated according to the Equation ACC = 1 - where /), S, I and C represent the
number of deletions, substitutions, insertions, and correct words. In contrast, the speech recorded by a commercial MEMS microphone achieves a recognition accuracy between 58% and 93%, as shown in Table 1.
[0067] To understand the performance gap, the waveform and spectrogram of these voice recordings were examined. As shown in FIG. 6, the high-frequency components beyond 2000Hz are largely absent over the speaker recordings, whereas a MEMS microphone preserves good frequency component of the signals on the high frequency. The absence of high-frequency components was found to significantly impact the perception of formants of the wakeup words. For example, in the case of the vowel sound /i/ shown in FIG. 5, due to the high-frequency deafness, only the first formant below 2kHz is observed in the earphone speaker recording while the subsequent formants above 2kHz are absent (§2). Compared with the MEMS microphone recording in FIG. 3, the absence of these critical formants in the earphone recording leads to confusion in the input feature for speech recognition, ultimately causing wakeup word recognition failures.
|0068| A follow-up question arises — what is the reason behind the absence of high- frequency components beyond 2000Hz in the speaker recordings? The speaker hardware imperfection was suspected to be the root cause of this high-frequency deafness. Hence, the frequency response of the earphone speaker when using it as a microphone in an anechoic chamber shown in FIG. 7 (on left) was measured.
[0069] FIG. 7 (on right) shows the frequency response of six pairs of earphones across over-ear, on-ear, and in-ear types. The frequency response of all six pairs of earphones was observed to decline as the frequency increases. Within the 0-2000Hz frequency range, the speaker maintains a high frequency response, which facilitates the accurate capture of vocal cord vibrations. However, as the frequency continues to rise, the speaker’s frequency response
decreases significantly, with an average attenuation of 30 dB. Consequently, the speech in this frequency range experience substantial attenuation, leading to reduced speech recognition accuracy.
4. Design
[0070] The mobile voice activation service was proposed to harvest the opportunities aforementioned and tackle the technical challenges identified in the preceding section. The mobile voice activation service includes two primary functionalities, namely, speech detector and primary user identification (§4. 1), and wakeup word enhancement (§4.2).
4.1 A Lightweight Speech Detector
[0071] This design component strives to promptly detect the presence of human speech from the audio recordings and determine whether it is its own user speaking (i.e., the primary speaker) or someone else nearby.
[0072] Existing speech detectors such as webrtc-vad work in two steps. It first sends the audio recording to an energy detector to locate potential human speeches, and then feeds these high-energy pitches to a GMM model to tell whether they are human speeches or ambient noises. Although the energy detector is low-power, it analyzes energy levels of audio recordings across a wide frequency range spanning from 80Hz to 4000Hz, in which ambient noise frequently manifests and the pseudo-microphone (i.e., using the earphone speaker as a microphone) conceals (§3.2). This can result in frequent false-triggering of the succeeding GMM-based speech detector and lead to an increase in system power consumption. Furthermore, existing speech detectors lack the capability to identify whether it is its own user talking but instead transmit all detected speech to the subsequent speech recognition module, which leads to energy wastage.
[0073] 4.1.1 Joint speech detection and primary user identification. The mobile voice activation service instead leverages the unique in-body signal propagation channel to simultaneously identify human speech and the primary speaker through the use of only the power
detector. It achieves so by detecting energy peaks specifically within the lower 1000Hz frequency band. This particular frequency range is primarily associated with the articulatory organs, making the strong energy peak within this band a reliable indicator of human speech presence. Furthermore, since speech from a nearby speaker propagates through an in-air channel, resulting in significant attenuation within this lower frequency band (as discussed in §3.1), whether the detected speech belongs to the primary speaker or someone else speaking nearby was distinguished by analyzing the energy peaks within the frequency range of 0 to 1000Hz.
[0074] The low-frequency energy detector proceeds in two steps: pre-processing and energy profiling.
[0075] Pre-processing. Let x(t) be the audio signal recorded by the earphone’s speaker transducer. x(t) was filtered with a second-order Butterworth low pass filter (LPF) with a cutoff frequency of 1000Hz to eliminate the out-band noises, which are largely likely to be polluted by the ambient environment noises. As the user’s motion noise (primarily below 50Hz) may still be preserved in the filtered signal, another Butterworth high pass filter with a cutoff frequency of 50Hz was adopted to remove human motions in that frequency band. Furthermore, due to the recorded speech energy being varied across different earphones and users, the energy of the filtered x(t) was normalized by scaling it up to the range of [-4000, 4000] (dtype=intl6). The signal normalization would not affect the relative amplitude and frequency distribution of the speech signal.
[0076] Per-frame energy profiling. Possible voice activity on the time domain was located by dividing speech signals into time frames. Due to speech signals being quasi- stationary within a short time (2-50ms), x(f) was divided into 20ms frames and calculate the energy of each frame is as follows:
where x(n) are the data samples within frame i. The mobile voice activation service monitors the fluctuations in energy between consecutive frames and sends the audio frame(s) to the primary user identification module if their energy surpasses 1.2 times the average energy, denoted as Si > 1.2 Savg. The value of Savg is regularly updated by incorporating new frames while excluding those that have been identified as containing speech. The hyper-parameter 1.2 is obtained through the benchmark studies in various noise level settings.
[0077] 4.1.2 Enhancement. The aforementioned procedure can detect the primary user’ s speeches with high accuracy because in most cases only the speech from the primary user can cause high energy peaks in the frequency below 1000Hz. However, cases were noticed where the strong ambient noises that occupy a wide frequency band (e.g., engine, wind, and road noises while driving) can fool this energy detection module, leading to false triggers of the succeeding wakeup word recognition module that is usually power hungry.
[0078] To minimize the occurrence of false activation of the wakeup word recognition module and reduce the associated power consumption, extracting articulatory features from the audio recording was proposed to validate whether the detected signal represents human speech rather than mere background noise. More precisely, the audio was segmented into discrete frames, where the Fo pitch (i.e., the fundamental pitch) frequency was detected within each frame and assess the consistency of Fo pitch across successive frames. If the signal corresponds to human speech, the Fo pitch should exhibit relatively stable continuity across these frames.
[0079] Fo pitch was chosen as the focus for several reasons. Firstly, Fo pitch is the essential articulation frequency determined by the rate at which the vocal cord vibrates and is controlled by the tension and length of the vocal cords. As these vibrations emanate from the articulatory organs and travel through to the ear canal, the Fo pitch carries the most potent reference of audible energy. Secondly, the frequency of Fo pitch is less susceptible to certain types of interference compared with other vocal frequencies. For instance, low-frequency vocal tract resonances may be confounded by motion artifacts, and high-frequency harmonics can be masked by ambient noise.
[9080| Fo pitch detection. The spectrogram of the audio signal was obtained using Short Time Fourier Transform (STFT) and then detect the Fo pitch on the spectrogram by measuring the maximum coincidence of harmonics. The key insight is the spectrogram of a speech may exhibit prominent peaks at frequencies that are integer multiples of the Fo pitch, stemming from the harmonics present in the speech signal. Building on this, we establish a range of potential Fo pitches, ranging from 90Hz to 250Hz. Studies show that the Fo frequency is around 90-180Hz for males and 165-255Hz for females. The frequency band of candidate pitches may be set to [90Hz, 250Hz] for running the Fo estimator. The power associated with each of these candidate pitches and its corresponding harmonics may be aggregated within the 1000Hz frequency range. In each time frame, the pitch with the highest cumulative power may be identified as the estimated Fo pitch. FIG. 8 illustrates this process.
[9081 ] Finally, the noise was removed on other frequencies to improve the SNR of the primary articulatory feature (Fo pitch) and feed the nullified spectrogram to a Support Vector Machine (SVM) for classification. Because the classifier focuses on detecting the continuity of the Fo pitch, a feature that doesn’t vary significantly between different users, there’s no necessity to amass a diverse set of training data from a large population. Moreover, the SVM’s lightweight design ensures that it is computationally efficient.
[0982] It’s important to note that this enhancement module is not in a constant state of activation. Instead, its activation is determined by per-frame energy profiling (§4.1), which calculates the ambient environmental energy level of each time frame. The enhancement module is activated only when the ambient energy level exceeds a predefined threshold, established based on a computation over five frames. This strategic approach allows the mobile voice activation service to activate the enhancement module in noisy environments to bolster accuracy, while also deactivating it under quieter conditions to conserve power.
4.2 Accurate W akeup W ord Enhancement
[0083] Once the audio speech is detected coming from the primary user, the audio signal may be sent to the wakeup word recognition module. However, as demonstrated in §3.2, directly
sending the voice recording to the wakeup word recognition module associated with existing voice assistants encounters significant errors due to the absence of critical high-frequency components. A lightweight wakeup word enhancement algorithm was proposed to address this issue.
[0084] 4.2.1 The failure of harmonics reconstruction. The initial attempt is to reconstruct the audio’s high-frequency spectrogram (2-8 kHz) using their low-frequency (0-2 kHz) components that are available on the audio recordings. The opportunity here is the fundamental frequencies (e.g., Fo pitch) in human speech manifest in higher frequency bands as harmonics (e.g., 2 * Fo, 4 * Fo, ...). Harmonics on 2-8kHz were synthesized using the fundamental frequency components and further decay the energy across frequencies, ensuring their smoothness. However, as the reconstructed audio to Google API for recognition was sent, it was found the wakeup word recognition accuracy did not get improved, maintaining at around 7%. The reconstructed audio clips released by Google API were fed and found that these audio clips achieve similarly low accuracy.
[0085] After carefully comparing the reconstructed signal spectrogram shown in FIG. 9A with the ground truth shown in FIG. 9B, it was found that harmonics reconstruction struggles to reconstruct the formants within the higher frequency band of 2-8 kHz. This is because the formants are not solely determined by the fundamental frequency or its harmonics. It is also closely related to the physical shape and size of the user’s vocal tract (§2). Accurate reconstruction of formants would require detailed information about the vocal tract’s shape and size, which are typically achieved through complex acoustic modeling or data-driven approach that are computationally intensive.
[0086] 4.2.2 The solution: copy, paste, and adapt. To mitigate the high-frequency deafness observed in the speech recording, to use a MEMS microphone’s pre-recording of the wakeup word (e.g., “Hey Siri”) as the template was proposed, extracting its high-frequency components ranging from 2 to 8kHz, and pasting it to the speech recording, as shown in FIG. 10A. This is based on an observation that when the speech recording is a wakeup word, the
combined speech signal can trigger the voice assistant even though its low-frequency and high- frequency components originate from different human speakers.
[0087] The rationale is that speech recognition systems are primarily designed to interpret content-dependent elements of human speech, such as vowels and consonants, which are characterized by these crucial formants. These systems are tuned to focus less on human speaker-dependent features like tones, prosody, and intonation, aiming to enhance the scalability of speech recognition performance.
[0088| Conversely, due to the lack of fundamental pitches and frequency components below 2kHz, the high-frequency component from the MEMS microphone’s recording alone, as shown in FIG. 10C, cannot be successfully recognized by the wakeup word recognition module. Similarly, due to the mismatch between the low-frequency and high-frequency components, the combination of a non-pickup word speech recording and a pickup word template, also fails to trigger the voice assistant, as shown in FIG. 10D.
[0089] Yet, implementing the copy-and-paste approach poses a considerable challenge because of the diverse nature of human speech, including variations in pace, pitch, intensity, and vocal patterns. Additionally, a single user might pronounce the same wake-up word very differently at different occasions. Blindly pasting the high-frequency component of the template keyword to the speaker’s speech recording can disrupt the alignment of critical formats in the combined audio signal, lead to the mismatch of the energy component in the low-frequency and high-frequency component, and further undermine the wakeup word recognition.
[0090] To address this challenge, to align the speech recording and the keyword template across three distinct dimensions was proposed: time, frequency, and energy. This alignment ensures that the harmonics as well as the formants in the high frequency band are well aligned with the audio components in the low-frequency band. Next, this alignment was detailed.
[00911 Step 1. Syllables alignment in time domain. A syllable is a fundamental unit in organizing speech sounds for pronunciation in linguistic. Variations in speech pace among
different users can lead to discrepancies in voice duration and the number of syllables. The mobile voice activation service first aligns captured speech signals with the template by stretching/squeezing the template audio on a syllable basis. The primary challenge in this process lies in accurately detecting the boundaries of syllables in the speech recording and adjusting the template’s voice speed to match that of the user, especially in the presence of background noise.
[0092] To overcome this challenge, the energy of the ambient background noise in the speaker’s audio recording was first calculated and then subtract this noise to enhance the speech signal SNR, making the boundary more distinct. After that, a pitch identification algorithm was applied to the speech recording to pinpoint the Fo fundamental pitch. This Fo pitch information is used to determine the number and location of syllables and the stretch ratio. The voice stretch is applied on a per-syllable basis. If the mobile voice activation service detects discrepancies in the number of syllables between the speech recording and the template (due to variations in speech pace and pronunciation habits), the mobile voice activation service merges syncopal syllables (e.g., /si-ri/) into a single syllable for alignment, as depicted in FIG. 1 1 , panel (a).
[0093] Step 2. Formants alignment across the audible band. After syllable alignment on the time dimension, the formant components on the spectrogram were aligned. Users differ in their vocal cords and vocal tract structures, and this discrepancy can result in distinct formant location relationships in the spectrogram. For example, females typically possess a higher Fo pitch compared to males, causing their F7, F2, and F3 formants to be noticeably higher. Directly pasting the F2-F3 formants template from a female to the speech recording from a male can result in frequency misalignment, disrupt the inherent relationships among the formants, and ultimately result in errors in wakeup word recognition.
(0094) Table 2: Comparison of word recognition accuracy, (a): without copy-paste- adapt; (b): with copy-paste, no adapt; (c): with copy-paste-adapt; (d): with copy-paste-adapt on non- wakeup word.
Setup (a) (b) (c) (d)
Recog. Acc. 11% 15% 89% 20%
[0095] Aligning the frequency formants on an STFT basis was proposed. As illustrated in FIG. 11, panel b, the audible band signal was divided into a 2D time-frequency matrix. Each time frame in the matrix spans 20 ms as the audio sound is quasi-stationary over a 2-50 ms period. Following the segmentation, the spectral envelope of each time frame was extracted. As shown in FIG. 3, the spectral envelope is an important cue for the identification of voice sounds and the characterization of formants (spectral resonances). The location of the Fi formant (< 2kHz) may be aligned in the spectral envelope by determining a shift factor. This shift factor is then adapted to the higher F2-F3 formants in the template signal. Subsequently, the adapted formant signal is copied onto the speech recording for replacement. The mobile voice activation service adopts the linear prediction spectral envelope in the implementation.
|0096| Step 3. Energy alignment. The last step is to align the energy between the template and the speech recording. The speech loudness may change over individuals - combining the template and the speech recording in different loudness would inevitably harm the wakeup word recognition accuracy. To solve the issue, the average energy level of the high- frequency component was first calculated, denoted as Phigh, and the low-frequency component, denoted as Plow, within the template audio. The energy level of the filtered speech recording was computed in the low-frequency band P'low. Finally, the high-frequency component of the combined signal using the following equation was adapted: P'high = Phigh * P'lowIPiow).
[0097] Result. A volunteer was invited to evaluate the effectiveness of this algorithm. The volunteer is instructed to speak the wakeup word “Alexa” 100 times and random nonwakeup words 100 times at her normal communication loudness. The word recognition accuracy is shown in Table 2. We observe that the algorithm, denoted as (c), can effectively activate voice assistants with an 89% successful rate. In contrast, the success rate drops to only 11% without applying the algorithm, denoted as (a). For comparison, direct copy-and-paste has a relatively low SR recognition rate (15%) as directly applying the template on a high frequency brings in
misalignment, as shown in (b). Experiments were conducted on applying the template to other non-wakeup words, de-noted as (d). It was found that these non-wakeup words cannot efficiently activate the SR, which demonstrates the effectiveness of the algorithm.
5. Implementation
[0098] The mobile voice activation service’s signal processing includes a lightweight hardware circuit that transforms the earphone speaker into a microphone, an energy-efficient algorithm that detects human speech and distinguishes whether it is the primary user speaking, as well as a signal enhancement algorithm that improves the quality of wakeup word. All these signal modules run on a dongle. FIG. 12 shows the mobile voice activation service prototype, which supports both wireless connection (through Bluetooth) and wired connection (through a 3.5mm TRRS audio cable).
[0099] This implementation possesses two advantages. First, because the voice detection and primary user identification features are implemented in the plug-in dongle, the earphone transducer doesn’t send all captured audio streams directly to the pairing device (such as a smartphone or laptop) for further processing. Instead, the audio data is processed locally on the dongle, and only legitimate voice commands from the primary user are forwarded to the backend for further processing. Second, this gating approach not only helps prevent unintended disclosure of ambient conversations but also unnecessary acoustic signal processing on smartphones, and thus reduces power consumption.
[OHIO] Hardware integration. The mobile voice activation service dongle comprises two 3.5 mm audio jacks, resistors in the form of a Wheatstone bridge, a power amplifier INA126, an audio codec chip ES8388, an onboard computation MCU ESP32-WROVER-E, a BLE radio.
The size of the prototype is 6cmx4.5cm. It costs approximately 8.3 USD. Its form factor can be further reduced by adopting a stretchable PCB. It is anticipated that this design can be seamlessly incorporated into mainstream True Wireless Stereo (TWS) earbuds by placing the miniaturized circuitry between the transducer and the audio chip.
6. Evaluation
[0101] Data collection. 15 volunteers were recruited (12 males, and three females, with an average age of 26 years old) for the experiment under the approval of the university’s Internal Review Board (IRB) protocol. The volunteers include three native speakers and 12 English working-professional international students. The volunteer wears the mobile voice activation service and speaks three types of wakeup words, including “Alexa”, “Ok Google”, and “Hey Siri”. The audio sampling rate is set to 16kHz.
[0102] Earphone configurations. Voice data is collected using 13 pairs of earphones with different types (e.g., over-ear, on-ear, and in-ear), and transducer sizes, as shown in FIG. 13.
[01 3] Baseline. The mobile voice activation service was evaluated against the AirPods Pro to assess its usability. The AirPods Pro takes leading position among commodity earbuds, particularly excelling in speaking sound quality. This superiority is achieved through the utilization of advanced sensor modalities, including the voice accelerometer and multimicrophone-based beamforming. In contrast, the mobile voice activation service only adopts the speaker transducer as the basic signal receiver.
(0104] Metrics. Three metrics to evaluate the mobile voice activation service were adopted:
• False Acceptance Rate (FAR). This metric quantifies the frequency that the mobile voice activation service erroneously activates the voice assistant over the total number of attempts. A high FAR score can lead to an unsatisfactory user experience and inadequate privacy preservation.
False Rejection Rate (FRR). This metric evaluates the frequency that the mobile voice activation service does not activate the voice assistant when the primary user intent to invoke it, over the total number of attempts. A high FRR suggests the mobile voice
activation service may encounter difficulties in freely accessing the voice assistant service.
• Success Rate (SR). This metric quantifies the rate of successful execution over all attempts. One successful execution is counted only when the corresponding wakeup word is successfully recognized by the ASR.
6.1 In-lab Study
[0105] The effectiveness of the mobile voice activation services was examined front-end and back-end design in a controlled environment.
[0106] Experimental procedure. The study is divided into two sessions. In the first session, the primary subject (who wears the earphone) is instructed to utter the wake-up words at her preferred pace and intensity. Each command was uttered 20 times per user with different earphones. The false rejection rate was then computed (FRR). In the second session, the primary subject may stay silent and invite another volunteer to speak the same wake-up word near the primary subject, playing the role of a nearby individual shown in FIG. 4B. The false acceptance rate was calculated (FAR). Each session takes around 30 minutes. All experiments are conducted in a quiet lab environment with an ambient noise level at 45 dBSPL on average.
10107] Primary speaker identification. The overall accuracy of the primary speaker identification in the mobile voice activation service was examined. The evaluation is conducted in two phases. In the first phase (Pl), the time framing identification method (§4.1.1) and examine the FRR and FAR results were only applied. As shown in FIG. 14, a consistently low average FRR (1%) is observed but a higher average FAR (12.2%) is observed across the 15 subjects. This outcome is expected since time framing primarily detects energy presence, not specific user identification. Afterward, The pitch detection was incorporated (§4.1.2) and observe significant improvements. The FAR drops to 1.4%, while the FRR slightly increases to 1.3%. These findings demonstrate the effectiveness of the pitch detection algorithm.
[0108| Taking a further scrutiny of these results, it was found that subjects 7, 8, 9, and 10 exhibit relatively higher FRR and FAR (e.g., >2%). This discrepancy can be attributed to the inadequate contact of earphones with the subjects’ skin, impacting the propagation of vocal cord vibrations through bone conduction and resulting in an increased FRR. Simultaneously, this lack of close contact allows the speaker transducer to capture speech from nearby users, contributing to a higher FAR. Additionally, subject 14 exhibits a higher FRR but maintains a lower FAR in comparison to others. Further investigation into the raw audio recording of subject 14 reveals that their voice volume is lower than that of other subjects, consequently leading to more frequent rejections by the mobile voice activation service.
[01.09] Wakeup word recognition. The effectiveness of wakeup word recognition using the copy, paste, and adapt design was evaluated next. FIG. 5 shows the recognition success rate for each individual. The error bars in the figure indicate performance variations across three different wakeup words. Overall, the mobile voice activation service achieves a success rate of 91% on average. Notably, subject 14 achieves the lowest SR at 61% due to his lower voice volume. Such reduced volume adversely affects pitch detection accuracy, subsequently impacting the precision of the alignment processes.
6.2 Field Study
[OHO] The mobile voice activation service’s end-to-end performance across various real- world scenarios was next assessed. As shown in FIGs. 17A-D, the evaluation encompasses four stationary and three mobility scenarios to represent typical indoor and outdoor settings. In each scenario, 100 utterances were collected for each wakeup word. The overall success rate of wakeup word recognition was then examined. AirPods are adopted for comparison. FIG. 16 shows the results corresponding to seven scenarios.
[OH 1 ] Stationary scenarios (a)-(d). The mobile voice activation service achieves a success rate of 95%, 92%, 89%, and 82% for these four static scenarios, respectively. The overall accuracy is at around 90%, which is slightly worse than that of AirPods (92%). A relatively bigger gap between the mobile voice activation service and AirPods is observed in
scenario (d). This suggests that severe noise artifacts, as encountered in (d), can still impact the accuracy of template matching, consequently affecting the recognition of wakeup words.
[0112] Mobile scenarios (e)-(g). The investigation was further extended to include three types of mobility. The results of these activities are shown in FIG. 16 (e)-(g). Notably, during (e) driving and (f) lifting, the mobile voice activation service achieves an average success rate of 85% and 84%, respectively. The success rate is slightly lower than in stationary environments with comparable noise levels. This decline in performance is primarily attributed to the head and upper body movements during driving and lifting, which adversely affect the signal input quality. In contrast, AirPods maintain a higher average success rate of 93%. The success rate further drops to 71% while walking at a busy intersection, influenced by noise from moving vehicles nearby and motion artifacts from the individual. The success rate of AirPods falls to 72% in these conditions.
[0113] Results discussion. In contrast to AirPods, which leverages advanced sensors and beamforming technologies to improve the voice quality, the mobile voice activation service relies solely on the earphone’s speaker transducer for voice activation and a lightweight signal processing algorithm for wakeup word enhancement. The manufacturing cost of the mobile voice activation service is markedly lower than the AirPods’ retail price, while striving to approach a comparable performance.
6.3 Micro-Benchmarks
[0114] Benchmark studies were further conducted to understand the effect of various factors on the mobile voice activation service’s performance.
[0115] Impact of different earphones. One participant was invited to conduct the speech activation experiment by wearing six pairs of earphones (out of 13) in the lab and speaking three types of wakeup words, with each wakeup word repeating 100 times. AirPods are adopted for comparison. The result is shown in FIG. 18A. Overall, the mobile voice activation service was observed to achieve an average success rate of 87% over all types of earphones. Notably, over-
ear and on-ear earphones achieve the highest success rate with an average of 92% and 91% SR, respectively. These results are on par with AirPods (with an average success rate of 92%), demonstrating the mobile voice activation service’s effectiveness across over-ear and on-ear earphones.
|0116] However, the mobile voice activation service’s performance is notably lower with in-ear earphones, with a success rate of 62% on average. One reason for the better performance of over-ear and on-ear earphones can be attributed to their larger speaker transducers and inherently larger surface contact with the skull, allowing for more efficient transfer of vocal cord vibration energy. In contrast, the smaller transducers of in-ear earphones exhibit reduced sensitivity to voice commands (FIG. 7). A potential solution is to adjust the speaker volume or incorporate a power amplifier into the dongle to enhance the signal strength of the speech recording.
[0117] Impact of different voice loudness. The impact of voice loudness on the mobile voice activation service’s success rate was next evaluated. Similarly, one participant was invited to utter the three types of wakeup words with four different loudness levels, spanning from 45 to 75 dBSPL. The range is selected based on CDC’s regulation, specifically, it designates approximately 40 dBSPL for a whisper, 60-70 dBSPL for a normal voice level, and 75-85 dBSPL for a loud voice conversation. As shown in FIG. 18B, as the voice loudness increases, we found that, the success rate of the mobile voice activation service grows by 2.5x from 39% to 99%. A similar trend can be found on AirPods as well, which shows the success rate grows from 40% SR to 100%. Notably, the success rate of the mobile voice activation service is relatively stable (i.e., 93%— 99%) when the voice loudness level surpasses 55 dBSPL. This result demonstrates the mobile voice activation service’s resilience in handling normal voice conversations.
[0118] System overhead and latency. System overhead and processing latency was also evaluated. Table 3 details the processing delay of the front-end design (§4.1.1 & §4.1.2), and copy, paste, and adapt design (§4.2), respectively. The measurement is conducted on a 2-second
audio sample extracted from the audio stream. It was observed that joint speech and primary speaker detection (§4.1.1) takes around 3ms for processing the 2s audio sample. The pitch detection-based enhancement §4.1.2 takes 159ms. The copy, paste, and adapt design (§4.2) takes around 25ms to process a 2-second audio sample. The overall signal processing delay is around 200ms, demonstrating the capability of real-time operations. It is anticipated that the delay may drop further through multi-thread processing.
[0119] Table 4 summarizes the power consumption of each component. Given a supply voltage of 5V, the sensing module, audio codec, and MCU consume 0.2mW, 60mW, and 152mW, respectively. The total power consumption of the mobile voice activation service is approximately 212 mW in the active mode. An 820 mAh lithium battery can be used to provide up to 19.3 hours of continuous running of the mobile voice activation service. The battery life could be further optimized with duty-cycles.
(0120| Table 4: Power consumption breakdown.
7. Comparisons with Other Approaches
[0121] Voice Assistant Activation Technologies. Existing general purpose voice activity detection (VAD) modules, e.g., Google’s webrtc-vad, GPVAD, and Kaldi-VAD, have been well- studied and integrated into many mobile applications. Nevertheless, applying these designs to earphones face challenges as voice communication on earphones can be plagued by environmental noise and more severely, the speech commands from nearby individuals.
10122] To solve the issue, personalized VAD with identifying the target user’s voice fingerprint has been proposed. But these personalized solutions are generally power intensive and struggle to counteract spoofing attacks. Hence, they are not widely adopted by consumer devices.
[0123] Besides, various approaches have also been developed for simplifying voice activation by involving hand gestures. For example, Raise to Speak enables Apple Watch being able to activate the voice assistant by detecting the raising hand gesture. ProxiMic explores the close-to-mic voice characteristics (e.g., pop noise) and enables voice activation by placing the microphone close to the user’s mouth. PrivateTalk activating voice input with user-defined hands-on-month gestures for earphone devices. Although these approaches guarantee low false positives, they inevitably require the involvement of hand gestures and thus bring extra burden for the users.
[0124] Different from the other approaches, the mobile voice activation service takes advantage of an opportunity hidden in the earphone transducer and develops a hands-free voice activation system while guaranteeing low false positives towards environmental noise and false triggering voice commands from nearby people. The proposed signal-processing algorithm could run efficiently on mobile and embedded devices without complex computation requirements.
[0125] Bone Conduction Microphones. Recently, bone conduction sensors, such as IMU, voice pickup sensor (VPU), non-audible murmur (NAM) and throat microphone, have been explored for speech enhancement and voice activation. For example, WhisperMask designs a new interface that catches the user’s whispering speech with an embedded condenser microphone woven hidden in a non-woven mask to reduce the noise interference from the environment. In-Ear-Voice developed a low-power personalized VAD system for hearables by exploring the bone conduction sensor. VibVoice utilized the bone conduction response from IMU sensors to enhance speech quality in a noisy environment. These demonstrate promising results, but they cannot be deployed on existing earphones due to the lack of such onboard sensors.
[0126] HeadFi explores the reciprocal principle of earphones and demonstrates the capability of using the earphone transducer for user identification, physiological sensing, touch gesture recognition, etc. The hardware dongle builds upon HeadFi but extends it to a software-
hardware system that explores two different voice channels to enable hands-free voice activation. Moreover, the mobile voice activation service contributes a novel signal processing pipeline to thoroughly improve the activation accuracy and enhance speech quality.
[0127] Whisper or Silent Speech Interface. Silent speech interface technologies may be explored for enriching speech recognition interfaces. For example, LipLearner proposes a customizable silent speech interface on mobile phones by building up the relationship between voice commands and corresponding non-verbal lip movements through a neural network model. It allows users to activate the speech service with lip motions. HP-Speech creates a silent speech interface on earphones by emitting inaudible acoustic signals to detect the movement of temporomandibular joint (TMJ) for silent voice command recognition. Mutelt tracks the user’s jaw motion with a dual-IMU setup to infer word articulation around the ear. EarCommand emits an ultrasonic signal in the ear canal and builds the relationship between the deformation of the ear canal and the movements of the articulator to infer the corresponding silent speech commands while speaking. Unlike the other approaches that aim to establish new paradigms for speech interaction, the mobile voice activation service adheres to the current speech recognition (SR) service, focusing on enhancing their reliability.
8. Conclusion
[0128] The design, implementation, and evaluation of the mobile voice activation service were implemented, a software-hardware solution that enables mobile users to activate their voice assistant on earphones without hand gesture intervention. The mobile voice activation service contributes a plethora of low-power signal processing algorithms that take advantage of the two speech signal propagation channels to detect the human speech, differentiate the primary speaker, and further enhance the quality of the wakeup word for accurate wakeup word recognition. The experiment in different real-world scenarios demonstrated the efficacy and effectiveness of the mobile voice activation service.
9. Systems and Methods of Processing Audio Signals from Earphones for Interactivity with
Applications
[0129| Referring now to FIG. 19, depicted is a block diagram of a system 100 for processing audio signals from earphones for interactivity with applications. In overview, the system 100 can include at least one computing device 105 and at least one speaker transducer 110, among others. The computing device 105 can include at least one voice interaction service 115 and at least one application 120. The voice interaction service 115 can include at least one audio processor 125, at least one activity detector 130, at least one voice enhancer 135, and at least one machine learning (ML) model 140, among others. The computing device 105 and the speaker transducer 110 can be associated with at least one user 145 (also referred herein as a primary user). Each of the components of system 100 (e.g., the computing device 105) may be implemented using hardware or a combination of hardware and software, such as those of system 600 as detailed herein in conjunction with FIG. 24. Each of the components in the system 100 may implement or execute the functionalities detailed herein, such as those detailed herein in Sections 1-8.
[0130] In further overview, the computing device 105 can be any computing device comprising one or more processors coupled with memory and software and capable of performing the various processes and tasks described herein. The computing device 105 can be operated or associated with the user 145. The computing device 105 can be a smartphone, other mobile phone, tablet computer, wearable computing device (e.g., smart watch, eyeglasses), or laptop computer, among others. The computing device 105 can be in communication with the speaker transducer 110, among other devices (e.g., via wireless communications or wired communications). The computing device 105 can be in communication with other devices, such as remote servers, computing devices, or other hardware devices, among others.
[0131] The voice interaction service 115 can process, manage, or otherwise handle the exchange of data from the speaker transducer 110 and the application 120. In the voice interaction service 115, the audio processor 125 can receive and process audio signals from the speaker transducer 110. The activity detector 130 can apply the ML model 140 on the processed audio signals from the audio processor 125 to identify that the user 145 is speaking. When the user 145 is identified as speaking, the voice enhancer 135 can add audio recordings of voice
commands to the processed audio signal to provide to invoke at least one function of the application 120. The ML model 140 can be used to identify whether the user 145 is speaking. The ML model 140 may include, for example, a deep learning artificial neural network (ANN), Naive Bayesian classifier, a relevance vector machine (RVM), a support vector machine (SVM), a regression model (e.g., linear or logistic regression), a clustering model (e.g., k-NN clustering or density-based clustering), or a decision tree (e.g., a random tree forest), among others.
[0132] In some embodiments, the voice interaction service 115 can be executed on the computing device 105 (e.g., as depicted). For example, the voice interaction service 115 can be a process or application separate from the application 120. In some embodiments, the voice interaction service 115 can be part of the application 120 running on the computing device 105. For instance, the functionalities ascribed to the voice interaction service 115 can be executed by the application 120. In some embodiments, the voice interaction service 115 can be executed on a device separate from the computing device 105. For example, the voice interaction service 115 can be executed on one or more processors and memory of an external device (e.g., a dongle or other portable device) that is in communication with the computing device 105.
|0133| The application 120 can include any software program executing on the computing device 105. The application 120 can be a voice assistant application (sometimes herein referred to as a digital assistant application) to interact with the user 145 via audio input (e.g., spoken queries or commands) and audio output (e.g., audio replies to queries or commands). The application 120 can use natural language processing (NLP) and artificial intelligence (Al) techniques to process natural language in the form of audio or text from the user 145. The application 120 can include a set of functions corresponding to a set of voice commands from the user 145. For example, the voice command of “Open application X” from the user 145 can invoke the launching or opening of the application 120. The application 120 can be, for example, Amazon Alexa™, Apple Siri™, Google Assistant™, or Microsoft Cortana™. The application 120 can interface with other processes and applications, such as the voice interaction service 115 to communicate or exchange data.
[0I34| The speaker transducer 110 (sometimes herein referred to as an electroacoustic transducer, earphone, or headphone) can produce, output, or otherwise generate acoustic sound waves. To generate the acoustic soundwaves, the speaker transducer 110 can transform or convert electrical signals (also referred herein as audio signals) from the computing device 105 into the acoustic sound wave. Conversely, the speaker transducer 110 can also produce, output, or otherwise generate electrical signals from acoustic soundwaves. To generate the electrical signals, the speaker transducer 110 can transform or convert acoustic soundwaves arriving at the speaker transducer 110.
[0135] The speaker transducer 110 can be optimized to function as a loudspeaker converting electrical signals into acoustic sound waves, rather than operating as a microphone converting acoustic waveforms into electrical signals. As such, the electrical signals converted from acoustic waveforms by the speaker transducer 110 can be of low quality, with high noise, low amplitude, and high interference, among others. The speaker transducer 110 can be a loudspeaker, such as a dynamic speaker, a cone speaker, a horn speaker, a planar magnetic speaker, an electrostatic speaker, or a ribbon speaker, among others. The speaker transducer 110 can be part of an earphone (e.g., fitted within the ear of the user 145) or a headphone (e.g., fitted atop the ear of the user 145), among others. The earphone or the headphone which the speaker transducer 110 is a part of can lack a separate, specific transducer for a microphone.
[0136] The speaker transducer 110 can be arranged, situated, or otherwise positioned relative to a corresponding ear of the user 145. For example, the speaker transducer 110 can be an earphone fitted within an ear canal of the ear of the user 145 or can be a headphone situated about the auricle of the ear of the user 145. There can be at least a pair of speaker transducers 110 about the corresponding ears of the user 145. For example, as depicted, one speaker transducer 110 can be positioned on the left ear of the user 145 and another speaker transducer 110 can be positioned on the right ear of the user 145. While primarily described in terms of one or two speaker transducers 110, the system 100 can include any number of speaker transducers 110 on the user 145.
[0I37| Referring now to FIG. 20, depicted is a block diagram of a process 200 for identifying users from audio signals acquired via earphones in the system 100 for processing audio signals. The process 200 can correspond to or include operations performed in the system 100 to identify whether the user 145 is speaking. Under the process 200, the audio processor 125 of the voice interaction service 115 can obtain, identify, or otherwise receive at least one audio signal 205 corresponding to at least one acoustic waveform 210 via the speaker transducer 110. In some embodiments, the audio processor 125 can receive the audio signal 205 via a wireless communication (e.g., Bluetooth or near field communications (NFC)) with the speaker transducer 110. In some embodiments, the audio processor 125 can receive the audio signal 205 via a wire connection with the speaker transducer 110. The acoustic waveform 210 can include a propagation of energy through a medium, such as the air about the user 145 and the body of the user 145. The audio signal 205 can be an electrical (e.g., digitized or quantized) representation of the acoustic waveform 210, and can be sampled at any sampling rate, for instance, ranging from 8-200kHz.
|0138| The acoustic waveform 210 can include at least one portion corresponding to an air channel 215A and at least one portion corresponding to a body channel 215B. The air channel 215A can correspond to the portion of the acoustic waveform 210 traveling through the air about the user 145. For example, the air channel 215A can include acoustic waveforms originating from other sources (e g., bystanders, vehicles, or background noise) besides the user 145 and reaching the speaker transducer 110. The body channel 215B can correspond to the portion of the acoustic waveform 210 traveling through the body of the user 145. For instance, the body channel 215B can correspond to acoustic waveforms originating from the vocal cords and lungs of the user 145 and arriving at the speaker transducer 110 through the body of the user 145. The acoustic waveform 210 can include a set of formants (e.g., F0, Fl, F2, and so forth from lowest to highest frequency) from the user 145. Each formant can correspond to an acoustic resonance of a vocal tract of the user 145 and can be characterized as a maximum within a frequency domain representation of the human speech from the user 145. Upon arrival at the
speaker transducer 110, the speaker transducer 110 can transform or convert the acoustic waveform 210 into the audio signal 205.
[0139] With receipt of the audio signal 205, the audio processor 125 can process or filter out the portion of the acoustic waveform 210 corresponding to the air channel 215A in the audio signal 205 to output, produce, or otherwise generate at least audio signal 205’. The audio signal 205’ can include or correspond to the portion of the acoustic waveform 210 corresponding to the body channel 215B. For example, the audio signal 205’ can include at least the lower formants (e.g., F0 and 7) of the speech sound from the user 145. The audio signal 205’ can substantially (e.g., at least 75%) lack the portion of the acoustic waveform 210 corresponding to the body channel 215B. In some embodiments, audio processor 125 can fdter out noise in the portion of the acoustic waveform 210 corresponding to the body channel 215B to generate the audio signal 205’. The audio signal 205’ can also substantially (e.g., at least 75%) lack noise within the portion of the acoustic waveform 210 corresponding to the body channel 215B.
[0140] In processing the audio signal 205, the audio processor 125 can apply at least one fdter. The fdter can include, for example, a low-pass fdter (LPF), a band-pass fdter (BPF), a band-stop fdter (BSF), or a high-pass fdter (HPF), among others, or any combination thereof. The fdter can be implemented using any type of architecture, such as a resistor-capacitor (RC) fdter, a resistor-inductor (RL fdter), a RLC fdter, an active fdter, a Butterworth fdter, a Chebyshev fdter, or a Bessel fdter, among others. In some embodiments, the audio processor 125 can apply an LPF with a cutoff frequency to attenuate or suppress the portion of the acoustic waveform 210 corresponding to the air channel 215A in the audio signal 205. The cutoff frequency of the LPF can be set at a frequency value to pass the portion of the acoustic waveform 210 corresponding to the body channel 215B and can range around 800-1200Hz. The cutoff frequency can be set to pass through at least a first formant (F0) of the set of formants within the speech of the user 145.
|01411 In addition, the audio processor 125 can also apply an HPF with another cutoff frequency to the filtered audio signal 205 to attenuate or suppress noise within the portion of the
acoustic waveform 210 corresponding to the body channel 215B in the audio signal 205. The cutoff frequency for the HPF can be set at a frequency value to remove lower frequency components below the range of frequencies for a human vocal tract and can range between 25- 100Hz. In some embodiments, the audio processor 125 can apply a BPF with the cutoff frequencies to suppress the portion of the acoustic waveform 210 corresponding to the air channel 215A and the noise within the portion of the acoustic waveform 210 corresponding to the body channel 215B in the audio signal 205. From applying the BPF, the audio processor 125 can generate the audio signal 205’.
[0142] Using the audio signal 205’, the activity detector 130 can determine or identify whether the user 145, on which the speaker transducer 110 is situated, is speaking. To identify, the activity detector 130 can apply the audio signal 205’ to the ML model 140. The ML model 140 can be implemented using any model architecture and can have at least one input corresponding to an audio signal, at least one output indicating whether a user is speaking, and a set of weights relating the input to the output, among others. The ML model 140 can be a lightweight model architecture, with minimal resource consumption specifications that a portable or mobile device (e.g., the computing device 105 or external hardware device) can satisfy. In applying, the activity detector 130 can feed or input the audio signal 205’ into the ML model 140. Upon feeding, the activity detector 130 can process the input audio signal 205’ in accordance with the set of weights of the ML model 140.
|0143| The ML model 140 may have been initialized, trained, or established (e.g., by the voice interaction service 115 or another computing device) using a training dataset. The ML model 140 can be trained in accordance with any learning techniques, such as supervised learning, unsupervised learning, Q-learning or weakly supervised learning, among others. The training dataset can include a set of examples. Each example can include or identify a sample audio signal including a portion of an acoustic waveform corresponding to a body channel of a respective sample user (e.g., on which the speaker transducer for the sample audio signal is situated). Each example can also include or identify a label indicating whether the sample user is speaking for the corresponding sample audio signal. The label can also indicate whether
someone else besides the sample user is speaking or whether the sample audio is of ambient noise (e.g., in the background).
[0144] For each example, the sample audio signal can be applied to the ML model 140 to produce or generate an output indicating whether the sample user in the sample audio signal is speaking. The output of the ML model 140 can be compared with the indication as identified by the label. Based on the comparison, a loss metric can be calculated in accordance with a loss function (e.g., a hinge loss, a mean squared error (MSE), a mean absolute error (MAE), a crossentropy loss, a Huber loss, or a log loss). The loss metric can be used to modify or update one or more of the set of weights of the ML model 140. The updating of the weights of the ML model 140 may be in accordance with an optimization function (e.g., stochastic gradient descent with a predefined learning rate). This process can be iteratively repeated until the ML model 140 reaches a convergence condition to stop or cease the training process. In some embodiments, the training of the ML model 140 can be performed on another computing system, separate from the computing device 105, and then loaded on the computing device 105 (e.g., when the voice interaction service 1 15 is installed).
|0145| Based on applying the ML model 140 to the audio signal 205’, the activity detector 130 can determine or identify whether the user 145 of the speaker transducer 110 is speaking. In some embodiments, the activity detector 130 can identify whether the audio signal 205 (and by extension, the acoustic waveform 210) is primarily (e.g., at least 75%) originating from the user 145. From processing the audio signal 205’ using the weights of the ML model 140, the activity detector 130 can produce or generate a classification indicating whether the user 145 of the speaker transducer 110 is speaking. When the classification from the ML model 140 indicates that the user is speaking, the activity detector 130 can identify the user 145 as speaking. The activity detector 130 can also determine or identify that the audio signal 205 is primarily originating from the user 145. Conversely, when the classification from the ML model 140 indicates that the user is not speaking, the activity detector 130 can identify the user 145 as not speaking. The activity detector 130 can also determine or identify that the audio signal 205 is not primarily originating from the user 145.
[0I46| In some embodiments, the activity detector 130 can calculate, generate, or determine a likelihood that the user 145 is speaking to identify whether the user 145 is speaking. The likelihood can identify or indicate a degree of probability that the user 145 on which the speaker transducer 110 is on is speaking. From applying the ML model 140 to the audio signal 205’, the activity detector 130 can produce, output, or determine the likelihood. With the determination of the likelihood, the activity detector 130 can compare the likelihood with a threshold. The threshold can delineate, define, or otherwise identify a value (e.g., 80-95%) for the likelihood at which to identify the user 145 as speaking. If the likelihood satisfies (e g., greater than or equal to) the threshold, the activity detector 130 can identify the user 145 as speaking. Conversely, if the likelihood does not satisfy (e.g., is less than) the threshold, the activity detector 130 can identify the user 145 as not speaking.
[0147] The activity detector 130 can generate or provide at least one output 220 based on the identification of whether the user 145 is speaking. The output 220 can include or identify information, based in part on the identification of whether the user 145. When the identification is that the user 145 is speaking, the activity detector 130 can generate the output 220 to indicate that the user 145 is speaking. In contrast, when the identification is that the user 145 is not speaking, the activity detector 130 can generate the output 220 to indicate that the user 145 is not speaking. The information of the output 220 can also include the audio signal 205’ (or the audio signal 205) or an identifier of the user 145, among others. With the generation of the output 220, the activity detector 130 can send, relay, or otherwise provide the output 220 to the application 120. In some embodiments, the activity detector 130 can store and maintain an association between the output 220 and an identifier of the audio signal 205 in memory. The association can be maintained using one or more data structures, such as a table, a linked list, an array, a matrix, a tree, a heap, a queue, or a stack, among others.
|0148| Referring now to FIG. 21, depicted is a block diagram of a process 300 for enhancing user voice commands in audio signals acquired via earphones in the system 100 for processing audio signals. The process 300 can correspond to or include operations performed in the system 100 to add pre-recorded audio to audio signals acquired via the speaker transducer
110. The process 300 can be initiated or performed, when the activity detector 130 identifies that the user 145 of the speaker transducer 110 is speaking using the audio signal 205’ corresponding to the portion of the acoustic waveform 210 associated with the body channel 215B. Under the process 300, the voice enhancer 135 of the voice interaction service 115 can access data storage 305 to retrieve, obtain or otherwise identify at least one of a set of recorded audio signals 310A- N (hereinafter generally referred to as audio signals 310). In conjunction, the voice enhancer 135 can retrieve, identify, or otherwise receive the audio signal 205’ (or the original audio signal 205) and the output 220, among others.
[0149] Each recorded audio signal 310 can correspond to a respective voice command including one or more keywords to invoke at least one respective function of a corresponding application 120. The function can include, for example: a wake up command (e.g., “Open application X”) to launch or open the application 120; a volume control command (e.g., “Increase volume” or “Decrease volume”) to adjust the volume of the audio output from the application 120; a command to control household appliances (e.g., “Turn on lights” or “Turn off stove”) through the application 120, or any built-in functionality of the application 120, among others. In some embodiments, each recorded audio signal 310 can be associated with at least one application 120. The data storage 305 can maintained on the voice interaction service 115 (e.g., as shown), a memory of the computing device 105, or a remote service accessible to the voice interaction service 115 via one or more networks, among others. The data storage 305 can store, maintain, or otherwise include an association between each recorded audio signal 310 and a respective application 120 (or a function of the respective application 120).
[0150] In some embodiments, the recorded audio signal 310 can include a portion of the acoustic waveform for the voice command corresponding to an air channel (e.g., similar to the air channel 215A). For example, the audio signal corresponding to the acoustic waveform for the voice command may have been acquired from another user (e.g., different from the user 145), and may have been filtered to pass the portion corresponding to the air channel to generate the recorded audio signal 310. The recorded audio signal 310 as a result can include or contain higher frequency components of the acoustic waveform for the voice command. For example,
the recorded audio signal 310 can include higher formants (e.g., F1-F3 or F2 and 3) of the speaker uttering the voice command. In some embodiments, the recorded audio signal 310 can include the entirety of the acoustic waveform for the voice command corresponding to an air channel (e.g., similar to the air channel 215A) and a body channel (e.g., similar to the body channel 215B). For example, the recorded audio signal 310 can include the entirety of the frequency components of the acoustic waveform for the voice command.
[01511 When the user 145 is identified as speaking as indicated by the output 220, the voice enhancer 135 can identify or select at least one recorded audio signal 310’ of the set of recorded audio signals 310 from the data storage 305. In some embodiments, the voice enhancer 135 can select the recorded audio signal 310’ based on an identification of the application 120. For example, the voice enhancer 135 can identify the application 120 as a voice assistant application and can select the recorded audio signal 310’ associated with the application 120 from the data storage 305. In some embodiments, the voice enhancer 135 can select the recorded audio signal 310’ from the set of recorded audio signals 310 at random. In contrast, when the user 145 is identified as not speaking as indicated by the output 220, the voice enhancer 135 can refrain from identifying or selecting any of the recorded audio signals 310.
[0152] With the selection of the recorded audio signal 310’, the voice enhancer 135 can produce, create, or otherwise generate at least one audio signal 205” to include the audio signal 205’ and the recorded audio signal 310’. To generate, the voice enhancer 135 can insertjoin, or otherwise add the recorded audio signal 310 to the audio signal 205’ (or the audio signal 205). In some embodiments, the voice enhancer 135 can generate the audio signal 205” to include one portion (e.g., a higher frequency band above 800-1200Hz) corresponding to the recorded audio signal 310’ and another portion (e.g., a lower frequency band below 800-1200Hz) corresponding to the audio signal 205’.
[0153] The voice enhancer 135 can process or parse the audio signal 205’ (e.g., using a short-time frequency transform (STFT) representation) to extract, determine, or otherwise identify the one or more characteristics. In some embodiments, the voice enhancer 135 can
apply one or more machine learning (ML) models (e.g., artificial neural networks (ANNs)), statistical models (e.g., autocorrelation), or other functions (e.g., linear predictive coding (LPC) or cepstral analysis), among others, to identify the characteristics. The characteristics can include or identify, for example, a time length of the audio signal 205’, a time point of syllables (e.g., an onset of consonant or vowels) in the audio signal 205’, a time point of each formant (e.g., an onset of F0, Fl, or F2) of the speech from the user 145 in the audio signal 205’, a distribution of energy across time in the audio signal 205’, among others. In some embodiments, voice enhancer 135 can process or parse the audio signal 310’ (e.g., using a short-time frequency transform (STFT) representation) to extract, determine, or otherwise identify the one or more characteristics. The characteristics can include or identify, for example, a time length of the audio signal 310’, a time point of syllables (e.g., an onset of consonant or vowels) in the audio signal 310’, a time point of each formant (e.g., an onset of F0, Fl, or F2) of the speech from the user 145 in the audio signal 310’, a distribution of energy across time in the audio signal 310’, among others.
101541 In some embodiments, in adding the recorded audio signal 310 to the audio signal
205’, the voice enhancer 135 can alter, change, or otherwise modify the recorded audio signal 310 based on one or more characteristics of the audio signal 205’ (or the audio signal 205). Using the characteristics of the audio signal 205’, the voice enhancer 135 can alter, adjust, or otherwise modify corresponding characteristics of the recorded audio signal 310’. In modifying, the voice enhancer 135 can alter the characteristics of the audio signal 310’ to match, align, or otherwise correspond with the characteristics of the audio signal 205’, or vice-versa. In some embodiments, the voice enhancer 135 can modify the time length of the recorded audio signal 310’ to substantially (e.g., at least 80-95%) match or correspond to the time length of the audio signal 205’. For instance, the voice enhancer 135 can lengthen or shorten the time length of the audio signal 310’ to match the time length of the audio signal 205’. In some embodiments, the voice enhancer 135 can modify the time point of syllables in the recorded audio signal 310’ to substantially (e.g., at least 80-95%) align, match, or correspond to the time point of syllables in the audio signal 205’. For example, the voice enhancer 135 can adjust the timing of onset of at
least one consonant or vowel in the recorded audio signal 310’ to align with the timing of onset of at least one consonant or vowel in the audio signal 205’.
[0155] In some embodiments, the voice enhancer 135 can modify the time point of the formants (e.g., F2-F3) in the recorded audio signal 310’ to substantially (e.g., at least 80-95%) match or correspond to the time point of the formants (e.g., F0, Fl, or 1'2) of the audio signal 205’. For instance, since the audio signal 205’ may lack higher formant (e.g., F2 or F3), the voice enhancer 135 can move the time point of at least one formant (e.g., higher formants F2 or F3) in the recorded signal 310’ to align with the time point of a lower formant (e.g., F0 or Fl) in the audio signal 205’. Within the frequency domain, the frequency information of the recorded signal 310’ can reside above the frequency information of the audio signal 205’. In some embodiments, the voice enhancer 135 can modify the distribution of energy across time in the recorded audio signal 310’ to substantially (e.g., at least 80-95%) match or correspond to the distribution of energy of the audio signal 205’. For example, the voice enhancer 135 can concentrate or disperse at least a portion (e.g., 60-90%) of the distribution of energy across time in the recorded audio signal 310’ to align with a portion of the distribution of energy across time in the audio signal 205’. The voice enhancer 135 can perform any number of modifications to the recorded audio signal 310’ to enhance the audio signal 205’.
[0156] With the modifications of the recorded audio signal 310’, the voice enhancer 135 can join or add the modified, recorded audio signal 310’ to the audio signal 205’ to form, produce, or otherwise generate the audio signal 205”. As the audio signal 205’ corresponds to acoustic waveform components in the lower frequency bands, the recorded audio signal 310 can be added to the higher frequency bands to increase or enhance the likelihood that the function of the application 120 is successfully invoked. With the generation of the audio signal 205”, the voice enhancer 135 can send, relay, or otherwise provide the audio signal 205” to the application 120. In some embodiments, the voice enhancer 135 can provide the audio signal 205” via an application programming interface (API) for the application 120.
]0I57| The application 120 executing on the computing device 105 can process the audio signal 205” provided by the voice interaction service 115. The application 120 can include natural language processing (NLP) and artificial intelligence (Al) functionalities to process and extract information from the audio signal 205”. When the audio signal 205” is identified as corresponding to a particular function, the application 120 can carry out, perform, or carry out the specified function. For example, when the audio signal 205” is for the wake-up command, the application 120 can cease sleep mode and launch as a foreground process on the computing device 105. On the other hand, when the audio signal 205” is identified as not corresponding to any of the functions of the application 120, the application 120 can produce, output, or generate an error indication. In some embodiments, the application 120 can refrain from responding to the audio signal 205”.
[0158] The application 120 can output, produce, or otherwise generate at least one response 315 to indicate whether the invocation of the function of the application 120 is successful or failed. When the function corresponding to the voice command of the audio signal 205” is executed, the application 120 can generate the response 315 to indicate that the invocation of the function is successful. Conversely, when the audio signal 205” is identified as not corresponding to any of the functions of the application 120, the application 120 can generate the response 315 to indicate that the invocation of any function of the application 120 has failed. In some embodiments, the application 120 can refrain from generation of the response 315 when the audio signal 205” is identified as not corresponding to any function of the application 120. With the generation, the application 120 can return, send, or otherwise provide the response 315 to the voice interaction service 115.
[0159] The voice enhancer 135 can identify or determine whether the invocation of the function of the application 120 was successful or a failure based on the response 315 from the application 120. When the response 315 indicates that the invocation of the function of the application 120 was successful, the voice enhancer 135 can determine that the invocation of the function of the application 120 was successful. The voice enhancer 135 can also refrain from additional selections of recorded audio signals 310’ for another voice command or for another
application to add to the audio signal 205’. Conversely, when the response 315 indicates that the invocation of the function of the application 120 was a failure, the voice enhancer 135 can determine that the invocation of the function of the application 120 failed. In some embodiments, the voice enhancer 135 can determine that the invocation of the function of the application 120 failed, based on lack of any response 315 from the application 120. For instance, the voice enhancer 135 can maintain a timer to keep track of time, subsequent to providing the audio signal 205” to the application 120. The voice enhancer 135 can wait for the response 315 from the application 120. When the elapsed time is above a threshold (e.g., 5-30 seconds) and no response 315 is received, the voice enhancer 135 can determine that the invocation of the function of the application 120 failed.
[0160| When the invocation of the function of the application 120 is determined to have failed, the voice enhancer 135 can repeat the selection of another recorded audio signal 310’ for another voice command for the same application 120 or for another application 120. The process for selection of the subsequent recorded audio signal 310’ and the determination of whether the invocation of the function of the application 120 is successful can be similar as described herein. For example, the voice enhancer 135 can select another recorded audio signal 310’ from the set of recorded audio signals 310. The voice enhancer 135 can generate another audio signal 205” using the audio signal 205’ and the next selected recorded audio signal 310’ to provide to invoke another function of the same application 120 or another application 120. Upon providing the audio signal 205” to the application 120, the voice enhancer 135 can determine whether the function of the application 120 is successfully invoked. The processes of the voice enhancer 135 can be repeated any number of times until one of the applications 120 is invoked or a threshold number of attempts is exceeded. If the number of attempts has not exceeded the threshold (e.g., 5-20 attempts), the voice enhancer 135 can continue selection of another recorded audio signal 310’ for the voice command for the application 120. If the number of attempts has exceeded the threshold, the voice enhancer 135 can terminate selection of other recorded audio signal 310’.
[0161 ] In this manner, the voice interaction service 115 can provide for voice command activation on speaker transducers 110 on earphones without the use of additional hardware
modifications. By repurposing the speaker transducer 110 of earphones into a microphone, the voice interaction service 115 may reduce or eliminate reliance on in-ear microphones or dedicated sensors. In addition, the voice interaction service 115 can use the ML model 140, which is a lightweight model, to distinguish between the speech from the primary user 145 as opposed to other speakers or ambient noise. The ML model 140 can leverage the propagation characteristics of propagation characteristics of human speech through the air channel 215A versus the body channel 215B as sensed by the speaker transducer 110. The ability to accurately distinguish can lower the instances or prevent unintended activation of the application 120 and thereby reduce processor and power consumption, making the computing device 105 more efficient in terms of processing and electrical power.
[0162] Furthermore, the voice interaction service 115 can offer a robust wake-up word recognition by compensating for the loss of higher frequency energies of the audio from propagating through the body channel 215B in the body of the primary user. To enhance the remaining information of the audio signal 205’, the voice interaction service 115 can take a copy, paste, and adapt approach using a bank of recorded audio signals 310. The recorded audio signals 310 can be used to add back high-frequency components to the audio signal 205’. The addition of these higher-frequency components can increase the likelihood that the voice command is recognized by the application 220, thereby improving the accuracy of wake-up word recognition even in noisy environments.
101631 The voice interaction service 115 can also function as a gating mechanism by audio with forwarding voice commands to the application 120, thereby minimizing the risk of privacy leaks and conserving power from processing otherwise unintelligible audio. This can also enhance the quality of human-computer interactions (HCI) between the user 145 and the application 120. In certain environments, the voice interaction service 115 can permit the user 145 to use voice commands with the speaker transducer 110 in a hands-free manner, thereby allowing the user to use the application 120 even in situations where it would be otherwise impractical. Overall, the voice interaction service 115 can provide for high accuracy in wakeup word recognition through a lightweight signal processing algorithm that distinguishes between
the primary user's speech and ambient noise, ensuring low false positive rates and efficient power consumption.
[0164] Referring now to FIG. 22, depicted is a flow diagram of a method 400 of identifying users from audio signals acquired via earphones. The method 400 can be implemented or performed using any one or more of the components detailed herein, such as the system 100 or 614, among others. Under the method 400, a computing system can receive an audio signal from an earphone (405). The computing system can filter out an air channel portion from the received audio signal (410). The computing system can apply the filtered audio signal to a machine learning (ML) model (415). The computing system can identify whether the user of the earphone is speaking based on the application of the filtered audio signal to the ML model (420). If the user is identified as speaking, the computing system can identify the user as speaker (425). Otherwise, if the user is identified as not speaking, the computing system can identify the user as not speaking (430). The computing system can provide an output based on the identification (435).
[0.165] Referring now to FIG. 23, depicted is a flow diagram of a method 500 of enhancing voice commands from audio signals acquired via earphones. The method 500 can be implemented or performed using any one or more of the components detailed herein, such as the system 100 or 614, among others. Under the method 500, a computing system can identify the user as speaking from an audio signal received via an earphone (505). The computing system can select an audio recording for a voice command to add the audio signal (510). The computing system can provide an enhanced audio signal with the voice command to invoke a function of an application (515). The computing system can receive an indication from the application (520). The computing system can determine whether the invocation of the function is successful (525). If the function is not successfully invoked, the computing system can repeat the functionality from step (510). If the function is determined to have been successfully invoked, the computing system can refrain from additional selection (530).
10. Computer Environment
[0166| Various operations described herein can be implemented on one or more computer systems. FIG. 24 shows a block diagram of a representative computing system 614 usable to implement the present disclosure. In some embodiments, the methods 400 and 500 may be implemented by the computing system 614. Computing system 614 can be implemented, for example, as a consumer device such as a smartphone, other mobile phone, tablet computer, wearable computing device (e.g., smart watch, eyeglasses, head mounted display), desktop computer, laptop computer, cloud computing service or implemented with distributed computing devices. In some embodiments, the computing system 614 can include computer components such as processors 616, storage device 618, network interface 620, user input device 622, and user output device 624.
[0167| Network interface 620 can provide a connection to a wide area network (e.g., the Internet) to which the WAN interface of a remote server system is also connected. Network interface 620 can include a wired interface (e.g., Ethernet) and/or a wireless interface implementing various RF data communication standards such as Wi-Fi, Bluetooth, or cellular data network standards (e.g., 4G, 5G, 60 GHz, LTE, etc ).
|0168| User input device 622 can include any device (or devices) via which a user can provide signals to computing system 614; computing system 614 can interpret the signals as indicative of particular user requests or information. User input device 622 can include any or all of a keyboard, a controller (e.g., joystick), touch pad, touch screen, mouse or other pointing device, scroll wheel, click wheel, dial, button, switch, keypad, microphone, sensors (e.g., a motion sensor, an eye tracking sensor, etc.), and so on.
[0169| User output device 624 can include any device via which computing system 614 can provide information to a user. For example, user output device 624 can include a display-to- display images generated by or delivered to computing system 614. The display can incorporate various image generation technologies, e.g., a liquid crystal display (LCD), light-emitting diode (LED) including organic light-emitting diodes (OLED), projection system, cathode ray tube (CRT), or the like, together with supporting electronics (e.g., digital-to-analog or analog-to-
digital converters, signal processors, or the like). A device such as a touchscreen that functions as both an input and output device can be used. Output devices 624 can be provided in addition to or instead of a display. Examples include indicator lights, speakers, tactile “display” devices, printers, and so on.
|0170] Some implementations include electronic components, such as microprocessors, storage, and memory that store computer program instructions in a non-transitory computer readable storage medium. Many of the features described in this specification can be implemented as processes that are specified as a set of program instructions encoded on a computer readable storage medium. When these program instructions are executed by one or more processors, they cause the processors to perform various operations indicated in the program instructions. Examples of program instructions or computer code include machine code, such as is produced by a compiler, and files including higher-level code that are executed by a computer, an electronic component, or a microprocessor using an interpreter. Through suitable programming, processor 616 can provide various functionalities for computing system 614, including any of the functionalities described herein as being performed by a server or client, or other functionalities associated with message management services.
[01711 It will be appreciated that computing system 614 is illustrative and that variations and modifications are possible. Computer systems used in connection with the present disclosure can have other capabilities not specifically described here. Further, while computing system 614 is described with reference to particular blocks, it is to be understood that these blocks are defined for convenience of description and are not intended to imply a particular physical arrangement of component parts. For instance, different blocks can be located in the same facility, in the same server rack, or on the same motherboard. Further, the blocks need not correspond to physically distinct components. Blocks can be configured to perform various operations, e.g., by programming a processor or providing appropriate control circuitry, and various blocks might or might not be reconfigurable depending on how the initial configuration is obtained. Implementations of the present disclosure can be realized in a variety of apparatus including electronic devices implemented using any combination of circuitry and software.
[0I72| Having now described some illustrative implementations, it is apparent that the foregoing is illustrative and not limiting, having been presented by way of example. In particular, although many of the examples presented herein involve specific combinations of method acts or system elements, those acts, and those elements can be combined in other ways to accomplish the same objectives. Acts, elements and features discussed in connection with one implementation are not intended to be excluded from a similar role in other implementations.
[0173] The hardware and data processing components used to implement the various processes, operations, illustrative logics, logical blocks, modules and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, or any conventional processor, controller, microcontroller, or state machine. A processor also may be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. In some embodiments, particular processes and methods may be performed by circuitry that is specific to a given function. The memory (e.g., memory, memory unit, storage device, etc.) may include one or more devices (e.g., RAM, ROM, flash memory, hard disk storage, etc.) for storing data and/or computer code for completing or facilitating the various processes, layers and modules described in the present disclosure. The memory may be or include volatile memory or nonvolatile memory, and may include database components, object code components, script components, or any other type of information structure for supporting the various activities and information structures described in the present disclosure. According to an embodiment, the memory is communicably connected to the processor via a processing circuit and includes computer code for executing (e g., by the processing circuit and/or the processor) the one or more processes described herein.
[0174] The present disclosure contemplates methods, systems and program products on any machine-readable media for accomplishing various operations. The embodiments of the present disclosure may be implemented using existing computer processors, or by a special purpose computer processor for an appropriate system, incorporated for this or another purpose, or by a hardwired system. Embodiments within the scope of the present disclosure include program products comprising machine-readable media for carrying or having machineexecutable instructions or data structures stored thereon. Such machine-readable media can be any available media that can be accessed by a general purpose or special purpose computer or other machine with a processor. By way of example, such machine-readable media can comprise RAM, ROM, EPROM, EEPROM, or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer or other machine with a processor. Combinations of the above are also included within the scope of machine-readable media. Machine-executable instructions include, for example, instructions and data which cause a general -purpose computer, special purpose computer, or special purpose processing machines to perform a certain function or group of functions.
[0175] The phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” “having,” “containing,” “involving,” “characterized by,” “characterized in that” and variations thereof herein, is meant to encompass the items listed thereafter, equivalents thereof, and additional items, as well as alternate implementations consisting of the items listed thereafter exclusively. In one implementation, the systems and methods described herein consist of one, each combination of more than one, or all of the described elements, acts, or components.
| 0176| Any references to implementations or elements or acts of the systems and methods herein referred to in the singular can also embrace implementations including a plurality of these elements, and any references in plural to any implementation or element or act herein can also embrace implementations including only a single element. References in the singular or
plural form are not intended to limit the presently disclosed systems or methods, their components, acts, or elements to single or plural configurations. References to any act or element being based on any information, act, or element can include implementations where the act or element is based at least in part on any information, act, or element.
[0.177] Any implementation disclosed herein can be combined with any other implementation or embodiment, and references to “an implementation,” “some implementations,” “one implementation” or the like are not necessarily mutually exclusive and are intended to indicate that a particular feature, structure, or characteristic described in connection with the implementation can be included in at least one implementation or embodiment. Such terms as used herein are not necessarily all referring to the same implementation. Any implementation can be combined with any other implementation, inclusively or exclusively, in any manner consistent with the aspects and implementations disclosed herein.
[0178] Where technical features in the drawings, detailed description or any claim are followed by reference signs, the reference signs have been included to increase the intelligibility of the drawings, detailed description, and claims. Accordingly, neither the reference signs nor their absence have any limiting effect on the scope of any claim elements. Technical and scientific terms used herein have the meanings commonly understood by one of ordinary skill in the art, unless otherwise defined. Any suitable materials and/or methodologies known to those of ordinary skill in the art can be utilized in carrying out the methods described herein.
[0179] Systems and methods described herein may be embodied in other specific forms without departing from the characteristics thereof. As used herein, “approximately,” “about,” “substantially” or other terms of degree will be understood by persons of ordinary skill in the art and will vary to some extent on the context in which it is used. If there are uses of the term which are not clear to persons of ordinary skill in the art given the context in which it is used, references to “approximately,” “about,” “substantially” or other terms of degree shall include
variations of +/- 10% from the given measurement, unit, or range unless explicitly indicated otherwise.
[0180] Coupled elements can be electrically, mechanically, or physically coupled with one another directly or with intervening elements. The scope of the systems and methods described herein is thus indicated by the appended claims, rather than the foregoing description, and changes that come within the meaning and range of equivalency of the claims are embraced therein.
[0181 ] The term “coupled” and variations thereof include thejoining of two members directly or indirectly to one another. Such joining may be stationary (e.g., permanent or fixed) or moveable (e.g., removable or releasable). Such joining may be achieved with the two members coupled directly with or to each other, with the two members coupled with each other using a separate intervening member and any additional intermediate members coupled with one another, or with the two members coupled with each other using an intervening member that is integrally formed as a single unitary body with one of the two members. If “coupled” or variations thereof are modified by an additional term (e.g., directly coupled), the generic definition of “coupled” provided above is modified by the plain language meaning of the additional term (e.g., “directly coupled” means thejoining of two members without any separate intervening member), resulting in a narrower definition than the generic definition of “coupled” provided above. Such coupling may be mechanical, electrical, or fluidic.
[0182] References to “or” can be construed as inclusive so that any terms described using “or” can indicate any of a single, more than one, and all of the described terms. A reference to “at least one of ‘A’ and ‘B’” can include only ‘A’, only ‘B’, as well as both ‘A’ and ‘B’. Such references used in conjunction with “comprising” or other open terminology can include additional items.
[0183] Modifications of described elements and acts, such as variations in sizes, dimensions, structures, shapes and proportions of the various elements, values of parameters, mounting arrangements, use of materials, colors, orientations can occur without materially
departing from the teachings and advantages of the subject matter disclosed herein. For example, elements shown as integrally formed can be constructed of multiple parts or elements, the position of elements can be reversed or otherwise varied, and the nature or number of discrete elements or positions can be altered or varied. Other substitutions, modifications, changes and omissions can also be made in the design, operating conditions and arrangement of the disclosed elements and operations without departing from the scope of the present disclosure.
[0184] References herein to the positions of elements (e.g., “top,” “bottom,” “above,” “below”) are merely used to describe the orientation of various elements in the FIGURES. The orientation of various elements may differ according to other exemplary embodiments, and that such variations are intended to be encompassed by the present disclosure.
|0185| As used herein, a subject can be a mammal, such as a non-primate (e.g., cows, pigs, horses, cats, dogs, rats, etc.) or a primate (e.g., monkey and human). In certain embodiments, the term “subject,” as used herein, refers to a vertebrate, such as a mammal.
Mammals include, without limitation, humans, non-human primates, wild animals, feral animals, farm animals, sport animals, and pets. In certain exemplary embodiments, a subject is a human.
[0186] As used herein, the terms “subject” and “user” are used interchangeably.
[0187] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present disclosure, suitable methods and materials are described herein.
[0188] As used herein, the singular forms “a”, “an,” and “the” include plural referents, unless the context clearly indicates otherwise. For example, the term “a cell” includes a plurality of cells, including mixtures thereof.
10189] As used herein, the term “about” is used to indicate that a value includes the standard deviation of error for the device or method being employed to determine the value. The
term “about” when used before a numerical designation, e.g., temperature, time, amount, and concentration, including range, indicates approximations, which may vary by (+) or (-) 15%, 10%, 5%, 3%, 2%, or 1 %.
[0190] Ranges: throughout this disclosure, various aspects of the invention can be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges, as well as individual numerical values within that range. For example, description of a range, such as from 1 to 6, should be considered to have specifically disclosed subranges, such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 2.7, 3, 4, 5, 5.3, and 6. This applies regardless of the breadth of the range.
Claims
1. A method of identifying users from audio signals acquired via electroacoustic transducers, comprising: receiving, by one or more processors, a first audio signal corresponding to a first acoustic waveform acquired via a speaker transducer positioned relative to an ear of a first user, the first acoustic waveform having (i) a first portion traveling through the first user and (ii) a second portion traveling outside the first user; filtering, by the one or more processors, the second portion of the first acoustic waveform within the first audio signal to generate a second audio signal corresponding to the first portion of the first acoustic waveform; applying, by the one or more processors, the second audio signal to a machine learning (ML) model, wherein the ML model is trained using a plurality of examples, each example of the plurality of examples identifying: (i) a respective third audio signal corresponding to a portion of a respective second acoustic waveform traveling through a respective second user and (ii) an identification of whether the second user is speaking; identifying, by the one or more processors, based on applying the second audio signal to the ML model, that the first user of the speaker transducer is speaking; and providing, by the one or more processors, an output based on an identification that the first user of the speaker transducer is speaking.
2. The method of claim 1, further comprising: identifying, by the one or more processors, that a third user of a second speaker transducer is not speaking, based on applying a fourth audio signal corresponding to a portion of a third acoustic waveform traveling through the third user to the ML model; and providing, by the one or more processors, a second output based on an identification that the third user of the second speaker transducer is not speaking.
3. The method of claim 1, wherein receiving the first audio signal further comprises receiving the first audio signal corresponding to the first acoustic waveform comprising a plurality of formants of the first user, and wherein filtering the first audio signal further comprises filtering the first audio signal below a threshold frequency to pass through at least a first formant ( O) of the plurality of formants as the second audio signal.
4. The method of claim 1, wherein applying the second audio signal further comprising applying the second audio signal to the ML model to determine a likelihood that the first user of the speaker transducer is speaking, and wherein identifying further comprises identifying that the first user of the speaker transducer is speaking, responsive to the likelihood satisfying a threshold.
5. The method of claim 1, wherein identifying that the first user is speaking further comprises identifying that the first audio signal is originating from the first user on which the speaker transducer is positioned.
6. The method of claim 1, wherein filtering the second portion of the first acoustic waveform further comprises applying a filter to suppress an air channel corresponding to the second portion of the first acoustic waveform and to pass a body channel corresponding to the first portion of the first acoustic waveform.
7. The method of claim 1, wherein providing the output further comprises initiating a process to enhance a voice command corresponding to the second audio signal to invoke a function of an application.
8. A method of enhancing voice commands from audio signals acquired via speaker transducers, comprising:
identifying, by one or more processors, that a user of a speaker transducer is speaking using a first audio signal corresponding to an acoustic waveform travelling through the user; selecting, by the one or more processors, responsive to identifying that the user is speaking, a second audio signal corresponding to a voice command comprising one or more keywords for an application; generating, by the one or more processors, a third audio signal to include (i) the first audio signal and (ii) the second audio signal; and providing, by the one or more processors, to the application, the third audio signal to invoke a function of the application corresponding to the one or more keywords of the voice command.
9. The method of claim 8, further comprising: determining, by the one or more processors, that the function of the application was not successfully invoked in response to providing the third audio signal to the application; selecting, by the one or more processors, responsive to determining that the function of the application was not successfully invoked, a fourth audio signal corresponding to a second voice command comprising one or more second keywords for at least one of (i) a second function of the application or (ii) a second application; generating, by the one or more processors, a fifth audio signal to include (i) the first audio signal and (ii) the fourth audio signal; and providing, by the one or more processors, the fifth audio signal corresponding to the second one or more keywords of the voice command to invoke at least one of the second function of the application or the second application.
10. The method of claim 8, further comprising: determining, by the one or more processors, that the function of the application was successfully invoked in response to providing the third audio signal to the application; and
refraining, by the one or more processors, from selecting another audio signal for another voice command, responsive to determining that the function of the application was successfully invoked.
11. The method of claim 8, further comprising maintaining, by the one or more processors, on memory, a plurality of audio signals each corresponding to one or more respective keywords to invoke a respective function of at least one of a plurality of applications, and wherein selecting the second audio signal further comprises selecting the second audio signal from the plurality of audio signals.
12. The method of claim 8, wherein generating the third audio signal further comprises generating the third audio signal to include, in a frequency domain, a first portion corresponding to the first audio signal and a second portion corresponding to the second audio signal.
13. The method of claim 8, wherein generating the third audio signal further comprises modifying the second audio signal based on one or more characteristics of the first audio signal, wherein the one or more characteristics comprise at least one of a time length, a time point of syllables, a time point of a formant, or a distribution of energy.
14. The method of claim 8, wherein identifying that the user of the speaker transducer is speaking further comprises applying the first audio signal to a machine learning (ML) model, wherein the ML model is trained using a plurality of examples, each example of the plurality of examples identifying: (i) a respective third audio signal corresponding to a portion of a respective second acoustic waveform traveling through a respective second user and (ii) an identification of whether the second user is speaking.
15. A system for processing audio signals from earphones for interactivity with applications, comprising: one or more processors coupled with memory, configured to:
receive a first audio signal corresponding to a first acoustic waveform acquired via a speaker transducer positioned relative to an ear of a first user, the first acoustic waveform having (i) a first portion traveling through the first user and (ii) a second portion traveling outside the first user; filter the second portion of the first acoustic waveform within the first audio signal to generate a second audio signal corresponding to the first portion of the first acoustic waveform; apply the second audio signal to a machine learning (ML) model, wherein the ML model is trained using a plurality of examples, each example of the plurality of examples identifying: (i) a respective third audio signal corresponding to a portion of a respective second acoustic waveform traveling through a respective second user and (ii) an identification of whether the second user is speaking; identify, based on applying the second audio signal to the ML model, that the first user of the speaker transducer is speaking; and provide an output based on an identification that the first user of the speaker transducer is speaking.
16. The system of claim 15, wherein the one or more processors are further configured to: identify that a third user of a second speaker transducer is not speaking, based on applying a fourth audio signal corresponding to a portion of a third acoustic waveform traveling through the third user to the ML model; and provide a second output based on an identification that the third user of the second speaker transducer is not speaking.
17. The system of claim 15, wherein the one or more processors are further configured to receive the first audio signal corresponding to the first acoustic waveform comprising a plurality of formants of the first user; and filter the first audio signal below a threshold frequency to pass through at least a first formant (FO) of the plurality of formants as the second audio signal.
18. The system of claim 15, wherein the one or more processors are further configured to apply the second audio signal to the ML model to determine a likelihood that the first user of the speaker transducer is speaking; and identify that the first user of the speaker transducer is speaking, responsive to the likelihood satisfying a threshold.
19. The system of claim 15, wherein the one or more processors are further configured to apply a filter to suppress an air channel corresponding to the second portion of the first acoustic waveform and to pass a body channel corresponding to the first portion of the first acoustic waveform.
20. The system of claim 15, wherein the one or more processors are further configured to initiate a process to enhance a voice command corresponding to the second audio signal to invoke a function of an application.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463561191P | 2024-03-04 | 2024-03-04 | |
| US63/561,191 | 2024-03-04 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2025188612A1 true WO2025188612A1 (en) | 2025-09-12 |
| WO2025188612A8 WO2025188612A8 (en) | 2025-10-02 |
Family
ID=96991502
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2025/018123 Pending WO2025188612A1 (en) | 2024-03-04 | 2025-03-03 | Processing of audio signals from earphones for interactivity with voice-activated applications |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025188612A1 (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200153646A1 (en) * | 2017-12-13 | 2020-05-14 | Amazon Technologies, Inc. | Network conference management and arbitration via voice-capturing devices |
| US20210409860A1 (en) * | 2020-06-25 | 2021-12-30 | Qualcomm Incorporated | Systems, apparatus, and methods for acoustic transparency |
| US20220345820A1 (en) * | 2019-07-30 | 2022-10-27 | Dolby Laboratories Licensing Corporation | Coordination of audio devices |
| US20220351729A1 (en) * | 2018-12-28 | 2022-11-03 | Ringcentral, Inc. | Systems and methods for recognizing a speech of a speaker |
| US20220358926A1 (en) * | 2018-05-09 | 2022-11-10 | Staton Techiya Llc | Methods and systems for processing, storing, and publishing data collected by an in-ear device |
| US20230223042A1 (en) * | 2022-01-10 | 2023-07-13 | Synaptics Incorporated | Sensitivity mode for an audio spotting system |
-
2025
- 2025-03-03 WO PCT/US2025/018123 patent/WO2025188612A1/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200153646A1 (en) * | 2017-12-13 | 2020-05-14 | Amazon Technologies, Inc. | Network conference management and arbitration via voice-capturing devices |
| US20220358926A1 (en) * | 2018-05-09 | 2022-11-10 | Staton Techiya Llc | Methods and systems for processing, storing, and publishing data collected by an in-ear device |
| US20220351729A1 (en) * | 2018-12-28 | 2022-11-03 | Ringcentral, Inc. | Systems and methods for recognizing a speech of a speaker |
| US20220345820A1 (en) * | 2019-07-30 | 2022-10-27 | Dolby Laboratories Licensing Corporation | Coordination of audio devices |
| US20210409860A1 (en) * | 2020-06-25 | 2021-12-30 | Qualcomm Incorporated | Systems, apparatus, and methods for acoustic transparency |
| US20230223042A1 (en) * | 2022-01-10 | 2023-07-13 | Synaptics Incorporated | Sensitivity mode for an audio spotting system |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2025188612A8 (en) | 2025-10-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7689644B1 (en) | Voice Triggers for Digital Assistants | |
| US10433075B2 (en) | Low latency audio enhancement | |
| US9165567B2 (en) | Systems, methods, and apparatus for speech feature detection | |
| Jin et al. | EarCommand: " Hearing" Your Silent Speech Commands In Ear | |
| Gao et al. | Echowhisper: Exploring an acoustic-based silent speech interface for smartphone users | |
| US8898058B2 (en) | Systems, methods, and apparatus for voice activity detection | |
| US8589167B2 (en) | Speaker liveness detection | |
| Zhang et al. | Sensing to hear: Speech enhancement for mobile devices using acoustic signals | |
| Roy et al. | Listening through a vibration motor | |
| Maruri et al. | V-speech: Noise-robust speech capturing glasses using vibration sensors | |
| US11290802B1 (en) | Voice detection using hearable devices | |
| US11842736B2 (en) | Subvocalized speech recognition and command execution by machine learning | |
| CN113949955B (en) | Noise reduction processing method, device, electronic equipment, earphone and storage medium | |
| US11895474B2 (en) | Activity detection on devices with multi-modal sensing | |
| CN108922525A (en) | Method of speech processing, device, storage medium and electronic equipment | |
| Chen et al. | Enabling hands-free voice assistant activation on earphones | |
| Zeng et al. | mSilent: Towards general corpus silent speech recognition using COTS mmWave radar | |
| CN118354237A (en) | Awakening method, device and equipment of MEMS earphone and storage medium | |
| Dekens et al. | Body conducted speech enhancement by equalization and signal fusion | |
| Duan et al. | Earse: Bringing robust speech enhancement to cots headphones | |
| US20150039314A1 (en) | Speech recognition method and apparatus based on sound mapping | |
| Jaroslavceva et al. | Robot Ego‐Noise Suppression with Labanotation‐Template Subtraction | |
| WO2025188612A1 (en) | Processing of audio signals from earphones for interactivity with voice-activated applications | |
| Han et al. | EAROE: Enabling Body-Channel Voice Interaction Interface on Earphones via Occlusion Effect | |
| Wang et al. | AccCall: Enhancing Real-time Phone Call Quality with Smartphone's Built-in Accelerometer |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25768309 Country of ref document: EP Kind code of ref document: A1 |