WO2025188615A1 - Detecting hand gestures in mobile acoustic fields around bone conduction headphones - Google Patents
Detecting hand gestures in mobile acoustic fields around bone conduction headphonesInfo
- Publication number
- WO2025188615A1 WO2025188615A1 PCT/US2025/018129 US2025018129W WO2025188615A1 WO 2025188615 A1 WO2025188615 A1 WO 2025188615A1 US 2025018129 W US2025018129 W US 2025018129W WO 2025188615 A1 WO2025188615 A1 WO 2025188615A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- user
- gesture
- audio signal
- gestures
- bone conduction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/10—Earpieces; Attachments therefor ; Earphones; Monophonic headphones
- H04R1/1041—Mechanical or electronic switches, or control elements
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2460/00—Details of hearing devices, i.e. of ear- or headphones covered by H04R1/10 or H04R5/033 but not provided for in any of their subgroups, or of hearing aids covered by H04R25/00 but not provided for in any of its subgroups
- H04R2460/13—Hearing devices using bone conduction transducers
Definitions
- a computing device may be communicatively coupled with one or more input/output (I/O) devices to accept inputs and provide outputs.
- I/O input/output
- One or more processors may provide, via a speaker of a bone conduction headphone positioned at least partially about an ear of a user, a first probing signal in a first frequency band to produce a first acoustic field at least partially about the ear of the user.
- the one or more processors may receive, from a microphone of the bone conduction headphone, a first audio signal corresponding to an acoustic waveform through the acoustic field.
- the first audio signal may include (i) a first portion corresponding to the first frequency band including the probing signal and (ii) a second portion corresponding to a second frequency band.
- the one or more processors may filter the first audio signal to pass the first portion corresponding to the first probing signal in the first frequency band to generate a second audio signal.
- the one or more processors may apply the second audio signal to a machine learning (ML) model.
- the ML model may be trained using a plurality of examples. Each of the plurality of examples may identify (i) a respective third audio signal corresponding to a respective second probing signal for a second acoustic field at least partially about a respective second user and (ii) an identification of whether a respective hand gesture is performed by the respective second user.
- the one or more processors may detect, based on applying the second audio signal to the ML model, a hand gesture performed by the user.
- the one or more processors may generate an output based on a detection of the hand gesture performed by the user.
- the one or more processors may detect a lack of any hand gesture performed by the user, based on applying a fourth audio signal corresponding to the first probing signal producing the first acoustic field in the first frequency band to the ML model.
- the one or more processors may generate a second output based on a detection of the lack of any hand gesture performed by the user.
- the one or more processors may provide the output to an application to invoke at least one of a plurality of functions corresponding to the hand gesture detected as performed by the user of the bone conduction headphone.
- the application may control the plurality of functions associated with operations of the bone conduction headphone.
- the plurality of functions may include at least one of (i) a volume control, (ii) a playback, or (iii) muting sound.
- the application may provide the plurality of functions associated with a virtual reality (VR) headset.
- the plurality of functions may include at least one of (i) a control of virtual object presented via the VR headset or (ii) a communication with another user through the VR headset.
- the application may present an alert for the user to prevent contact with a face of the user to suppress virus transmission.
- the one or more processors may suppress noise within the first frequency band of the first audio signal to boost a relative amplitude of the first probe signal.
- the ML model may be trained using the plurality of examples each identifying one of a plurality of hand gestures performed by the respective second user.
- the one or more processors may identify, from the plurality of hand gestures, the hand gesture performed by the user. [0007]
- the one or more processors may provide the first probing signal to radiate about the ear the user from the speaker. The first acoustic field produced may respond to hand gestures performed by the user within an effective range from the speaker.
- the first frequency band can include an inaudible frequency range between 18 kHz to 22 kHz.
- the second frequency band can include an audible frequency range between 20 Hz to 20 kHz.
- the first acoustic field may have an effective range between 2 cm to 10 cm from the speaker.
- FIG. 1 An illustration of Mobile Acoustic Field (MAF).
- MAF Mobile Acoustic Field
- A An illustration of Mobile Acoustic Field (MAF).
- MAF is based on a key observation - when audio is transmitted from the bone conduction earphone, it may not only propagate along the surface of the human face but also dissipate into the air, creating an acoustic field that envelops the individual’s head. This acoustic field empowers the mobile user to define their own on-face and over-the-face hand gestures for human-computer interactions.
- B Three potential applications of MAF, out of many.
- Left software-defined headphones that allow users to play and stop music with in-air gestures.
- (Middle) enhanced gaming experience - the military game can automatically recognize user gestures in the air with MAF.
- FIGs. 2A-C Illustration of surface acoustic wave (SAW) and leaky surface acoustic wave (LSAW);
- SAW surface acoustic wave
- LSAW leaky surface acoustic wave
- B Touching the check twice results in two peaks in the SAW signals;
- C Approaching the face leads to weak yet observable variations in LSAW signals.
- FIGs. 3A-C (A): Sealing the earphone with a Plasticine. (B): The signal wave produced by an approaching gesture without the plasticine sealing. (C): The signal wave produced by an approaching gesture with the plasticine sealing.
- FIGs. 4A-C Detecting approaching gestures in different volume levels or distance settings.
- A experiment setups;
- B the gesture detection success rate in different volume settings (dB SPL); and
- C the gesture detection success rate in different distance settings (cm).
- FIG. 5 The effective coverage area of SAW signals and LSAW signals.
- the detection success rate of touching (top) and approaching (bottom) gestures grows with the darkness of the heatmap.
- FIGs. 6A-D The set of histograms illustrates the success rates of gesture recognition across different facial regions (left, middle, and right) under various conditions. Histogram (A) focuses on the face-touching gesture under different hydration conditions, while histogram (B) showcases the face-approaching gesture under the same hydration conditions. Evaluating (C) the detection success rate of the face-touching gesture and (D) the faceapproaching gesture under varied motion conditions.
- the symbol A, B, and C denotes the stationary state, walking state, and jogging state, respectively.
- FIG. 7. Gauging the initial feedback from 22 participants on the MAF system.
- FIGs. 8A-D (A): the raw spectrogram of on-face and over-the-face gestures. (B): the spectrogram after applying narrow bandpass and bandstop filters. (C): the spectrogram after the signal enhancement. (D): the two signal segments after applying KL divergence-based signal segmentation.
- FIG. 9 Model structure of MAF, it consists of a 5-layer CNN model, a bidirectional LSTM layer, and a classic MLP classifier layer.
- FIG. 11 Training loss curve and validation loss curve over different training epochs.
- FIG. 12 The average gesture recognition accuracy across 22 participants.
- FIGs. 13 A and 13B Examine the gesture recognition accuracy across ten different gestures.
- A the confusion matrix that displays the classification accuracy of each type of gesture.
- B the feature distribution of ten different gestures.
- FIGs. 14A and 14B (A): The gesture recognition accuracy across four age groups. (B): The gesture recognition accuracy across different genders.
- FIG. 15 Feedback from 22 users was gathered through a Likert scale questionnaire after using MAF.
- FIGs. 16A-E Examine the gesture recognition accuracy in different environments and human factor settings.
- A the gesture recognition accuracy in different ambient noise environment settings.
- B the gesture recognition accuracy in the presence of human speech.
- C the gesture recognition accuracy in different levels of human activities.
- D the gesture recognition accuracy in different skin hydration settings.
- E the gesture recognition accuracy in the absence and presence of music playback.
- FIG. 17 depicts a block diagram of a system for detecting hand gestures in mobile acoustic fields around bone conduction headphones in accordance with an illustrative embodiment.
- FIG. 18 depicts a block diagram of a process to acquire audio signals in the system for detecting hand gestures in accordance with an illustrative embodiment.
- FIG. 19 depicts a block diagram of a process to identify hand gestures in acoustic fields in the system for detecting hand gestures in accordance with an illustrative embodiment.
- FIG. 20 depicts a flow diagram of a method of detecting hand gestures in mobile acoustic fields around bone conduction headphones in accordance with an illustrative embodiment.
- FIG. 21 is a block diagram of a computing environment according to an example implementation of the present disclosure.
- MAF a novel acoustic sensing approach that leverages the commodity bone conduction earphones for hand-to-face gesture interactions. Briefly, by shining audio signals with bone conduction headphones, it is observed that these signals not only propagate along the surface of the human face but also dissipate into the air, creating an acoustic field that envelops the individual’s head. Benchmark studies were conducted to understand how various hand-to-face gestures and human factors influence this acoustic field. Building on the insights gained from these initial studies, a lightweight deep neural network combined with signal preprocessing techniques is proposed.
- Hand-to-face gestures are a natural and intuitive way to control devices or interfaces. It improves the user experience across a wide spectrum of applications from virtual reality to smart home devices. At present, most hand-to-face gesture detection systems rely on dedicated sensing technologies like IMUs and capacitive sensors to identify gestures having direct contact with the user’s face.
- MAF mobile acoustic field
- SAW surface acoustic waves
- LSAW leaky surface acoustic waves
- FIG. 2A illustrates this principle.
- user gestures performed on or in the vicinity of the face can perturb the channel of SAW or LSAW signals, enabling mobile users to detect these gestures by analyzing variations in the signal received by the microphone.
- MAF Compared with existing hand-to-face gesture detection and recognition approaches, MAF offers several distinct advantages. Firstly, its wearable nature guarantees that users can move freely without any inconvenience or hindrance, always interacting with the device seamlessly. Secondly, MAF does not depend on specialized sensors or require any modifications to standard bone conduction earphones. This means that mobile users can effortlessly enjoy hand-to-gesture interactions without any additional equipment or alterations. Notably, this mobile acoustic field proves particularly advantageous in the context of the pandemic. For instance, it can effectively aid in evaluating the risks associated with facetouching gestures, thereby serving as a preventive barrier against the entry of bacteria into sensitive areas such as the mouth, nose, and eyes.
- a lightweight signal-processing pipeline is built for detecting and recognizing hand-to-face gestures to showcase the potential of the mobile acoustic field.
- a set of 12 hand gestures was initially created, comprising six performed on the face and six in close proximity to the face (i.e., over the face). From this set, a selection of ten gestures was curated, consisting of four on-face gestures and six over-the- face gestures, based on the preferences of 22 participants.
- Section 2 discusses other approaches in this domain.
- Section 3 starts with a detailed introduction to SAW and LSAW, briefly touches on the feasibility of the MAF systems, and then delves into their real-world application scenarios.
- Section 4 examines the practical possibility of integrating SAW and LSAW signals, evaluating them from three different perspectives.
- Section 5 describes the development and validation of signal processing and machine learning frameworks for MAF.
- Section 6 assesses user perceptions of the MAF system and examines the performance benchmarks of the MAF system in a variety of challenging situations.
- Section 7 discusses the system limitations and potential improvement. Section 8 concludes.
- sensor-based solutions are reviewed for hand-to-face gesture interactions, with a particular focus on acoustic-based and earable-based solutions that are closely related to the design.
- mmWave radar has also been employed to detect face contact. For instance, one approach demonstrated the use of microwave radar systems for hand gesture recognition. Another approach employed sonar-inspired techniques to measure the distance between the hand and the face and trigger an alarm if the user approaches too closely. Besides, there may be various types of wearable sensors for hand-to-face gesture recognition.
- FaceTouch utilized a vibration sensor placed on the wrist or finger to track hand movements towards the facial region, and the followups have also leveraged accelerometer and inertial measurement unit (IMU) sensors on commodity smartphones to detect facial touch.
- IMU inertial measurement unit
- FaceOri leverages an ultrasonic chirp to track head position and orientation on earphones.
- EchoSpeech leverages the inaudible sound emitted from an eye-wear device to detect and further recognize silent speech.
- SoundWave employs inaudible band tones generated by the PC speaker to detect gestures in the surrounding space, exploiting the concept that various gestures induce distinct frequency shifts due to Doppler effects.
- Sonicoperator devised a recursive neural network and implemented it on mobile devices to recognize mid-air human gestures. Additionally, Dolphin, Strata, and AudioGest have also explored similar techniques for recognizing human gestures.
- CAT employs frequency- modulated continuous wave (FMCW) signals to estimate the relative displacement between smartphone speakers and microphones, subsequently integrating Doppler shift data obtained through FMCW with IMU measurements to enhance gesture tracking precision.
- FMCW frequency- modulated continuous wave
- IMU IMU measurements
- fingerlO has embraced orthogonal frequency-division multiplexing (OFDM) modulation technology to monitor subtle finger movements in the proximity of the phone.
- OFDM orthogonal frequency-division multiplexing
- Earable computing is a rapidly growing research field, with an increasing amount of attention given to technologies surrounding ear-based or headset applications for acoustic sensing.
- the possible ways of interacting around the ear are as follows.
- the majority of desired ear-based interactive gestures involve mid-air hand interactions.
- HeadFi transforms everyday headphones into smart devices, making earable sensing easily accessible.
- HeadFi requires additional hardware support and only allows for conventional interaction with the existing headset.
- FreeDigiter integrated proximity sensors into earbuds enabling near-ear noncontact input from finger gestures.
- FaceSense designed an earbud with impedance sensing and thermal sensing for gesture recognition.
- Earbuddy leverages feed-forwards microphone on ANC earphones to detect the sound of touching gestures in the facial and ear areas for gesture recognition. However, it cannot detect non-contact, over-face gestures.
- SonicASL adopted a speaker and front microphone combo to recognize the sign language gestures of deaf individuals.
- EarEcho leverages in-ear microphones to identify different users based on unique ear canal structures.
- Another approach used facial muscle movements to alter the transfer function of the user’s ear canal for facial expression recognition.
- Briany hand employs a mini-proj ector and color camera within an earbud to relay input feedback to the user’s ear.
- PrivateTalk employed audio signals reaching the left and right ears to interpret the user’s intent to interact.
- VAHF voice-accompanying hand-to-face
- an active probing-based approach explores the surface acoustic wave and leaky surface acoustic waves produced by commodity bone conduction earphones for both on-face and over-the-face gesture recognition, without the instrument of any dedicated sensors.
- This approach leverages the natural properties of bone-conduction technology to provide a more immersive and interactive user experience.
- MAF relies solely on a pair of off-the-shelf bone conduction earphones, eliminating the need for specialized sensors, such as piezoelectric sensors.
- MAF possesses the capability to monitor both on-face and over-the-face gestures without any hardware modifications to the earphones. Consequently, it holds significant potential to enhance a wide range of facial gesture applications.
- SAWs are a type of mechanical waves that propagate along the interface between a solid material and its adjacent medium, exhibiting a longitudinal and vertical shear component along the surface. Furthermore, these surface acoustic waves also disperse into the air as they travel through the user’s facial region, creating another type of signal known as Leaky Surface Acoustic waves (LSA W) .
- LSA W Leaky Surface Acoustic waves
- Both SAW and LSAW waves persist as long as the mobile user continues to play audio, offering opportunities for interactions in close proximity to the user’s face.
- the combination of these two signals generates an acoustic field that envelops the user’s head, as shown in FIG. 2A. It is termed as Mobile Acoustic Field as it is produced by the headphones and moves with the user.
- Mobile Acoustic Field As it is produced by the headphones and moves with the user.
- the earbud is sealed with Plasticine, as shown in FIG. 3A.
- FIG. 3B and 3C show the signal wave produced by an approaching gesture in the presence and absence of Plasticine sealing, respectively. A clear signal pattern was observed when the earphone was sealed, which demonstrates that the LSAW signal is due to the mobile acoustic field, not the motion of the rear part of the earpiece.
- the mobile acoustic field can be leveraged to detect and recognize different types of gestures that are performed both in contact with the human face and in the vicinity of the human face (i.e., over the face), without the instrument of any dedicated sensors.
- a volunteer is invited to wear a pair of bone conduction earphones.
- the earphone emits a single tone at the ultrasound frequency band.
- this single-tone probing signal produces surface acoustic waves and leaky surface acoustic waves that propagate along the human head and get picked up by a microphone attached to the bottom part of the left face.
- the volunteer is first asked to gently touch her cheek.
- the VR applications allow users to manipulate and control virtual objects and perform various actions, without the need for physical controllers. Moreover, it opens possibilities for enhanced social interactions, particularly in scenarios depicted in the middle figure in FIG. 1 (marked “(b)”), like VR teambased shooter game Larcenauts. Users can communicate non-verbally through their over-the- face hand gestures, fostering a more engaging shooting game experience.
- the left speaker of the headphone emits a single-tone probing signal at the ultrasound frequency of 18 kHz.
- participant B proceeds to perform approaching gestures by moving her palm closer to participant A’s face, with 1 cm spacing in-between.
- the involvement of two participants ensures the absence of body motion artifacts from participants that could interfere with gesture detection. It also guarantees consistent alignment of the palm each time they approach the face.
- FIG. 4C shows the success rate of gesture detection (referred to as the success rate in the figure) across the left-, middle-, and right-facial regions of the human face. It is observed consistently high success rates (>95%) in the left-facial region across all five volume settings. However, in the middle-facial and right-facial regions, it is noticed that approaching gestures are difficult to detect at low sound volumes (43 dB SPL).
- Figure 4(b) illustrates that the approaching gesture can consistently be detected when the hand is within 5 cm of participant A’ s left face.
- the success rate declines to below 80% as the spacing between the hand and the left face increases.
- the success rate drops below 50%.
- the success rate decreases significantly in both the middle-facial and right- facial regions compared to the left-facial region, primarily due to the greater distance from the signal source (headphone speaker).
- the success rate in the middle-facial and right-facial regions decreases to 15%, representing a 60%> reduction compared to the left-facial region.
- a targeting distance of 5 cm is set in order to minimize the false positives.
- the heatmap is used to represent the gesture detection success rate. As shown in Figure 5, it is observed that touching gestures generally achieve a broader coverage area, with a significantly higher success rate due to the following two reasons. Firstly, the touching gestures involve body contact that generates a new signal while simultaneously influencing the propagation path between the original audio source and the capturing microphone. This dual effect of touching face gestures substantially modifies the received signal detected by the microphone, leading to a notable increase in amplitude.
- the touching face gestures occur at a closer distance to the microphone sensor than approaching face gestures.
- the experimental design follows the setup used in the facial hydration state experiment, wherein the left channel of a bone conduction earphone transmitted an 18 kHz ultrasonic wave with an energy level of 45 dB SPL.
- the effective detection distance for approaching gestures is also maintained at 5 cm ⁇ 1 cm.
- FIGs. 6A and 6B show the success rate of on-face touching gestures and over-the-face approaching gestures in dehydrated and hydrated facial states, respectively.
- the state of dehydration is characterized by the participant’s facial skin appearing oily and taut. Conversely, with hydration, the participant’s skin condition is smooth and delicate.
- the success rate of both touching gestures and approaching gestures remains consistently high in the two facial states, with slight differences when the user performs gestures in the right-facial regions. This shows that facial skin conditions do not remarkably impact the sensitivity of the microphone or the headset’s audio output, thus not leading to a significant alteration in the signal path.
- FIGs. 6C and 6D shows the success rate of on-face and over-the-face gesture detection at different regions and in different motion state settings, respectively. It is observed that in the stationary state, the success rate of both on-face and over- the-face gestures with the heatmap results shown in FIG. 5. This essentially reveals that the resting participant using their own hands did not significantly affect the MAF’s performance. In the context of the two motion states under different speeds, a decrease in the success rate is observed compared to the stationary state.
- the success rate of the right-facial gesture drops from an original 85% to 25% during walking and 20% while jogging as shown in FIG. 6C. This decrease can be attributed to a shift in the headphone’s speaker caused by touch and body movement, leading to a corresponding shift in the transmission ultrasound signal.
- the approaching face gesture in FIG. 6D also shows a downward trend.
- the left-facial gestures consistently exhibit a robust success rate (>95%), thereby assuring good gesture detection under any motion state. This result manifests that the left-facial region can be fully relied upon for over-the-face interactions under any tested motion states.
- the Probing Signal leverages the acoustic signal emitted from the bone conduction earphones to generate the mobile acoustic field.
- music signals can produce both surface acoustic waves and leaky surface acoustic waves, their frequency and amplitude both change abruptly over time, introducing variations to both the SAW and LSAW. It is thus challenging to disentangle the signal variation caused by human gestures from the raw signal receptions.
- a probing signal on the ultrasound band is proactively sent out to produce stable surface acoustic waves and leaky surface acoustic waves.
- the probing signal works on the ultrasound band for three key reasons. Firstly, it allows mobile users to perform gesture control while listening to music without interfering with each other. Secondly, it is imperceptible to the human beings and thus may not negatively affect the user experience. Thirdly, compared to audible band signals, ultrasound at a higher frequency band attenuates more rapidly and thus is less prone to false alarms triggered by other users nearby. Moreover, it suffers less from ambient noises since most environmental noises are below 18 kHz.
- the frequency response of three different pairs of bone conduction earphones was measured and empirically set the central frequency of the probing signal to 18 kHz.
- the user is free to use a higher frequency within the range of 18 kHz to 22 kHz to transmit probing signals if they can hear the probing signal at a lower frequency.
- a single tone is chosen instead of the chirp signal (FMCW) as the probing signal for two reasons. Firstly, it was found that the frequency response of most earphone speaker transducers in the ultrasound band varies significantly. This implies that the power of a chirp signal is not uniform across the frequency band. Given that the power fluctuation of the received signal is a crucial feature of the gesture recognition model, the inconsistency in chirp signal power has the potential to impact the performance of the model. Secondly, it was found that sending continuous chirps can lead to audible noises. This is because continuous chirp signals can trigger impulsive responses in the system, leading to the generation of transient signals that manifest as audible noise.
- FIGs. 8A-D shows the proposed signal pre-processing pipeline.
- the raw signal received by the microphone first passes through a series of filters to extract the gesture-induced signals from the noise receptions. These processed signals are then fed into a signal enhancement module to improve their SNR.
- Step One Filtering.
- the received signal is first fed into a Butterworth bandpass filter with a cutoff frequency of f pro b ⁇ 50Hz in order to remove the out-of-band noises.
- f pro t> is the frequency of the probing signal on the ultrasound band.
- the filtered signal is passed through a Butterworth band-stop filter with the central frequency of f pro t). This allows for removal of the probing signal from the receptions while preserving the frequency variation caused by hand-to-face gestures, thereby enhancing its SNR.
- FIG. 8B shows the received signal after passing the filters. Evidently, the signal variation due to facial gestures becomes more prominent after the filtering step.
- Step Two Signal Enhancement. To attain precise segmentations in MAF, it is essential to mitigate the effects of the in-band noise artifacts as well.
- One significant contributor to these artifacts is the probe signal, which generates multipath components as it traverses different channels on the face, such as bones and fats, subsequently affecting the accurate detection of SAW and LSAW when the gesture commences (as illustrated in FIG. 8B).
- BPF band-pass filter
- the initial step in this process entails collecting a brief segment of noise samples, typically lasting 0.3 seconds, prior to initiating the filtering. This step facilitates the analysis of the noise’s frequency characteristics, thereby assisting in the accurate determination of the filter’s parameters. Subsequently, these parameters are employed to filter out successive time frames sharing identical frequency characteristics.
- the Wiener filter predominantly gathers the time frames that encompass the multipath occurring between the speaker and microphone pair when the probing signal initiates. As depicted in FIG. 8C, the application of the Wiener filter substantially reduces these multipaths when a gesture commences, thereby enhancing the discernibility of each gesture’s initiation point and duration.
- the received audio wave is divided into a sequence of audio segments and feed those segments containing human gestures into the classifier for gesture recognition.
- an intuitive solution is to apply a predefined threshold to the audio wave to detect the energy variations caused by human gestures.
- this method is not scalable as it does not consider the fluctuations in signal energies resulting from diverse human behaviors, such as varying user strengths during gesture execution.
- KL divergence-based method is employed to detect the presence of a gesture within each segment. Specifically, given two consecutive audio segments, the energy probability distribution of these two segments is computed, denoted as P and Q, respectively.
- D ⁇ P II Q) SiP(t) log ( ⁇ 7 ⁇ ).
- FIG. 8D shows the on-face and over-the-face gestures after segmentation.
- the aim was to differentiate the specified gestures within MAF.
- a data-driven framework is introduced to identify these on-face and over-the-face gestures in MAF.
- the overall frame consists of two parts: feature extraction, and model training.
- MAF processes audio data for gesture classification using a short-time Fourier transform (STFT) spectrogram directly.
- STFT short-time Fourier transform
- the 2D STFT spectrogram is considered to generally provide richer information on the feature representations and has better temporal and frequency localization properties than a one- dimension waveform in the time-domain, making it a unique fit to classify the nonstationary human gestures.
- the STFT spectrogram is applied directly in MAF. The reason is that the nonlinear Mel-scale compresses the fine-grained spectral structure that is often less important to speech recognition but critically important to gesture detection on the high-frequency band (>18 kHz).
- the frame length of the spectrogram input is selected to 2048, corresponding to 20ms within the sampling rate of 48,000 Hz.
- the hop length is set to 1024. Accordingly, the frequency resolution is around 23 Hz within each sample point.
- MAF adopts a hybrid neural network architecture to enhance the classification performance, as depicted in FIG. 9. Specifically, a combination of Convolutional Neural Networks (CNNs) and a Recurrent Neural Network (RNN) layer is employed to facilitate superior feature extraction before inputting the spectrum feature representations into the multilayer perceptron (MLP) for classification.
- the architecture encompasses five CNN encoder layers, a bi-directional LSTM layer, and a classic multilayer perception (MLP) structure. Each CNN layer is configured with a 2D convolution, a batch norm, a ReLU function, and a dropout regularization.
- the stride is set to 2.
- the kernel size of the initial two convolution layers have been designed to be 7x7. This decision ensures that the receptive field is adequately sized to encapsulate a complete gesture component within the spectrogram, thus enhancing the feature extraction efficacy.
- the high-dimensional features extracted are forwarded to the LSTM layer, which enhances the temporal connections between individual time frames.
- This LSTM layer acts as a bridge, conveying the refined feature set to the MLP.
- the MLP processes the features received from the LSTM and outputs the prediction results.
- the cross-entropy loss function is utilized. 6. Evaluation
- FIG. 10 illustrates these 12 gestures. Subsequently, participants were invited to rate each gesture, indicating their personal preferences. The objective was to assess whether the gesture set being crafted aligned with user preferences and intuitive behavior.
- a supervisor connected the earphones to a laptop and emitted an 18 kHz frequency signal at a consistent volume (45 dB SPL), ensuring that the ultrasound signal remained inaudible to the participants.
- the microphone recordings from the earphones were captured using a MATLAB program. Simultaneously, the supervisor recorded video footage of the participants performing gestures to establish the ground truth. Each participant was instructed to execute each type of gesture 20 times, resulting in a total time commitment of approximately 15 minutes. In total, 4,400 gesture recordings were collected.
- the CRNN model is implemented in PyTorch and trained on an NVIDIA Al 00 GPU for 150 epochs, using a batch size of 32.
- the Adam optimizer is employed with a learning rate set to 0.001.
- an early stopping mechanism during the training phase is applied.
- a Leave-One-Out is employed, specifically a 5-fold cross-validation approach, to evaluate the CRNN model.
- the data is divided from 22 participants into 5 groups, with each group containing data from 4 or 5 participants.
- the combined data from the remaining groups are used as the training set. This ensures that each group has the opportunity to serve as an independent test set, and also accounts for potential correlations between participants’ data, providing a more comprehensive assessment of the model’s generalization capabilities.
- Prevent Overfitting The following actions were taken to prevent model overfitting. Firstly, leave-one-out cross-validation was adopted to ensure the testing is performed on unseen data (i.e., collected from other users). Secondly, the model adopted L2 regularization to penalize large weights in the model, which helps prevent the model from fitting the training data too closely. Thirdly, to mitigate overfitting, early stopping was implemented and introduced dropout layers. The training and validation loss curves both trend downwards consistently. FIG. 11 shows the training and testing losing curve. The training loss gradually decreases upon adding training examples and flattens gradually. The validation loss decreases upon adding training examples and flattens gradually.
- gesture “fist open and close” gesture (g) overlap with the feature of the gesture “fist single click” (/). This confusion stems from the similarity in the foundational movements involved in both gestures, as the gesture (g) encompasses elements commonly found in other fist-related actions, thereby increasing the complexity of accurate classification.
- FIG. 14B shows the accuracy of gesture recognition for both on-face and over-the-face gestures. Across both genders, high performance was observed consistently for gestures made away from the face (over-the-face), averaging 98%. However, when it comes to on-face gestures, a disparity was noted, with females achieving a recognition accuracy of 88% and males achieving 84%.
- the Impact of Body Motions is evaluated on gesture recognition accuracy. Specifically, a participant was invited to perform ten gestures under the walking state and jogging state. Each gesture is repeated 20 times. The same pretrained model was used to recognize gestures collected in these two states.
- FIG. 16C shows the results. It is found that under different body movements, ten different gesture signals can still be recognized, but its median recognition accuracy drops to below 90% for walking and performs worse in jogging. This also reaffirms the results of Section ( ⁇ 4.2), indicating that the motion state affects the posture recognition rate. The reason is that it can be influenced by the movement of the head and the tightness of the headphone wear, affecting the formation of the MAF. However, the above 80% recognition accuracy is still acceptable, sselling significant accuracy despite the challenges presented by physical movements. Future improvement of MAF might focus on implementing advanced algorithms capable of compensating for the disturbances generated by these physical activities.
- the SAW and LSAW waves only show up when the user wears a pair of bone conduction earphones. It is suspected that the insufficient skin contact or an insufficiently sized contact surface of on-ear, over-ear, and in-ear headphones are the primary cause. Additionally, the on-ear and over-ear headphones are usually equipped with soft earcups or ear pads that can absorb acoustic energy, further diminishing the generation of these waves.
- MAF mobile acoustic field
- the system 100 can include at least one computing device 105 and at least one bone conduction headphone 110 (sometimes herein referred to as a bone conduction earphone or headset), among others.
- the computing device 105 can include at least one field monitoring service 115 and at least one application 120.
- the field monitoring service 115 can include at least one probe generator 125, at least one audio processor 130, at least one gesture detector 135, and at least one machine learning (ML) model 140, among others.
- the bone conduction headphone 110 can include at least one speaker 145 and at least one microphone 150, among others.
- the computing device 105 and the bone conduction headphone 110 can be associated with at least one user 155.
- Each of the components of system 100 e.g., the computing device 105) may be implemented using hardware or a combination hardware and software, such as those of system 514 as detailed herein in conjunction with FIG. 21.
- Each of the components in the system 100 may implement or execute the functionalities detailed herein, such as those detailed herein in Sections 1-8.
- the computing device 105 can be any computing device comprising one or more processors coupled with memory and software and capable of performing the various processes and tasks described herein.
- the computing device 105 can be operated or associated with the user 155.
- the computing device 105 can be a smartphone, other mobile phone, tablet computer, wearable computing device (e.g., smart watch, eyeglasses), or laptop computer, among others.
- the computing device 105 can be in communication with the bone conduction headphone 110, among other devices (e.g., via wireless communications or wired communications).
- the computing device 105 can be in communication with other devices, such as remote servers, computing devices, or other hardware devices, among others.
- the field monitoring service 115 can process, manage, or otherwise handle exchange of data from the bone conduction headphone 110 and the application 120.
- the probe generator 125 can generate a probe signal to be emitted via the speaker 145 of the bone conduction headphone 110 to form an acoustic field.
- the audio processor 130 can receive and process audio signals from the microphone 150 of the bone conduction headphone 110.
- the gesture detector 135 can apply the ML model 140 on the processed audio signals from the audio processor 130 to detect and identify any hand gestures within the acoustic field.
- the ML model 140 can be used to identify hand gestures within the acoustic field about the head of the user 155 or the bone conduction headphone 110.
- the ML model 140 may include, for example, a deep learning artificial neural network (ANN), Naive Bayesian classifier, a relevance vector machine (RVM), or a support vector machine (SVM), a regression model (e.g., linear or logistic regression) or a clustering model (e.g., k-NN clustering or density-based clustering), or a decision tree (e.g., a random tree forest), among others.
- ANN deep learning artificial neural network
- RVM relevance vector machine
- SVM support vector machine
- regression model e.g., linear or logistic regression
- a clustering model e.g., k-NN clustering or density-based clustering
- a decision tree e.g., a random tree forest
- the field monitoring service 115 can be executed on the computing device 105 (e.g., as depicted).
- the field monitoring service 115 can be a process or application separate from the application 120.
- the field monitoring service 115 can be part of the application 120 running on the computing device 105.
- the functionalities ascribed to the field monitoring service 115 can be executed by the application 120.
- the field monitoring service 115 can be executed on a device separate from the computing device 105.
- the field monitoring service 115 can be executed on one or more processors and memory of an external device (e.g., a dongle or other portable device) that is in communication with the computing device 105.
- an external device e.g., a dongle or other portable device
- the application 120 can include any software program executing on the computing device 105.
- the application 120 can be any type of application to interact with the user 155 via input (e.g., mapped from hand gestures by the field monitoring service 115) and output (e.g., replies to user commands or queries).
- the application 120 can interface (e.g., via an application programming interface (API) with the field monitoring service 115.
- API application programming interface
- the application 120 can be any type of software program, such as a word processor, a spreadsheet program, a presentation program, an electronic mail client, a web browser, a graphic design application, a video editor, a project management application, a database management software, a messenger, or a multimedia player, among others.
- the application 120 can have any number of functions to be invoked via user input.
- the application 120 can be associated with the various operations of the bone conduction headphone 110.
- the application 120 can include various functions related to the bone conduction headphone 110, such a volume control, a playback, and mute, among others.
- the application 120 can be associated with a virtual reality (VR) headset.
- the application 120 can control various functions of the VR headset from user input, such as a control of a virtual object presented through the headset or communication with another user, among others.
- the application 120 can be used to control or suppress disease or virus transmission.
- the application 120 can present an alert for the user 155 to prevent contact with a face of the user 155, among others.
- the bone conduction headphone 110 (sometimes herein referred to as a bone conduction earphone) can be an audio device to emit sound through bones on the face of the user 155 to bypass at least a portion of the ear (e.g., middle ear) of the user 155.
- the bone conduction headphone 110 can be communicatively coupled with the computing device 105 or another device executing the field monitoring service 115 (e.g., via wired or wireless communications).
- the bone conduction headphone 110 can be arranged, situated, or otherwise positioned relative to a corresponding ear of the user 155. For example, the bone conduction headphone 110 can be fitted about the back region of the ear of the user 155.
- bone conduction headphones 110 there can be a pair of bone conduction headphones 110 on the user 155.
- one bone conduction headphone 110 can be positioned on the left ear of the user 155 and another bone conduction headphone 110 can be positioned on the right ear of the user 155.
- the system 100 can include any number of bone conduction headphones 110 on the user 155.
- the speaker 145 of the bone conduction headphone 110 can produce, emit, or otherwise radiate acoustic sound waves against the face of the user 155.
- the speaker 145 can transform or convert electrical signals (also referred herein as audio signals) from the computing device 105 into the acoustic sound wave.
- the speaker 145 can be any type of transducer, such as a loudspeaker, a piezoelectric transducer, or an electromagnetic transducer, among others.
- the speaker 145 can be situated, positioned, or otherwise disposed on the face of the user 155. For instance, the speaker 145 can be fitted or secured against the cheekbonejawbone, or temporal bone of the user 155 by the remaining portion of the bone conduction headphone 110.
- the microphone 150 of the bone conduction headphone 110 can also produce, output, or otherwise generate electrical signals from acoustic soundwaves.
- the acoustic soundwaves can be traveling through the air about the user 155 and can reach the microphone 150.
- the microphone 150 can transform or convert acoustic soundwaves arriving at the microphone 150.
- the microphone 150 can be any type of transducer to receive acoustic waves from the air around the user 155, such as a dynamic microphone, a condenser microphone, a ribbon microphone, a carbon microphone, a piezoelectric microphone, a microelectromechanical system (MEMS) microphone, a Lavalier microphone, or a pressure zone microphone, among others.
- the microphone 150 can be any type of transducer to receive acoustic waves or vibrations about the face of the user 155, such as a piezoelectric transducer or an electromagnetic transducer, among others.
- FIG. 18 depicted is a block diagram of a process 200 to acquire audio signals in the system 100 for detecting hand gestures.
- the process 200 can correspond to or include operations performed in the system 100 to form an acoustic field about the head of the user 155 and detect hand gestures performed by the user 155 affecting the acoustic field.
- the probe generator 125 of the field monitoring service 115 can transmit, send, or otherwise provide at least one probing signal 205 to the speaker 145 of the bone conduction headphone 110.
- the probing signal 205 may be used to form, create, or otherwise produce at least one acoustic field 210 about the ear of the user 155, the head of the user 155, or about the bone conduction headphone 110 on the user 155.
- the probe generator 125 can produce, create, or generate the probing signal 205 to be within a specified frequency range.
- the probing signal 205 may be an electrical (e.g., digitized or quantized) representation of the acoustic waveform an acoustic waveform (or vibration) within a specified frequency band to be radiated from the speaker 145.
- the frequency range may correspond to a range of frequencies inaudible to humans, such as between 18 kHz-40 kHz.
- the probing signal 205 can have a center frequency at which the amplitude is maximum. The center frequency can reside within the frequency band of 18 kHz- 40 kHz, for example, at 18 kHz.
- the probing signal 205 can be sampled at any sampling rate, for instance, ranging from 8-200 kHz.
- the probing signal 205 can be of any duration in time, ranging between 2 ms to 1 minute, and be repeatedly play any number of times.
- the probe generator 125 can transmit, provide, or otherwise send the probing signal 205 to the bone conduction headphone 110.
- the speaker 145 of the bone conduction headphone 110 can create, form, or otherwise produce the acoustic field 210.
- the speaker 145 can convert or transform the probing signal 205 into the acoustic waveform and can output, emit, or otherwise radiate the acoustic waveform (or vibrations) to produce the acoustic field 210.
- the acoustic field 210 can at least partially envelop, encase, or otherwise surround the head of the user 155 (e.g., as depicted), the ear of the user 155, or the bone conduction headphone 110 on the user 155, among others.
- the speaker 145 can also output, radiate, or otherwise produce other acoustic waveforms using audio signals from other sources. For instance, the speaker 145 can produce sound using audio signals of audiovisual content played on a multimedia application running on the computing device 105.
- the acoustic field 210 can have an effective range between 2- 10 cm about the surface of the head, the surface of the outer ear, or the speaker 145 of the bone conduction headphone 110.
- the acoustic field 210 can reside within the frequency range of the probing signal 205.
- the acoustic field 210 can have a frequency range between 18 kHz-40 kHz that is also inaudible to humans.
- the acoustic field 210 can be affected by or can respond to at least one hand gesture 215 performed by the user 155 within the effective range of the acoustic field 210.
- the hand gesture 215 can include or can correspond to a movement of at least one hand or at least one of finger by the user 155.
- the hand gesture 215 can correspond to a combination of movement of the hand or fingers of the user 155 to signal a command to invoke a function of the application 120.
- the audio processor 130 of the field monitoring service 115 can retrieve, obtain, or otherwise receive at least one audio signal 220 from the microphone 150 of the bone conduction headphone 110. While the probing signal 205 is provided to the speaker 145 to produce the acoustic field 210, the audio processor 130 can continuously listen or monitor the audio signal 220.
- the audio signal 220 can be generated by the microphone 150 by transforming or converting acoustic waveforms reaching the microphone 150 and can be provided to the field monitoring service 115 (e.g., via wired or wireless communications).
- the audio signal 220 can correspond to at least one acoustic waveform through or from the acoustic field 210 reaching the microphone 150.
- the audio signal 220 can correspond to the acoustic waveform altered or produced in response to at least one hand gesture 215 within the acoustic field 210.
- the audio signal 220 may be of any duration in time, such as between 2 ms to 1 minute.
- the audio processor 130 can listen to or monitor for the audio signal 220 over a sliding time window.
- the time window may range between 2 ms to 1 minute, with a sliding interval can be a fraction of the time window (e.g., 0.5 ms to 15 seconds).
- the audio processor 130 can process or filter at least a portion of the audio signal 220.
- the audio signal 220 can include at least one probe portion 225 A and at least one non-probe portion 225B.
- the probe portion 225 A can correspond to a portion of the acoustic waveform associated with a frequency band in which the probing signal 205 resides.
- the probe portion 225A for example, can correspond to an inaudible frequency range for humans, such as between 18 kHz-40 kHz.
- the non-probe portion 225B can correspond to a remaining portion of the acoustic waveform.
- the nonprobe portion 225B can correspond to a portion of the acoustic waveform associated with a frequency band exclusive of the probing signal 205.
- the non-probe portion 225B can correspond to an audible frequency range for humans, such as between 20 Hz to 18 kHz or 20 Hz to 20 kHz.
- the audio processor 130 can pass the probe portion 225A to produce, output, or otherwise generate at least one audio signal 220’.
- the audio signal 220’ can include at least the probe portion 225A corresponding to the frequency band which the probing signal 205 is in.
- the audio processor 130 can apply at least one filter.
- the filter can include, for example, a low-pass filter (LPF), a bandpass filter (BPF) filter, a band-stop filter (BSF), or a high-pass filter (HPF), among others, or any combination thereof.
- the filter can be implemented using any type of architecture, such as a resistor-capacitor (RC) filter, a resistor-inductor (RL filter), a RLC filter, an active filter, a Butterworth filter, a Chebyshev filter, or a Bessel filter, among others.
- the audio processor 130 can apply a BPF to pass through the probe portion 225A of the audio signal 220 to output the audio signal 220’.
- the BPF can have a cutoff frequency relative to the frequency of the probing signal 205 used to generate the acoustic field 210.
- the cutoff frequency for the BPF can be 25-50 Hz about the frequency of the probing signal 205.
- the audio signal 220’ can include a subsection of the originally acquired audio signal 220, with frequency components focused about the center frequency of the probing signal 205 (e.g., about 25-50 Hz of the center).
- the audio processor 130 can process or filter the audio signal 220 to remove, attenuate, or otherwise suppress noise at least within the probe portion 225A.
- the audio processor 130 can apply a BSF to the audio signal 220 to identify a portion of the audio signal 220 outside the probing signal 205.
- the BSF can have a cutoff frequency relative to the frequency of the probing signal 205 used to generate the acoustic field 210.
- the cutoff frequency for the BSF can be 25-50 Hz about the frequency of the probing signal 205.
- the audio processor 130 can subtract the portion outside the probing signal 205 from the overall audio signal 220 (or the audio signal 220’) to suppress the noise. By suppressing the noise, the audio processor 130 can increase, amplify, or otherwise boost a relative amplitude of the probing signal 205 within the audio signal 220’ in comparison to the remaining portions of the audio signal 220.
- FIG. 19 depicted is a block diagram of a process 300 to identify hand gestures in acoustic fields in the system 100 for detecting hand gestures.
- the process 300 can correspond or include operations performed in the system 100 to detect and classify hand gestures performed by the user in the acoustic field.
- the gesture detector 135 of the field monitoring service 115 can process or apply the audio signal 220’ to the ML model 140.
- the ML model 140 can be implemented using any model architecture and can have at least one input corresponding to the filtered audio signal 220’, at least one output classifying a type of hand gesture 215 performed by the user 155, and a set of weights relating the input to the output, among others.
- the ML model 140 can be a light-weight model architecture, with minimal resources consumption specifications that a portable or mobile device (e.g., the computing device 105 or external hardware device) can satisfy.
- the gesture detector 135 can feed or input the audio signal 220’ into the ML model 140.
- the gesture detector 135 can process the input audio signal 220’ in accordance with the set of weights of the ML model 140.
- the ML model 140 may have been initialized, trained, or established (e.g., by the field monitoring service 115 or another computing device) using a training dataset.
- the ML model 140 can be trained in accordance with any learning techniques, such as supervised learning, unsupervised learning, Q-learning or weakly supervised learning, among others.
- the training dataset can include a set of examples. Each example can include or identify a sample audio signal (e g., similar to the audio signal 220’) corresponding to a probe signal in an acoustic field at least partially about an ear, a head, or a speaker on a bone conduction headphone of another user.
- Each example can also include or identify a label indicating whether (e.g., a presence or absence) a hand gesture is performed by the user associated with the sample audio signal.
- each example can also include or identify a label identifying which type of hand gesture is being performed by the user.
- the type of hand gesture of the label can include, for instance: no gesture; one finger tapping cheek; two fingers hovering over ear; three fingers touching forehead; at least one figure touching mouth, ear, or nose; a motion of hand spinning about the bone conduction headphone, a fist forming near bone conduction headphone, among others.
- the sample audio signal may be of any duration in time, such as between 2 ms to 1 minute.
- the sample audio signal 220’ can be applied to the ML model 140 to produce or generate an output indicating whether the hand gesture is performed by the user.
- the application of the sample audio signal 220’ to the ML model 140 can produce or generate an output indicating which type of hand gesture is performed.
- the output of the ML model 140 can be compared with the corresponding indication identified by the label. Based on the comparison, a loss metric can be calculated in accordance with a loss function (e.g., a hinge loss, a mean squared error (MSE), a mean absolute error (MAE), a crossentropy loss, a Huber loss, or a log loss).
- a loss function e.g., a hinge loss, a mean squared error (MSE), a mean absolute error (MAE), a crossentropy loss, a Huber loss, or a log loss.
- the loss metric can be used to modify or update one or more of the set of weights of the ML model 140.
- the updating of the weights of the ML model 140 may be in accordance with an optimization function (e.g., stochastic gradient descent with a predefined learning rate). This process can be iteratively repeated until the ML model 140 reaches a convergence condition to stop or cease the training process.
- the training of the ML model 140 can be performed on another computing system, separate from the computing device 105, and then loaded on the computing device 105 (e.g., when the field monitoring service 115 is installed thereon).
- the gesture detector 135 can identify, determine, or otherwise detect a presence (or occurrence) or an absence (or lack) of the hand gesture 215 performed by the user 155. From the application of the ML model 140, the gesture detector 135 can produce, output, or otherwise generate at least one gesture classification 305 for the input audio signal 220’.
- the gesture classification 305 can indicate or identify whether the hand gesture 215 is being performed by the user 155.
- the gesture detector 135 can detect or identify a type of hand gesture 215 performed by the user 155 based on the application of the ML model 140 to the audio signal 220’. In identifying, the gesture detector 135 can generate the gesture classification 305 to indicate or identify a type of the hand gesture 215 performed by the user 155.
- the gesture detector 135 can calculate, generate, or determine a likelihood of a presence (or absence) of the hand gesture 215 performed by user 155.
- the likelihood can identify or indicate a degree of probability that the user 155 is performing the hand gesture 215.
- the gesture detector 135 can determine the likelihood for each type of hand gesture 215. From applying the ML model 140 to the audio signal 220’, the gesture detector 135 can produce, output, or determine the likelihood. With the determination of the likelihood, the gesture detector 135 can compare the likelihood with a threshold.
- the threshold can delineate, define, or otherwise identify a value (e.g., 80-95%) for the likelihood at which to identify the presence (or absence) of the hand gesture.
- the threshold can identify a value for the likelihood at which to identify a type of hand gesture. If the likelihood satisfies (e.g., greater than or equal to) the threshold, the gesture detector 135 can detect the presence of the hand gesture 215. In some embodiments, the gesture detector 135 can identify the type of the hand gesture 215 based on the likelihood satisfying the threshold for the type of hand gesture. Conversely, if the likelihood does not satisfy (e.g., less than) the threshold, the gesture detector 135 can identify the absence of the hand gesture 215.
- the gesture detector 135 can generate or provide at least one output 310 based on the gesture classification 305.
- the output 310 can include or identify information based on part of detecting the presence or the absence of the hand gesture 215 performed by the user 155.
- the gesture detector 135 can generate the output 310 to indicate the detection of the presence of the hand gesture 215.
- the gesture detector 135 can generate the output 310 to indicate the detection of the absence of the hand gesture 215.
- the gesture detector 135 can generate the output 310 to identify the type of hand gesture 215.
- Other information can be included, such as a timestamp of detection and a user identifier corresponding to the user 155.
- the gesture detector 135 can identify or determine a command 315 to invoke a corresponding function of the application 120 based on the gesture classification 305.
- the gesture detector 135 can transform or convert the gesture classification 305 to a command 315 to invoke a corresponding function of the application 120.
- the gesture detector 135 can use a list specifying a mapping between the type of hand gesture 215 and the corresponding command 315 to invoke the function for a given application 120.
- the list can include a set of mappings for a corresponding set of applications 120. For example, for a web browser application, the list can specify that a hand gesture 215 corresponding to two fingers waved forward corresponds to an increase in zoom in a web page presented in the web browser application. For an interactive panorama of a street within a map application, the list can identify that the same hand gesture 215 corresponding to two fingers waved forward corresponds to a go forward function from the position along the street depicted in the interactive panorama.
- the gesture detector 135 can identify the command 315 to be invoked based on the type of hand gesture 215 detected as performed by the user 155.
- the gesture detector 135 can identify or select the mapping based on the application 120. For instance, the gesture detector 135 can select the list of mappings between the types of hand gestures 215 for a given application 120, based on identifying the application 120 in focus (e.g., as a foreground process) on the computing device 105. From the list, the gesture detector 135 can identify the mapping for the type of hand gesture 215 detected with the corresponding command 315. With the identification, the gesture detector 135 can generate the output 310 to include or identify the command 315.
- the output 310 can omit or lack the command 315.
- the gesture detector 135 can send, convey, or provide the output 310 with the command 315 to the application 120.
- the application 120 executing on the computing device 105 can process or parse the output 310 to extract or identify the gesture classification 305. Based on the gesture classification 305, the application 120 can identify or determine the corresponding command 315 for the respective command. The determination of the command 315 can be independent of the conversion of the gesture classification 305 by the gesture detector 135.
- the application 120 can transform or convert the gesture classification 305 to a command 315 to invoke a corresponding function of the application 120.
- the application 120 can use a list specifying a mapping between the type of hand gesture 215 and the corresponding command 315 to invoke the function. In some embodiments, the application 120 can process or parse the output 310 to extract or identify the command 315 from the output 310.
- the application 120 can invoke the corresponding function.
- the application 120 can be a multimedia player or can be communicatively coupled with a smartphone to handle telephone calls.
- the application 120 can include various functions related to the bone conduction headphone 110, such a volume control, a playback, and mute, among others.
- the application 120 can invoke the corresponding function to control the bone conduction headphone 110 based on the command 315.
- the application 120 (or the gesture detector 135) can convert the hand gesture 215 corresponding to a rising hand toward the right-side bone conduction headphone 110 to a command to increase volume on the speaker 145.
- the application 120 can invoke the function to increase volume on the speaker 145 in the bone conduction headphone 110.
- the application 120 can translate the hand gesture 215 corresponding to a lower hand toward the right-side bone conduction headphone 110 as a command to decrease volume on the speaker 145. Based on this translation, the application 120 can invoke the function to decrease the volume on the speaker 145.
- the gesture detector 135 can also convert the hand gesture 215 corresponding to a spinning finger motion to a command to perform a playback of an audio recording.
- the application 120 can use the command to invoke and carry out the function to initiate playing back of the recording.
- the application 120 can be a health awareness program to control or suppress transmission (e.g., via contact transmission) of germs, bacteria, or viruses.
- the application 120 can include a function to present an alert for the user 155 to notify the user 155 to cease contact with the face of the user 155.
- the gesture detector 135 can determine or detect that the hand gesture 215 corresponding to at least one finger or the hand being in contact with the face (e.g., along the eyes, nose, or mouth) or head of the user 155.
- the application 120 can invoke the function to present the alert, based on the detection of the hand gesture 215 corresponding to at least one finger being in contact with the face or head of the user 155.
- the alert can be audible and played through the speaker 145 of the bone conduction headphone 110.
- the alert can be a visual element presented via a graphical user interface of the application 120 or on the computing device 105.
- the application 120 can halt presentation of the alert, when the gesture detector 135 subsequently detects no finger or hand in contact with the face or head of the user 155.
- the application 120 can control various functions associated with a virtual reality (VR) headset, such as a control of a virtual object presented through the headset or communication with another user.
- the bond conduction headphone 110 can be part of or can be in communication with the VR headset.
- the functions for the application 120 can include, for example, a manipulation (e.g., moving, rotating, or sizing) of a virtual object presented through the headset, an interaction with a user interface element, or communication with another, among others.
- the application 120 can invoke the function to carry out the enlargement of the virtual object.
- the application 120 can initiate establishment of a communication session with another instance of the application associated with the specified user.
- the field monitoring service 115 can continue any one or more of the processes 200 and 300 detailed herein. For instance, upon execution of the function identified in the command 315 specified in the output 310, the application 120 can return, send, or otherwise provide an indication of execution of the function corresponding to command 315.
- the probe generator 125 can in turn generate another probing signal 205 to provide via the speaker 145 of the bone conduction headphone 110 to form the acoustic field 210 about the head or face of the user 155.
- the audio processor 130 can obtain another audio signal 220 from the microphone 150 of the bone conduction headphone 110, capturing any hand gestures 215 performed by the user 155 if any.
- the audio processor 130 can filter the audio signal 220 to extract the probe portion 225A and to generate the audio signal 220’.
- the gesture detector 135 can apply the audio signal 220’ to the ML model 140 to determine the gesture classification 305 as well as the command 315.
- the command 315 may be for a function accessible upon the execution of the previous function as identified by the hand gesture 215.
- the interactivity of the application 120 can be expanded and widened beyond input/output (I/O) devices such as keyboards, touchscreens, and mouses to cover hand gestures 215 performed within the acoustic field 210 formed by the bone conduction headphone 110.
- I/O input/output
- This functionality can allow the user 155 to move freely without any inconvenience or hindrance, able to access the functions of the application 120 while wearing the bone conduction headphone 110 coupled with the field monitoring service 115.
- the field monitoring service 115 can re-adapt existing bond conduction headphones 110 for the purpose of detecting hand gesture 215.
- the field monitoring service 115 can leverage the speaker 145 of the bond conduction headphone 110 to use a probing signal 205 to generate the acoustic field 210 at any time to pick up any number of different hand gestures 215.
- the field monitoring service 115 can enable new types of functionality.
- the field monitoring service 115 can aid in evaluating the risks associated with face-touching gestures, serving as a preventive barrier against the entry of bacteria into sensitive areas such as the mouth, nose, and eyes. This capability may be valuable in promoting hygienic practices and preventing virus transmission.
- the field monitoring service 115 can employ a combination of signal processing techniques and a lightweight model in the form of the ML model 140 to recognize hand gestures 215 with a high degree of accuracy, allowing for computing devices 105 (e.g., smart phones) to carry out the operations in a quick manner. Relative to other techniques that use specialized hardware and computationally complex algorithms (e.g., as detailed in Section 2), the field monitoring service 115 can use less processing power, memory, and electric power on the part of the computing device 105. The field monitoring service 115 can maintain this high degree of accuracy, even in the face of a wide range of environmental setting, including varying ambient noise levels, human speech, body motion, and skin hydration conditions, among others. This adaptability can allow the field monitoring service 115 to accurately detect the hand gestures 215 and invoke the function in the application 125 intended by the user 155 in a variety of daily life settings.
- a computing system can provide a probe signal via a speaker of a bone conduction headphone to form an acoustic field (405).
- the computing system can receive an audio signal from an earphone of the bone conduction headphone (410).
- the computing system can filter the audio signal to pass a frequency band corresponding to the probe signal (415).
- the computing system can apply the filtered audio signal to a machine learning (ML) model (420).
- ML machine learning
- the computing system can detect whether a hand gesture is present in the acoustic field (425). If there is no hand gesture, the computing system can repeat the functionalities from step (405). In contrast, if there is a hand gesture detected, the computing system can provide an output to an application (430). 10. Computer Environment
- FIG. 21 shows a block diagram of a representative computing system 514 usable to implement the present disclosure.
- the method 400 may be implemented by the computing system 514.
- Computing system 514 can be implemented, for example, as a consumer device such as a smartphone, other mobile phone, tablet computer, wearable computing device (e.g., smart watch, eyeglasses, head mounted display), desktop computer, laptop computer, cloud computing service or implemented with distributed computing devices.
- the computing system 514 can include computer components such as processors 515, storage device 518, network interface 520, user input device 522, and user output device 524.
- Network interface 520 can provide a connection to a wide area network (e.g., the Internet) to which WAN interface of a remote server system is also connected.
- Network interface 520 can include a wired interface (e.g., Ethernet) and/or a wireless interface implementing various RF data communication standards such as Wi-Fi, Bluetooth, or cellular data network standards (e g., 3G, 4G, 5G, 50 GHz, LTE, etc.).
- User input device 522 can include any device (or devices) via which a user can provide signals to computing system 514; computing system 514 can interpret the signals as indicative of particular user requests or information.
- User input device 522 can include any or all of a keyboard, a controller (e.g., joystick), touch pad, touch screen, mouse or other pointing device, scroll wheel, click wheel, dial, button, switch, keypad, microphone, sensors (e g., a motion sensor, an eye tracking sensor, etc.), and so on.
- User output device 524 can include any device via which computing system 514 can provide information to a user.
- user output device 524 can include a display-to- display images generated by or delivered to computing system 514.
- the display can incorporate various image generation technologies, e.g., a liquid crystal display (LCD), light-emitting diode (LED) including organic light-emitting diodes (OLED), projection system, cathode ray tube (CRT), or the like, together with supporting electronics (e.g., digital-to-analog or analog-to- digital converters, signal processors, or the like).
- a device such as a touchscreen that function as both input and output device can be used.
- User output devices 524 can be provided in addition to or instead of a display. Examples include indicator lights, speakers, tactile “display” devices, printers, and so on.
- Some implementations include electronic components, such as microprocessors, storage and memory that store computer program instructions in a non-transitory computer readable storage medium. Many of the features described in this specification can be implemented as processes that are specified as a set of program instructions encoded on a computer readable storage medium. When these program instructions are executed by one or more processors, they cause the processors to perform various operation indicated in the program instructions. Examples of program instructions or computer code include machine code, such as is produced by a compiler, and files including higher-level code that are executed by a computer, an electronic component, or a microprocessor using an interpreter. Through suitable programming, processor 515 can provide various functionality for computing system 514, including any of the functionality described herein as being performed by a server or client, or other functionality associated with message management services.
- computing system 514 is illustrative and that variations and modifications are possible. Computer systems used in connection with the present disclosure can have other capabilities not specifically described here. Further, while computing system 514 is described with reference to particular blocks, it is to be understood that these blocks are defined for convenience of description and are not intended to imply a particular physical arrangement of component parts. For instance, different blocks can be located in the same facility, in the same server rack, or on the same motherboard. Further, the blocks need not correspond to physically distinct components. Blocks can be configured to perform various operations, e.g., by programming a processor or providing appropriate control circuitry, and various blocks might or might not be reconfigurable depending on how the initial configuration is obtained. Implementations of the present disclosure can be realized in a variety of apparatus including electronic devices implemented using any combination of circuitry and software.
- the hardware and data processing components used to implement the various processes, operations, illustrative logics, logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- a general-purpose processor may be a microprocessor, or any conventional processor, controller, microcontroller, or state machine.
- a processor also may be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- particular processes and methods may be performed by circuitry that is specific to a given function.
- the memory e.g., memory, memory unit, storage device, etc.
- the memory may include one or more devices (e.g., RAM, ROM, Flash memory, hard disk storage, etc.) for storing data and/or computer code for completing or facilitating the various processes, layers, and modules described in the present disclosure.
- the memory may include volatile memory or non-volatile memory, and may include database components, object code components, script components, or any other type of information structure for supporting the various activities and information structures described in the present disclosure.
- the memory is communicably connected to the processor via a processing circuit and includes computer code for executing (e.g., by the processing circuit and/or the processor) the one or more processes described herein.
- the present disclosure contemplates methods, systems, and program products on any machine-readable media for accomplishing various operations.
- the embodiments of the present disclosure may be implemented using existing computer processors, or by a special purpose computer processor for an appropriate system, incorporated for this or another purpose, or by a hardwired system.
- Embodiments within the scope of the present disclosure include program products comprising of machine-readable media for carrying or having machineexecutable instructions or data structures stored thereon.
- Such machine-readable media can be any available media that can be accessed by a general purpose or special purpose computer or other machine with a processor.
- machine-readable media can comprise RAM, ROM, EPROM, EEPROM, or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer or other machine with a processor. Combinations of the above are also included within the scope of machine-readable media.
- Machine-executable instructions include, for example, instructions and data which cause a general -purpose computer, special purpose computer, or special purpose processing machines to perform a certain function or group of functions.
- references to implementations or elements or acts of the systems and methods herein referred to in the singular can also embrace implementations including a plurality of these elements, and any references in plural to any implementation or element or act herein can also embrace implementations including only a single element.
- References in the singular or plural form are not intended to limit the presently disclosed systems or methods, their components, acts, or elements to single or plural configurations.
- References to any act or element being based on any information, act or element can include implementations where the act or element is based at least in part on any information, act, or element.
- Coupled elements can be electrically, mechanically, or physically coupled with one another directly or with intervening elements. Scope of the systems and methods described herein is thus indicated by the appended claims, rather than the foregoing description, and changes that come within the meaning and range of equivalency of the claims are embraced therein.
- Coupled includes the joining of two members directly or indirectly to one another. Such joining may be stationary (e.g., permanent or fixed) or moveable (e.g., removable or releasable). Such joining may be achieved with the two members coupled directly with or to each other, with the two members coupled with each other using a separate intervening member and any additional intermediate members coupled with one another, or with the two members coupled with each other using an intervening member that is integrally formed as a single unitary body with one of the two members.
- Coupled or variations thereof are modified by an additional term (e.g., directly coupled)
- the generic definition of “coupled” provided above is modified by the plain language meaning of the additional term (e.g., “directly coupled” means the joining of two members without any separate intervening member), resulting in a narrower definition than the generic definition of “coupled” provided above.
- Such coupling may be mechanical, electrical, or fluidic.
- references to “or” can be construed as inclusive so that any terms described using “or” can indicate any of a single, more than one, and all of the described terms.
- a reference to “at least one of ‘A’ and ‘B’” can include only ‘A’, only ‘B’, as well as both ‘A’ and ‘B’.
- Such references used in conjunction with “comprising” or other open terminology can include additional items.
- elements shown as integrally formed can be constructed of multiple parts or elements, the position of elements can be reversed or otherwise varied, and the nature or number of discrete elements or positions can be altered or varied.
- Other substitutions, modifications, changes and omissions can also be made in the design, operating conditions and arrangement of the disclosed elements and operations without departing from the scope of the present disclosure.
- references herein to the positions of elements are merely used to describe the orientation of various elements in the FIGURES.
- the orientation of various elements may differ according to other exemplary embodiments, and that such variations are intended to be encompassed by the present disclosure.
- a subject can be a mammal, such as a non-primate (e.g., cows, pigs, horses, cats, dogs, rats, etc.) or a primate (e g., monkey and human).
- a mammal such as a non-primate (e.g., cows, pigs, horses, cats, dogs, rats, etc.) or a primate (e g., monkey and human).
- the term “subject,” as used herein, refers to a vertebrate, such as a mammal. Mammals include, without limitation, humans, non-human primates, wild animals, feral animals, farm animals, sport animals, and pets.
- a subject is a human.
- the term “about” is used to indicate that a value includes the standard deviation of error for the device or method being employed to determine the value.
- Ranges throughout this disclosure, various aspects of the invention can be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 5 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 5, from 3 to 5, etc., as well as individual numbers within that range, for example, 1, 2, 2.7, 3, 4, 5, 5.3, and 5. This applies regardless of the breadth of the range.
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- General Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Presented herein are systems and methods of detecting hand gestures in mobile acoustic fields around bone conduction headphones. A computing system may provide, via a speaker of a bone conduction headphone positioned about an ear of a user, a first probing signal in a first frequency band to produce an acoustic field. The computing system may receive, from a microphone of the bone conduction headphone, a first audio signal corresponding to an acoustic waveform through the acoustic field. The first audio signal may include (i) a first portion corresponding to the first frequency band and (ii) a second portion corresponding to a second frequency band. The computing system may filter the first audio signal to pass the first portion to generate a second audio signal. The computing system may apply the second audio signal to a model to detect a hand gesture performed by the user.
Description
DETECTING HAND GESTURES IN MOBILE ACOUSTIC FIELDS
AROUND BONE CONDUCTION HEADPHONES
CROSS REFERENCE TO RELATED APPLICATIONS
|0001| The present application claims priority to U.S. Provisional Patent Application No. 63/561,179, titled “Detecting Hand Gestures in Mobile Acoustic Fields Around Bone Conduction Headphones,” filed March 4, 2024, which is incorporated by reference in its entirety.
BACKGROUND
[0002] A computing device may be communicatively coupled with one or more input/output (I/O) devices to accept inputs and provide outputs.
SUMMARY
[0003] Aspects of the present disclosure are directed to systems and methods of detecting hand gestures in mobile acoustic fields around bone conduction headphones. One or more processors may provide, via a speaker of a bone conduction headphone positioned at least partially about an ear of a user, a first probing signal in a first frequency band to produce a first acoustic field at least partially about the ear of the user. The one or more processors may receive, from a microphone of the bone conduction headphone, a first audio signal corresponding to an acoustic waveform through the acoustic field. The first audio signal may include (i) a first portion corresponding to the first frequency band including the probing signal and (ii) a second portion corresponding to a second frequency band. The one or more processors may filter the first audio signal to pass the first portion corresponding to the first probing signal in the first frequency band to generate a second audio signal. The one or more processors may apply the second audio signal to a machine learning (ML) model. The ML model may be trained using a plurality of examples. Each of the plurality of examples may identify (i) a respective third audio signal corresponding to a respective second probing signal for a second acoustic field at least partially about a respective second user and (ii) an identification of whether a respective hand
gesture is performed by the respective second user. The one or more processors may detect, based on applying the second audio signal to the ML model, a hand gesture performed by the user. The one or more processors may generate an output based on a detection of the hand gesture performed by the user.
100041 In some embodiments, the one or more processors may detect a lack of any hand gesture performed by the user, based on applying a fourth audio signal corresponding to the first probing signal producing the first acoustic field in the first frequency band to the ML model.
The one or more processors may generate a second output based on a detection of the lack of any hand gesture performed by the user. In some embodiments, the one or more processors may provide the output to an application to invoke at least one of a plurality of functions corresponding to the hand gesture detected as performed by the user of the bone conduction headphone.
(0005] In some embodiments, the application may control the plurality of functions associated with operations of the bone conduction headphone. The plurality of functions may include at least one of (i) a volume control, (ii) a playback, or (iii) muting sound. In some embodiments, the application may provide the plurality of functions associated with a virtual reality (VR) headset. The plurality of functions may include at least one of (i) a control of virtual object presented via the VR headset or (ii) a communication with another user through the VR headset. In some embodiments, the application may present an alert for the user to prevent contact with a face of the user to suppress virus transmission.
|0006| In some embodiments, the one or more processors may suppress noise within the first frequency band of the first audio signal to boost a relative amplitude of the first probe signal. In some embodiments, the ML model may be trained using the plurality of examples each identifying one of a plurality of hand gestures performed by the respective second user. In some embodiments, the one or more processors may identify, from the plurality of hand gestures, the hand gesture performed by the user.
[0007] In some embodiments, the one or more processors may provide the first probing signal to radiate about the ear the user from the speaker. The first acoustic field produced may respond to hand gestures performed by the user within an effective range from the speaker. In some embodiments, the first frequency band can include an inaudible frequency range between 18 kHz to 22 kHz. The second frequency band can include an audible frequency range between 20 Hz to 20 kHz. The first acoustic field may have an effective range between 2 cm to 10 cm from the speaker.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] FIG. 1. (A): An illustration of Mobile Acoustic Field (MAF). MAF is based on a key observation - when audio is transmitted from the bone conduction earphone, it may not only propagate along the surface of the human face but also dissipate into the air, creating an acoustic field that envelops the individual’s head. This acoustic field empowers the mobile user to define their own on-face and over-the-face hand gestures for human-computer interactions. (B): Three potential applications of MAF, out of many. (Left): software-defined headphones that allow users to play and stop music with in-air gestures. (Middle): enhanced gaming experience - the military game can automatically recognize user gestures in the air with MAF. (Right): facetouching awareness and prevention.
[0009] FIGs. 2A-C. (A): Illustration of surface acoustic wave (SAW) and leaky surface acoustic wave (LSAW); (B): Touching the check twice results in two peaks in the SAW signals; (C): Approaching the face leads to weak yet observable variations in LSAW signals.
10010] FIGs. 3A-C. (A): Sealing the earphone with a Plasticine. (B): The signal wave produced by an approaching gesture without the plasticine sealing. (C): The signal wave produced by an approaching gesture with the plasticine sealing.
[0011 ] FIGs. 4A-C. Detecting approaching gestures in different volume levels or distance settings. (A): experiment setups; (B): the gesture detection success rate in different
volume settings (dB SPL); and (C): the gesture detection success rate in different distance settings (cm).
[0012] FIG. 5. The effective coverage area of SAW signals and LSAW signals. The detection success rate of touching (top) and approaching (bottom) gestures grows with the darkness of the heatmap.
|0013| FIGs. 6A-D. The set of histograms illustrates the success rates of gesture recognition across different facial regions (left, middle, and right) under various conditions. Histogram (A) focuses on the face-touching gesture under different hydration conditions, while histogram (B) showcases the face-approaching gesture under the same hydration conditions. Evaluating (C) the detection success rate of the face-touching gesture and (D) the faceapproaching gesture under varied motion conditions. The symbol A, B, and C denotes the stationary state, walking state, and jogging state, respectively.
[0014] FIG. 7. Gauging the initial feedback from 22 participants on the MAF system.
[0015] FIGs. 8A-D. (A): the raw spectrogram of on-face and over-the-face gestures. (B): the spectrogram after applying narrow bandpass and bandstop filters. (C): the spectrogram after the signal enhancement. (D): the two signal segments after applying KL divergence-based signal segmentation.
[0016] FIG. 9. Model structure of MAF, it consists of a 5-layer CNN model, a bidirectional LSTM layer, and a classic MLP classifier layer.
[0017| FIG. 10. 12 gesture candidates for user evaluation. Among them, the first six (a, b, c, d, e,f) are on-face gestures whereas the last six (g, h, i,J, k, 1) are above-face gestures. The average score indicating user preference was put at the bottom of each gesture, with a higher score indicating greater user satisfaction with that particular gesture. The eyes are shadowed for double-blind review.
[0018] FIG. 11. Training loss curve and validation loss curve over different training epochs.
[0019] FIG. 12. The average gesture recognition accuracy across 22 participants.
[0020] FIGs. 13 A and 13B. Examine the gesture recognition accuracy across ten different gestures. (A): the confusion matrix that displays the classification accuracy of each type of gesture. (B): the feature distribution of ten different gestures.
|0021| FIGs. 14A and 14B. (A): The gesture recognition accuracy across four age groups. (B): The gesture recognition accuracy across different genders.
[0022] FIG. 15. Feedback from 22 users was gathered through a Likert scale questionnaire after using MAF.
[0023] FIGs. 16A-E. Examine the gesture recognition accuracy in different environments and human factor settings. (A): the gesture recognition accuracy in different ambient noise environment settings. (B): the gesture recognition accuracy in the presence of human speech. (C): the gesture recognition accuracy in different levels of human activities. (D): the gesture recognition accuracy in different skin hydration settings. (E): the gesture recognition accuracy in the absence and presence of music playback.
[0024] FIG. 17 depicts a block diagram of a system for detecting hand gestures in mobile acoustic fields around bone conduction headphones in accordance with an illustrative embodiment.
[0025] FIG. 18 depicts a block diagram of a process to acquire audio signals in the system for detecting hand gestures in accordance with an illustrative embodiment.
10026] FIG. 19 depicts a block diagram of a process to identify hand gestures in acoustic fields in the system for detecting hand gestures in accordance with an illustrative embodiment.
[0027] FIG. 20 depicts a flow diagram of a method of detecting hand gestures in mobile acoustic fields around bone conduction headphones in accordance with an illustrative embodiment.
[0028] FIG. 21 is a block diagram of a computing environment according to an example implementation of the present disclosure.
DETAILED DESCRIPTION
|0029| Following below are more detailed descriptions of various concepts related to, and embodiments of, systems and methods for detecting hand gestures in mobile acoustic fields around bone conduction headphones. It should be appreciated that various concepts introduced above and discussed in greater detail below may be implemented in any of numerous ways, as the disclosed concepts are not limited to any particular manner of implementation. Examples of specific implementations and applications are provided primarily for illustrative purposes.
1 . Introduction
[0030J Presented herein, MAF, a novel acoustic sensing approach that leverages the commodity bone conduction earphones for hand-to-face gesture interactions. Briefly, by shining audio signals with bone conduction headphones, it is observed that these signals not only propagate along the surface of the human face but also dissipate into the air, creating an acoustic field that envelops the individual’s head. Benchmark studies were conducted to understand how various hand-to-face gestures and human factors influence this acoustic field. Building on the insights gained from these initial studies, a lightweight deep neural network combined with signal preprocessing techniques is proposed. This combination empowers MAF to effectively detect, segment, and subsequently recognize a variety of hand-to-face gestures, whether in close contact with the face or above it. The comprehensive evaluation, involving 22 participants, demonstrates that MAF achieves an average gesture recognition accuracy of 92% across ten different gestures tailored to users’ preferences.
[0031 ] Hand-to-face gestures are a natural and intuitive way to control devices or interfaces. It improves the user experience across a wide spectrum of applications from virtual reality to smart home devices. At present, most hand-to-face gesture detection systems rely on dedicated sensing technologies like IMUs and capacitive sensors to identify gestures having direct contact with the user’s face. However, these systems face limitations in capturing contactless gestures (z.e., those gestures performed over the face) because such gestures do not generate tangible signals that are detectable by the aforementioned sensors. While camera-based solutions can facilitate contactless hand gesture detection, their effectiveness is vulnerable to varying lighting conditions, potential obstructions, and often raises privacy concerns. Similarly, radar-based solutions can be influenced by distance variations and occasional obstructions due to arm movement. In addition, they often come with significant costs and power consumption, which limits their adoption.
[0032] In view of the shortcomings and limits of the existing approaches, in this paper, a simple question is asked, “Can a system be designed that is able to detect hand-to-face gestures, whether in contact with the face or above it, using widely accessible mobile devices?” A positive response to this question would enable mobile users to experience the advantages of gesture-based interactions in their everyday activities, moving this exciting technology one significant stride closer to widespread acceptance. Furthermore, such pervasive interactions are anticipated when integrated with emerging Extended Reality (XR) technologies, could offer users extraordinary and unprecedented experiences.
[0033] An affirmative answer is given by presenting mobile acoustic field (MAF), a novel acoustic sensing approach that leverages the commodity bone conduction earphones for hand-to-face gesture interactions. MAF draws inspiration from the principles of surface acoustic waves (SAW) and leaky surface acoustic waves (LSAW), which are well-studied in seismology, adapting their underlying physics to establish a sound field enveloping the user’s head. In particular, when the speaker of the bone conduction earphones is in contact with the user’s skin, the emitted acoustic signal can generate acoustic radiation around the user’s facial structure in the form of surface acoustic waves. In the meantime, part of the sound waves can dissipate into
the surrounding air, forming leaky surface acoustic waves. The combination of these two signals effectively creates an acoustic field surrounding the user’s head, which is referred to as the Mobile Acoustic Field, as it is generated by widely available earphones and moves with the user. FIG. 2A illustrates this principle. As a result, user gestures performed on or in the vicinity of the face can perturb the channel of SAW or LSAW signals, enabling mobile users to detect these gestures by analyzing variations in the signal received by the microphone.
[0034] Compared with existing hand-to-face gesture detection and recognition approaches, MAF offers several distinct advantages. Firstly, its wearable nature guarantees that users can move freely without any inconvenience or hindrance, always interacting with the device seamlessly. Secondly, MAF does not depend on specialized sensors or require any modifications to standard bone conduction earphones. This means that mobile users can effortlessly enjoy hand-to-gesture interactions without any additional equipment or alterations. Notably, this mobile acoustic field proves particularly advantageous in the context of the pandemic. For instance, it can effectively aid in evaluating the risks associated with facetouching gestures, thereby serving as a preventive barrier against the entry of bacteria into sensitive areas such as the mouth, nose, and eyes.
[0035] To harvest the aforementioned benefits, a series of benchmark studies is first designed to systematically study the capacity and capability of the mobile acoustic field, answering a plethora of fundamental questions such as, “How does the sound volume affect the gesture detection accuracy?” “What is the effective size of the mobile acoustic field?” and “Would the spacing between the hand and the face matter”? A plethora of user studies is crafted to z) examine the impact of various human factors on the mobile acoustic field; and ii) assess the social acceptance of this mobile acoustic field-based gesture interaction by interviewing 22 participants. The preliminary results are promising, and the user feedback is generally positive.
|0036| Based on these promising preliminary results, a lightweight signal-processing pipeline is built for detecting and recognizing hand-to-face gestures to showcase the potential of the mobile acoustic field. To begin, a set of 12 hand gestures was initially created, comprising
six performed on the face and six in close proximity to the face (i.e., over the face). From this set, a selection of ten gestures was curated, consisting of four on-face gestures and six over-the- face gestures, based on the preferences of 22 participants. Given that over-the-face gestures tend to produce relatively weak channel distortions, a series of signal-processing techniques combined with a lightweight Convolutional Recurrent Neural Network (CRNN) model was further designed, with the spectrogram of each segmented signal as the input. These designs are effective in preprocessing, filtering, segmenting, and subsequently recognizing each type of hand-to-face gestures.
[0037] Comprehensive field studies were conducted based on 22 volunteers. The experiment results show that MAF achieves an average gesture recognition accuracy of above 92% over these ten testing gestures. Benchmark experiments were further conducted in various environmental settings to scrutinize the influences of human speech, body movement, music playback, and skin moisture on MAF’s performance. It was observed that despite fluctuations amongst these variables, MAF could adequately accommodate the demands of most daily life settings.
[0038] The contributions presented are summarized as follows. First, a new opportunity was identified for hand-to-face gesture interaction based on a novel scheme termed mobile acoustic field. The capacity, robustness, and user acceptance of this mobile acoustic field technology was systematically studied by designing a series of user studies. Second, the potential of such a mobile acoustic field was demonstrated for hand-to-face gesture interaction by building an end-to-end, data-driven signal processing pipeline. The proposed approach can effectively detect and further recognize ten user-selected gestures at high accuracy (>92%).
Third, the entire system was implemented and evaluated its performance with 22 participants. In addition, a comprehensive UX study was also conducted to gauge the users’ attitudes toward this new technology.
[0039] The remainder of this article unfolds as follows: Section 2 discusses other approaches in this domain. Section 3 starts with a detailed introduction to SAW and LSAW,
briefly touches on the feasibility of the MAF systems, and then delves into their real-world application scenarios. Section 4 examines the practical possibility of integrating SAW and LSAW signals, evaluating them from three different perspectives. Section 5 describes the development and validation of signal processing and machine learning frameworks for MAF. Section 6 assesses user perceptions of the MAF system and examines the performance benchmarks of the MAF system in a variety of challenging situations. Section 7 discusses the system limitations and potential improvement. Section 8 concludes.
2. Other Approaches
[0040] In this section, sensor-based solutions are reviewed for hand-to-face gesture interactions, with a particular focus on acoustic-based and earable-based solutions that are closely related to the design.
2.1 Vision-, RADAR-, and IMU-based solutions
[0041] There exists a wide range of designs that employ various sensors to detect facial gestures. The most extensively researched approach is using computer vision for non-contact detection. However, such an approach usually consumes significant computational resources, which becomes a critical issue for mobile and wearable platforms. Moreover, this approach also suffers from low-lighting conditions and occlusions, raising privacy concerns. mmWave radar, on the other hand, has also been employed to detect face contact. For instance, one approach demonstrated the use of microwave radar systems for hand gesture recognition. Another approach employed sonar-inspired techniques to measure the distance between the hand and the face and trigger an alarm if the user approaches too closely. Besides, there may be various types of wearable sensors for hand-to-face gesture recognition. FaceTouch utilized a vibration sensor placed on the wrist or finger to track hand movements towards the facial region, and the followups have also leveraged accelerometer and inertial measurement unit (IMU) sensors on commodity smartphones to detect facial touch.
[0042) Unlike these previous studies, a speaker-microphone pair is utilized commonly found in everyday earphones for facial gesture interaction. The approach overcomes issues related to lighting conditions and ensures privacy protection. In contrast to vibration or IMU sensing, which operates at low frequencies, the system achieves greater precision in gesture recognition thanks to its capability to achieve a high sampling rate of up to 48 kHz. When compared to mmWave sensing, the solution is both more cost-effective and power-efficient.
2.2 Acoustic-based gesture tracking and recognition
[0043] There is also plenty of research on using speaker-microphone pairs for acoustic sensing.
[0044] For instance, FaceOri leverages an ultrasonic chirp to track head position and orientation on earphones. EchoSpeech leverages the inaudible sound emitted from an eye-wear device to detect and further recognize silent speech. SoundWave employs inaudible band tones generated by the PC speaker to detect gestures in the surrounding space, exploiting the concept that various gestures induce distinct frequency shifts due to Doppler effects. Sonicoperator devised a recursive neural network and implemented it on mobile devices to recognize mid-air human gestures. Additionally, Dolphin, Strata, and AudioGest have also explored similar techniques for recognizing human gestures. In another approach, CAT employs frequency- modulated continuous wave (FMCW) signals to estimate the relative displacement between smartphone speakers and microphones, subsequently integrating Doppler shift data obtained through FMCW with IMU measurements to enhance gesture tracking precision. Furthermore, fingerlO has embraced orthogonal frequency-division multiplexing (OFDM) modulation technology to monitor subtle finger movements in the proximity of the phone.
[0045] While these studies show promise, they often require explicit user involvement or are primarily effective in static settings. For instance, Sonicoperator mandates that users hold the phone and align the microphone to face forward for gesture detection, restricting its usability to relatively stationary scenarios. Similarly, SoundWave functions effectively when the user is
seated in front of a computer. In contrast, the system presented herein capitalizes on the distinctive potential of the mobile acoustic field generated by readily available bone conduction earphones. This enables hand-to-face gestures in both stationary and mobile environments to be detected without necessitating explicit user intervention.
2.3 Earable-based gesture recognition
[0046] Earable computing is a rapidly growing research field, with an increasing amount of attention given to technologies surrounding ear-based or headset applications for acoustic sensing. The possible ways of interacting around the ear are as follows. The majority of desired ear-based interactive gestures involve mid-air hand interactions. HeadFi transforms everyday headphones into smart devices, making earable sensing easily accessible. However, HeadFi requires additional hardware support and only allows for conventional interaction with the existing headset. FreeDigiter integrated proximity sensors into earbuds, enabling near-ear noncontact input from finger gestures. FaceSense designed an earbud with impedance sensing and thermal sensing for gesture recognition. Earbuddy leverages feed-forwards microphone on ANC earphones to detect the sound of touching gestures in the facial and ear areas for gesture recognition. However, it cannot detect non-contact, over-face gestures.
[0047] In a different approach, SonicASL, adopted a speaker and front microphone combo to recognize the sign language gestures of deaf individuals. Further, EarEcho leverages in-ear microphones to identify different users based on unique ear canal structures. Another approach used facial muscle movements to alter the transfer function of the user’s ear canal for facial expression recognition. Meanwhile, Briany hand employs a mini-proj ector and color camera within an earbud to relay input feedback to the user’s ear. PrivateTalk employed audio signals reaching the left and right ears to interpret the user’s intent to interact. Similarly, yet another approach designed voice-accompanying hand-to-face (VAHF) gestures for voice interaction.
[0048] Different from the above approaches, an active probing-based approach is proposed that explores the surface acoustic wave and leaky surface acoustic waves produced by commodity bone conduction earphones for both on-face and over-the-face gesture recognition, without the instrument of any dedicated sensors. This approach leverages the natural properties of bone-conduction technology to provide a more immersive and interactive user experience.
2.4 Surface acoustic waves and Leaky surface acoustic waves
[0049] This is not the first attempt to explore surface acoustic waves for ubiquitous sensing. Various types of sensors may be explored to generate surface acoustic waves. These sensors include microphones, IMU sensors, geophones, and piezoelectric devices. SAWSense explored a newly emerging sensor known as a voice pick-up unit (VPU) for on-desk gesturing. In a different vein, Leaky Acoustic Surface Waves (LSAW) have more recently found utility in collision avoidance for robotics. This concept is realized by deploying a pair of piezoelectric sensors on a robotic arm to generate LSAW, enabling the monitoring of obstacles encountered by the robotic arm.
[0050] The present is distinguishable itself in two ways. Firstly, MAF relies solely on a pair of off-the-shelf bone conduction earphones, eliminating the need for specialized sensors, such as piezoelectric sensors. Secondly, MAF possesses the capability to monitor both on-face and over-the-face gestures without any hardware modifications to the earphones. Consequently, it holds significant potential to enhance a wide range of facial gesture applications.
3. Preliminary, Feasibility, and Potential Applications
[0051 [ In this section, the concept of the mobile acoustic field (§3.1) is introduced, highlighting its potential for hand-to-face gesture interaction through a feasibility study (§3.2). Subsequently, three representative mobile applications are described that can directly benefit from the mobile acoustic field (§3.3).
3.1 Preliminary
[0052] When a mobile user uses bone conduction earphones to listen to an audio clip, the electrical audio signals get transformed into mechanical waves by the diaphragm of the earphone’s speaker. Because the earphone is in direct contact with the user’s head, these mechanical waves transfer their energy into the tissues of the human head, forming Surface Acoustic Waves (SAW). SAWs are a type of mechanical waves that propagate along the interface between a solid material and its adjacent medium, exhibiting a longitudinal and vertical shear component along the surface. Furthermore, these surface acoustic waves also disperse into the air as they travel through the user’s facial region, creating another type of signal known as Leaky Surface Acoustic waves (LSA W) .
[0053] Both SAW and LSAW waves persist as long as the mobile user continues to play audio, offering opportunities for interactions in close proximity to the user’s face. Essentially, the combination of these two signals generates an acoustic field that envelops the user’s head, as shown in FIG. 2A. It is termed as Mobile Acoustic Field as it is produced by the headphones and moves with the user. One may wonder whether these LSAW signals result from the motion of the rear part of the earpiece or from the mobile acoustic field. To answer this question, the earbud is sealed with Plasticine, as shown in FIG. 3A. FIG. 3B and 3C show the signal wave produced by an approaching gesture in the presence and absence of Plasticine sealing, respectively. A clear signal pattern was observed when the earphone was sealed, which demonstrates that the LSAW signal is due to the mobile acoustic field, not the motion of the rear part of the earpiece.
3.2 Mobile Acoustic Field for Hand-to-Face Gesture Interaction
[0054] It is envisioned that the mobile acoustic field can be leveraged to detect and recognize different types of gestures that are performed both in contact with the human face and in the vicinity of the human face (i.e., over the face), without the instrument of any dedicated sensors. To validate this feasibility, a volunteer is invited to wear a pair of bone conduction
earphones. The earphone emits a single tone at the ultrasound frequency band. As the bone conduction earphone is in close contact with the human face, this single-tone probing signal produces surface acoustic waves and leaky surface acoustic waves that propagate along the human head and get picked up by a microphone attached to the bottom part of the left face. The volunteer is first asked to gently touch her cheek. This physical contact not only generates a new signal but also has an impact on how surface acoustic waves propagate. Consequently, in FIG. 2B, significant deviations from the original signal when two cheek-touching gestures are performed can clearly be observed. Leaky Surface Acoustic Waves (LSAWs) create an acoustic field above the face due to the energy leakage from the surface acoustic waves. As illustrated in FIG. 2C, when the volunteer moves her hand closer to or further away from her face (around 2 cm apart), a subtle change in the received signals is also detected. This occurs because the approaching hand gesture alters the way the leaky waves propagate through the air, thereby modifying the received signal.
3.3 Applications of Mobile Acoustic Field
|0055| With the all-encompassing mobile acoustic field, the human head becomes an independent space for interaction. Below are listed a few potential applications of MAF, out of many.
[0056] Software-defined intelligent headphones. Intelligent headphones like Bose QC35 and Microsoft Surface headphones may use built-in accelerometers to detect touch gestures, enabling gesture control. However, these existing intelligent headphones are expensive and bulky, limiting their widespread adoption. As the left Figure 1(b) shows, it is envisioned that mobile acoustic sound can enable mobile users to define personalized gestures for controlling volume, playback, and muting without the need for dedicated sensors. This approach eliminates the cost and bulk associated with intelligent headphones, offering a more accessible and customizable gesture control solution for mobile users worldwide.
[0(157] Enhanced VR gaming experience. The on-face and over-the-face human gesture interaction can be leveraged to enhance the virtual reality (VR) experience. By accurately detecting and interpreting the gestures of the user’s hands, the VR applications allow users to manipulate and control virtual objects and perform various actions, without the need for physical controllers. Moreover, it opens possibilities for enhanced social interactions, particularly in scenarios depicted in the middle figure in FIG. 1 (marked “(b)”), like VR teambased shooter game Larcenauts. Users can communicate non-verbally through their over-the- face hand gestures, fostering a more engaging shooting game experience.
[0058] Face-touching awareness and prevention. The identification and monitoring of face-touching behavior are crucial in preventing virus transmission and promoting hygienic practices, particularly during the COVID- 19 pandemic when the virus can spread through contaminated surfaces and close contact. However, existing solutions, such as those based on earables and smart rings, can only detect face-touching after it has already occurred, lacking the ability to proactively prevent virus transmission. In contrast, the implementation of over-the- face detection utilizing the mobile acoustic field allows for the detection of face-touching intentions by capturing hand movements approaching the face, such as right figure in FIG. 1 (marked “(b)”)) shows. This capability enables a proactive approach to prevent face-touching behavior and mitigate the risk of virus transmission.
4. Mobile Acoustic Field
10059] To gain a comprehensive understanding of the mobile acoustic field (MAF) for gesture-based human-computer interaction, a plethora of benchmarks and user studies is designed to effectively assess the capacity (§4.1), robustness (§4.2), and user acceptance (§4.3) of MAF in real-world scenarios. All human evaluations in this section are conducted in full accordance with the internal Institutional Review Board (IRB) protocol. The maximum signal transmission power is set to 60 dB SPL, 10 dB lower than the CDC’s regulation.
4.1 Capacity of MAF
[0060] QI: How does the sound volume affect the gesture detection accuracy? The initial investigation focuses on investigating whether the intensity of the acoustic field generated by leaky surface acoustic waves is sufficient for detecting gestures. Two participants are invited, named A and B, to conduct this experiment in a controlled laboratory environment. As shown in Figure 4(a), participant^ wears a pair of GZCRDZ bone conduction wired headphones and sits on a chair. The headphone speakers are in close contact with the face so that the acoustic signal may not leak into the air. The microphone on the bone-conduction headphones is facing toward the participant A ’s face but has no direct contact with the skin. This allows the microphone to detect both SAW and LSAW signals. The left speaker of the headphone emits a single-tone probing signal at the ultrasound frequency of 18 kHz. At the beginning of the experiment, participant B proceeds to perform approaching gestures by moving her palm closer to participant A’s face, with 1 cm spacing in-between. The involvement of two participants ensures the absence of body motion artifacts from participants that could interfere with gesture detection. It also guarantees consistent alignment of the palm each time they approach the face.
[0061] Experiment Results. The human face is divided into the left-facial region, middle-facial region, and right-facial region. As depicted in Figure 4(b), within each region, the sound volume is varied between the minimum 43 dB SPL to 60 dB SPL to assess the success rate of gesture detection at different volume levels. More specifically, consider a template x(l) and a received data segment y(t) where Z=0,l,2...N-l. The cross-correlation r is defined as follows:
where x and y are the mean of the template and received data. A gesture is considered to be successfully detected as long as the cross-correlation coefficient r is higher than a pre-defined threshold. A correlation coefficient close to 1 suggests a strong positive relationship. In many cases, a value above 0.6 or 0.7 is often considered indicative of a strong positive correlation. So, the threshold is set to 0.7 in the experiment.
10062] To ensure reliability, the experiment is repeated 20 times for each volume setting. FIG. 4C shows the success rate of gesture detection (referred to as the success rate in the figure) across the left-, middle-, and right-facial regions of the human face. It is observed consistently high success rates (>95%) in the left-facial region across all five volume settings. However, in the middle-facial and right-facial regions, it is noticed that approaching gestures are difficult to detect at low sound volumes (43 dB SPL). This is expected because the sound source (the speaker) is positioned on the left side of the human head. Consequently, the sound signals (LSAW) attenuate significantly before reaching the middle-facial and right-facial regions, resulting in a low success rate for approaching gesture detection. As the sound volume is increased to 45 dB SPL and further to 50 dB SPL, a substantial improvement is observed in the success rate for the middle-facial region, reaching over 60% and eventually surpassing 95%. Similarly, in the right-facial region, the success rate increased to over 45% at 45 dB SPL and then reached 95% at 50 dB SPL. These preliminary experiments demonstrate the sensitivity of gesture detection success rates to sound volume. However, a sound volume of 45 dB SPL proved to be sufficiently strong for successful gesture detection in the left-facial region. To ensure coverage of all three regions, a slight increase in sound volume to 50 dB SPL can be implemented without compromising safety requirements.
[0063] Q2: How does the success rate change with the spacing between hand and face? Next, the sound volume is set to 45 dB SPL and examine how the gesture detection success rate changes with the spacing between the hand and the face. It is anticipated the benchmark results could reveal the smallest spacing in which the commodity headphones can generate detectable waves around the user’s face. The experiment setup is similar to the previous experiment.
[0064] Experiment Results. Figure 4(b) illustrates that the approaching gesture can consistently be detected when the hand is within 5 cm of participant A’ s left face. However, the success rate declines to below 80% as the spacing between the hand and the left face increases. Furthermore, when the hand is 20 cm away from participant A’s left face, the success rate drops below 50%. Notably, the success rate decreases significantly in both the middle-facial and right-
facial regions compared to the left-facial region, primarily due to the greater distance from the signal source (headphone speaker). For instance, when the hand-to-face spacing is reduced to 10 cm, the success rate in the middle-facial and right-facial regions decreases to 15%, representing a 60%> reduction compared to the left-facial region. In practice, a longer detection distance may cause false alarms as a nearby user may unintentionally trigger a gesture. Hence, a targeting distance of 5 cm is set in order to minimize the false positives.
[0065] Q3: How does the mobile acoustic field look like? Finally, the effective coverage area of the mobile acoustic field is measured. In this trial experiment, the region of interest is divided into 12 sub-regions: the left and right cheeks, forehead, chin, nose, top of the head, back of the head, left and right neck, back neck, and left and right shoulders. The coverage area of both SAW signals and LSAW signals is detected by performing touching and approaching gestures at these 12 sub-regions, respectively. The approaching gesture maintains an effective spacing of roughly 5 cm ±1 cm between the hand and the face. The sound volume is set to 45 dB SPL.
[0066] Experiment Results. The heatmap is used to represent the gesture detection success rate. As shown in Figure 5, it is observed that touching gestures generally achieve a broader coverage area, with a significantly higher success rate due to the following two reasons. Firstly, the touching gestures involve body contact that generates a new signal while simultaneously influencing the propagation path between the original audio source and the capturing microphone. This dual effect of touching face gestures substantially modifies the received signal detected by the microphone, leading to a notable increase in amplitude.
Secondly, the touching face gestures occur at a closer distance to the microphone sensor than approaching face gestures.
[0067] Taking further scrutiny of the touching gestures (top figure in FIG. 5), the predominantly deep orange coloration around the entire head area (>80%>) indicates a high gesture detection success rate. Interestingly, there is a notable exception in the user’s nose. This is due to the relatively small contact area between hand and nose, which results in a less
pronounced signal path alteration. In addition, other positions away from the head, e.g., the neck and the front chest can still achieve around a 70% success rate. The success rate declines significantly in areas such as the shoulders and the back of the head, dropping to less than 30%. This decline is due to the significant SAW signal attenuation over distance. Thus, from a practical perspective regarding touching face gestures, it was focused on the head area due to its exceptional stability and optimal recognition success rate.
[0068] Lastly, the focus was shifted to approaching gestures, as shown in the bottom figure in FIG. 5. The left-side regions of the head, including the left face, left neck, and left chin, represented as triangular areas on the heatmap, exhibit a distinctly higher success rate (>75%) compared to the rest. This striking difference underscores the fact the LSAW attenuates more severely than SAW due to its air propagation path: it emanates from the left speaker on the earphone, permeating the left face, and extending to the head and shoulder areas. Hence the users could be instructed to focus their gestures in the areas with the strongest signal transmission for the highest recognition success rates. This would enhance the overall user experience by ensuring the system responds accurately and reliably to user inputs.
4.2 Robustness of MAF
[0069] Next, the robustness of the mobile acoustic field is examined under diverse conditions. As a new human-computer interface, the robustness of MAF is crucial for its real- world applications. A highly robust MAF should maintain its performance within an acceptable range in the presence of body movement and changes in skin conditions.
[0070] QI: Can facial hydration condition affect the success rate of gesture detection? Firstly, the potential effects of daily facial skin condition changes are evaluated on system performance. This is crucial since the circadian rhythm can significantly influence various skin conditions, including its hydration levels. A volunteer (participant A) is invited to perform on-face touching and over-the-face approaching gestures, once after applying a moisturizing mask in the morning (hydrated facial state) and once following a typical workday in
the evening (dehydrated facial state). The microphone’s position on the user’s face is labeled with a marker to ensure precise remounting at the same location. This experiment is designed to simulate common skin conditions and monitor their potential impact on system performance. The experimental design follows the setup used in the facial hydration state experiment, wherein the left channel of a bone conduction earphone transmitted an 18 kHz ultrasonic wave with an energy level of 45 dB SPL. The effective detection distance for approaching gestures is also maintained at 5 cm ±1 cm.
[0071 ] Experiment Results. FIGs. 6A and 6B show the success rate of on-face touching gestures and over-the-face approaching gestures in dehydrated and hydrated facial states, respectively. The state of dehydration is characterized by the participant’s facial skin appearing oily and taut. Conversely, with hydration, the participant’s skin condition is smooth and delicate. The success rate of both touching gestures and approaching gestures remains consistently high in the two facial states, with slight differences when the user performs gestures in the right-facial regions. This shows that facial skin conditions do not remarkably impact the sensitivity of the microphone or the headset’s audio output, thus not leading to a significant alteration in the signal path.
[0072] Q2: Can motion artifact affect the success rate of gesture detection? Next, it is experimentally validated whether different body motions could disrupt the efficacy of the mobile acoustic field for on-face (touching) and over-the-face (approaching) gesture detection. In this controlled lab experiment, participants carry out touching and approaching gestures in three movements: stationary, walking (~ 1.4m/s), and joggling (~ 3m/s). Similar to the previous experiment, the user plays a single-tone signal at 18 kHz through the left speaker on her bone conduction earphone. The sound volume is fixed to 45dB SPL. However, compared with the previous experimental settings, participant A has been asked to complete all experiments independently in this study, while participant B only played a supervisory role to ensure that the proximity gesture was in line with 5 cm±l cm each time.
[0073] Experiment Results. FIGs. 6C and 6D shows the success rate of on-face and over-the-face gesture detection at different regions and in different motion state settings, respectively. It is observed that in the stationary state, the success rate of both on-face and over- the-face gestures with the heatmap results shown in FIG. 5. This essentially reveals that the resting participant using their own hands did not significantly affect the MAF’s performance. In the context of the two motion states under different speeds, a decrease in the success rate is observed compared to the stationary state. This is particularly apparent with the success rate of touching face gestures. For instance, the success rate of the right-facial gesture drops from an original 85% to 25% during walking and 20% while jogging as shown in FIG. 6C. This decrease can be attributed to a shift in the headphone’s speaker caused by touch and body movement, leading to a corresponding shift in the transmission ultrasound signal. Similarly, the approaching face gesture in FIG. 6D also shows a downward trend. However, despite a noticeable decrease in the success rate from the static state, the left-facial gestures consistently exhibit a robust success rate (>95%), thereby assuring good gesture detection under any motion state. This result manifests that the left-facial region can be fully relied upon for over-the-face interactions under any tested motion states.
4.3 User Acceptance of MAF
(0074] Q3: Is MAF acceptable to mobile users? A Likert Scale questionnaire and a simple gesture test was created to evaluate the user acceptance of the MAF system.
[0075] Participants & Apparatus. Tests involving N = 22 participants were conducted, encompassing diverse skin conditions, ranging from 16 to 54 years old, including both males and females. The experimental setup aligned with the methodology, where all participants were directed to interact with their only left facial regions, specifically by touching or positioning their hands in close proximity. Each of 20 sets of trials of these two gestures were executed under stable acoustic intensity and effective distance. It only took one person five minutes to complete this set of experiments. To uphold the experimental rigor, the procedure was conducted under the watchful guidance of a designated supervisor. When each person completed these
experiments, their gesture signals diagrams were shown to them which are similar in FIG. 2B and 2C, and the results were explained to them.
[0076] Experiment Results. The users’ feedback is shown in FIG. 7. First of all, it can be seen clearly that users often use headphones in most of their daily lives to carry out various entertainment activities, such as listening to music and watching videos, which indicates that headphones have occupied an important position in their daily lives. Secondly, in the initial stage of the straightforward experiment, participants were queried about their prior experience with gesture recognition technology. A substantial 81.8% revealed they had never interacted with such technology, whereas a minor portion, 18.2%, had some experience, mainly through VR or AR platforms. Following this, users were enabled to explore face-touching and faceapproaching gesture recognition activities, later gathering their responses through a final Likert scale questionnaire, focusing particularly on questions (d) and (e). An impressive 95% of participants described the gesture recognition method as incredibly intuitive and simple to use. Furthermore, 91% found it was still comfortable to wear with only minimal attaching to the microphone’s mouthpiece. As a whole, users’ recognition and expectations of the MAF system are positive.
5. System Design
[0(177] The Probing Signal. The system leverages the acoustic signal emitted from the bone conduction earphones to generate the mobile acoustic field. Although music signals can produce both surface acoustic waves and leaky surface acoustic waves, their frequency and amplitude both change abruptly over time, introducing variations to both the SAW and LSAW. It is thus challenging to disentangle the signal variation caused by human gestures from the raw signal receptions.
[0078] In the system, a probing signal on the ultrasound band is proactively sent out to produce stable surface acoustic waves and leaky surface acoustic waves. The probing signal works on the ultrasound band for three key reasons. Firstly, it allows mobile users to perform
gesture control while listening to music without interfering with each other. Secondly, it is imperceptible to the human beings and thus may not negatively affect the user experience. Thirdly, compared to audible band signals, ultrasound at a higher frequency band attenuates more rapidly and thus is less prone to false alarms triggered by other users nearby. Moreover, it suffers less from ambient noises since most environmental noises are below 18 kHz. The frequency response of three different pairs of bone conduction earphones was measured and empirically set the central frequency of the probing signal to 18 kHz. The user is free to use a higher frequency within the range of 18 kHz to 22 kHz to transmit probing signals if they can hear the probing signal at a lower frequency.
[0079] Single Tone vs. FMCW. A single tone is chosen instead of the chirp signal (FMCW) as the probing signal for two reasons. Firstly, it was found that the frequency response of most earphone speaker transducers in the ultrasound band varies significantly. This implies that the power of a chirp signal is not uniform across the frequency band. Given that the power fluctuation of the received signal is a crucial feature of the gesture recognition model, the inconsistency in chirp signal power has the potential to impact the performance of the model. Secondly, it was found that sending continuous chirps can lead to audible noises. This is because continuous chirp signals can trigger impulsive responses in the system, leading to the generation of transient signals that manifest as audible noise.
5.1 Signal Pre-processing
[0080] FIGs. 8A-D shows the proposed signal pre-processing pipeline. The raw signal received by the microphone first passes through a series of filters to extract the gesture-induced signals from the noise receptions. These processed signals are then fed into a signal enhancement module to improve their SNR.
[0081 ] Step One: Filtering. The received signal is first fed into a Butterworth bandpass filter with a cutoff frequency of fprob ± 50Hz in order to remove the out-of-band noises. fprot> is the frequency of the probing signal on the ultrasound band. Subsequently, the filtered signal is
passed through a Butterworth band-stop filter with the central frequency of fprot). This allows for removal of the probing signal from the receptions while preserving the frequency variation caused by hand-to-face gestures, thereby enhancing its SNR. FIG. 8B shows the received signal after passing the filters. Evidently, the signal variation due to facial gestures becomes more prominent after the filtering step.
[0082] Step Two: Signal Enhancement. To attain precise segmentations in MAF, it is essential to mitigate the effects of the in-band noise artifacts as well. One significant contributor to these artifacts is the probe signal, which generates multipath components as it traverses different channels on the face, such as bones and fats, subsequently affecting the accurate detection of SAW and LSAW when the gesture commences (as illustrated in FIG. 8B). Given that these multipath artifacts and the probing signal exhibit overlapping frequency components, prior band-pass filter (BPF) strategies are incapable of isolating the noise effectively. Thus, Wiener filtering is leveraged to manage such frequency-overlapped noise.
[0083J The initial step in this process entails collecting a brief segment of noise samples, typically lasting 0.3 seconds, prior to initiating the filtering. This step facilitates the analysis of the noise’s frequency characteristics, thereby assisting in the accurate determination of the filter’s parameters. Subsequently, these parameters are employed to filter out successive time frames sharing identical frequency characteristics. In MAF, the Wiener filter predominantly gathers the time frames that encompass the multipath occurring between the speaker and microphone pair when the probing signal initiates. As depicted in FIG. 8C, the application of the Wiener filter substantially reduces these multipaths when a gesture commences, thereby enhancing the discernibility of each gesture’s initiation point and duration.
5.2 Segmentation
[0084] Next, the received audio wave is divided into a sequence of audio segments and feed those segments containing human gestures into the classifier for gesture recognition. To detect the presence of a gesture in a segment, an intuitive solution is to apply a predefined
threshold to the audio wave to detect the energy variations caused by human gestures. However, this method is not scalable as it does not consider the fluctuations in signal energies resulting from diverse human behaviors, such as varying user strengths during gesture execution.
[0085] To tackle this challenge, a Kullback-Leibler (KL) divergence-based method is employed to detect the presence of a gesture within each segment. Specifically, given two consecutive audio segments, the energy probability distribution of these two segments is computed, denoted as P and Q, respectively. The KL divergence quantifies the information loss when Q approximates P, as given by the equation: D^ P II Q) = SiP(t) log (^7^). When there is no gesture that shows up, the two audio segments are full of ambient noises. Accordingly, their energy distribution would be similar. Accordingly, the KL divergence value would be close to 0. Conversely, when a gesture shows up, its energy distribution would be drastically different from the audio segment that contains purely ambient noise. Hence, a large KL divergence value is expected to be seen. On the other hand, to find a proper length of the audio segment, the duration of the gestures collected from 30 users across different ages (details can be found in §6) is assessed and found the longest gesture lasting for 2s. So, a slightly larger segment size of 2.5s with a 50% overlap is adopted to ensure the completeness of the gesture within each audio segment. FIG. 8D shows the on-face and over-the-face gestures after segmentation.
5.3 Gesture Classification
[0086] In the final stage, the aim was to differentiate the specified gestures within MAF. Inspired by the success of the Deep Neural Network in the applications of image and audio fingerprinting classifications, a data-driven framework is introduced to identify these on-face and over-the-face gestures in MAF. The overall frame consists of two parts: feature extraction, and model training.
(0087] Feature extraction. MAF processes audio data for gesture classification using a short-time Fourier transform (STFT) spectrogram directly. Compared with the ID time series signal, the 2D STFT spectrogram is considered to generally provide richer information on the
feature representations and has better temporal and frequency localization properties than a one- dimension waveform in the time-domain, making it a unique fit to classify the nonstationary human gestures. Different from the other approaches applying a Mel spectrogram for classification, the STFT spectrogram is applied directly in MAF. The reason is that the nonlinear Mel-scale compresses the fine-grained spectral structure that is often less important to speech recognition but critically important to gesture detection on the high-frequency band (>18 kHz). Due to the acoustic signal being quasi-stationary within a short time (e. ., 2-50 ms), the frame length of the spectrogram input is selected to 2048, corresponding to 20ms within the sampling rate of 48,000 Hz. The hop length is set to 1024. Accordingly, the frequency resolution is around 23 Hz within each sample point.
[0088] Model structure. MAF adopts a hybrid neural network architecture to enhance the classification performance, as depicted in FIG. 9. Specifically, a combination of Convolutional Neural Networks (CNNs) and a Recurrent Neural Network (RNN) layer is employed to facilitate superior feature extraction before inputting the spectrum feature representations into the multilayer perceptron (MLP) for classification. The architecture encompasses five CNN encoder layers, a bi-directional LSTM layer, and a classic multilayer perception (MLP) structure. Each CNN layer is configured with a 2D convolution, a batch norm, a ReLU function, and a dropout regularization. The stride is set to 2.
|0089| Given that the gesture representation usually spans across a frequency band of 150 Hz, the kernel size of the initial two convolution layers have been designed to be 7x7. This decision ensures that the receptive field is adequately sized to encapsulate a complete gesture component within the spectrogram, thus enhancing the feature extraction efficacy.
Subsequently, the high-dimensional features extracted are forwarded to the LSTM layer, which enhances the temporal connections between individual time frames. This LSTM layer acts as a bridge, conveying the refined feature set to the MLP. The MLP processes the features received from the LSTM and outputs the prediction results. To compute the loss, the cross-entropy loss function is utilized.
6. Evaluation
6.1 Study One: Gesture Selection
[0090| A set of 12 distinct gestures was devised. Among these, six were performed directly on the face, while the remaining six were performed in proximity to the face (a.k.a., over-the-face gestures). FIG. 10 illustrates these 12 gestures. Subsequently, participants were invited to rate each gesture, indicating their personal preferences. The objective was to assess whether the gesture set being crafted aligned with user preferences and intuitive behavior.
[0091] Participants & Apparatus. The identical group of 22 participants who took part in the previous user study Section (§4.3) were enlisted for this particular experiment. To be precise, each participant was asked to wear the bone conduction earphones and execute the set of 12 gestures illustrated in FIG. 10. Upon the completion of the gesture performance, each participant was requested to complete a Likert scale questionnaire. In this questionnaire, they were prompted to rank these 12 gestures (on a scale from 1 to 5, the higher the better) based on their individual preferences. Notice that the volunteers #6, #10, and #20 have noticeable facial hair. So, the microphone was taped on top of their facial hair.
[0092] Results. In FIG. 10 representing each gesture, the average score associated with each gesture was marked at the bottom. From the results, it is observed that the participants generally prefer over-the-face gestures to on-face gestures. On the other hand, comparing gestures h and z, as well as j and k, it can also be seen that users prefer to perform gestures only once, indicating that users prefer simple and efficient gestures. Notably, gestures c and /' received the lowest scores, averaging around 1.41. Participants’ feedback highlighted that they find these two types of gestures uncomfortable as they involve pinching the face, potentially causing discomfort or even pain. Furthermore, they indicated that performing these gestures in public could be viewed as inappropriate or uncouth. Based on these user study results, it was finally decided to keep the palm single and double pressing, pinching ear and covering the face for on-face gestures (a, b, d, e) and including all of the over-the-face gestures: fist open and
close, the palm sliding once or twice, the first single and double click, and one palm approaching (g, h, i, j, k. I) for gesture recognition.
6.2 Study Two: Gesture Recognition
(0093] In this section, the experiment setups are first described and then discuss the experiment results.
|0094] Experiment Setups. The same group of 22 participants was recruited for the field study. Within this group, there were 12 male participants and ten female participants, with an average age of 26.8 years and a standard deviation of 9.0.
[0095| Data collection. All participants used the same pair of bone-conduction earphones during the data collection process. To ensure consistency, the microphone’s mouthpiece was taped approximately three fingers’ width away from the participant’s earlobe, on their cheek. The microphone is taped on the participant’s face because the bone conduction earphone being used in the experiment adopts an inline microphone. In contrast, numerous bone conduction earphones do not necessitate taping to the user’s face, as they feature built-in microphones within the earphone body, establishing natural contact with the user’s face. Following this, a supervisor connected the earphones to a laptop and emitted an 18 kHz frequency signal at a consistent volume (45 dB SPL), ensuring that the ultrasound signal remained inaudible to the participants. The microphone recordings from the earphones were captured using a MATLAB program. Simultaneously, the supervisor recorded video footage of the participants performing gestures to establish the ground truth. Each participant was instructed to execute each type of gesture 20 times, resulting in a total time commitment of approximately 15 minutes. In total, 4,400 gesture recordings were collected.
|0096( Model Training and Evaluation. The CRNN model is implemented in PyTorch and trained on an NVIDIA Al 00 GPU for 150 epochs, using a batch size of 32. The Adam optimizer is employed with a learning rate set to 0.001. To further mitigate the risk of overfitting, an early stopping mechanism during the training phase is applied. A Leave-One-Out
is employed, specifically a 5-fold cross-validation approach, to evaluate the CRNN model. In this setup, the data is divided from 22 participants into 5 groups, with each group containing data from 4 or 5 participants. During each iteration of cross-validation, one group is left out as the test set, while the combined data from the remaining groups are used as the training set. This ensures that each group has the opportunity to serve as an independent test set, and also accounts for potential correlations between participants’ data, providing a more comprehensive assessment of the model’s generalization capabilities.
[0097] Evaluation Metrics. Gesture recognition accuracy is adopted as the metric to evaluate the performance of the proposed solution. The recognition accuracy is formally defined as:
„ .. . . the number of correctly recognized gestures
Recognition Accuracy = - - - 5 - - - (2) the total number of gestures being tested
[0098] Additionally, precision, recall, Fl score, and accuracy are also used to assess the model’s performance.
[0099] Prevent Overfitting. The following actions were taken to prevent model overfitting. Firstly, leave-one-out cross-validation was adopted to ensure the testing is performed on unseen data (i.e., collected from other users). Secondly, the model adopted L2 regularization to penalize large weights in the model, which helps prevent the model from fitting the training data too closely. Thirdly, to mitigate overfitting, early stopping was implemented and introduced dropout layers. The training and validation loss curves both trend downwards consistently. FIG. 11 shows the training and testing losing curve. The training loss gradually decreases upon adding training examples and flattens gradually. The validation loss decreases upon adding training examples and flattens gradually. There is a gap that was noticed between the training loss and validation loss after 50 epochs, indicating addition of more training examples does not improve the model performance on unseen data. Hence the model did not overfit.
[0100] Experiment Results. In this section, the evaluation results were reported based on the data collected.
[0101] 1. Overall Performance Across Different Users. The gesture recognition accuracy was first examined across all 22 participants. To do this, the mean gesture recognition accuracy was calculated for each individual based on their performance across ten different gestures and repeated each gesture 20 times to ensure the reliability of the results. The results are shown in FIG. 12. It was observed that MAF consistently demonstrates strong performance, consistently exceeding 91% accuracy across all 22 participants. Notably, only participants #7 and #11 achieved a slightly lower accuracy of approximately 89%. On the whole, these results reaffirm the overall effectiveness of MAF.
[0102] 2. Gesture Recognition Accuracy Across Ten Types of Gestures. A detailed analysis of each individual gesture was conducted. In FIG. 13 A, the confusion matrix was shown illustrating the results for the ten tested gestures. Among them, gestures a, b, d, and e were detected using SAW signals, while the remaining six gestures (g, h i,j, k, I) were recognized using LSAW signals. Additionally, a default gesture was included to serve as a reference, indicating no specific gesture being performed.
[0103] There have been three observations. Firstly, the recognition accuracy for most of these gestures surpasses 92%, demonstrating the efficacy of the proposed signal processing algorithms. Secondly, it is observed that LSAW signals outperformed SAW signals in accurately capturing facial gestures. This observation suggests that the MAF system demonstrates increased sensitivity when detecting expansive movements or gestures that span a wider spatial area. Thirdly, a notable challenge was observed in accurately classifying the “fist open and close” gesture (g), with a recognition accuracy of merely 83%. To understand the reason for this inferior performance, the feature distribution of these ten gestures were further visualized in FIG. 13B. As shown, the features of the gesture “fist open and close” gesture (g) overlap with the feature of the gesture “fist single click” (/). This confusion stems from the similarity in the foundational movements involved in both gestures, as the gesture (g)
encompasses elements commonly found in other fist-related actions, thereby increasing the complexity of accurate classification.
[0104] 3. The Impact of Different Ages. People of different age tend to perform gestures differently. For instance, a senior may perform a gesture slightly slower than a junior. Moreover, the difference in palm size and finger length across different age groups can lead to distinct effects on the received signal. Furthermore, people of different age brackets may exhibit distinct levels of facial skin moisture, which could impact the propagation of both SAW and LSAW signals. Hence, experiments are conducted to examine the impact of different ages on the system. In this experimental, the 22 participants are classified into four age groups: Group 1 (under 22 years old), Group 2 (between 22 and 25 years old), Group 3 (between 26 and 30 years old), and Group 4 (above 30 years old), as illustrated in FIG. 14A. The results reveal a relatively consistent gesture recognition accuracy across these four groups, averaging 92%. However, it is worth noting that the variance in gesture recognition accuracy is slightly higher in Groups 1, 2, and 4 compared to Group 3. The superior accuracy observed in Group 3 may be attributed to the stability and consistency demonstrated by participants in this age bracket in terms of ease of gesture manipulation performance, physical coordination, and even skin moisture conditions.
[0105] 4. The Impact of Different Genders. Likewise, there may be inherent differences in how men and women naturally execute gestures as people in different genders differ in their palm size, finger length, and the strength as well as the speed when performing a gesture. This could lead to variations in the received SAW and LSAW signals. Consequently, an analysis of gesture recognition performance is conducted across different genders. FIG. 14B shows the accuracy of gesture recognition for both on-face and over-the-face gestures. Across both genders, high performance was observed consistently for gestures made away from the face (over-the-face), averaging 98%. However, when it comes to on-face gestures, a disparity was noted, with females achieving a recognition accuracy of 88% and males achieving 84%. Moreover, there was a notably higher variance in the accuracy of on-face gesture recognition across both genders. While differences in gesture intensity might contribute to this variation, it
is also suspected that the smaller amount of training data available for on-face gestures (in comparison to eight types of over-the-face gestures) could be another contributing factor.
[0106] 5. Comparison Between Different Models. The design CRNN model was also compared against a few other classification methods such as random decision trees, K Nearest Neighbor (kNN), and Support Vector Machine (SVM), as well as deep learning models consisting solely of CNN or RNN components. All methods underwent identical signal preprocessing procedures, encompassing filtering, enhancement, and segmentation. Furthermore, identical training and testing datasets were employed for all of these models to ensure a fair and consistent evaluation.
Model Precision Recall Fl score Accuracy
SVM 0.550 0.547 0.550 0.547
Random Decision Tree 0.555 0.541 0.543 0.541
K-Nearest Neighbors 0.565 0.559 0.545 0.559
RNN only 0.613 0.613 0.613 0.565
CNN only 0.823 0.793 0.808 0.817
CRNN (presented 0.949 0.949 0.949 0.928 herein)
[0107] Table 1. Test results of different models, and the final weighted values of Precision, recall, Fl score, and accuracy.
[0108] The results are shown in Table 1. The DNN model exhibited remarkable superiority over traditional classifiers, achieving an accuracy of 92.8% compared to 54.7% for SVM, 54.1% for Random Decision Trees, and 55.9% for K-Nearest Neighbors. The gesture recognition accuracy of the RNN model experienced only a slight increase to 56.5%. This limited improvement may be attributed to the RNN’s predominant focus on temporal data, potentially overlooking crucial spatial characteristics inherent in gestures. The utilization of the CNN model resulted in a significant leap in gesture recognition accuracy to 82%. This improvement can be attributed to CNN’s robust capability to extract essential spatial features
from the spectrogram data. When the strengths of both CNN and RNN were combined in the hybrid RCNN model, the gesture recognition accuracy peaks at 92.8%, surpassing all baseline models by a considerable margin.
[0.109] 6. Subjective Evaluation. A second Likert scale survey was administered to each individual after using MAF. The goal is to evaluate their perspectives on the usability of MAF. The results are shown in FIG. 15. Encouragingly, a substantial 86.4% found the MAF approach engaging, with no one stating that it was not. Furthermore, it is found that 17 participants considered the difficulty of executing the ten gestures as acceptable. As a follow-up step, from the results in the first survey of FIG. 7, only eight users had previously encountered the conveniences afforded by gesture recognition technology, and six of them found the MAF technology to be more appealing. Moreover, all 22 interviewees believed that MAF could enhance the market allure of VR/AR headsets, with 21 indicating a willingness to purchase the service should it become available on the market. Ultimately, a positive sentiment towards the overall satisfaction with MAF was expressed by 21/22 participants.
6.3 Study Three: Micro-Benchmark Evaluations
101101 Next, four micro-benchmark studies were conducted to investigate how different parameter configurations affect the performance of MAF, focusing specifically on the impacts of noise, human speech, body motion, and skin hydration.
[0111 ] 1. Comparison of Different Gesture Classifiers. To evaluate the robustness of the system in environments with varying noise levels, different noise levels were simulated, typically encountered in daily human activities. During these experiments, a participant was instructed to perform each of the ten gestures 20 times, amidst white noise emitted from a speaker at different volumes: 40 dB SPL, 60 dB SPL, and 80 dB SPL, as illustrated in FIG. 16A. Utilizing the pre-trained model, the potential impact of these noise levels was assessed on gesture recognition accuracy. The analysis indicated that increasing noise levels could potentially compromise the system’s ability to accurately recognize gestures. This can be
attributed to the interference created by noise reflections over the face, affecting the formation of the MAF that could obstruct precise gesture recognition. However, it was encouraging to note that the system maintained considerable stability at a noise level of 40 dB SPL, proving to be highly reliable for indoor activities such as VR gaming. This underscores that MAF system could accommodate the majority of daily activities.
[0112] 2. The Impact of Human Speech. To assess the impact of verbal communication on the system’s performance, an experiment was devised in which participants were required to perform gestures while concurrently having a conversation. In this trial, each participant was tasked with repeating each gesture 20 times, and the collected data was classified using the same pre-trained model. As illustrated in FIG. 16B, a noticeable discrepancy in accuracy can be observed when comparing on-face gesture recognition of loud speech to soft speech. This phenomenon can be attributed to the significant movement in the cheek area that occurs during loud speech, which interferes with the signal propagation of gesture transmission. Nevertheless, the median results indicate that the recognition accuracy remains approximately 90%, regardless of the volume of speech. This stability is largely due to the signal preprocessing filtering phase, wherein the frequencies commonly associated with verbal communication and their harmonics are effectively eliminated, ensuring the integrity of gesture recognition.
(0113] 3. The Impact of Body Motions. The impact of body motions is evaluated on gesture recognition accuracy. Specifically, a participant was invited to perform ten gestures under the walking state and jogging state. Each gesture is repeated 20 times. The same pretrained model was used to recognize gestures collected in these two states. FIG. 16C shows the results. It is found that under different body movements, ten different gesture signals can still be recognized, but its median recognition accuracy drops to below 90% for walking and performs worse in jogging. This also reaffirms the results of Section (§4.2), indicating that the motion state affects the posture recognition rate. The reason is that it can be influenced by the movement of the head and the tightness of the headphone wear, affecting the formation of the MAF. However, the above 80% recognition accuracy is still acceptable, showcasing significant accuracy despite the challenges presented by physical movements. Future improvement of MAF
might focus on implementing advanced algorithms capable of compensating for the disturbances generated by these physical activities.
[0114] 4. The Impact of Skin Hydration. The circadian rhythm notably influences the skin’s water permeability, the hydration level of facial muscles, and the concentration and dispersion of oils in the stratum comeum in individuals. Since these factors can vary at different points in the day, an experiment was designed to assess whether these changes affect the accuracy of gesture recognition signals. 20 times with ten gestures was conducted in the morning and evening with the same participant. The microphone’s position is labeled on the user’s face with a marker to ensure precise remounting at the same location. From the results of FIG. 16D, the pre-trained model struggled to perform well when participants had oily skin after a day’s activities. This is largely due to the natural oil buildup on the facial skin over the course of the day, which can disrupt the accuracy of the gesture recognition process.
[0115] 5. The Impact of Music Playback. An experiment is further conducted to assess if music playback affects gesture recognition. The participant wore the headphones while a computer played a song mixed with the 18 kHz probing signal. The recognition accuracy of ten gestures is performed by the participant. The gesture recognition accuracy is also plotted in the absence of music playback for comparison. As shown in FIG. 16E, the system achieves above 90% accuracy in the presence of music playback, which is consistent with that achieved in the absence of music playback. This is because the frequency of music signals is usually below 15 kHz, and thus may not interfere with the probing signals. These music signals can be further filtered out during the signal processing. These results affirm that the system robustly supports gesture recognition without compromising the headset’s music playback function.
7. Constraints
[0116] Restricted to bone-conduction headphones. Several types of headphones were experimented with, including on-ear, over-ear, in-ear, and bone conduction earphones.
However, the SAW and LSAW waves only show up when the user wears a pair of bone
conduction earphones. It is suspected that the insufficient skin contact or an insufficiently sized contact surface of on-ear, over-ear, and in-ear headphones are the primary cause. Additionally, the on-ear and over-ear headphones are usually equipped with soft earcups or ear pads that can absorb acoustic energy, further diminishing the generation of these waves.
|0117| Extend to micro-gestures. The existing design effectively identifies and categorizes ten distinct gestures performed on the face and above it. Nevertheless, the CRNN model proposed encounters challenges in accurately recognizing micro-gestures involving finegrained finger motions, such as scratching the face, pinching the ear, and sliding the face with two fingers, owing to its constrained model capacity. One potential remedy is to add more layers to the current CRNN model to augment its capacity. However, this comes with a tradeoffescalating model inference latency, which could adversely impact the user experience. The tradeoff between model capacity and latency is worth further exploration.
8. Conclusion
[0118] The design, implementation, and evaluation of mobile acoustic field (MAF) have been presented, a novel acoustic sensing approach that leverages commodity bone conduction earphones for hand-to-face gesture interactions. This new approach hinges on the principles of Surface Acoustic Waves (SAW) and Leaky Surface Acoustic Waves (LSAW) to create signals that not only traverse the user’s facial surface but also radiate into the surrounding air, forming an encompassing acoustic field surrounding the individual’s head. The evaluation involving 22 participants demonstrated MAF can accurately recognize ten distinct gestures performed both on-face and above-face with high precision. The user feedback is also promising: an overwhelming majority of the 22 interviewees showcased enthusiasm toward adopting the MAF, anticipating its seamless integration into contemporary scenarios. It is envisioned that MAF stands as a significant approach in the field of hand-to-face gesture recognition technology.
9. Systems and Methods for Detecting Hand Gestures in Mobile Acoustic Fields Around
Bone Conduction Headphones
(0119] Referring now to FIG. 17, depicted is a block diagram of a system 100 for detecting hand gestures in mobile acoustic fields around bone conduction headphones. In overview, the system 100 can include at least one computing device 105 and at least one bone conduction headphone 110 (sometimes herein referred to as a bone conduction earphone or headset), among others. The computing device 105 can include at least one field monitoring service 115 and at least one application 120. The field monitoring service 115 can include at least one probe generator 125, at least one audio processor 130, at least one gesture detector 135, and at least one machine learning (ML) model 140, among others. The bone conduction headphone 110 can include at least one speaker 145 and at least one microphone 150, among others. The computing device 105 and the bone conduction headphone 110 can be associated with at least one user 155. Each of the components of system 100 (e.g., the computing device 105) may be implemented using hardware or a combination hardware and software, such as those of system 514 as detailed herein in conjunction with FIG. 21. Each of the components in the system 100 may implement or execute the functionalities detailed herein, such as those detailed herein in Sections 1-8.
(0120] In further overview, the computing device 105 can be any computing device comprising one or more processors coupled with memory and software and capable of performing the various processes and tasks described herein. The computing device 105 can be operated or associated with the user 155. The computing device 105 can be a smartphone, other mobile phone, tablet computer, wearable computing device (e.g., smart watch, eyeglasses), or laptop computer, among others. The computing device 105 can be in communication with the bone conduction headphone 110, among other devices (e.g., via wireless communications or wired communications). The computing device 105 can be in communication with other devices, such as remote servers, computing devices, or other hardware devices, among others.
[01211 The field monitoring service 115 can process, manage, or otherwise handle exchange of data from the bone conduction headphone 110 and the application 120. In the field monitoring service 115, the probe generator 125 can generate a probe signal to be emitted via the speaker 145 of the bone conduction headphone 110 to form an acoustic field. The audio processor 130 can receive and process audio signals from the microphone 150 of the bone conduction headphone 110. The gesture detector 135 can apply the ML model 140 on the processed audio signals from the audio processor 130 to detect and identify any hand gestures within the acoustic field. The ML model 140 can be used to identify hand gestures within the acoustic field about the head of the user 155 or the bone conduction headphone 110. The ML model 140 may include, for example, a deep learning artificial neural network (ANN), Naive Bayesian classifier, a relevance vector machine (RVM), or a support vector machine (SVM), a regression model (e.g., linear or logistic regression) or a clustering model (e.g., k-NN clustering or density-based clustering), or a decision tree (e.g., a random tree forest), among others.
|0122| In some embodiments, the field monitoring service 115 can be executed on the computing device 105 (e.g., as depicted). For example, the field monitoring service 115 can be a process or application separate from the application 120. In some embodiments, the field monitoring service 115 can be part of the application 120 running on the computing device 105. For instance, the functionalities ascribed to the field monitoring service 115 can be executed by the application 120. In some embodiments, the field monitoring service 115 can be executed on a device separate from the computing device 105. For example, the field monitoring service 115 can be executed on one or more processors and memory of an external device (e.g., a dongle or other portable device) that is in communication with the computing device 105.
[0123] The application 120 can include any software program executing on the computing device 105. The application 120 can be any type of application to interact with the user 155 via input (e.g., mapped from hand gestures by the field monitoring service 115) and output (e.g., replies to user commands or queries). The application 120 can interface (e.g., via an application programming interface (API) with the field monitoring service 115. The application 120 can be any type of software program, such as a word processor, a spreadsheet program, a
presentation program, an electronic mail client, a web browser, a graphic design application, a video editor, a project management application, a database management software, a messenger, or a multimedia player, among others.
10124] The application 120 can have any number of functions to be invoked via user input. In some embodiments, the application 120 can be associated with the various operations of the bone conduction headphone 110. The application 120 can include various functions related to the bone conduction headphone 110, such a volume control, a playback, and mute, among others. In some embodiments, the application 120 can be associated with a virtual reality (VR) headset. The application 120 can control various functions of the VR headset from user input, such as a control of a virtual object presented through the headset or communication with another user, among others. In some embodiments, the application 120 can be used to control or suppress disease or virus transmission. The application 120 can present an alert for the user 155 to prevent contact with a face of the user 155, among others.
[0125] The bone conduction headphone 110 (sometimes herein referred to as a bone conduction earphone) can be an audio device to emit sound through bones on the face of the user 155 to bypass at least a portion of the ear (e.g., middle ear) of the user 155. The bone conduction headphone 110 can be communicatively coupled with the computing device 105 or another device executing the field monitoring service 115 (e.g., via wired or wireless communications). The bone conduction headphone 110 can be arranged, situated, or otherwise positioned relative to a corresponding ear of the user 155. For example, the bone conduction headphone 110 can be fitted about the back region of the ear of the user 155. There can be a pair of bone conduction headphones 110 on the user 155. For example, as depicted, one bone conduction headphone 110 can be positioned on the left ear of the user 155 and another bone conduction headphone 110 can be positioned on the right ear of the user 155. While primarily described in terms of one or two bone conduction headphones 110, the system 100 can include any number of bone conduction headphones 110 on the user 155.
[0126] The speaker 145 of the bone conduction headphone 110 can produce, emit, or otherwise radiate acoustic sound waves against the face of the user 155. To generate the acoustic soundwaves, the speaker 145 can transform or convert electrical signals (also referred herein as audio signals) from the computing device 105 into the acoustic sound wave. The speaker 145 can be any type of transducer, such as a loudspeaker, a piezoelectric transducer, or an electromagnetic transducer, among others. The speaker 145 can be situated, positioned, or otherwise disposed on the face of the user 155. For instance, the speaker 145 can be fitted or secured against the cheekbonejawbone, or temporal bone of the user 155 by the remaining portion of the bone conduction headphone 110.
[0127] The microphone 150 of the bone conduction headphone 110 can also produce, output, or otherwise generate electrical signals from acoustic soundwaves. The acoustic soundwaves can be traveling through the air about the user 155 and can reach the microphone 150. To generate the electrical signals, the microphone 150 can transform or convert acoustic soundwaves arriving at the microphone 150. The microphone 150 can be any type of transducer to receive acoustic waves from the air around the user 155, such as a dynamic microphone, a condenser microphone, a ribbon microphone, a carbon microphone, a piezoelectric microphone, a microelectromechanical system (MEMS) microphone, a Lavalier microphone, or a pressure zone microphone, among others. In some embodiments, the microphone 150 can be any type of transducer to receive acoustic waves or vibrations about the face of the user 155, such as a piezoelectric transducer or an electromagnetic transducer, among others.
[0128] Referring now to FIG. 18, depicted is a block diagram of a process 200 to acquire audio signals in the system 100 for detecting hand gestures. The process 200 can correspond to or include operations performed in the system 100 to form an acoustic field about the head of the user 155 and detect hand gestures performed by the user 155 affecting the acoustic field. Under the process 200, the probe generator 125 of the field monitoring service 115 can transmit, send, or otherwise provide at least one probing signal 205 to the speaker 145 of the bone conduction headphone 110. The probing signal 205 may be used to form, create, or otherwise produce at
least one acoustic field 210 about the ear of the user 155, the head of the user 155, or about the bone conduction headphone 110 on the user 155.
[0129] The probe generator 125 can produce, create, or generate the probing signal 205 to be within a specified frequency range. The probing signal 205 may be an electrical (e.g., digitized or quantized) representation of the acoustic waveform an acoustic waveform (or vibration) within a specified frequency band to be radiated from the speaker 145. For instance, the frequency range may correspond to a range of frequencies inaudible to humans, such as between 18 kHz-40 kHz. The probing signal 205 can have a center frequency at which the amplitude is maximum. The center frequency can reside within the frequency band of 18 kHz- 40 kHz, for example, at 18 kHz. The probing signal 205 can be sampled at any sampling rate, for instance, ranging from 8-200 kHz. The probing signal 205 can be of any duration in time, ranging between 2 ms to 1 minute, and be repeatedly play any number of times. Upon generation, the probe generator 125 can transmit, provide, or otherwise send the probing signal 205 to the bone conduction headphone 110.
|0130| With receipt of the probing signal 205, the speaker 145 of the bone conduction headphone 110 can create, form, or otherwise produce the acoustic field 210. The speaker 145 can convert or transform the probing signal 205 into the acoustic waveform and can output, emit, or otherwise radiate the acoustic waveform (or vibrations) to produce the acoustic field 210. The acoustic field 210 can at least partially envelop, encase, or otherwise surround the head of the user 155 (e.g., as depicted), the ear of the user 155, or the bone conduction headphone 110 on the user 155, among others. In at least partial concurrence with the production of the acoustic field 210, the speaker 145 can also output, radiate, or otherwise produce other acoustic waveforms using audio signals from other sources. For instance, the speaker 145 can produce sound using audio signals of audiovisual content played on a multimedia application running on the computing device 105.
[01311 Upon emanation, the acoustic field 210 can have an effective range between 2- 10 cm about the surface of the head, the surface of the outer ear, or the speaker 145 of the bone
conduction headphone 110. The acoustic field 210 can reside within the frequency range of the probing signal 205. For example, similar to the probe signal, the acoustic field 210 can have a frequency range between 18 kHz-40 kHz that is also inaudible to humans. The acoustic field 210 can be affected by or can respond to at least one hand gesture 215 performed by the user 155 within the effective range of the acoustic field 210. The hand gesture 215 can include or can correspond to a movement of at least one hand or at least one of finger by the user 155. The hand gesture 215 can correspond to a combination of movement of the hand or fingers of the user 155 to signal a command to invoke a function of the application 120.
[0132] In conjunction, the audio processor 130 of the field monitoring service 115 can retrieve, obtain, or otherwise receive at least one audio signal 220 from the microphone 150 of the bone conduction headphone 110. While the probing signal 205 is provided to the speaker 145 to produce the acoustic field 210, the audio processor 130 can continuously listen or monitor the audio signal 220. The audio signal 220 can be generated by the microphone 150 by transforming or converting acoustic waveforms reaching the microphone 150 and can be provided to the field monitoring service 115 (e.g., via wired or wireless communications). The audio signal 220 can correspond to at least one acoustic waveform through or from the acoustic field 210 reaching the microphone 150. For example, the audio signal 220 can correspond to the acoustic waveform altered or produced in response to at least one hand gesture 215 within the acoustic field 210. The audio signal 220 may be of any duration in time, such as between 2 ms to 1 minute. In some embodiments, the audio processor 130 can listen to or monitor for the audio signal 220 over a sliding time window. The time window may range between 2 ms to 1 minute, with a sliding interval can be a fraction of the time window (e.g., 0.5 ms to 15 seconds).
[0133] With receipt of the audio signal 220, the audio processor 130 can process or filter at least a portion of the audio signal 220. The audio signal 220 can include at least one probe portion 225 A and at least one non-probe portion 225B. The probe portion 225 A can correspond to a portion of the acoustic waveform associated with a frequency band in which the probing signal 205 resides. The probe portion 225A, for example, can correspond to an inaudible frequency range for humans, such as between 18 kHz-40 kHz. The non-probe portion 225B can
correspond to a remaining portion of the acoustic waveform. In some embodiments, the nonprobe portion 225B can correspond to a portion of the acoustic waveform associated with a frequency band exclusive of the probing signal 205. The non-probe portion 225B, for instance, can correspond to an audible frequency range for humans, such as between 20 Hz to 18 kHz or 20 Hz to 20 kHz.
[0134] By filtering the audio signal 220, the audio processor 130 can pass the probe portion 225A to produce, output, or otherwise generate at least one audio signal 220’. The audio signal 220’ can include at least the probe portion 225A corresponding to the frequency band which the probing signal 205 is in. In processing the audio signal 220, the audio processor 130 can apply at least one filter. The filter can include, for example, a low-pass filter (LPF), a bandpass filter (BPF) filter, a band-stop filter (BSF), or a high-pass filter (HPF), among others, or any combination thereof. The filter can be implemented using any type of architecture, such as a resistor-capacitor (RC) filter, a resistor-inductor (RL filter), a RLC filter, an active filter, a Butterworth filter, a Chebyshev filter, or a Bessel filter, among others. In some embodiments, the audio processor 130 can apply a BPF to pass through the probe portion 225A of the audio signal 220 to output the audio signal 220’. The BPF can have a cutoff frequency relative to the frequency of the probing signal 205 used to generate the acoustic field 210. For example, the cutoff frequency for the BPF can be 25-50 Hz about the frequency of the probing signal 205. The audio signal 220’ can include a subsection of the originally acquired audio signal 220, with frequency components focused about the center frequency of the probing signal 205 (e.g., about 25-50 Hz of the center).
(0135] In some embodiments, the audio processor 130 can process or filter the audio signal 220 to remove, attenuate, or otherwise suppress noise at least within the probe portion 225A. In some embodiments, the audio processor 130 can apply a BSF to the audio signal 220 to identify a portion of the audio signal 220 outside the probing signal 205. The BSF can have a cutoff frequency relative to the frequency of the probing signal 205 used to generate the acoustic field 210. For example, the cutoff frequency for the BSF can be 25-50 Hz about the frequency of the probing signal 205. With the identification of the portion outside the probing signal 205,
the audio processor 130 can subtract the portion outside the probing signal 205 from the overall audio signal 220 (or the audio signal 220’) to suppress the noise. By suppressing the noise, the audio processor 130 can increase, amplify, or otherwise boost a relative amplitude of the probing signal 205 within the audio signal 220’ in comparison to the remaining portions of the audio signal 220.
[0136] Referring now to FIG. 19, depicted is a block diagram of a process 300 to identify hand gestures in acoustic fields in the system 100 for detecting hand gestures. The process 300 can correspond or include operations performed in the system 100 to detect and classify hand gestures performed by the user in the acoustic field. Under the process 300, the gesture detector 135 of the field monitoring service 115 can process or apply the audio signal 220’ to the ML model 140. The ML model 140 can be implemented using any model architecture and can have at least one input corresponding to the filtered audio signal 220’, at least one output classifying a type of hand gesture 215 performed by the user 155, and a set of weights relating the input to the output, among others. The ML model 140 can be a light-weight model architecture, with minimal resources consumption specifications that a portable or mobile device (e.g., the computing device 105 or external hardware device) can satisfy. In applying, the gesture detector 135 can feed or input the audio signal 220’ into the ML model 140. Upon feeding, the gesture detector 135 can process the input audio signal 220’ in accordance with the set of weights of the ML model 140.
[0137] The ML model 140 may have been initialized, trained, or established (e.g., by the field monitoring service 115 or another computing device) using a training dataset. The ML model 140 can be trained in accordance with any learning techniques, such as supervised learning, unsupervised learning, Q-learning or weakly supervised learning, among others. The training dataset can include a set of examples. Each example can include or identify a sample audio signal (e g., similar to the audio signal 220’) corresponding to a probe signal in an acoustic field at least partially about an ear, a head, or a speaker on a bone conduction headphone of another user. Each example can also include or identify a label indicating whether (e.g., a presence or absence) a hand gesture is performed by the user associated with the sample audio
signal. In some embodiments, each example can also include or identify a label identifying which type of hand gesture is being performed by the user. The type of hand gesture of the label can include, for instance: no gesture; one finger tapping cheek; two fingers hovering over ear; three fingers touching forehead; at least one figure touching mouth, ear, or nose; a motion of hand spinning about the bone conduction headphone, a fist forming near bone conduction headphone, among others. The sample audio signal may be of any duration in time, such as between 2 ms to 1 minute.
[0138] For each example, the sample audio signal 220’ can be applied to the ML model 140 to produce or generate an output indicating whether the hand gesture is performed by the user. In some embodiments, the application of the sample audio signal 220’ to the ML model 140 can produce or generate an output indicating which type of hand gesture is performed. The output of the ML model 140 can be compared with the corresponding indication identified by the label. Based on the comparison, a loss metric can be calculated in accordance with a loss function (e.g., a hinge loss, a mean squared error (MSE), a mean absolute error (MAE), a crossentropy loss, a Huber loss, or a log loss). The loss metric can be used to modify or update one or more of the set of weights of the ML model 140. The updating of the weights of the ML model 140 may be in accordance with an optimization function (e.g., stochastic gradient descent with a predefined learning rate). This process can be iteratively repeated until the ML model 140 reaches a convergence condition to stop or cease the training process. In some embodiments, the training of the ML model 140 can be performed on another computing system, separate from the computing device 105, and then loaded on the computing device 105 (e.g., when the field monitoring service 115 is installed thereon).
[0139] Based on applying the ML model 140 to the audio signal 220’, the gesture detector 135 can identify, determine, or otherwise detect a presence (or occurrence) or an absence (or lack) of the hand gesture 215 performed by the user 155. From the application of the ML model 140, the gesture detector 135 can produce, output, or otherwise generate at least one gesture classification 305 for the input audio signal 220’. The gesture classification 305 can indicate or identify whether the hand gesture 215 is being performed by the user 155. In some
embodiments, the gesture detector 135 can detect or identify a type of hand gesture 215 performed by the user 155 based on the application of the ML model 140 to the audio signal 220’. In identifying, the gesture detector 135 can generate the gesture classification 305 to indicate or identify a type of the hand gesture 215 performed by the user 155.
|0140| In some embodiments, the gesture detector 135 can calculate, generate, or determine a likelihood of a presence (or absence) of the hand gesture 215 performed by user 155. The likelihood can identify or indicate a degree of probability that the user 155 is performing the hand gesture 215. In some embodiments, the gesture detector 135 can determine the likelihood for each type of hand gesture 215. From applying the ML model 140 to the audio signal 220’, the gesture detector 135 can produce, output, or determine the likelihood. With the determination of the likelihood, the gesture detector 135 can compare the likelihood with a threshold. The threshold can delineate, define, or otherwise identify a value (e.g., 80-95%) for the likelihood at which to identify the presence (or absence) of the hand gesture. In some embodiments, the threshold can identify a value for the likelihood at which to identify a type of hand gesture. If the likelihood satisfies (e.g., greater than or equal to) the threshold, the gesture detector 135 can detect the presence of the hand gesture 215. In some embodiments, the gesture detector 135 can identify the type of the hand gesture 215 based on the likelihood satisfying the threshold for the type of hand gesture. Conversely, if the likelihood does not satisfy (e.g., less than) the threshold, the gesture detector 135 can identify the absence of the hand gesture 215.
The gesture detector 135 can generate or provide at least one output 310 based on the gesture classification 305. The output 310 can include or identify information based on part of detecting the presence or the absence of the hand gesture 215 performed by the user 155. When the presence of the hand gesture 215 is detected, the gesture detector 135 can generate the output 310 to indicate the detection of the presence of the hand gesture 215. Conversely, when the absence of the hand gesture 215 is detected, the gesture detector 135 can generate the output 310 to indicate the detection of the absence of the hand gesture 215. In some embodiments, the gesture detector 135 can generate the output 310 to identify the type of hand gesture 215. Other information can be included, such as a timestamp of detection and a user identifier corresponding
to the user 155. With the generation, the gesture detector 135 can send, convey, or otherwise provide the output 310 to the application 120 (e g., via an application programming interface (API) for the field monitoring service 115 or the application 120). In some embodiments, when the absence of any hand gesture 215 is detected, the gesture detector 135 can forego or refrain from the generation of the output 310.
[0142] In some embodiments, the gesture detector 135 can identify or determine a command 315 to invoke a corresponding function of the application 120 based on the gesture classification 305. The gesture detector 135 can transform or convert the gesture classification 305 to a command 315 to invoke a corresponding function of the application 120. The gesture detector 135 can use a list specifying a mapping between the type of hand gesture 215 and the corresponding command 315 to invoke the function for a given application 120. The list can include a set of mappings for a corresponding set of applications 120. For example, for a web browser application, the list can specify that a hand gesture 215 corresponding to two fingers waved forward corresponds to an increase in zoom in a web page presented in the web browser application. For an interactive panorama of a street within a map application, the list can identify that the same hand gesture 215 corresponding to two fingers waved forward corresponds to a go forward function from the position along the street depicted in the interactive panorama.
(0143] In accordance with the mapping, the gesture detector 135 can identify the command 315 to be invoked based on the type of hand gesture 215 detected as performed by the user 155. In some embodiments, the gesture detector 135 can identify or select the mapping based on the application 120. For instance, the gesture detector 135 can select the list of mappings between the types of hand gestures 215 for a given application 120, based on identifying the application 120 in focus (e.g., as a foreground process) on the computing device 105. From the list, the gesture detector 135 can identify the mapping for the type of hand gesture 215 detected with the corresponding command 315. With the identification, the gesture detector 135 can generate the output 310 to include or identify the command 315. In some embodiments, the output 310 can omit or lack the command 315. The gesture detector 135 can send, convey, or provide the output 310 with the command 315 to the application 120.
[0144] With the response of the output 310, the application 120 executing on the computing device 105 can process or parse the output 310 to extract or identify the gesture classification 305. Based on the gesture classification 305, the application 120 can identify or determine the corresponding command 315 for the respective command. The determination of the command 315 can be independent of the conversion of the gesture classification 305 by the gesture detector 135. The application 120 can transform or convert the gesture classification 305 to a command 315 to invoke a corresponding function of the application 120. The application 120 can use a list specifying a mapping between the type of hand gesture 215 and the corresponding command 315 to invoke the function. In some embodiments, the application 120 can process or parse the output 310 to extract or identify the command 315 from the output 310.
[0145] Using the command 315, the application 120 can invoke the corresponding function. In some embodiments, the application 120 can be a multimedia player or can be communicatively coupled with a smartphone to handle telephone calls. The application 120 can include various functions related to the bone conduction headphone 110, such a volume control, a playback, and mute, among others. The application 120 can invoke the corresponding function to control the bone conduction headphone 110 based on the command 315. For example, the application 120 (or the gesture detector 135) can convert the hand gesture 215 corresponding to a rising hand toward the right-side bone conduction headphone 110 to a command to increase volume on the speaker 145. The application 120 can invoke the function to increase volume on the speaker 145 in the bone conduction headphone 110. Conversely, the application 120 (or the gesture detector 135) can translate the hand gesture 215 corresponding to a lower hand toward the right-side bone conduction headphone 110 as a command to decrease volume on the speaker 145. Based on this translation, the application 120 can invoke the function to decrease the volume on the speaker 145. The gesture detector 135 can also convert the hand gesture 215 corresponding to a spinning finger motion to a command to perform a playback of an audio recording. The application 120 can use the command to invoke and carry out the function to initiate playing back of the recording.
[0146] In some embodiments, the application 120 can be a health awareness program to control or suppress transmission (e.g., via contact transmission) of germs, bacteria, or viruses. For example, the application 120 can include a function to present an alert for the user 155 to notify the user 155 to cease contact with the face of the user 155. The gesture detector 135 can determine or detect that the hand gesture 215 corresponding to at least one finger or the hand being in contact with the face (e.g., along the eyes, nose, or mouth) or head of the user 155. The application 120 can invoke the function to present the alert, based on the detection of the hand gesture 215 corresponding to at least one finger being in contact with the face or head of the user 155. In some embodiments, the alert can be audible and played through the speaker 145 of the bone conduction headphone 110. In some embodiments, the alert can be a visual element presented via a graphical user interface of the application 120 or on the computing device 105. The application 120 can halt presentation of the alert, when the gesture detector 135 subsequently detects no finger or hand in contact with the face or head of the user 155.
[0147] In some embodiments, the application 120 can control various functions associated with a virtual reality (VR) headset, such as a control of a virtual object presented through the headset or communication with another user. The bond conduction headphone 110 can be part of or can be in communication with the VR headset. The functions for the application 120 can include, for example, a manipulation (e.g., moving, rotating, or sizing) of a virtual object presented through the headset, an interaction with a user interface element, or communication with another, among others. For instance, when the detected hand gesture 215 is determined to correspond to an enlargement of a virtual object in the screen of the headset, the application 120 can invoke the function to carry out the enlargement of the virtual object. When the detected hand gesture 215 is identified as corresponding to the function to initialize communication with a specific user, the application 120 can initiate establishment of a communication session with another instance of the application associated with the specified user.
[0148] Subsequently, the field monitoring service 115 can continue any one or more of the processes 200 and 300 detailed herein. For instance, upon execution of the function
identified in the command 315 specified in the output 310, the application 120 can return, send, or otherwise provide an indication of execution of the function corresponding to command 315. The probe generator 125 can in turn generate another probing signal 205 to provide via the speaker 145 of the bone conduction headphone 110 to form the acoustic field 210 about the head or face of the user 155. In conjunction, the audio processor 130 can obtain another audio signal 220 from the microphone 150 of the bone conduction headphone 110, capturing any hand gestures 215 performed by the user 155 if any. The audio processor 130 can filter the audio signal 220 to extract the probe portion 225A and to generate the audio signal 220’. The gesture detector 135 can apply the audio signal 220’ to the ML model 140 to determine the gesture classification 305 as well as the command 315. The command 315 may be for a function accessible upon the execution of the previous function as identified by the hand gesture 215.
(0149] In this manner, the interactivity of the application 120 can be expanded and widened beyond input/output (I/O) devices such as keyboards, touchscreens, and mouses to cover hand gestures 215 performed within the acoustic field 210 formed by the bone conduction headphone 110. This functionality can allow the user 155 to move freely without any inconvenience or hindrance, able to access the functions of the application 120 while wearing the bone conduction headphone 110 coupled with the field monitoring service 115. In comparison to other techniques that rely on specialized sensors and other alterations to hardware (e.g., as detailed in Section 2), the field monitoring service 115 can re-adapt existing bond conduction headphones 110 for the purpose of detecting hand gesture 215. The lack of reliance on specialized sensors or hardware may make the approach by the field monitoring service 115 widely adaptable. The field monitoring service 115 can leverage the speaker 145 of the bond conduction headphone 110 to use a probing signal 205 to generate the acoustic field 210 at any time to pick up any number of different hand gestures 215. With particular types of applications, the field monitoring service 115 can enable new types of functionality. For example, the field monitoring service 115 can aid in evaluating the risks associated with face-touching gestures, serving as a preventive barrier against the entry of bacteria into sensitive areas such as the
mouth, nose, and eyes. This capability may be valuable in promoting hygienic practices and preventing virus transmission.
[0150] Furthermore, the field monitoring service 115 can employ a combination of signal processing techniques and a lightweight model in the form of the ML model 140 to recognize hand gestures 215 with a high degree of accuracy, allowing for computing devices 105 (e.g., smart phones) to carry out the operations in a quick manner. Relative to other techniques that use specialized hardware and computationally complex algorithms (e.g., as detailed in Section 2), the field monitoring service 115 can use less processing power, memory, and electric power on the part of the computing device 105. The field monitoring service 115 can maintain this high degree of accuracy, even in the face of a wide range of environmental setting, including varying ambient noise levels, human speech, body motion, and skin hydration conditions, among others. This adaptability can allow the field monitoring service 115 to accurately detect the hand gestures 215 and invoke the function in the application 125 intended by the user 155 in a variety of daily life settings.
|01511 Referring now to FIG. 20, depicted is a flow diagram of a method 400 of detecting hand gestures in mobile acoustic fields around bone conduction headphones. The method 400 can be implemented or performed using any one or more of the components detailed herein, such as the system 100 or 514, among others. Under the method 400, a computing system can provide a probe signal via a speaker of a bone conduction headphone to form an acoustic field (405). The computing system can receive an audio signal from an earphone of the bone conduction headphone (410). The computing system can filter the audio signal to pass a frequency band corresponding to the probe signal (415). The computing system can apply the filtered audio signal to a machine learning (ML) model (420). From applying the filter audio signal to the ML model, the computing system can detect whether a hand gesture is present in the acoustic field (425). If there is no hand gesture, the computing system can repeat the functionalities from step (405). In contrast, if there is a hand gesture detected, the computing system can provide an output to an application (430).
10. Computer Environment
[0152] Various operations described herein can be implemented on one or more computer systems. FIG. 21 shows a block diagram of a representative computing system 514 usable to implement the present disclosure. In some embodiments, the method 400 may be implemented by the computing system 514. Computing system 514 can be implemented, for example, as a consumer device such as a smartphone, other mobile phone, tablet computer, wearable computing device (e.g., smart watch, eyeglasses, head mounted display), desktop computer, laptop computer, cloud computing service or implemented with distributed computing devices. In some embodiments, the computing system 514 can include computer components such as processors 515, storage device 518, network interface 520, user input device 522, and user output device 524.
[0153] Network interface 520 can provide a connection to a wide area network (e.g., the Internet) to which WAN interface of a remote server system is also connected. Network interface 520 can include a wired interface (e.g., Ethernet) and/or a wireless interface implementing various RF data communication standards such as Wi-Fi, Bluetooth, or cellular data network standards (e g., 3G, 4G, 5G, 50 GHz, LTE, etc.).
[0154] User input device 522 can include any device (or devices) via which a user can provide signals to computing system 514; computing system 514 can interpret the signals as indicative of particular user requests or information. User input device 522 can include any or all of a keyboard, a controller (e.g., joystick), touch pad, touch screen, mouse or other pointing device, scroll wheel, click wheel, dial, button, switch, keypad, microphone, sensors (e g., a motion sensor, an eye tracking sensor, etc.), and so on.
[0155] User output device 524 can include any device via which computing system 514 can provide information to a user. For example, user output device 524 can include a display-to- display images generated by or delivered to computing system 514. The display can incorporate various image generation technologies, e.g., a liquid crystal display (LCD), light-emitting diode
(LED) including organic light-emitting diodes (OLED), projection system, cathode ray tube (CRT), or the like, together with supporting electronics (e.g., digital-to-analog or analog-to- digital converters, signal processors, or the like). A device such as a touchscreen that function as both input and output device can be used. User output devices 524 can be provided in addition to or instead of a display. Examples include indicator lights, speakers, tactile “display” devices, printers, and so on.
[0156] Some implementations include electronic components, such as microprocessors, storage and memory that store computer program instructions in a non-transitory computer readable storage medium. Many of the features described in this specification can be implemented as processes that are specified as a set of program instructions encoded on a computer readable storage medium. When these program instructions are executed by one or more processors, they cause the processors to perform various operation indicated in the program instructions. Examples of program instructions or computer code include machine code, such as is produced by a compiler, and files including higher-level code that are executed by a computer, an electronic component, or a microprocessor using an interpreter. Through suitable programming, processor 515 can provide various functionality for computing system 514, including any of the functionality described herein as being performed by a server or client, or other functionality associated with message management services.
[0.1571 It will be appreciated that computing system 514 is illustrative and that variations and modifications are possible. Computer systems used in connection with the present disclosure can have other capabilities not specifically described here. Further, while computing system 514 is described with reference to particular blocks, it is to be understood that these blocks are defined for convenience of description and are not intended to imply a particular physical arrangement of component parts. For instance, different blocks can be located in the same facility, in the same server rack, or on the same motherboard. Further, the blocks need not correspond to physically distinct components. Blocks can be configured to perform various operations, e.g., by programming a processor or providing appropriate control circuitry, and various blocks might or might not be reconfigurable depending on how the initial configuration
is obtained. Implementations of the present disclosure can be realized in a variety of apparatus including electronic devices implemented using any combination of circuitry and software.
[0158] Having now described some illustrative implementations, it is apparent that the foregoing is illustrative and not limiting, having been presented by way of example. In particular, although many of the examples presented herein involve specific combinations of method acts or system elements, those acts and those elements can be combined in other ways to accomplish the same objectives. Acts, elements, and features discussed in connection with one implementation are not intended to be excluded from a similar role in other implementations or implementations.
10159] The hardware and data processing components used to implement the various processes, operations, illustrative logics, logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, or any conventional processor, controller, microcontroller, or state machine. A processor also may be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. In some embodiments, particular processes and methods may be performed by circuitry that is specific to a given function. The memory (e.g., memory, memory unit, storage device, etc.) may include one or more devices (e.g., RAM, ROM, Flash memory, hard disk storage, etc.) for storing data and/or computer code for completing or facilitating the various processes, layers, and modules described in the present disclosure. The memory may include volatile memory or non-volatile memory, and may include database components, object code components, script components, or any other type of information structure for supporting the various activities and information structures described in the present disclosure. According to an embodiment, the memory is
communicably connected to the processor via a processing circuit and includes computer code for executing (e.g., by the processing circuit and/or the processor) the one or more processes described herein.
[0.160] The present disclosure contemplates methods, systems, and program products on any machine-readable media for accomplishing various operations. The embodiments of the present disclosure may be implemented using existing computer processors, or by a special purpose computer processor for an appropriate system, incorporated for this or another purpose, or by a hardwired system. Embodiments within the scope of the present disclosure include program products comprising of machine-readable media for carrying or having machineexecutable instructions or data structures stored thereon. Such machine-readable media can be any available media that can be accessed by a general purpose or special purpose computer or other machine with a processor. By way of example, such machine-readable media can comprise RAM, ROM, EPROM, EEPROM, or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer or other machine with a processor. Combinations of the above are also included within the scope of machine-readable media. Machine-executable instructions include, for example, instructions and data which cause a general -purpose computer, special purpose computer, or special purpose processing machines to perform a certain function or group of functions.
[01611 The phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including” “comprising” “having” “containing” “involving” “characterized by” “characterized in that” and variations thereof herein, is meant to encompass the items listed thereafter, equivalents thereof, and additional items, as well as alternate implementations consisting of the items listed thereafter exclusively. In one implementation, the systems and methods described herein consist of one, each combination of more than one, or all of the described elements, acts, or components.
[0162] Any references to implementations or elements or acts of the systems and methods herein referred to in the singular can also embrace implementations including a plurality of these elements, and any references in plural to any implementation or element or act herein can also embrace implementations including only a single element. References in the singular or plural form are not intended to limit the presently disclosed systems or methods, their components, acts, or elements to single or plural configurations. References to any act or element being based on any information, act or element can include implementations where the act or element is based at least in part on any information, act, or element.
[01.63] Any implementation disclosed herein can be combined with any other implementation or embodiment, and references to “an implementation,” “some implementations,” “one implementation” or the like are not necessarily mutually exclusive and are intended to indicate that a particular feature, structure, or characteristic described in connection with the implementation can be included in at least one implementation or embodiment. Such terms as used herein are not necessarily all referring to the same implementation. Any implementation can be combined with any other implementation, inclusively or exclusively, in any manner consistent with the aspects and implementations disclosed herein.
[0164] Where technical features in the drawings, detailed description or any claim are followed by reference signs, the reference signs have been included to increase the intelligibility of the drawings, detailed description, and claims. Accordingly, neither the reference signs nor their absence has any limiting effect on the scope of any claim elements. Technical and scientific terms used herein have the meanings commonly understood by one of ordinary skill in the art, unless otherwise defined. Any suitable materials and/or methodologies known to those of ordinary skill in the art can be utilized in carrying out the methods described herein.
[0165] Systems and methods described herein may be embodied in other specific forms without departing from the characteristics thereof. As used herein, “approximately,” “about” “substantially” or other terms of degree will be understood by persons of ordinary skill in the art
and will vary to some extent on the context in which it is used. If there are uses of the term which are not clear to persons of ordinary skill in the art given the context in which it is used, references to “approximately,” “about” “substantially” or other terms of degree shall include variations of +/-10% from the given measurement, unit, or range unless explicitly indicated otherwise.
[0166] Coupled elements can be electrically, mechanically, or physically coupled with one another directly or with intervening elements. Scope of the systems and methods described herein is thus indicated by the appended claims, rather than the foregoing description, and changes that come within the meaning and range of equivalency of the claims are embraced therein.
[0167] The term “coupled”, and variations thereof, includes the joining of two members directly or indirectly to one another. Such joining may be stationary (e.g., permanent or fixed) or moveable (e.g., removable or releasable). Such joining may be achieved with the two members coupled directly with or to each other, with the two members coupled with each other using a separate intervening member and any additional intermediate members coupled with one another, or with the two members coupled with each other using an intervening member that is integrally formed as a single unitary body with one of the two members. If “coupled” or variations thereof are modified by an additional term (e.g., directly coupled), the generic definition of “coupled” provided above is modified by the plain language meaning of the additional term (e.g., “directly coupled” means the joining of two members without any separate intervening member), resulting in a narrower definition than the generic definition of “coupled” provided above. Such coupling may be mechanical, electrical, or fluidic.
[0168] References to “or” can be construed as inclusive so that any terms described using “or” can indicate any of a single, more than one, and all of the described terms. A reference to “at least one of ‘A’ and ‘B’” can include only ‘A’, only ‘B’, as well as both ‘A’ and ‘B’. Such references used in conjunction with “comprising” or other open terminology can include additional items.
[0169] Modifications of described elements and acts such as variations in sizes, dimensions, structures, shapes and proportions of the various elements, values of parameters, mounting arrangements, use of materials, colors, orientations can occur without materially departing from the teachings and advantages of the subject matter disclosed herein. For example, elements shown as integrally formed can be constructed of multiple parts or elements, the position of elements can be reversed or otherwise varied, and the nature or number of discrete elements or positions can be altered or varied. Other substitutions, modifications, changes and omissions can also be made in the design, operating conditions and arrangement of the disclosed elements and operations without departing from the scope of the present disclosure.
[0170] References herein to the positions of elements (e.g., “top,” “bottom,” “above,” “below”) are merely used to describe the orientation of various elements in the FIGURES. The orientation of various elements may differ according to other exemplary embodiments, and that such variations are intended to be encompassed by the present disclosure.
[0171] As used herein, a subject can be a mammal, such as a non-primate (e.g., cows, pigs, horses, cats, dogs, rats, etc.) or a primate (e g., monkey and human). In certain embodiments, the term “subject,” as used herein, refers to a vertebrate, such as a mammal. Mammals include, without limitation, humans, non-human primates, wild animals, feral animals, farm animals, sport animals, and pets. In certain exemplary embodiments, a subject is a human.
[0172] As used herein, the terms “subject” and “user” are used interchangeably.
[0173] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present disclosure, suitable methods and materials are described herein.
[0174] As used herein, the singular forms “a”, “an,” and “the” include plural referents unless the context clearly indicates otherwise. For example, the term “a cell” includes a plurality of cells, including mixtures thereof.
[0175] As used herein, the term “about” is used to indicate that a value includes the standard deviation of error for the device or method being employed to determine the value. The term “about” when used before a numerical designation, e.g., temperature, time, amount, and concentration, including range, indicates approximations which may vary by (+) or (-) 15%, 10%, 5%, 3%, 2%, or 1 %.
[0176] Ranges: throughout this disclosure, various aspects of the invention can be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 5 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 5, from 3 to 5, etc., as well as individual numbers within that range, for example, 1, 2, 2.7, 3, 4, 5, 5.3, and 5. This applies regardless of the breadth of the range.
Claims
1. A method of detecting hand gestures in mobile acoustic fields around bone conduction headphones, comprising: providing, by one or more processors, via a speaker of a bone conduction headphone positioned at least partially about an ear of a user, a first probing signal in a first frequency band to produce a first acoustic field at least partially about the ear of the user; receiving, by the one or more processors, from a microphone of the bone conduction headphone, a first audio signal corresponding to an acoustic waveform through the acoustic field, the first audio signal comprising (i) a first portion corresponding to the first frequency band including the probing signal and (ii) a second portion corresponding to a second frequency band; filtering, by the one or more processors, the first audio signal to pass the first portion corresponding to the first probing signal in the first frequency band to generate a second audio signal; applying, by the one or more processors, the second audio signal to a machine learning (ML) model, wherein the ML model is trained using a plurality of examples, each of the plurality of examples identifying (i) a respective third audio signal corresponding to a respective second probing signal for a second acoustic field at least partially about a respective second user and (ii) an identification of whether a respective hand gesture is performed by the respective second user; detecting, by the one or more processors, based on applying the second audio signal to the ML model, a hand gesture performed by the user; and generating, by the one or more processors, an output based on a detection of the hand gesture performed by the user.
2. The method of claim 1, further comprising: detecting, by the one or more processors, a lack of any hand gesture performed by the user, based on applying a fourth audio signal corresponding to the first probing signal producing the first acoustic field in the first frequency band to the ML model; and
generating, by the one or more processors, a second output based on a detection of the lack of any hand gesture performed by the user.
3. The method of claim 1, further comprising providing, by the one or more processors, the output to an application to invoke at least one of a plurality of functions corresponding to the hand gesture detected as performed by the user of the bone conduction headphone.
4. The method of claim 3, wherein the application is further configured to control the plurality of functions associated with operations of the bone conduction headphone, the plurality of functions comprising at least one of (i) a volume control, (ii) a playback, or (iii) muting sound.
5. The method of claim 3, wherein the application is further configured to provide the plurality of functions associated with a virtual reality (VR) headset, the plurality of functions comprising at least one of (i) a control of virtual object presented via the VR headset or (ii) a communication with another user through the VR headset.
6. The method of claim 3, wherein the application is further configured to present an alert for the user to prevent contact with a face of the user to suppress virus transmission.
7. The method of claim 1, wherein filtering the first audio signal further comprises suppressing noise within the first frequency band of the first audio signal to boost a relative amplitude of the first probe signal.
8. The method of claim 1, wherein the ML model is trained using the plurality of examples each identifying one of a plurality of hand gestures performed by the respective second user; and wherein detecting the hand gesture further comprises identifying, from the plurality of hand gestures, the hand gesture performed by the user.
9. The method of claim 1, wherein providing the first probing signal further comprises providing the first probing signal to radiate about the ear the user from the speaker, the first acoustic field produced to respond to hand gestures performed by the user within an effective range from the speaker.
10. The method of claim 1, wherein the first frequency band comprises an inaudible frequency range between 18 kHz to 22 kHz, wherein the second frequency band comprises an audible frequency range between 20 Hz to 20 kHz, and wherein the first acoustic field has an effective range between 2 cm to 10 cm from the speaker.
11. A system for detecting hand gestures in mobile acoustic fields around bone conduction headphones, comprising: one or more processors coupled with memory, configured to: provide via a speaker of a bone conduction headphone positioned at least partially about an ear of a user, a first probing signal in a first frequency band to produce a first acoustic field at least partially about the ear of the user; receive, from a microphone of the bone conduction headphone, a first audio signal corresponding to an acoustic waveform through the acoustic field, the first audio signal comprising (i) a first portion corresponding to the first frequency band including the probing signal and (ii) a second portion corresponding to a second frequency band; filter the first audio signal to pass the first portion corresponding to the first probing signal in the first frequency band to generate a second audio signal; apply the second audio signal to a machine learning (ML) model, wherein the ML model is trained using a plurality of examples, each of the plurality of examples identifying (i) a respective third audio signal corresponding to a respective second probing signal for a second acoustic field at least partially about a respective second user and (ii) an identification of whether a respective hand gesture is performed by the respective second user; detect, based on applying the second audio signal to the ML model, a hand gesture performed by the user; and
generate an output based on a detection of the hand gesture performed by the user.
12. The system of claim 11, wherein the one or more processors are further configured to: detect a lack of any hand gesture performed by the user, based on applying a fourth audio signal corresponding to the first probing signal producing the first acoustic field in the first frequency band to the ML model; and generate a second output based on a detection of the lack of any hand gesture performed by the user.
13. The system of claim 11, wherein the one or more processors are further configured to provide the output to an application to invoke at least one of a plurality of functions corresponding to the hand gesture detected as performed by the user of the bone conduction headphone.
14. The system of claim 13, wherein the application is further configured to control the plurality of functions associated with operations of the bone conduction headphone, the plurality of functions comprising at least one of (i) a volume control, (ii) a playback, or (iii) muting sound.
15. The system of claim 13, wherein the application is further configured to provide the plurality of functions associated with a virtual reality (VR) headset, the plurality of functions comprising at least one of (i) a control of virtual object presented via the VR headset or (ii) a communication with another user through the VR headset.
16. The system of claim 13, wherein the application is further configured to present an alert for the user to prevent contact with a face of the user to suppress virus transmission.
17. The system of claim 11, wherein the one or more processors are further configured to suppress noise within the first frequency band of the first audio signal to boost a relative amplitude of the first probe signal.
18. The system of claim 11, wherein the ML model is trained using the plurality of examples each identifying one of a plurality of hand gestures performed by the respective second user; and wherein the one or more processors are further configured to identify, from the plurality of hand gestures, the hand gesture performed by the user.
19. The system of claim 11, wherein the one or more processors are further configured to provide the first probing signal to radiate about the ear the user from the speaker, the first acoustic field produced to respond to hand gestures performed by the user within an effective range from the speaker.
20. The system of claim 11, wherein the first frequency band comprises an inaudible frequency range between 18 kHz to 22 kHz, wherein the second frequency band comprises an audible frequency range between 20 Hz to 20 kHz, and wherein the first acoustic field has an effective range between 2 cm to 10 cm from the speaker.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463561179P | 2024-03-04 | 2024-03-04 | |
| US63/561,179 | 2024-03-04 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2025188615A1 true WO2025188615A1 (en) | 2025-09-12 |
| WO2025188615A8 WO2025188615A8 (en) | 2025-10-02 |
Family
ID=96991429
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2025/018129 Pending WO2025188615A1 (en) | 2024-03-04 | 2025-03-03 | Detecting hand gestures in mobile acoustic fields around bone conduction headphones |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025188615A1 (en) |
-
2025
- 2025-03-03 WO PCT/US2025/018129 patent/WO2025188615A1/en active Pending
Non-Patent Citations (5)
| Title |
|---|
| "HOME/SUPPORT/USER GUIDES/SUUNTO WING USER GUIDE", 14 November 2023, SUUNTO OY, Finland, article SUUNTO CUSTOMER SUPPORT: "SUUNTO Wing User Guide", pages: 1 - 20, XP093355969 * |
| ALKIEK KHALED; HARRAS KHALED A.; YOUSSEF MOUSTAFA: "EarGest: Hand Gesture Recognition with Earables", 2022 19TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING (SECON), IEEE, 20 September 2022 (2022-09-20), pages 91 - 99, XP034212981, DOI: 10.1109/SECON55815.2022.9918622 * |
| FAN ET AL.: "HeadFi: Bringing Intelligence to All Headphones", PROCEEDINGS OF THE 27TH ANNUAL INTERNATIONAL CONFERENCE ON MOBILE COMPUTING AND NETWORKING, 9 September 2021 (2021-09-09), pages 147 - 159, XP058910088, Retrieved from the Internet <URL:https://dl.acm.org/doi/10.1145/3447993.3448624> [retrieved on 20250421], DOI: 10.1145/3447993.3448624 * |
| WANG ET AL.: "Device-free gesture tracking using acoustic signals", PROCEEDINGS OF THE 22ND ANNUAL INTERNATIONAL CONFERENCE ON MOBILE COMPUTING AND NETWORKING, 3 October 2016 (2016-10-03), pages 82 - 94, XP058279595, Retrieved from the Internet <URL:https://dl.acm.org/doi/10.1145/2973750.2973764> [retrieved on 20250421], DOI: 10.1145/2973750.2973764 * |
| ZHANG XUEHAN; BAO ZHONGXU; YU XIAOJIE; YIN YUQING; YANG XU; NIU QIANG: "Device-Free and Training-Free Hand Gesture Recognition with Acoustic Signal", 2023 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), IEEE, 1 October 2023 (2023-10-01), pages 402 - 407, XP034529957, DOI: 10.1109/SMC53992.2023.10394116 * |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2025188615A8 (en) | 2025-10-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Verma et al. | Expressear: Sensing fine-grained facial expressions with earables | |
| Liu et al. | Machine learning-assisted wearable sensing systems for speech recognition and interaction | |
| Jin et al. | EarCommand: " Hearing" Your Silent Speech Commands In Ear | |
| Ma et al. | Oesense: employing occlusion effect for in-ear human sensing | |
| Li et al. | Eario: A low-power acoustic sensing earable for continuously tracking detailed facial movements | |
| Yang et al. | MAF: Exploring mobile acoustic field for hand-to-face gesture interactions | |
| Qifan et al. | Dolphin: Ultrasonic-based gesture recognition on smartphone platform | |
| Min et al. | Exploring audio and kinetic sensing on earable devices | |
| Xie et al. | Acoustic-based upper facial action recognition for smart eyewear | |
| Li et al. | Eyeecho: Continuous and low-power facial expression tracking on glasses | |
| CN106992013A (en) | Speech emotional is changed | |
| WO2012138450A1 (en) | Tongue tracking interface apparatus and method for controlling a computer program | |
| CN106716440B (en) | Methods, apparatus and media for ultrasonic-based facial and pattern touch sensing | |
| US20230277130A1 (en) | In-ear microphones for ar/vr applications and devices | |
| US20230240611A1 (en) | In-ear sensors and methods of use thereof for ar/vr applications and devices | |
| Wang et al. | UFace: Your Smartphone Can" Hear" Your Facial Expression! | |
| Mahmud et al. | ActSonic: recognizing everyday activities from inaudible acoustic wave around the body | |
| Sun et al. | EyeGesener: Eye gesture listener for smart glasses interaction using acoustic sensing | |
| CN108683790A (en) | Method of speech processing and Related product | |
| Demirel et al. | Unobtrusive air leakage estimation for earables with in-ear microphones | |
| Chen et al. | A comprehensive survey of side-channel sound-sensing methods | |
| Guo et al. | EchoBreath: Continuous Respiratory Behavior Recognition in the Wild via Acoustic Sensing on Smart Glasses | |
| WO2025188615A1 (en) | Detecting hand gestures in mobile acoustic fields around bone conduction headphones | |
| Cao et al. | ipand: Accurate gesture input with smart acoustic sensing on hand | |
| Guo et al. | EchoExpress: Facial Expression Recognition in the Wild via Acoustic Sensing on Smart Glasses |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25768311 Country of ref document: EP Kind code of ref document: A1 |