WO2024087050A1

WO2024087050A1 - Machine-learning-based diaphragm excursion prediction for speaker protection

Info

Publication number: WO2024087050A1
Application number: PCT/CN2022/127609
Authority: WO
Inventors: Yuwei REN; Matthew Zivney; Yin Huang; Eddie Choy; Chirag Sureshbhai Patel
Original assignee: Qualcomm Incorporated
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2024-05-02

Abstract

Certain aspects of the present disclosure provide techniques and apparatus for machine-learning-based diaphragm excursion prediction for speaker protection. One example method generally includes receiving an indication of one or more parameters associated with driving a speaker, predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters, and taking one or more actions based on the predicted displacement.

Description

MACHINE-LEARNING-BASED DIAPHRAGM EXCURSION PREDICTION FOR SPEAKER PROTECTION

INTRODUCTION

Aspects of the present disclosure relate to speaker diaphragm protection using machine learning techniques.

A speaker is an electro-acoustic transducer, generating sound from an electric signal produced by a power amplifier. Generally, the voice coil of a speaker is attached to a diaphragm that is mounted on a fixed frame via a suspension. A magnetic field is generated by a permanent magnet that is conducted to the region of the coil gap. Due to the presence of the magnetic field, an electrical current passing through the voice-coil generates a force f _c which causes the membrane to move up and down. The displacement x _d of the diaphragm is the excursion, which has a limit. If the excursion limit is exceeded, the speaker exhibits nonlinear behavior, which in turn manifests as distorted sound and degraded acoustic echo cancellation performance. Moreover, as current is pushed through the voice coil, some of the electrical energy is converted into heat instead of sound. Further, if the speaker is driven too hard, such excursions heat the diaphragm, which may distort the diaphragm and, in some cases, may manifest as plastic melt visible as bubbles on the edge of the diaphragm. This distortion may create an asymmetry in the diaphragm that causes the diaphragm to not vibrate as a piston. The issue generally becomes more acute as speakers become smaller and more portable (e.g., as used in micro-speakers, earbuds, etc. ) .

BRIEF SUMMARY

Certain aspects generally relate to machine-learning (ML) -based diaphragm excursion prediction for speaker protection.

Certain aspects provide a method. The method generally includes receiving an indication of one or more parameters associated with driving a speaker, predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters, and taking one or more actions based on the predicted displacement.

Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer- readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and apparatus comprising means for performing the aforementioned methods as well as those further described herein.

The following description and the related drawings set forth in detail certain illustrative features of one or more embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

So that the manner in which the above-recited features of the present disclosure can be understood in detail, a more particular description, briefly summarized above, may be had by reference to aspects, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only certain typical aspects of this disclosure and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.

FIG. 1 depicts an example speaker cross-section showing diaphragm displacement.

FIG. 2 illustrates an example laser excursion measurement.

FIG. 3A illustrates an example machine learning (ML) model to predict displacement.

FIG. 3B illustrates an example input matrix, ML model, and output vector.

FIG. 4A illustrates an example preprocessing procedure.

FIG. 4B illustrates an example plot illustrating a correlation between current I and displacement X.

FIG. 4C illustrates an example input, filter, and extracted frequency.

FIG. 5 illustrates an example Fourier neural network model.

FIG. 6 illustrates an example model, in accordance with certain aspects of the present disclosure.

FIG. 7 illustrates an example batch normalization (BN) re-estimation algorithm, in accordance with certain aspects of the present disclosure.

FIG. 8 illustrates an example plot illustrating a L1 loss (e.g., a Least Absolute Deviation loss) and a scaled fourth-power loss.

FIG. 9 illustrates an example plot illustrating a predicted value, a ground truth value, and a residual value.

FIG. 10 illustrates an example system implementation, in accordance with certain aspects of the present disclosure.

FIG. 11 illustrates an example method flow diagram, in accordance with certain aspects of the present disclosure.

FIG. 12 illustrates an example method flow diagram, in accordance with certain aspects of the present disclosure.

FIG. 13 illustrates an example device, in accordance with certain aspects of the present disclosure.

FIG. 14 illustrates an example device, in accordance with certain aspects of the present disclosure.

The APPENDIX describes various aspects of the present disclosure.

To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the drawings. It is contemplated that elements and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.

DETAILED DESCRIPTION

Aspects of the present disclosure provide apparatuses, methods, processing systems, and computer-readable mediums for speaker diaphragm excursion prediction using machine learning.

Speaker protection generally leverages the playback signal to prevent over-excursion while maintaining maximum loudness, e.g. for speakerphone or gaming use in tiny loudspeakers, such as are found in smartphones, tablets, laptops, and other portable devices. One challenge is to model and predict over-excursion with highly nonlinear characteristics. To do so, aspects of the present disclosure utilize deep learning (DL) techniques to accurately predict nonlinear excursion of speaker diaphragms when driven. Feedback current and/or voltage may be sampled as input, and a laser may be used to measure diaphragm excursion. The sampled current and/or voltage are labeled (or otherwise correlated) with the measured diaphragm excursion. In some aspects, a convolutional neural network (ConvNet, or CNN) is designed as the baseline, and a fast Fourier transform network (FFTNet) may be used to explore the dominant low-frequency and the unknown harmonic (s) . In some aspects, batch normalization (BN) re-estimation is enabled to achieve online adaptation, and quantization (8-bit integer (INT8) quantization) based on the artificial intelligence model efficiency toolkit (AIMET) is used to further reduce the computational complexity involved in predicting diaphragm excursion from current and voltage. Certain aspects of the present disclosure can achieve greater than 99%of the residual DC causing a diaphragm excursion of less than 0.1 mm, which may exceed the performance of various digital signal processor (DSP) solutions, verified in two speakers considering three scenarios.

Some solutions for speaker protection involve building a speaker protection block, which first monitors the current and voltage, and then analyzes the buffering and predicts the excursion status. Once the predicted excursion is larger than a threshold, the speaker protection block is triggered to attenuate the input power or modify the source signal to decrease diaphragm excursion. However, it is hard to precisely predict diaphragm excursion based on the monitored current and/or voltage. Thus, one simple technique to control diaphragm excursion may use traditional equalization (EQ) filters to attenuate an input signal. These traditional EQ filters are generally designed conservatively due to the wide range of operating factors (e.g., speaker variations, various types of audio signals with large dynamic ranges, etc. ) in which various speakers operate. For example, an EQ filter for nonlinear distortion in direct-radiator loudspeakers in a closed cabinet may be implemented as an exact inverse of an electro-mechanical model of the loudspeaker. Estimates generated by the digital loudspeaker model may be used to predict the excursion based on the input voltage, and the predicted excursion may be controlled using dynamic range compression in the excursion domain. These approaches, however, do not push the speaker to its true limit. For example, EQ filters still attenuate the output audio, even when low audio-signal energy and diaphragm excursion is within a defined limit or threshold, thereby degrading the audio quality and the volume of the audio.

In another example, deep learning (DL) approaches can be used in modeling the behaviors of a voice coil actuator (VCA) . These DL approaches generally incorporate a recurrent neural network (RNN) into a multi-physics simulation to enhance the computation efficiency of these DL approaches. DL solutions can solve differential equations (DEs) , which can partially model a diverse non-linear system, such as excursion prediction and VCA modeling. Neural operators in the RNN can directly learn the mapping from any functional parametric dependence to the solution. One example uses physics-informed neural networks that directly solve the ordinary DEs. Another example formulates the neural operator n in the Fourier space by parameterizing the integral kernel. However, DL solutions may be highly dependent on the training data set and may be subject to overfitting, especially with highly variable data sets, such as those associated with diaphragm excursion characteristics.

To allow for accurate prediction of diaphragm excursion, which may be used in speaker protection tasks, aspects of the present disclosure provide DL techniques to explore effective features for speaker protection. To do so, a diaphragm excursion measurement setting is established where a laser is to track the corresponding excursion, and a comprehensive preprocessing pipeline is to prepare the dataset. A model, based on ConvNet and/or FFTNet, for example, may be trained and verified based on two typical speakers. BN re-estimation for online adaptation and quantization in AIMET may also be implemented.

Example Speaker Diaphragm Excursion Prediction

FIG. 1 depicts a cross-section of an example speaker 100 showing displacement X (also referred to as “excursion” ) of the speaker’s diaphragm.

The speaker 100 represents the transduction of electrical energy to mechanical energy. A continuous-time model for the electrical behavior is shown as:

where v _c (t) is the voltage input across the terminals of the voice coil, i _c (t) is the voice coil current, R _eb is the blocked electrical resistance,

is the diaphragm excursion velocity, φ ₀ is the transduction coefficient at the equilibrium state x _d (t) , which is the diaphragm excursion.

Such mechanical characteristics of the speaker may be mostly determined by the parameters R _eb and φ ₀, which highly depend on the speaker’s geometric construction and the materials used in the voice coil, the diaphragm, and the enclosure. It is hard to accurately construct the mathematical method to track the nonlinear distribution and variation. In Equation (1) , such nonlinear features can be represented by [v _c (t) , i _c (t) , x _d (t) ] , which is further used in supervised training of models used in a DL solution to learn the characteristics of a speaker.

FIG. 2 illustrates an example laser excursion measurement environment 200. The current (I) and voltage (V) are the inputs. The measured excursion (X) from the laser is used/labeled as ground truth data used in supervised learning techniques to train the machine learning models described herein.

FIG. 3A illustrates an example machine learning (ML) model 300A trained to predict displacement (i.e., excursion) of a speaker diaphragm based on a training data set of V and/or I, labeled with a measured amount of displacement of a speaker diaphragm. After training, the ML model 300 may predict the displacement of a speaker based on V and/or I.

FIG. 3B illustrates an example 300B of an input matrix, ML model, and output vector.

Example 300B may be constrained by various constraints on model size and latency. Example constraints may include the size of a one-dimensional time sequence (e.g., sampled at a sampling rate of 48 kHz) , timing constraints (e.g., 10 ms scheduling time) , input length (e.g., 256 samples provided as input) , and the like. The output of the ML model generally includes a prediction of diaphragm excursion given the inputs of V and/or I. In some aspects, the output may be generated using mixed quantization (8 and 16 bits) and may be performed by a central processing unit (CPU) or offloaded to other processors, such as a graphics processing unit (GPU) or a neural processing unit (NPU) .

FIG. 4A illustrates an example preprocessing procedure 400A, according to certain aspects of the present disclosure.

A laser (or other measurement/metrology device) may be pointed at the center of the speaker in order to track the displacement (i.e., excursion) of the diaphragm for any given V and/or I used to drive the speaker. The measured diaphragm displacement is logged as x _d (t) . Meanwhile, the corresponding real-time current i _c (t) and/or voltage v _c (t) are measured, as shown in Fig. 1. Next, Equation (1) is transferred as

where f (·) is the function to represent the mechanical characteristics of the speaker. For example, in one voltage-controlled speaker, voltage is the source input, encoded by the voice content, which causes the diaphragm to vibrate and from which excursion from a base plane can be measured. The mechanical response is embedded into the feedback’s current. Once a model can be trained/optimized to represent f (·) (e.g., the mechanical characteristics of the speaker) stably, real-time logged current and/or voltage can be used to predict the corresponding diaphragm displacement based on the trained model, as shown in Equation (3) :

where the motivation is to learn

based on the logged dataset, and to predict diaphragm displacement using the model, given the real-time v _c (t) and/or i _c (t) .

The measured diaphragm excursion is generally impacted by noise and the location and environment in which a speaker is installed. As discussed, continuous long-term large excursion may cause progressively more serious damage to the diaphragm. To prevent, or at least reduce, damage to speaker diaphragms, aspects of the present disclosure may use direct current (DC) drift prediction, where a low-pass filter (e.g., a second-order Butterworth filter or other second-order filter) with a relatively low cutoff frequency (e.g., 10 Hz) may be involved to extract the DC of the measured diaphragm displacement, which, as discussed above, may be used as ground-truth data associated with a V and/or I measurement for training the models discussed herein. Moreover, in the logging, because current/voltage and laser measurements utilize separate clocks, synchronization may be implemented to temporally align a sequence of V and/or I data with the corresponding measurements. Cross-correlation between current and measured excursion may be used to time shift the data for accurate training of the model.

FIG. 4B illustrates an example plot 400B illustrating a correlation between current I and displacement X. The plot illustrates synchronization between I and the measured excursion. In some cases, V/I monitoring and laser measurement may be deployed at two independent parts. Correlation between I and X is illustrated at 8 kHz.

FIG. 4C illustrates an example correlation 400C for an input, filter, and extracted frequency. As illustrated, a raw measurement (e.g., sampled at 48 kHz) may be input to a filter, which may reduce noise while maintaining features of the measured excursion. In some cases, low frequency may have an impact on heating. In some cases, the filter may comprise a low-pass filter (e.g., a Butterworth filter) , which may extract the DC value in the measured excursion.

FIG. 5 illustrates an example Fourier neural network model 500, according to certain aspects of the present disclosure.

Mathematical models of speaker diaphragm excursion may not allow for speakers to be driven to full potential; however, these models illustrate that DC drift can be associated with some unknown frequency and the corresponding harmonic components, which is highly associated to the mechanical design of the speaker. Aspects of the present disclosure leverage DL solutions to extract these frequencies and to predict DC drift.

In the training stage, for an input sequence with N samples, the number of state variables is 2N, including the voltage and current components, s _n= {i _n, v _n} , written as s= (s ₁, s ₂, …, s _N) ∈R ^2N. The output prediction x _N is mapped to the time stamp t as the last sample in the sequence. Each sample used to train the model is defined as {s, x _N} .

In some aspects, a Fourier Attention Operator Layer (FAOL) may be used to extract the effective frequency components for a given input sequence of voltage and/or current components. In the FAOL, the multi-head attention is embedded, and the complex value is re-organized into two real parts, simplified for concatenation in the channel domain. After attention processing, the channels may be combined to restore a complex value. In some aspects, several FAOL blocks may be concatenated, which may aid in extracting the harmonic features for a given sequence of samples. Further, a skip path in the time domain may be used to restore discarded frequency parts from the input sequence which may have been previously discarded. In some aspects, the overall structure of a fast Fourier transform network may include a 1-dimensional convolutional layer that increases the size of the channel feature space, an FAOL configured to extract features from an input sequence, and an average pooling layer to down-sample the extracted features into a smaller space.

Consider a scenario in which there are J Fourier Attention Operator Layers (FAOLs) in the neural network. The output of each layer is g _j for j=1, 2, 3, …, J. For the input of each layer, a discrete or fast Fourier transform F may be performed to convert the input time-domain samples into the frequency domain. A multi-head self-attention block parameterized by

may be used to learn in the frequency domain, and then recover the time-domain sequences based on an inverse Fourier transform F ^-1. This process may be referred to as a Fourier attention operator

represented by the following equation:

where

is the multi-head attention block, to learn the coefficients based on the given patches. Then,

is the weight tensor conduct linear combination of the modes in the frequency domain. The output of the j-th layer adds up F ^-1 output with the initial time-domain sequence weighted by conv1d operator ψ (·) . Rectified linear unit (ReLU) activation is used along with one-dimensional convolutional operations.

Further, as a comparison, in some aspects, a ResNet-based one-dimensional convolutional network (ConvNet) may be used, which may have a similar structure as the FFTNet discussed above (e.g., have the same input and output format, a single conv1d layer, several ResNet blocks, average pooling, and a fully connected (FC) layer to regress the predicted DC drift) .

In some cases, a fast Fourier transform (FFT) neural model complexity may include 333184.0 float operations and 1725.0 parameters.

FIG. 6 illustrates an example model 600, in accordance with aspects of the present disclosure. In some cases, a ResNet 1D model complexity may include 3244096.0 float operations and 19073.0 parameters.

FIG. 7 illustrates an example batch normalization (BN) re-estimation algorithm 700, in accordance with aspects of the present disclosure.

As discussed, speakers (and the performance of these speakers) may be impacted by production and unknown mechanical characteristics. Further, different units of the same model of speaker may have varying characteristics, which may impose difficulties in accurately predicting diaphragm displacement for a speaker. Because of power and computation constraint in edge devices, such as smartphones or other devices in which speakers are included, batch normalization (BN) re-estimation may be used for online adaptation, without any labeling and fine-tuning request, to adapt to variations between different types of speakers and different units of a same speaker type. The BN layer is generally designed to alleviate the issue of internal covariant shifting: a common problem while training a very deep neural network, and defined as Equation (5) :

where x _j and y _j are the input/output scalars of one neuron response in one sample, X _. j denotes the j _th column of the input data in one BN layer, X∈R ^n×p, j∈ {1…p} . n denotes the batch size, and p is the feature dimension. γ _j and β _j are parameters to be learned. Once the training is done, the parameters in BN are frozen for inference operations.

When the model is initialized in a new device or new scenario for inference with BN re-estimation, as shown in Algorithm 1, the input is buffered within the given window, further, to calculate the mean and variance. Further, to track the variation, a 1-tap infinite impulse response (IIR) filter is used to track the mean and variance, which is useful for the optimization in the whole space.

Pseudocode describing an algorithm for predicting diaphragm displacement is illustrated below, in accordance with aspects of the present disclosure:

FIG. 8 illustrates an example plot 800 illustrating an L1 loss and a scaled fourth-power loss.

In the experiments, three typical scenarios are considered: normal, heating, and DC injection. Fourteen units from two different speaker types (SBS2 and AAC) are used to collect the data. These are further split: 8 units for training, 4 units for validation, and 2 units for testing.

Three methods are verified: a first method utilizes an FFTNet with 3 FFT attention blocks, and involves 333k float operations and 1.7k parameters; a second method utilizes a ConvNet with 4 ResNet blocks, involving 3244k float operations and 19k parameters; a third method uses a digital signal processor (DSP) with limited operations and ignorable parameters for memory cost. The DSP may still leverage the training and validation dataset to fine-tune the algorithm.

As large DC drift values generally lead to more serious damage to the speaker diaphragm, an L1 loss threshold can be used so that small loss values may be considered noise, while loss values larger than the threshold can be enhanced. One scaled fourth-power loss can be used in the training procedure to focus on the large L1 loss, shown in

where S is the batch size and x _out, i is the prediction of the i-th sample for i=1, 2, …, S, and δ is to adjust the identical point with L1 loss.

As illustrated, predicted diaphragm displacement can track the peak DC jitter with small or no residual loss. For small DC jitter, which is impacted by the random noise (e.g., from the circuit or mechanical diaphragm noise) , the variation of the predicted diaphragm displacement is hard to track. The maximum loss occurs when there is DC cliffing, where, in an ideal system, a zero valued input (e.g., power off or no voice, v _c (t) =i _c (t) =0) has a corresponding excursion of zero. However, due to mechanical constraints of a speaker diaphragm, diaphragm displacement gradually reduces to zero, but is not fully aligned with electrical signal control. Although such cliffing leads to large loss values, such cliffing would not involve the damage to speaker, as the amount or magnitude of diaphragm excursion is decreasing (leading to less mechanical stress on the diaphragm) .

Using a 32-bit floating point FFTNet and a testing sequence, the predicted DC drift sequence can track the ground truth DC variation, and the maximum L1 loss may be a value less than a target displacement value (e.g., a measured displacement of 0.0478 mm, less than the target displacement of 0.1 mm) .

In Table 1, compared to ConvNet, FFTNet shows more promising performance. However, FFT processing may be computationally complex and may be challenging to deploy in a real hardware device, e.g., accelerated by NPU. To allow for an FFTNet to be used in predicting speaker diaphragm excursion, high precision (e.g., 32-bit floating point) models may be reserved for complex scenarios, such as a DC injection scenario, which may be considered as a highly challenging corner-case scenario against which a speaker protection block is to protect.

	Mean (mm)	Max (mm)
FFTNet	0.0077	0.2169
ConvNet	0.0091	0.2314

Table 1: L1 Loss Comparison for DC Injection Scenario

Table 2 illustrates verification of BN re-estimation for online adaptation, where one model, trained based on a specific type of speaker (e.g., an SBS2 speaker) , is used to predict diaphragm displacement for a different type of speaker (e.g., an AAC speaker) . One AAC unit is used to verify the performance of the models discussed herein and also to further explore the impact of different filter coefficients α. As shown in Table 2, such adaptation methods can largely improve the performance of machine learning models used to predict speaker diaphragm displacement. For example, compared to a baseline threshold, α=0.1 can achieve 21%gain in the maximum loss. It should be noted, however, that it may be difficult to identify an optimal threshold α, as the optimal threshold may be highly dependent on the data/model weight distribution. Although α=0.1 achieves the optimum maximum loss, the corresponding mean loss may be higher than the mean loss for other threshold values of α.

α=	baseline	0.1	0.05	0.001	0.0001
Mean (mm)	0.0081	0.0035	0.0024	0.001	0.0028
Max (mm)	0.7466	0.4179	0.5849	0.6802	0.7055

Table 2: Comparison of ConvNet FP32 performance in online adaptation scenarios between different speakers

In some cases, model quantization may be used to reduce the complexity when deploying models in the edge devices. Here, AIMET can be used to perform quantization (e.g., to 8-bit integer (INT8) or some other level of quantization) . As FFT operations are not supported in the AIMET, the ConvNet described herein may be quantized, and FFTNet need not be used. In the experiment, two separate models are designed specifically for SBS2 and AAC.

As shown in Table 3, compared to DSP, the ConvNet32 shows the huge gain in the mean loss, and the maximum loss is close to, but larger than, the target (0.1 mm) . Further, after 8-bit quantization, compared to the baseline FP32, INT8 performance is lightly degraded, and the maximum L1 loss is from 0.1121 mm to 0.1298 mm, but still much better than the DSP solution. Inference performance for the SBS2 speaker leads to a similar conclusion as the AAC; however, the mean loss may be somewhat worse than that in the DSP.

		Mean (mm)	Max (mm)
	DSP	0.0140	0.2569
AAC	ConvNet FP32	0.0038	0.1121
	ConvNet INT8	0.0076	0.1298
	DSP	0.0020	0.2711
SBS2	ConvNet FP32	0.0032	0.13171
	ConvNet INT8	0.0046	0.1408

Table 3: L1 loss results for different speakers and different prediction networks

FIG. 9 illustrates an example plot 900 illustrating a predicted value, a ground truth value, and a residual value.

FIG. 10 illustrates an example system implementation 1000, in accordance with aspects of the present disclosure.

Aspects of the present disclosure provide an end-to-end pipeline to predict DC drift which, as discussed, may be correlated with speaker diaphragm displacement (i.e., excursion) . An attention mechanism may be used to extract frequency features from an input of V and/or I, which shows better performance than the ConvNet and the DSP solutions. Further, BN re-estimation may be enabled for online adaptation when the model is applied to new scenarios (e.g., applied to speaker protection for different speaker types) .

Example Operations for ML-based Diaphragm Excursion Prediction for Speaker Protection

FIG. 11 shows an example of a method 1100 for ML-based diaphragm excursion prediction for speaker protection, in accordance with aspects of the present disclosure. In some examples, the method 1100 may be performed by a device, such as the device 1300 illustrated in FIG. 13.

As illustrated, method 1100 begins at block 1110 with receiving an indication of one or more parameters associated with driving a speaker. In some cases, the operations of this block refer to, or may be performed by, a component (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.

Method 1100 then proceeds to block 1120 with predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters. In some cases, the operations of this block refer to, or may be performed by, a component (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.

Method 1100 then proceeds to block 1130 with taking one or more actions based on the predicted displacement. In some cases, the operations of this block refer to, or may be performed by, one or more computer-executable components (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.

FIG. 12 shows an example of a method 1200 for training a machine learning model, which may be used in speaker protection tasks, to predict speaker diaphragm displacement (excursion) , in accordance with aspects of the present disclosure. In some examples, the method 1200 may be performed by a device, such as the device 1400 illustrated in FIG. 14.

As illustrated, method 1200 begins at block 1210 with generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters. In some cases, the operations of this block refer to, or may be performed by, a component (e.g., 1424A-1424B) of a memory 1424 as described with reference to FIG. 14.

Method 1200 then proceeds to block 1220 with training a machine learning model, based on the training data set, to predict the displacement of the diaphragm. In some cases, the operations of this block refer to, or may be performed by, a component (e.g., 1424A-1424B) of a memory 1424 as described with reference to FIG. 14.

Example Processing Systems for ML-based Diaphragm Excursion Prediction for Speaker Protection

FIG. 13 depicts an example processing system 1300 for ML-based diaphragm excursion prediction for speaker protection, such as described herein for example with respect to FIG. 11.

Processing system 1300 includes a central processing unit (CPU) 1302, which in some examples may be a multi-core CPU. Instructions executed at the CPU 1302 may be loaded, for example, from a program memory associated with the CPU 1302 or may be loaded from a memory 1324.

Processing system 1300 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 1304, a digital signal processor (DSP) 1306, a neural processing unit (NPU) 1308, a multimedia processing unit 1310, a wireless connectivity component 1312.

An NPU, such as 1308, is generally a specialized circuit configured for implementing control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like. An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing units (TPUs) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.

NPUs, such as 1308, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the plurality of NPUs may be part of a dedicated neural-network accelerator.

NPUs may be optimized for training or inference, or in some cases configured to balance performance between both. For NPUs that are capable of performing both training and inference, the two tasks may still generally be performed independently.

NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged) , iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance. Generally, optimizing based on a wrong prediction involves propagating back through the layers of the model and determining gradients to reduce the prediction error.

NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this new piece through an already trained model to generate a model output (e.g., an inference) .

In one implementation, NPU 1308 is a part of one or more of CPU 1302, GPU 1304, and/or DSP 1306.

In some examples, wireless connectivity component 1312 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE) , fifth generation connectivity (e.g., 5G or NR) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards. Wireless connectivity component 1312 is further connected to one or more antennas 1314.

Processing system 1300 may also include one or more input and/or output devices 1322, such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.

In some examples, one or more of the processors of processing system 1300 may be based on an ARM or RISC-V instruction set.

Processing system 1300 also includes memory 1324, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like. In this example, memory 1324 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 1300.

In particular, in this example, memory 1324 includes receiving component 1324A, diaphragm displacement predicting component 1324B, and action taking component 1324C. The depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.

Generally, processing system 1300 and/or components thereof may be configured to perform the methods described herein.

Notably, in other aspects, aspects of processing system 1300 may be omitted, such as where processing system 1300 is a server computer or the like. Further, aspects of processing system 1300 may be distributed, such as training a model and using the model to generate inferences, such as user verification predictions.

FIG. 14 depicts an example processing system 1400 for training a machine learning model, which may be used in speaker protection tasks, to predict speaker diaphragm displacement (excursion) , such as described herein for example with respect to FIG. 12.

Processing system 1400 includes a central processing unit (CPU) 1402, which in some examples may be a multi-core CPU. Instructions executed at the CPU 1402 may be loaded, for example, from a program memory associated with the CPU 1402 or may be loaded from a memory 1424.

Processing system 1400 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 1404, a digital signal processor (DSP) 1406, a neural processing unit (NPU) 1408, a multimedia processing unit 1410, a wireless connectivity component 1412.

An NPU, such as 1408, is generally a specialized circuit configured for implementing control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like. An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing units (TPUs) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.

NPUs, such as 1408, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the plurality of NPUs may be part of a dedicated neural-network accelerator.

In one implementation, NPU 1408 is a part of one or more of CPU 1402, GPU 1404, and/or DSP 1406.

In some examples, wireless connectivity component 1412 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE) , fifth generation connectivity (e.g., 5G or NR) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards. Wireless connectivity component 1412 is further connected to one or more antennas 1414.

Processing system 1400 may also include one or more input and/or output devices 1422, such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.

In some examples, one or more of the processors of processing system 1400 may be based on an ARM or RISC-V instruction set.

Processing system 1400 also includes memory 1424, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like. In this example, memory 1424 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 1400.

In particular, in this example, memory 1424 includes generating component 1424A and training component 1424B. The depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.

Generally, processing system 1400 and/or components thereof may be configured to perform the methods described herein.

Notably, in other aspects, aspects of processing system 1400 may be omitted, such as where processing system 1400 is a server computer or the like. Further, aspects of processing system 1400 may be distributed, such as training a model and using the model to generate inferences, such as user verification predictions.

Example Clauses

Clause 1: A processor-implemented method comprising: receiving an indication of one or more parameters associated with driving a speaker; predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters; and taking one or more actions based on the predicted displacement.

Clause 2: The method of Clause 1, wherein the taking comprises controlling an amplitude of at least one of a voltage signal or a current signal associated with driving the speaker, based on the predicted displacement.

Clause 3: The method of

Clause

1 or 2, wherein the machine learning model comprises a one-dimensional residual neural network model.

Clause 4: The method of any of Clauses 1-3, wherein the machine learning model comprises a Fourier neural network model.

Clause 5: The method of Clause 4, wherein the predicting comprises generating a Fourier transform of the one or more parameters, wherein the machine learning model comprises an attention block configured to extract features in frequency domain from the Fourier transform of the one or more parameters.

Clause 6: The method of Clause 4, wherein the machine learning model comprises a Fourier attention operator layer including a skip path associated with an inverse Fourier transform output in the time domain.

Clause 7: The method of any of Clauses 1-6, wherein the machine learning model comprises a model trained to predict the displacement of the diaphragm of the speaker based on a training data set mapping the one or more parameters to a measured displacement.

Clause 8: The method of any of Clauses 1-7, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.

Clause 9: The method of any of Clauses 1-8, further comprising adapting the machine learning model from a first speaker to a second speaker based on batch normalization.

Clause 10: A processor-implemented method comprising: generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters; and training a machine learning model, based on the training data set, to predict the displacement of the diaphragm.

Clause 11: The method of Clause 10, wherein generating the training data set comprises filtering the indication of the displacement of the diaphragm to generate a filtered indication.

Clause 12: The method of Clause 11, wherein the filtering comprises using a Butterworth filter.

Clause 13: The method of Clause 11 or 12, wherein generating the training data set comprises correlating the filtered indication of the displacement of the diaphragm with the indication of the one or more parameters to effectively synchronize the filtered indication of the displacement with the one or more parameters.

Clause 14: The method of Clause 13, wherein the training is based on the filtered indication of the displacement effectively synchronized to the indication of the one or more parameters.

Clause 15: The method of any of Clauses 10-14, wherein the machine learning model comprises a one-dimensional residual neural network model.

Clause 16: The method of any of Clauses 10-14, wherein the machine learning model comprises a Fourier neural network model.

Clause 17: The method of any of Clauses 10-16, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.

Clause 18: A system comprising: a memory having executable instructions stored thereon; and a processor configured to execute the executable instructions to cause the system to perform the operations of any of Clauses 1-17.

Clause 19: A system comprising means for performing the operations of any of Clauses 1-17.

Clause 20: A computer-readable medium having instructions stored thereon which, when executed by a processor, performs the operations of any of Clauses 1-17.

Additional Considerations

The preceding description is provided to enable any person skilled in the art to practice the various embodiments described herein. The examples discussed herein are not limiting of the scope, applicability, or embodiments set forth in the claims. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments. For example, changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For instance, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Also, features described with respect to some examples may be combined in some other examples. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.

As used herein, the word “exemplary” means “serving as an example, instance, or illustration. ” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.

As used herein, a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c) .

As used herein, the term “determining” encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure) , ascertaining, and the like. Also, “determining” may include receiving (e.g., receiving information) , accessing (e.g., accessing data in a memory) , and the like. Also, “determining” may include resolving, selecting, choosing, establishing, and the like.

The methods disclosed herein comprise one or more steps or actions for achieving the methods. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims. Further, the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software component (s) and/or module (s) , including, but not limited to a circuit, an application specific integrated circuit (ASIC) , or processor. Generally, where there are operations illustrated in figures, those operations may have corresponding counterpart means-plus-function components with similar numbering.

The following claims are not intended to be limited to the embodiments shown herein, but are to be accorded the full scope consistent with the language of the claims. Within a claim, reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more. ” Unless specifically stated otherwise, the term “some” refers to one or more. No claim element is to be construed under the provisions of 35 U.S.C. §112 (f) unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for. ” All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims.

Claims

A processor-implemented method comprising:

receiving an indication of one or more parameters associated with driving a speaker;

predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters; and

taking one or more actions based on the predicted displacement.
The method of claim 1, wherein the taking comprises controlling an amplitude of at least one of a voltage signal or a current signal associated with driving the speaker, based on the predicted displacement.
The method of claim 1, wherein the machine learning model comprises a one-dimensional residual neural network model.
The method of claim 1, wherein the machine learning model comprises a Fourier neural network model.
The method of claim 4, wherein the predicting comprises generating a Fourier transform of the one or more parameters and wherein the Fourier neural network model comprises an attention block configured to extract features in the frequency domain from the Fourier transform of the one or more parameters.
The method of claim 4, wherein the machine learning model comprises a Fourier attention operator layer including a skip path associated with an inverse Fourier transform output in the time domain.
The method of claim 1, wherein the machine learning model comprises a model trained to predict the displacement of the diaphragm of the speaker based on a training data set mapping the one or more parameters to a measured displacement.
The method of claim 1, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
The method of claim 1, further comprising adapting the machine learning model from a first speaker to a second speaker based on batch normalization.
A processor-implemented method comprising:

generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters; and

training a machine learning model, based on the training data set, to predict the displacement of the diaphragm.
The method of claim 10, wherein generating the training data set comprises filtering the indication of the displacement of the diaphragm to generate a filtered indication.
The method of claim 11, wherein the filtering comprises using a second-order Butterworth filter.
The method of claim 11, wherein generating the training data set comprises correlating the filtered indication of the displacement of the diaphragm with the indication of the one or more parameters to effectively synchronize the filtered indication of the displacement with the one or more parameters.
The method of claim 13, wherein the training is based on the filtered indication of the displacement effectively synchronized to the indication of the one or more parameters.
The method of claim 10, wherein the machine learning model comprises a one-dimensional residual neural network model.
The method of claim 10, wherein the machine learning model comprises a Fourier neural network model.
The method of claim 10, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
A processing system comprising:

a memory comprising computer-executable instructions; and

one or more processors configured to execute the computer-executable instructions and cause the processing system to perform a method in accordance with any of Claims 1-17.
A processing system, comprising means for performing a method in accordance with any of Claims 1-17.
A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform a method in accordance with any of Claims 1-17.
A computer program product embodied on a computer-readable storage medium comprising code for performing a method in accordance with any of Claims 1-17.