WO2024087050A1 - Machine-learning-based diaphragm excursion prediction for speaker protection - Google Patents

Machine-learning-based diaphragm excursion prediction for speaker protection Download PDF

Info

Publication number
WO2024087050A1
WO2024087050A1 PCT/CN2022/127609 CN2022127609W WO2024087050A1 WO 2024087050 A1 WO2024087050 A1 WO 2024087050A1 CN 2022127609 W CN2022127609 W CN 2022127609W WO 2024087050 A1 WO2024087050 A1 WO 2024087050A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
displacement
diaphragm
parameters
indication
Prior art date
Application number
PCT/CN2022/127609
Other languages
French (fr)
Inventor
Yuwei REN
Matthew Zivney
Yin Huang
Eddie Choy
Chirag Sureshbhai Patel
Original Assignee
Qualcomm Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qualcomm Incorporated filed Critical Qualcomm Incorporated
Priority to PCT/CN2022/127609 priority Critical patent/WO2024087050A1/en
Publication of WO2024087050A1 publication Critical patent/WO2024087050A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/007Protection circuits for transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R9/00Transducers of moving-coil, moving-strip, or moving-wire type
    • H04R9/06Loudspeakers

Definitions

  • aspects of the present disclosure relate to speaker diaphragm protection using machine learning techniques.
  • a speaker is an electro-acoustic transducer, generating sound from an electric signal produced by a power amplifier.
  • the voice coil of a speaker is attached to a diaphragm that is mounted on a fixed frame via a suspension.
  • a magnetic field is generated by a permanent magnet that is conducted to the region of the coil gap. Due to the presence of the magnetic field, an electrical current passing through the voice-coil generates a force f c which causes the membrane to move up and down.
  • the displacement x d of the diaphragm is the excursion, which has a limit. If the excursion limit is exceeded, the speaker exhibits nonlinear behavior, which in turn manifests as distorted sound and degraded acoustic echo cancellation performance.
  • Certain aspects generally relate to machine-learning (ML) -based diaphragm excursion prediction for speaker protection.
  • ML machine-learning
  • the method generally includes receiving an indication of one or more parameters associated with driving a speaker, predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters, and taking one or more actions based on the predicted displacement.
  • processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer- readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and apparatus comprising means for performing the aforementioned methods as well as those further described herein.
  • FIG. 1 depicts an example speaker cross-section showing diaphragm displacement.
  • FIG. 2 illustrates an example laser excursion measurement
  • FIG. 3A illustrates an example machine learning (ML) model to predict displacement.
  • ML machine learning
  • FIG. 3B illustrates an example input matrix, ML model, and output vector.
  • FIG. 4A illustrates an example preprocessing procedure.
  • FIG. 4B illustrates an example plot illustrating a correlation between current I and displacement X.
  • FIG. 4C illustrates an example input, filter, and extracted frequency.
  • FIG. 5 illustrates an example Fourier neural network model.
  • FIG. 6 illustrates an example model, in accordance with certain aspects of the present disclosure.
  • FIG. 7 illustrates an example batch normalization (BN) re-estimation algorithm, in accordance with certain aspects of the present disclosure.
  • FIG. 8 illustrates an example plot illustrating a L1 loss (e.g., a Least Absolute Deviation loss) and a scaled fourth-power loss.
  • a L1 loss e.g., a Least Absolute Deviation loss
  • a scaled fourth-power loss e.g., a scaled fourth-power loss
  • FIG. 9 illustrates an example plot illustrating a predicted value, a ground truth value, and a residual value.
  • FIG. 10 illustrates an example system implementation, in accordance with certain aspects of the present disclosure.
  • FIG. 11 illustrates an example method flow diagram, in accordance with certain aspects of the present disclosure.
  • FIG. 12 illustrates an example method flow diagram, in accordance with certain aspects of the present disclosure.
  • FIG. 13 illustrates an example device, in accordance with certain aspects of the present disclosure.
  • FIG. 14 illustrates an example device, in accordance with certain aspects of the present disclosure.
  • the APPENDIX describes various aspects of the present disclosure.
  • aspects of the present disclosure provide apparatuses, methods, processing systems, and computer-readable mediums for speaker diaphragm excursion prediction using machine learning.
  • Speaker protection generally leverages the playback signal to prevent over-excursion while maintaining maximum loudness, e.g. for speakerphone or gaming use in tiny loudspeakers, such as are found in smartphones, tablets, laptops, and other portable devices.
  • One challenge is to model and predict over-excursion with highly nonlinear characteristics.
  • aspects of the present disclosure utilize deep learning (DL) techniques to accurately predict nonlinear excursion of speaker diaphragms when driven.
  • Feedback current and/or voltage may be sampled as input, and a laser may be used to measure diaphragm excursion. The sampled current and/or voltage are labeled (or otherwise correlated) with the measured diaphragm excursion.
  • DL deep learning
  • a convolutional neural network (ConvNet, or CNN) is designed as the baseline, and a fast Fourier transform network (FFTNet) may be used to explore the dominant low-frequency and the unknown harmonic (s) .
  • FFTNet fast Fourier transform network
  • batch normalization (BN) re-estimation is enabled to achieve online adaptation, and quantization (8-bit integer (INT8) quantization) based on the artificial intelligence model efficiency toolkit (AIMET) is used to further reduce the computational complexity involved in predicting diaphragm excursion from current and voltage.
  • AIMET artificial intelligence model efficiency toolkit
  • Certain aspects of the present disclosure can achieve greater than 99%of the residual DC causing a diaphragm excursion of less than 0.1 mm, which may exceed the performance of various digital signal processor (DSP) solutions, verified in two speakers considering three scenarios.
  • DSP digital signal processor
  • Some solutions for speaker protection involve building a speaker protection block, which first monitors the current and voltage, and then analyzes the buffering and predicts the excursion status. Once the predicted excursion is larger than a threshold, the speaker protection block is triggered to attenuate the input power or modify the source signal to decrease diaphragm excursion. However, it is hard to precisely predict diaphragm excursion based on the monitored current and/or voltage. Thus, one simple technique to control diaphragm excursion may use traditional equalization (EQ) filters to attenuate an input signal. These traditional EQ filters are generally designed conservatively due to the wide range of operating factors (e.g., speaker variations, various types of audio signals with large dynamic ranges, etc. ) in which various speakers operate.
  • EQ equalization
  • an EQ filter for nonlinear distortion in direct-radiator loudspeakers in a closed cabinet may be implemented as an exact inverse of an electro-mechanical model of the loudspeaker. Estimates generated by the digital loudspeaker model may be used to predict the excursion based on the input voltage, and the predicted excursion may be controlled using dynamic range compression in the excursion domain. These approaches, however, do not push the speaker to its true limit. For example, EQ filters still attenuate the output audio, even when low audio-signal energy and diaphragm excursion is within a defined limit or threshold, thereby degrading the audio quality and the volume of the audio.
  • DL approaches can be used in modeling the behaviors of a voice coil actuator (VCA) .
  • VCA voice coil actuator
  • DL approaches generally incorporate a recurrent neural network (RNN) into a multi-physics simulation to enhance the computation efficiency of these DL approaches.
  • DL solutions can solve differential equations (DEs) , which can partially model a diverse non-linear system, such as excursion prediction and VCA modeling.
  • DEs differential equations
  • Neural operators in the RNN can directly learn the mapping from any functional parametric dependence to the solution.
  • One example uses physics-informed neural networks that directly solve the ordinary DEs.
  • Another example formulates the neural operator n in the Fourier space by parameterizing the integral kernel.
  • DL solutions may be highly dependent on the training data set and may be subject to overfitting, especially with highly variable data sets, such as those associated with diaphragm excursion characteristics.
  • aspects of the present disclosure provide DL techniques to explore effective features for speaker protection.
  • a diaphragm excursion measurement setting is established where a laser is to track the corresponding excursion, and a comprehensive preprocessing pipeline is to prepare the dataset.
  • a model based on ConvNet and/or FFTNet, for example, may be trained and verified based on two typical speakers.
  • BN re-estimation for online adaptation and quantization in AIMET may also be implemented.
  • FIG. 1 depicts a cross-section of an example speaker 100 showing displacement X (also referred to as “excursion” ) of the speaker’s diaphragm.
  • the speaker 100 represents the transduction of electrical energy to mechanical energy.
  • a continuous-time model for the electrical behavior is shown as:
  • v c (t) is the voltage input across the terminals of the voice coil
  • i c (t) is the voice coil current
  • R eb is the blocked electrical resistance
  • ⁇ 0 is the transduction coefficient at the equilibrium state x d (t) , which is the diaphragm excursion.
  • Such mechanical characteristics of the speaker may be mostly determined by the parameters R eb and ⁇ 0 , which highly depend on the speaker’s geometric construction and the materials used in the voice coil, the diaphragm, and the enclosure. It is hard to accurately construct the mathematical method to track the nonlinear distribution and variation.
  • Equation (1) such nonlinear features can be represented by [v c (t) , i c (t) , x d (t) ] , which is further used in supervised training of models used in a DL solution to learn the characteristics of a speaker.
  • FIG. 2 illustrates an example laser excursion measurement environment 200.
  • the current (I) and voltage (V) are the inputs.
  • the measured excursion (X) from the laser is used/labeled as ground truth data used in supervised learning techniques to train the machine learning models described herein.
  • FIG. 3A illustrates an example machine learning (ML) model 300A trained to predict displacement (i.e., excursion) of a speaker diaphragm based on a training data set of V and/or I, labeled with a measured amount of displacement of a speaker diaphragm.
  • the ML model 300 may predict the displacement of a speaker based on V and/or I.
  • FIG. 3B illustrates an example 300B of an input matrix, ML model, and output vector.
  • Example 300B may be constrained by various constraints on model size and latency.
  • Example constraints may include the size of a one-dimensional time sequence (e.g., sampled at a sampling rate of 48 kHz) , timing constraints (e.g., 10 ms scheduling time) , input length (e.g., 256 samples provided as input) , and the like.
  • the output of the ML model generally includes a prediction of diaphragm excursion given the inputs of V and/or I. In some aspects, the output may be generated using mixed quantization (8 and 16 bits) and may be performed by a central processing unit (CPU) or offloaded to other processors, such as a graphics processing unit (GPU) or a neural processing unit (NPU) .
  • CPU central processing unit
  • GPU graphics processing unit
  • NPU neural processing unit
  • FIG. 4A illustrates an example preprocessing procedure 400A, according to certain aspects of the present disclosure.
  • a laser may be pointed at the center of the speaker in order to track the displacement (i.e., excursion) of the diaphragm for any given V and/or I used to drive the speaker.
  • the measured diaphragm displacement is logged as x d (t) .
  • the corresponding real-time current i c (t) and/or voltage v c (t) are measured, as shown in Fig. 1.
  • Equation (1) is transferred as
  • f ( ⁇ ) is the function to represent the mechanical characteristics of the speaker.
  • voltage is the source input, encoded by the voice content, which causes the diaphragm to vibrate and from which excursion from a base plane can be measured.
  • the mechanical response is embedded into the feedback’s current.
  • the motivation is to learn based on the logged dataset, and to predict diaphragm displacement using the model, given the real-time v c (t) and/or i c (t) .
  • aspects of the present disclosure may use direct current (DC) drift prediction, where a low-pass filter (e.g., a second-order Butterworth filter or other second-order filter) with a relatively low cutoff frequency (e.g., 10 Hz) may be involved to extract the DC of the measured diaphragm displacement, which, as discussed above, may be used as ground-truth data associated with a V and/or I measurement for training the models discussed herein.
  • DC direct current
  • synchronization may be implemented to temporally align a sequence of V and/or I data with the corresponding measurements.
  • Cross-correlation between current and measured excursion may be used to time shift the data for accurate training of the model.
  • FIG. 4B illustrates an example plot 400B illustrating a correlation between current I and displacement X.
  • the plot illustrates synchronization between I and the measured excursion.
  • V/I monitoring and laser measurement may be deployed at two independent parts. Correlation between I and X is illustrated at 8 kHz.
  • FIG. 4C illustrates an example correlation 400C for an input, filter, and extracted frequency.
  • a raw measurement e.g., sampled at 48 kHz
  • the filter may comprise a low-pass filter (e.g., a Butterworth filter) , which may extract the DC value in the measured excursion.
  • a low-pass filter e.g., a Butterworth filter
  • FIG. 5 illustrates an example Fourier neural network model 500, according to certain aspects of the present disclosure.
  • the output prediction x N is mapped to the time stamp t as the last sample in the sequence.
  • Each sample used to train the model is defined as ⁇ s, x N ⁇ .
  • a Fourier Attention Operator Layer may be used to extract the effective frequency components for a given input sequence of voltage and/or current components.
  • the multi-head attention is embedded, and the complex value is re-organized into two real parts, simplified for concatenation in the channel domain. After attention processing, the channels may be combined to restore a complex value.
  • several FAOL blocks may be concatenated, which may aid in extracting the harmonic features for a given sequence of samples.
  • a skip path in the time domain may be used to restore discarded frequency parts from the input sequence which may have been previously discarded.
  • the overall structure of a fast Fourier transform network may include a 1-dimensional convolutional layer that increases the size of the channel feature space, an FAOL configured to extract features from an input sequence, and an average pooling layer to down-sample the extracted features into a smaller space.
  • a ResNet-based one-dimensional convolutional network may be used, which may have a similar structure as the FFTNet discussed above (e.g., have the same input and output format, a single conv1d layer, several ResNet blocks, average pooling, and a fully connected (FC) layer to regress the predicted DC drift) .
  • ConvNet ResNet-based one-dimensional convolutional network
  • a fast Fourier transform (FFT) neural model complexity may include 333184.0 float operations and 1725.0 parameters.
  • FIG. 6 illustrates an example model 600, in accordance with aspects of the present disclosure.
  • a ResNet 1D model complexity may include 3244096.0 float operations and 19073.0 parameters.
  • FIG. 7 illustrates an example batch normalization (BN) re-estimation algorithm 700, in accordance with aspects of the present disclosure.
  • BN batch normalization
  • x j and y j are the input/output scalars of one neuron response in one sample
  • X . j denotes the j th column of the input data in one BN layer
  • n denotes the batch size
  • p is the feature dimension.
  • ⁇ j and ⁇ j are parameters to be learned. Once the training is done, the parameters in BN are frozen for inference operations.
  • the input is buffered within the given window, further, to calculate the mean and variance. Further, to track the variation, a 1-tap infinite impulse response (IIR) filter is used to track the mean and variance, which is useful for the optimization in the whole space.
  • IIR infinite impulse response
  • FIG. 8 illustrates an example plot 800 illustrating an L1 loss and a scaled fourth-power loss.
  • a first method utilizes an FFTNet with 3 FFT attention blocks, and involves 333k float operations and 1.7k parameters; a second method utilizes a ConvNet with 4 ResNet blocks, involving 3244k float operations and 19k parameters; a third method uses a digital signal processor (DSP) with limited operations and ignorable parameters for memory cost.
  • DSP digital signal processor
  • an L1 loss threshold can be used so that small loss values may be considered noise, while loss values larger than the threshold can be enhanced.
  • predicted diaphragm displacement can track the peak DC jitter with small or no residual loss.
  • small DC jitter which is impacted by the random noise (e.g., from the circuit or mechanical diaphragm noise)
  • the variation of the predicted diaphragm displacement is hard to track.
  • diaphragm displacement gradually reduces to zero, but is not fully aligned with electrical signal control.
  • cliffing leads to large loss values, such cliffing would not involve the damage to speaker, as the amount or magnitude of diaphragm excursion is decreasing (leading to less mechanical stress on the diaphragm) .
  • the predicted DC drift sequence can track the ground truth DC variation, and the maximum L1 loss may be a value less than a target displacement value (e.g., a measured displacement of 0.0478 mm, less than the target displacement of 0.1 mm) .
  • a target displacement value e.g., a measured displacement of 0.0478 mm, less than the target displacement of 0.1 mm
  • FFTNet shows more promising performance.
  • FFT processing may be computationally complex and may be challenging to deploy in a real hardware device, e.g., accelerated by NPU.
  • high precision (e.g., 32-bit floating point) models may be reserved for complex scenarios, such as a DC injection scenario, which may be considered as a highly challenging corner-case scenario against which a speaker protection block is to protect.
  • Table 2 illustrates verification of BN re-estimation for online adaptation, where one model, trained based on a specific type of speaker (e.g., an SBS2 speaker) , is used to predict diaphragm displacement for a different type of speaker (e.g., an AAC speaker) .
  • One AAC unit is used to verify the performance of the models discussed herein and also to further explore the impact of different filter coefficients ⁇ .
  • 0.1 achieves the optimum maximum loss, the corresponding mean loss may be higher than the mean loss for other threshold values of ⁇ .
  • model quantization may be used to reduce the complexity when deploying models in the edge devices.
  • AIMET can be used to perform quantization (e.g., to 8-bit integer (INT8) or some other level of quantization) .
  • FFT operations are not supported in the AIMET, the ConvNet described herein may be quantized, and FFTNet need not be used.
  • two separate models are designed specifically for SBS2 and AAC.
  • the ConvNet32 shows the huge gain in the mean loss, and the maximum loss is close to, but larger than, the target (0.1 mm) .
  • the maximum loss is close to, but larger than, the target (0.1 mm) .
  • INT8 performance is lightly degraded, and the maximum L1 loss is from 0.1121 mm to 0.1298 mm, but still much better than the DSP solution.
  • Inference performance for the SBS2 speaker leads to a similar conclusion as the AAC; however, the mean loss may be somewhat worse than that in the DSP.
  • FIG. 9 illustrates an example plot 900 illustrating a predicted value, a ground truth value, and a residual value.
  • FIG. 10 illustrates an example system implementation 1000, in accordance with aspects of the present disclosure.
  • aspects of the present disclosure provide an end-to-end pipeline to predict DC drift which, as discussed, may be correlated with speaker diaphragm displacement (i.e., excursion) .
  • An attention mechanism may be used to extract frequency features from an input of V and/or I, which shows better performance than the ConvNet and the DSP solutions.
  • BN re-estimation may be enabled for online adaptation when the model is applied to new scenarios (e.g., applied to speaker protection for different speaker types) .
  • FIG. 11 shows an example of a method 1100 for ML-based diaphragm excursion prediction for speaker protection, in accordance with aspects of the present disclosure.
  • the method 1100 may be performed by a device, such as the device 1300 illustrated in FIG. 13.
  • method 1100 begins at block 1110 with receiving an indication of one or more parameters associated with driving a speaker.
  • the operations of this block refer to, or may be performed by, a component (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.
  • Method 1100 then proceeds to block 1120 with predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters.
  • the operations of this block refer to, or may be performed by, a component (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.
  • Method 1100 then proceeds to block 1130 with taking one or more actions based on the predicted displacement.
  • the operations of this block refer to, or may be performed by, one or more computer-executable components (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.
  • FIG. 12 shows an example of a method 1200 for training a machine learning model, which may be used in speaker protection tasks, to predict speaker diaphragm displacement (excursion) , in accordance with aspects of the present disclosure.
  • the method 1200 may be performed by a device, such as the device 1400 illustrated in FIG. 14.
  • method 1200 begins at block 1210 with generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters.
  • the operations of this block refer to, or may be performed by, a component (e.g., 1424A-1424B) of a memory 1424 as described with reference to FIG. 14.
  • Method 1200 then proceeds to block 1220 with training a machine learning model, based on the training data set, to predict the displacement of the diaphragm.
  • the operations of this block refer to, or may be performed by, a component (e.g., 1424A-1424B) of a memory 1424 as described with reference to FIG. 14.
  • FIG. 13 depicts an example processing system 1300 for ML-based diaphragm excursion prediction for speaker protection, such as described herein for example with respect to FIG. 11.
  • Processing system 1300 includes a central processing unit (CPU) 1302, which in some examples may be a multi-core CPU. Instructions executed at the CPU 1302 may be loaded, for example, from a program memory associated with the CPU 1302 or may be loaded from a memory 1324.
  • CPU central processing unit
  • Instructions executed at the CPU 1302 may be loaded, for example, from a program memory associated with the CPU 1302 or may be loaded from a memory 1324.
  • Processing system 1300 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 1304, a digital signal processor (DSP) 1306, a neural processing unit (NPU) 1308, a multimedia processing unit 1310, a wireless connectivity component 1312.
  • GPU graphics processing unit
  • DSP digital signal processor
  • NPU neural processing unit
  • MCI multimedia processing unit
  • An NPU such as 1308, is generally a specialized circuit configured for implementing control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like.
  • An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing units (TPUs) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.
  • NSP neural signal processor
  • TPUs tensor processing units
  • NNP neural network processor
  • IPU intelligence processing unit
  • VPU vision processing unit
  • graph processing unit or graph processing unit.
  • NPUs such as 1308, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models.
  • a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the plurality of NPUs may be part of a dedicated neural-network accelerator.
  • SoC system on a chip
  • NPUs may be optimized for training or inference, or in some cases configured to balance performance between both.
  • the two tasks may still generally be performed independently.
  • NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged) , iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance.
  • model parameters such as weights and biases
  • NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this new piece through an already trained model to generate a model output (e.g., an inference) .
  • a model output e.g., an inference
  • NPU 1308 is a part of one or more of CPU 1302, GPU 1304, and/or DSP 1306.
  • wireless connectivity component 1312 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE) , fifth generation connectivity (e.g., 5G or NR) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards.
  • Wireless connectivity component 1312 is further connected to one or more antennas 1314.
  • Processing system 1300 may also include one or more input and/or output devices 1322, such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
  • input and/or output devices 1322 such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
  • one or more of the processors of processing system 1300 may be based on an ARM or RISC-V instruction set.
  • Processing system 1300 also includes memory 1324, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like.
  • memory 1324 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 1300.
  • memory 1324 includes receiving component 1324A, diaphragm displacement predicting component 1324B, and action taking component 1324C.
  • receiving component 1324A diaphragm displacement predicting component 1324B
  • action taking component 1324C action taking component
  • processing system 1300 and/or components thereof may be configured to perform the methods described herein.
  • aspects of processing system 1300 may be omitted, such as where processing system 1300 is a server computer or the like. Further, aspects of processing system 1300 may be distributed, such as training a model and using the model to generate inferences, such as user verification predictions.
  • FIG. 14 depicts an example processing system 1400 for training a machine learning model, which may be used in speaker protection tasks, to predict speaker diaphragm displacement (excursion) , such as described herein for example with respect to FIG. 12.
  • Processing system 1400 includes a central processing unit (CPU) 1402, which in some examples may be a multi-core CPU. Instructions executed at the CPU 1402 may be loaded, for example, from a program memory associated with the CPU 1402 or may be loaded from a memory 1424.
  • CPU central processing unit
  • Instructions executed at the CPU 1402 may be loaded, for example, from a program memory associated with the CPU 1402 or may be loaded from a memory 1424.
  • Processing system 1400 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 1404, a digital signal processor (DSP) 1406, a neural processing unit (NPU) 1408, a multimedia processing unit 1410, a wireless connectivity component 1412.
  • GPU graphics processing unit
  • DSP digital signal processor
  • NPU neural processing unit
  • MPU multimedia processing unit
  • An NPU such as 1408, is generally a specialized circuit configured for implementing control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like.
  • An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing units (TPUs) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.
  • NSP neural signal processor
  • TPUs tensor processing units
  • NNP neural network processor
  • IPU intelligence processing unit
  • VPU vision processing unit
  • graph processing unit or graph processing unit.
  • NPUs such as 1408, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models.
  • a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the plurality of NPUs may be part of a dedicated neural-network accelerator.
  • SoC system on a chip
  • NPUs may be optimized for training or inference, or in some cases configured to balance performance between both.
  • the two tasks may still generally be performed independently.
  • NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged) , iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance.
  • model parameters such as weights and biases
  • NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this new piece through an already trained model to generate a model output (e.g., an inference) .
  • a model output e.g., an inference
  • NPU 1408 is a part of one or more of CPU 1402, GPU 1404, and/or DSP 1406.
  • wireless connectivity component 1412 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE) , fifth generation connectivity (e.g., 5G or NR) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards.
  • Wireless connectivity component 1412 is further connected to one or more antennas 1414.
  • Processing system 1400 may also include one or more input and/or output devices 1422, such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
  • input and/or output devices 1422 such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
  • one or more of the processors of processing system 1400 may be based on an ARM or RISC-V instruction set.
  • Processing system 1400 also includes memory 1424, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like.
  • memory 1424 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 1400.
  • memory 1424 includes generating component 1424A and training component 1424B.
  • the depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.
  • processing system 1400 and/or components thereof may be configured to perform the methods described herein.
  • aspects of processing system 1400 may be omitted, such as where processing system 1400 is a server computer or the like. Further, aspects of processing system 1400 may be distributed, such as training a model and using the model to generate inferences, such as user verification predictions.
  • a processor-implemented method comprising: receiving an indication of one or more parameters associated with driving a speaker; predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters; and taking one or more actions based on the predicted displacement.
  • Clause 2 The method of Clause 1, wherein the taking comprises controlling an amplitude of at least one of a voltage signal or a current signal associated with driving the speaker, based on the predicted displacement.
  • Clause 3 The method of Clause 1 or 2, wherein the machine learning model comprises a one-dimensional residual neural network model.
  • Clause 4 The method of any of Clauses 1-3, wherein the machine learning model comprises a Fourier neural network model.
  • Clause 5 The method of Clause 4, wherein the predicting comprises generating a Fourier transform of the one or more parameters, wherein the machine learning model comprises an attention block configured to extract features in frequency domain from the Fourier transform of the one or more parameters.
  • Clause 6 The method of Clause 4, wherein the machine learning model comprises a Fourier attention operator layer including a skip path associated with an inverse Fourier transform output in the time domain.
  • Clause 7 The method of any of Clauses 1-6, wherein the machine learning model comprises a model trained to predict the displacement of the diaphragm of the speaker based on a training data set mapping the one or more parameters to a measured displacement.
  • Clause 8 The method of any of Clauses 1-7, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
  • Clause 9 The method of any of Clauses 1-8, further comprising adapting the machine learning model from a first speaker to a second speaker based on batch normalization.
  • a processor-implemented method comprising: generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters; and training a machine learning model, based on the training data set, to predict the displacement of the diaphragm.
  • Clause 11 The method of Clause 10, wherein generating the training data set comprises filtering the indication of the displacement of the diaphragm to generate a filtered indication.
  • Clause 12 The method of Clause 11, wherein the filtering comprises using a Butterworth filter.
  • Clause 13 The method of Clause 11 or 12, wherein generating the training data set comprises correlating the filtered indication of the displacement of the diaphragm with the indication of the one or more parameters to effectively synchronize the filtered indication of the displacement with the one or more parameters.
  • Clause 14 The method of Clause 13, wherein the training is based on the filtered indication of the displacement effectively synchronized to the indication of the one or more parameters.
  • Clause 15 The method of any of Clauses 10-14, wherein the machine learning model comprises a one-dimensional residual neural network model.
  • Clause 16 The method of any of Clauses 10-14, wherein the machine learning model comprises a Fourier neural network model.
  • Clause 17 The method of any of Clauses 10-16, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
  • Clause 18 A system comprising: a memory having executable instructions stored thereon; and a processor configured to execute the executable instructions to cause the system to perform the operations of any of Clauses 1-17.
  • Clause 19 A system comprising means for performing the operations of any of Clauses 1-17.
  • Clause 20 A computer-readable medium having instructions stored thereon which, when executed by a processor, performs the operations of any of Clauses 1-17.
  • an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein.
  • the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.
  • exemplary means “serving as an example, instance, or illustration. ” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
  • a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members.
  • “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c) .
  • determining encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure) , ascertaining, and the like. Also, “determining” may include receiving (e.g., receiving information) , accessing (e.g., accessing data in a memory) , and the like. Also, “determining” may include resolving, selecting, choosing, establishing, and the like.
  • the methods disclosed herein comprise one or more steps or actions for achieving the methods.
  • the method steps and/or actions may be interchanged with one another without departing from the scope of the claims.
  • the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
  • the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions.
  • the means may include various hardware and/or software component (s) and/or module (s) , including, but not limited to a circuit, an application specific integrated circuit (ASIC) , or processor.
  • ASIC application specific integrated circuit

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Certain aspects of the present disclosure provide techniques and apparatus for machine-learning-based diaphragm excursion prediction for speaker protection. One example method generally includes receiving an indication of one or more parameters associated with driving a speaker, predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters, and taking one or more actions based on the predicted displacement.

Description

MACHINE-LEARNING-BASED DIAPHRAGM EXCURSION PREDICTION FOR SPEAKER PROTECTION
INTRODUCTION
Aspects of the present disclosure relate to speaker diaphragm protection using machine learning techniques.
A speaker is an electro-acoustic transducer, generating sound from an electric signal produced by a power amplifier. Generally, the voice coil of a speaker is attached to a diaphragm that is mounted on a fixed frame via a suspension. A magnetic field is generated by a permanent magnet that is conducted to the region of the coil gap. Due to the presence of the magnetic field, an electrical current passing through the voice-coil generates a force f c which causes the membrane to move up and down. The displacement x d of the diaphragm is the excursion, which has a limit. If the excursion limit is exceeded, the speaker exhibits nonlinear behavior, which in turn manifests as distorted sound and degraded acoustic echo cancellation performance. Moreover, as current is pushed through the voice coil, some of the electrical energy is converted into heat instead of sound. Further, if the speaker is driven too hard, such excursions heat the diaphragm, which may distort the diaphragm and, in some cases, may manifest as plastic melt visible as bubbles on the edge of the diaphragm. This distortion may create an asymmetry in the diaphragm that causes the diaphragm to not vibrate as a piston. The issue generally becomes more acute as speakers become smaller and more portable (e.g., as used in micro-speakers, earbuds, etc. ) .
BRIEF SUMMARY
Certain aspects generally relate to machine-learning (ML) -based diaphragm excursion prediction for speaker protection.
Certain aspects provide a method. The method generally includes receiving an indication of one or more parameters associated with driving a speaker, predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters, and taking one or more actions based on the predicted displacement.
Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer- readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and apparatus comprising means for performing the aforementioned methods as well as those further described herein.
The following description and the related drawings set forth in detail certain illustrative features of one or more embodiments.
BRIEF DESCRIPTION OF THE DRAWINGS
So that the manner in which the above-recited features of the present disclosure can be understood in detail, a more particular description, briefly summarized above, may be had by reference to aspects, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only certain typical aspects of this disclosure and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.
FIG. 1 depicts an example speaker cross-section showing diaphragm displacement.
FIG. 2 illustrates an example laser excursion measurement.
FIG. 3A illustrates an example machine learning (ML) model to predict displacement.
FIG. 3B illustrates an example input matrix, ML model, and output vector.
FIG. 4A illustrates an example preprocessing procedure.
FIG. 4B illustrates an example plot illustrating a correlation between current I and displacement X.
FIG. 4C illustrates an example input, filter, and extracted frequency.
FIG. 5 illustrates an example Fourier neural network model.
FIG. 6 illustrates an example model, in accordance with certain aspects of the present disclosure.
FIG. 7 illustrates an example batch normalization (BN) re-estimation algorithm, in accordance with certain aspects of the present disclosure.
FIG. 8 illustrates an example plot illustrating a L1 loss (e.g., a Least Absolute Deviation loss) and a scaled fourth-power loss.
FIG. 9 illustrates an example plot illustrating a predicted value, a ground truth value, and a residual value.
FIG. 10 illustrates an example system implementation, in accordance with certain aspects of the present disclosure.
FIG. 11 illustrates an example method flow diagram, in accordance with certain aspects of the present disclosure.
FIG. 12 illustrates an example method flow diagram, in accordance with certain aspects of the present disclosure.
FIG. 13 illustrates an example device, in accordance with certain aspects of the present disclosure.
FIG. 14 illustrates an example device, in accordance with certain aspects of the present disclosure.
The APPENDIX describes various aspects of the present disclosure.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the drawings. It is contemplated that elements and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.
DETAILED DESCRIPTION
Aspects of the present disclosure provide apparatuses, methods, processing systems, and computer-readable mediums for speaker diaphragm excursion prediction using machine learning.
Speaker protection generally leverages the playback signal to prevent over-excursion while maintaining maximum loudness, e.g. for speakerphone or gaming use in tiny loudspeakers, such as are found in smartphones, tablets, laptops, and other portable devices. One challenge is to model and predict over-excursion with highly nonlinear characteristics. To do so, aspects of the present disclosure utilize deep learning (DL) techniques to accurately predict nonlinear excursion of speaker diaphragms when driven. Feedback current and/or voltage may be sampled as input, and  a laser may be used to measure diaphragm excursion. The sampled current and/or voltage are labeled (or otherwise correlated) with the measured diaphragm excursion. In some aspects, a convolutional neural network (ConvNet, or CNN) is designed as the baseline, and a fast Fourier transform network (FFTNet) may be used to explore the dominant low-frequency and the unknown harmonic (s) . In some aspects, batch normalization (BN) re-estimation is enabled to achieve online adaptation, and quantization (8-bit integer (INT8) quantization) based on the artificial intelligence model efficiency toolkit (AIMET) is used to further reduce the computational complexity involved in predicting diaphragm excursion from current and voltage. Certain aspects of the present disclosure can achieve greater than 99%of the residual DC causing a diaphragm excursion of less than 0.1 mm, which may exceed the performance of various digital signal processor (DSP) solutions, verified in two speakers considering three scenarios.
Some solutions for speaker protection involve building a speaker protection block, which first monitors the current and voltage, and then analyzes the buffering and predicts the excursion status. Once the predicted excursion is larger than a threshold, the speaker protection block is triggered to attenuate the input power or modify the source signal to decrease diaphragm excursion. However, it is hard to precisely predict diaphragm excursion based on the monitored current and/or voltage. Thus, one simple technique to control diaphragm excursion may use traditional equalization (EQ) filters to attenuate an input signal. These traditional EQ filters are generally designed conservatively due to the wide range of operating factors (e.g., speaker variations, various types of audio signals with large dynamic ranges, etc. ) in which various speakers operate. For example, an EQ filter for nonlinear distortion in direct-radiator loudspeakers in a closed cabinet may be implemented as an exact inverse of an electro-mechanical model of the loudspeaker. Estimates generated by the digital loudspeaker model may be used to predict the excursion based on the input voltage, and the predicted excursion may be controlled using dynamic range compression in the excursion domain. These approaches, however, do not push the speaker to its true limit. For example, EQ filters still attenuate the output audio, even when low audio-signal energy and diaphragm excursion is within a defined limit or threshold, thereby degrading the audio quality and the volume of the audio.
In another example, deep learning (DL) approaches can be used in modeling the behaviors of a voice coil actuator (VCA) . These DL approaches generally incorporate a recurrent neural network (RNN) into a multi-physics simulation to enhance the computation efficiency of these DL approaches. DL solutions can solve differential equations (DEs) , which can partially model a diverse non-linear system, such as excursion prediction and VCA modeling. Neural operators in the RNN can directly learn the mapping from any functional parametric dependence to the solution. One example uses physics-informed neural networks that directly solve the ordinary DEs. Another example formulates the neural operator n in the Fourier space by parameterizing the integral kernel. However, DL solutions may be highly dependent on the training data set and may be subject to overfitting, especially with highly variable data sets, such as those associated with diaphragm excursion characteristics.
To allow for accurate prediction of diaphragm excursion, which may be used in speaker protection tasks, aspects of the present disclosure provide DL techniques to explore effective features for speaker protection. To do so, a diaphragm excursion measurement setting is established where a laser is to track the corresponding excursion, and a comprehensive preprocessing pipeline is to prepare the dataset. A model, based on ConvNet and/or FFTNet, for example, may be trained and verified based on two typical speakers. BN re-estimation for online adaptation and quantization in AIMET may also be implemented.
Example Speaker Diaphragm Excursion Prediction
FIG. 1 depicts a cross-section of an example speaker 100 showing displacement X (also referred to as “excursion” ) of the speaker’s diaphragm.
The speaker 100 represents the transduction of electrical energy to mechanical energy. A continuous-time model for the electrical behavior is shown as:
Figure PCTCN2022127609-appb-000001
where v c (t) is the voltage input across the terminals of the voice coil, i c (t) is the voice coil current, R eb is the blocked electrical resistance, 
Figure PCTCN2022127609-appb-000002
is the diaphragm excursion velocity, φ 0 is the transduction coefficient at the equilibrium state x d (t) , which is the diaphragm excursion.
Such mechanical characteristics of the speaker may be mostly determined by the parameters R eb and φ 0, which highly depend on the speaker’s geometric construction and the materials used in the voice coil, the diaphragm, and the enclosure. It is hard to accurately construct the mathematical method to track the nonlinear distribution and variation. In Equation (1) , such nonlinear features can be represented by [v c (t) , i c (t) , x d (t) ] , which is further used in supervised training of models used in a DL solution to learn the characteristics of a speaker.
FIG. 2 illustrates an example laser excursion measurement environment 200. The current (I) and voltage (V) are the inputs. The measured excursion (X) from the laser is used/labeled as ground truth data used in supervised learning techniques to train the machine learning models described herein.
FIG. 3A illustrates an example machine learning (ML) model 300A trained to predict displacement (i.e., excursion) of a speaker diaphragm based on a training data set of V and/or I, labeled with a measured amount of displacement of a speaker diaphragm. After training, the ML model 300 may predict the displacement of a speaker based on V and/or I.
FIG. 3B illustrates an example 300B of an input matrix, ML model, and output vector.
Example 300B may be constrained by various constraints on model size and latency. Example constraints may include the size of a one-dimensional time sequence (e.g., sampled at a sampling rate of 48 kHz) , timing constraints (e.g., 10 ms scheduling time) , input length (e.g., 256 samples provided as input) , and the like. The output of the ML model generally includes a prediction of diaphragm excursion given the inputs of V and/or I. In some aspects, the output may be generated using mixed quantization (8 and 16 bits) and may be performed by a central processing unit (CPU) or offloaded to other processors, such as a graphics processing unit (GPU) or a neural processing unit (NPU) .
FIG. 4A illustrates an example preprocessing procedure 400A, according to certain aspects of the present disclosure.
A laser (or other measurement/metrology device) may be pointed at the center of the speaker in order to track the displacement (i.e., excursion) of the diaphragm for any given V and/or I used to drive the speaker. The measured diaphragm displacement is logged as x d (t) . Meanwhile, the corresponding real-time current i c (t)  and/or voltage v c (t) are measured, as shown in Fig. 1. Next, Equation (1) is transferred as
Figure PCTCN2022127609-appb-000003
where f (·) is the function to represent the mechanical characteristics of the speaker. For example, in one voltage-controlled speaker, voltage is the source input, encoded by the voice content, which causes the diaphragm to vibrate and from which excursion from a base plane can be measured. The mechanical response is embedded into the feedback’s current. Once a model can be trained/optimized to represent f (·) (e.g., the mechanical characteristics of the speaker) stably, real-time logged current and/or voltage can be used to predict the corresponding diaphragm displacement based on the trained model, as shown in Equation (3) :
Figure PCTCN2022127609-appb-000004
where the motivation is to learn
Figure PCTCN2022127609-appb-000005
based on the logged dataset, and to predict diaphragm displacement using the model, given the real-time v c (t) and/or i c (t) .
The measured diaphragm excursion is generally impacted by noise and the location and environment in which a speaker is installed. As discussed, continuous long-term large excursion may cause progressively more serious damage to the diaphragm. To prevent, or at least reduce, damage to speaker diaphragms, aspects of the present disclosure may use direct current (DC) drift prediction, where a low-pass filter (e.g., a second-order Butterworth filter or other second-order filter) with a relatively low cutoff frequency (e.g., 10 Hz) may be involved to extract the DC of the measured diaphragm displacement, which, as discussed above, may be used as ground-truth data associated with a V and/or I measurement for training the models discussed herein. Moreover, in the logging, because current/voltage and laser measurements utilize separate clocks, synchronization may be implemented to temporally align a sequence of V and/or I data with the corresponding measurements. Cross-correlation between current and measured excursion may be used to time shift the data for accurate training of the model.
FIG. 4B illustrates an example plot 400B illustrating a correlation between current I and displacement X. The plot illustrates synchronization between I and the measured excursion. In some cases, V/I monitoring and laser measurement may be deployed at two independent parts. Correlation between I and X is illustrated at 8 kHz.
FIG. 4C illustrates an example correlation 400C for an input, filter, and extracted frequency. As illustrated, a raw measurement (e.g., sampled at 48 kHz) may be input to a filter, which may reduce noise while maintaining features of the measured excursion. In some cases, low frequency may have an impact on heating. In some cases, the filter may comprise a low-pass filter (e.g., a Butterworth filter) , which may extract the DC value in the measured excursion.
FIG. 5 illustrates an example Fourier neural network model 500, according to certain aspects of the present disclosure.
Mathematical models of speaker diaphragm excursion may not allow for speakers to be driven to full potential; however, these models illustrate that DC drift can be associated with some unknown frequency and the corresponding harmonic components, which is highly associated to the mechanical design of the speaker. Aspects of the present disclosure leverage DL solutions to extract these frequencies and to predict DC drift.
In the training stage, for an input sequence with N samples, the number of state variables is 2N, including the voltage and current components, s n= {i n, v n} , written as s= (s 1, s 2, …, s N) ∈R 2N. The output prediction x N is mapped to the time stamp t as the last sample in the sequence. Each sample used to train the model is defined as {s, x N} .
In some aspects, a Fourier Attention Operator Layer (FAOL) may be used to extract the effective frequency components for a given input sequence of voltage and/or current components. In the FAOL, the multi-head attention is embedded, and the complex value is re-organized into two real parts, simplified for concatenation in the channel domain. After attention processing, the channels may be combined to restore a complex value. In some aspects, several FAOL blocks may be concatenated, which may aid in extracting the harmonic features for a given sequence of samples. Further, a skip path in the time domain may be used to restore discarded frequency parts from the input sequence which may have been previously discarded. In some aspects, the overall structure of a fast Fourier transform network may include a 1-dimensional convolutional layer that increases the size of the channel feature space, an FAOL configured to extract features from an input sequence, and an average pooling layer to down-sample the extracted features into a smaller space.
Consider a scenario in which there are J Fourier Attention Operator Layers (FAOLs) in the neural network. The output of each layer is g j for j=1, 2, 3, …, J. For the input of each layer, a discrete or fast Fourier transform F may be performed to convert the input time-domain samples into the frequency domain. A multi-head self-attention block parameterized by
Figure PCTCN2022127609-appb-000006
may be used to learn in the frequency domain, and then recover the time-domain sequences based on an inverse Fourier transform F -1. This process may be referred to as a Fourier attention operator 
Figure PCTCN2022127609-appb-000007
represented by the following equation:
Figure PCTCN2022127609-appb-000008
where
Figure PCTCN2022127609-appb-000009
is the multi-head attention block, to learn the coefficients based on the given patches. Then, 
Figure PCTCN2022127609-appb-000010
is the weight tensor conduct linear combination of the modes in the frequency domain. The output of the j-th layer adds up F -1 output with the initial time-domain sequence weighted by conv1d operator ψ (·) . Rectified linear unit (ReLU) activation is used along with one-dimensional convolutional operations.
Further, as a comparison, in some aspects, a ResNet-based one-dimensional convolutional network (ConvNet) may be used, which may have a similar structure as the FFTNet discussed above (e.g., have the same input and output format, a single conv1d layer, several ResNet blocks, average pooling, and a fully connected (FC) layer to regress the predicted DC drift) .
In some cases, a fast Fourier transform (FFT) neural model complexity may include 333184.0 float operations and 1725.0 parameters.
FIG. 6 illustrates an example model 600, in accordance with aspects of the present disclosure. In some cases, a ResNet 1D model complexity may include 3244096.0 float operations and 19073.0 parameters.
FIG. 7 illustrates an example batch normalization (BN) re-estimation algorithm 700, in accordance with aspects of the present disclosure.
As discussed, speakers (and the performance of these speakers) may be impacted by production and unknown mechanical characteristics. Further, different units of the same model of speaker may have varying characteristics, which may impose difficulties in accurately predicting diaphragm displacement for a speaker. Because of power and computation constraint in edge devices, such as smartphones or other devices  in which speakers are included, batch normalization (BN) re-estimation may be used for online adaptation, without any labeling and fine-tuning request, to adapt to variations between different types of speakers and different units of a same speaker type. The BN layer is generally designed to alleviate the issue of internal covariant shifting: a common problem while training a very deep neural network, and defined as Equation (5) :
Figure PCTCN2022127609-appb-000011
where x j and y j are the input/output scalars of one neuron response in one sample, X . j denotes the j th column of the input data in one BN layer, X∈R n×p, j∈ {1…p} . n denotes the batch size, and p is the feature dimension. γ j and β j are parameters to be learned. Once the training is done, the parameters in BN are frozen for inference operations.
When the model is initialized in a new device or new scenario for inference with BN re-estimation, as shown in Algorithm 1, the input is buffered within the given window, further, to calculate the mean and variance. Further, to track the variation, a 1-tap infinite impulse response (IIR) filter is used to track the mean and variance, which is useful for the optimization in the whole space.
Pseudocode describing an algorithm for predicting diaphragm displacement is illustrated below, in accordance with aspects of the present disclosure:
Figure PCTCN2022127609-appb-000012
FIG. 8 illustrates an example plot 800 illustrating an L1 loss and a scaled fourth-power loss.
In the experiments, three typical scenarios are considered: normal, heating, and DC injection. Fourteen units from two different speaker types (SBS2 and AAC) are used to collect the data. These are further split: 8 units for training, 4 units for validation, and 2 units for testing.
Three methods are verified: a first method utilizes an FFTNet with 3 FFT attention blocks, and involves 333k float operations and 1.7k parameters; a second method utilizes a ConvNet with 4 ResNet blocks, involving 3244k float operations and 19k parameters; a third method uses a digital signal processor (DSP) with limited operations and ignorable parameters for memory cost. The DSP may still leverage the training and validation dataset to fine-tune the algorithm.
As large DC drift values generally lead to more serious damage to the speaker diaphragm, an L1 loss threshold can be used so that small loss values may be considered noise, while loss values larger than the threshold can be enhanced. One  scaled fourth-power loss can be used in the training procedure to focus on the large L1 loss, shown in
Figure PCTCN2022127609-appb-000013
where S is the batch size and x out, i is the prediction of the i-th sample for i=1, 2, …, S, and δ is to adjust the identical point with L1 loss.
As illustrated, predicted diaphragm displacement can track the peak DC jitter with small or no residual loss. For small DC jitter, which is impacted by the random noise (e.g., from the circuit or mechanical diaphragm noise) , the variation of the predicted diaphragm displacement is hard to track. The maximum loss occurs when there is DC cliffing, where, in an ideal system, a zero valued input (e.g., power off or no voice, v c (t) =i c (t) =0) has a corresponding excursion of zero. However, due to mechanical constraints of a speaker diaphragm, diaphragm displacement gradually reduces to zero, but is not fully aligned with electrical signal control. Although such cliffing leads to large loss values, such cliffing would not involve the damage to speaker, as the amount or magnitude of diaphragm excursion is decreasing (leading to less mechanical stress on the diaphragm) .
Using a 32-bit floating point FFTNet and a testing sequence, the predicted DC drift sequence can track the ground truth DC variation, and the maximum L1 loss may be a value less than a target displacement value (e.g., a measured displacement of 0.0478 mm, less than the target displacement of 0.1 mm) .
In Table 1, compared to ConvNet, FFTNet shows more promising performance. However, FFT processing may be computationally complex and may be challenging to deploy in a real hardware device, e.g., accelerated by NPU. To allow for an FFTNet to be used in predicting speaker diaphragm excursion, high precision (e.g., 32-bit floating point) models may be reserved for complex scenarios, such as a DC injection scenario, which may be considered as a highly challenging corner-case scenario against which a speaker protection block is to protect.
  Mean (mm) Max (mm)
FFTNet 0.0077 0.2169
ConvNet 0.0091 0.2314
Table 1: L1 Loss Comparison for DC Injection Scenario
Table 2 illustrates verification of BN re-estimation for online adaptation, where one model, trained based on a specific type of speaker (e.g., an SBS2 speaker) , is used to predict diaphragm displacement for a different type of speaker (e.g., an AAC speaker) . One AAC unit is used to verify the performance of the models discussed herein and also to further explore the impact of different filter coefficients α. As shown in Table 2, such adaptation methods can largely improve the performance of machine learning models used to predict speaker diaphragm displacement. For example, compared to a baseline threshold, α=0.1 can achieve 21%gain in the maximum loss. It should be noted, however, that it may be difficult to identify an optimal threshold α, as the optimal threshold may be highly dependent on the data/model weight distribution. Although α=0.1 achieves the optimum maximum loss, the corresponding mean loss may be higher than the mean loss for other threshold values of α.
α= baseline 0.1 0.05 0.001 0.0001
Mean (mm) 0.0081 0.0035 0.0024 0.001 0.0028
Max (mm) 0.7466 0.4179 0.5849 0.6802 0.7055
Table 2: Comparison of ConvNet FP32 performance in online adaptation scenarios between different speakers
In some cases, model quantization may be used to reduce the complexity when deploying models in the edge devices. Here, AIMET can be used to perform quantization (e.g., to 8-bit integer (INT8) or some other level of quantization) . As FFT operations are not supported in the AIMET, the ConvNet described herein may be quantized, and FFTNet need not be used. In the experiment, two separate models are designed specifically for SBS2 and AAC.
As shown in Table 3, compared to DSP, the ConvNet32 shows the huge gain in the mean loss, and the maximum loss is close to, but larger than, the target (0.1 mm) . Further, after 8-bit quantization, compared to the baseline FP32, INT8 performance is lightly degraded, and the maximum L1 loss is from 0.1121 mm to 0.1298 mm, but still much better than the DSP solution. Inference performance for the SBS2 speaker leads to a similar conclusion as the AAC; however, the mean loss may be somewhat worse than that in the DSP.
    Mean (mm) Max (mm)
  DSP 0.0140 0.2569
AAC ConvNet FP32 0.0038 0.1121
  ConvNet INT8 0.0076 0.1298
  DSP 0.0020 0.2711
SBS2 ConvNet FP32 0.0032 0.13171
  ConvNet INT8 0.0046 0.1408
Table 3: L1 loss results for different speakers and different prediction networks
FIG. 9 illustrates an example plot 900 illustrating a predicted value, a ground truth value, and a residual value.
FIG. 10 illustrates an example system implementation 1000, in accordance with aspects of the present disclosure.
Aspects of the present disclosure provide an end-to-end pipeline to predict DC drift which, as discussed, may be correlated with speaker diaphragm displacement (i.e., excursion) . An attention mechanism may be used to extract frequency features from an input of V and/or I, which shows better performance than the ConvNet and the DSP solutions. Further, BN re-estimation may be enabled for online adaptation when the model is applied to new scenarios (e.g., applied to speaker protection for different speaker types) .
Example Operations for ML-based Diaphragm Excursion Prediction for Speaker Protection
FIG. 11 shows an example of a method 1100 for ML-based diaphragm excursion prediction for speaker protection, in accordance with aspects of the present disclosure. In some examples, the method 1100 may be performed by a device, such as the device 1300 illustrated in FIG. 13.
As illustrated, method 1100 begins at block 1110 with receiving an indication of one or more parameters associated with driving a speaker. In some cases, the operations of this block refer to, or may be performed by, a component (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.
Method 1100 then proceeds to block 1120 with predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters. In some cases, the operations of this block refer to, or may  be performed by, a component (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.
Method 1100 then proceeds to block 1130 with taking one or more actions based on the predicted displacement. In some cases, the operations of this block refer to, or may be performed by, one or more computer-executable components (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.
FIG. 12 shows an example of a method 1200 for training a machine learning model, which may be used in speaker protection tasks, to predict speaker diaphragm displacement (excursion) , in accordance with aspects of the present disclosure. In some examples, the method 1200 may be performed by a device, such as the device 1400 illustrated in FIG. 14.
As illustrated, method 1200 begins at block 1210 with generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters. In some cases, the operations of this block refer to, or may be performed by, a component (e.g., 1424A-1424B) of a memory 1424 as described with reference to FIG. 14.
Method 1200 then proceeds to block 1220 with training a machine learning model, based on the training data set, to predict the displacement of the diaphragm. In some cases, the operations of this block refer to, or may be performed by, a component (e.g., 1424A-1424B) of a memory 1424 as described with reference to FIG. 14.
Example Processing Systems for ML-based Diaphragm Excursion Prediction for Speaker Protection
FIG. 13 depicts an example processing system 1300 for ML-based diaphragm excursion prediction for speaker protection, such as described herein for example with respect to FIG. 11.
Processing system 1300 includes a central processing unit (CPU) 1302, which in some examples may be a multi-core CPU. Instructions executed at the CPU 1302 may be loaded, for example, from a program memory associated with the CPU 1302 or may be loaded from a memory 1324.
Processing system 1300 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 1304, a digital signal processor (DSP) 1306, a neural processing unit (NPU) 1308, a multimedia processing unit 1310, a wireless connectivity component 1312.
An NPU, such as 1308, is generally a specialized circuit configured for implementing control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like. An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing units (TPUs) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.
NPUs, such as 1308, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the plurality of NPUs may be part of a dedicated neural-network accelerator.
NPUs may be optimized for training or inference, or in some cases configured to balance performance between both. For NPUs that are capable of performing both training and inference, the two tasks may still generally be performed independently.
NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged) , iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance. Generally, optimizing based on a wrong prediction involves propagating back through the layers of the model and determining gradients to reduce the prediction error.
NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this new piece through an already trained model to generate a model output (e.g., an inference) .
In one implementation, NPU 1308 is a part of one or more of CPU 1302, GPU 1304, and/or DSP 1306.
In some examples, wireless connectivity component 1312 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE) , fifth generation connectivity (e.g., 5G or NR) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards. Wireless connectivity component 1312 is further connected to one or more antennas 1314.
Processing system 1300 may also include one or more input and/or output devices 1322, such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
In some examples, one or more of the processors of processing system 1300 may be based on an ARM or RISC-V instruction set.
Processing system 1300 also includes memory 1324, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like. In this example, memory 1324 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 1300.
In particular, in this example, memory 1324 includes receiving component 1324A, diaphragm displacement predicting component 1324B, and action taking component 1324C. The depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.
Generally, processing system 1300 and/or components thereof may be configured to perform the methods described herein.
Notably, in other aspects, aspects of processing system 1300 may be omitted, such as where processing system 1300 is a server computer or the like. Further, aspects of processing system 1300 may be distributed, such as training a model and using the model to generate inferences, such as user verification predictions.
FIG. 14 depicts an example processing system 1400 for training a machine learning model, which may be used in speaker protection tasks, to predict speaker  diaphragm displacement (excursion) , such as described herein for example with respect to FIG. 12.
Processing system 1400 includes a central processing unit (CPU) 1402, which in some examples may be a multi-core CPU. Instructions executed at the CPU 1402 may be loaded, for example, from a program memory associated with the CPU 1402 or may be loaded from a memory 1424.
Processing system 1400 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 1404, a digital signal processor (DSP) 1406, a neural processing unit (NPU) 1408, a multimedia processing unit 1410, a wireless connectivity component 1412.
An NPU, such as 1408, is generally a specialized circuit configured for implementing control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like. An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing units (TPUs) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.
NPUs, such as 1408, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the plurality of NPUs may be part of a dedicated neural-network accelerator.
NPUs may be optimized for training or inference, or in some cases configured to balance performance between both. For NPUs that are capable of performing both training and inference, the two tasks may still generally be performed independently.
NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged) , iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance. Generally, optimizing based on a wrong prediction involves  propagating back through the layers of the model and determining gradients to reduce the prediction error.
NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this new piece through an already trained model to generate a model output (e.g., an inference) .
In one implementation, NPU 1408 is a part of one or more of CPU 1402, GPU 1404, and/or DSP 1406.
In some examples, wireless connectivity component 1412 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE) , fifth generation connectivity (e.g., 5G or NR) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards. Wireless connectivity component 1412 is further connected to one or more antennas 1414.
Processing system 1400 may also include one or more input and/or output devices 1422, such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
In some examples, one or more of the processors of processing system 1400 may be based on an ARM or RISC-V instruction set.
Processing system 1400 also includes memory 1424, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like. In this example, memory 1424 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 1400.
In particular, in this example, memory 1424 includes generating component 1424A and training component 1424B. The depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.
Generally, processing system 1400 and/or components thereof may be configured to perform the methods described herein.
Notably, in other aspects, aspects of processing system 1400 may be omitted, such as where processing system 1400 is a server computer or the like. Further, aspects  of processing system 1400 may be distributed, such as training a model and using the model to generate inferences, such as user verification predictions.
Example Clauses
Clause 1: A processor-implemented method comprising: receiving an indication of one or more parameters associated with driving a speaker; predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters; and taking one or more actions based on the predicted displacement.
Clause 2: The method of Clause 1, wherein the taking comprises controlling an amplitude of at least one of a voltage signal or a current signal associated with driving the speaker, based on the predicted displacement.
Clause 3: The method of  Clause  1 or 2, wherein the machine learning model comprises a one-dimensional residual neural network model.
Clause 4: The method of any of Clauses 1-3, wherein the machine learning model comprises a Fourier neural network model.
Clause 5: The method of Clause 4, wherein the predicting comprises generating a Fourier transform of the one or more parameters, wherein the machine learning model comprises an attention block configured to extract features in frequency domain from the Fourier transform of the one or more parameters.
Clause 6: The method of Clause 4, wherein the machine learning model comprises a Fourier attention operator layer including a skip path associated with an inverse Fourier transform output in the time domain.
Clause 7: The method of any of Clauses 1-6, wherein the machine learning model comprises a model trained to predict the displacement of the diaphragm of the speaker based on a training data set mapping the one or more parameters to a measured displacement.
Clause 8: The method of any of Clauses 1-7, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
Clause 9: The method of any of Clauses 1-8, further comprising adapting the machine learning model from a first speaker to a second speaker based on batch normalization.
Clause 10: A processor-implemented method comprising: generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters; and training a machine learning model, based on the training data set, to predict the displacement of the diaphragm.
Clause 11: The method of Clause 10, wherein generating the training data set comprises filtering the indication of the displacement of the diaphragm to generate a filtered indication.
Clause 12: The method of Clause 11, wherein the filtering comprises using a Butterworth filter.
Clause 13: The method of Clause 11 or 12, wherein generating the training data set comprises correlating the filtered indication of the displacement of the diaphragm with the indication of the one or more parameters to effectively synchronize the filtered indication of the displacement with the one or more parameters.
Clause 14: The method of Clause 13, wherein the training is based on the filtered indication of the displacement effectively synchronized to the indication of the one or more parameters.
Clause 15: The method of any of Clauses 10-14, wherein the machine learning model comprises a one-dimensional residual neural network model.
Clause 16: The method of any of Clauses 10-14, wherein the machine learning model comprises a Fourier neural network model.
Clause 17: The method of any of Clauses 10-16, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
Clause 18: A system comprising: a memory having executable instructions stored thereon; and a processor configured to execute the executable instructions to cause the system to perform the operations of any of Clauses 1-17.
Clause 19: A system comprising means for performing the operations of any of Clauses 1-17.
Clause 20: A computer-readable medium having instructions stored thereon which, when executed by a processor, performs the operations of any of Clauses 1-17.
Additional Considerations
The preceding description is provided to enable any person skilled in the art to practice the various embodiments described herein. The examples discussed herein are not limiting of the scope, applicability, or embodiments set forth in the claims. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments. For example, changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For instance, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Also, features described with respect to some examples may be combined in some other examples. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.
As used herein, the word “exemplary” means “serving as an example, instance, or illustration. ” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
As used herein, a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c) .
As used herein, the term “determining” encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing,  deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure) , ascertaining, and the like. Also, “determining” may include receiving (e.g., receiving information) , accessing (e.g., accessing data in a memory) , and the like. Also, “determining” may include resolving, selecting, choosing, establishing, and the like.
The methods disclosed herein comprise one or more steps or actions for achieving the methods. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims. Further, the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software component (s) and/or module (s) , including, but not limited to a circuit, an application specific integrated circuit (ASIC) , or processor. Generally, where there are operations illustrated in figures, those operations may have corresponding counterpart means-plus-function components with similar numbering.
The following claims are not intended to be limited to the embodiments shown herein, but are to be accorded the full scope consistent with the language of the claims. Within a claim, reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more. ” Unless specifically stated otherwise, the term “some” refers to one or more. No claim element is to be construed under the provisions of 35 U.S.C. §112 (f) unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for. ” All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims.
Figure PCTCN2022127609-appb-000014
Figure PCTCN2022127609-appb-000015
Figure PCTCN2022127609-appb-000016
Figure PCTCN2022127609-appb-000017
Figure PCTCN2022127609-appb-000018
Figure PCTCN2022127609-appb-000019
Figure PCTCN2022127609-appb-000020
Figure PCTCN2022127609-appb-000021

Claims (21)

  1. A processor-implemented method comprising:
    receiving an indication of one or more parameters associated with driving a speaker;
    predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters; and
    taking one or more actions based on the predicted displacement.
  2. The method of claim 1, wherein the taking comprises controlling an amplitude of at least one of a voltage signal or a current signal associated with driving the speaker, based on the predicted displacement.
  3. The method of claim 1, wherein the machine learning model comprises a one-dimensional residual neural network model.
  4. The method of claim 1, wherein the machine learning model comprises a Fourier neural network model.
  5. The method of claim 4, wherein the predicting comprises generating a Fourier transform of the one or more parameters and wherein the Fourier neural network model comprises an attention block configured to extract features in the frequency domain from the Fourier transform of the one or more parameters.
  6. The method of claim 4, wherein the machine learning model comprises a Fourier attention operator layer including a skip path associated with an inverse Fourier transform output in the time domain.
  7. The method of claim 1, wherein the machine learning model comprises a model trained to predict the displacement of the diaphragm of the speaker based on a training data set mapping the one or more parameters to a measured displacement.
  8. The method of claim 1, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
  9. The method of claim 1, further comprising adapting the machine learning model from a first speaker to a second speaker based on batch normalization.
  10. A processor-implemented method comprising:
    generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters; and
    training a machine learning model, based on the training data set, to predict the displacement of the diaphragm.
  11. The method of claim 10, wherein generating the training data set comprises filtering the indication of the displacement of the diaphragm to generate a filtered indication.
  12. The method of claim 11, wherein the filtering comprises using a second-order Butterworth filter.
  13. The method of claim 11, wherein generating the training data set comprises correlating the filtered indication of the displacement of the diaphragm with the indication of the one or more parameters to effectively synchronize the filtered indication of the displacement with the one or more parameters.
  14. The method of claim 13, wherein the training is based on the filtered indication of the displacement effectively synchronized to the indication of the one or more parameters.
  15. The method of claim 10, wherein the machine learning model comprises a one-dimensional residual neural network model.
  16. The method of claim 10, wherein the machine learning model comprises a Fourier neural network model.
  17. The method of claim 10, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
  18. A processing system comprising:
    a memory comprising computer-executable instructions; and
    one or more processors configured to execute the computer-executable instructions and cause the processing system to perform a method in accordance with any of Claims 1-17.
  19. A processing system, comprising means for performing a method in accordance with any of Claims 1-17.
  20. A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform a method in accordance with any of Claims 1-17.
  21. A computer program product embodied on a computer-readable storage medium comprising code for performing a method in accordance with any of Claims 1-17.
PCT/CN2022/127609 2022-10-26 2022-10-26 Machine-learning-based diaphragm excursion prediction for speaker protection WO2024087050A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/127609 WO2024087050A1 (en) 2022-10-26 2022-10-26 Machine-learning-based diaphragm excursion prediction for speaker protection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/127609 WO2024087050A1 (en) 2022-10-26 2022-10-26 Machine-learning-based diaphragm excursion prediction for speaker protection

Publications (1)

Publication Number Publication Date
WO2024087050A1 true WO2024087050A1 (en) 2024-05-02

Family

ID=90829523

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/127609 WO2024087050A1 (en) 2022-10-26 2022-10-26 Machine-learning-based diaphragm excursion prediction for speaker protection

Country Status (1)

Country Link
WO (1) WO2024087050A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180167732A1 (en) * 2016-12-13 2018-06-14 Samsung Electronics Co., Ltd. Method of processing sound signal of electronic device and electronic device for same
CN112533115A (en) * 2019-09-18 2021-03-19 华为技术有限公司 Method and device for improving tone quality of loudspeaker
CN114390406A (en) * 2020-10-16 2022-04-22 华为技术有限公司 Method and device for controlling displacement of loudspeaker diaphragm
US20220141578A1 (en) * 2020-10-30 2022-05-05 Samsung Electronics Co., Ltd. Nonlinear control of a loudspeaker with a neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180167732A1 (en) * 2016-12-13 2018-06-14 Samsung Electronics Co., Ltd. Method of processing sound signal of electronic device and electronic device for same
CN112533115A (en) * 2019-09-18 2021-03-19 华为技术有限公司 Method and device for improving tone quality of loudspeaker
CN114390406A (en) * 2020-10-16 2022-04-22 华为技术有限公司 Method and device for controlling displacement of loudspeaker diaphragm
US20220141578A1 (en) * 2020-10-30 2022-05-05 Samsung Electronics Co., Ltd. Nonlinear control of a loudspeaker with a neural network

Similar Documents

Publication Publication Date Title
US10679612B2 (en) Speech recognizing method and apparatus
KR101864478B1 (en) Method and arrangement for controlling an electro-acoustical transducer
JP7486266B2 (en) Method and apparatus for determining a depth filter - Patents.com
Küçük et al. Real-time convolutional neural network-based speech source localization on smartphone
US20240121570A1 (en) Apparatus, Methods and Computer Programs for Enabling Audio Rendering
US20230395087A1 (en) Machine Learning for Microphone Style Transfer
Klippel Adaptive stabilization of electro-dynamical transducers
KR101704925B1 (en) Voice Activity Detection based on Deep Neural Network Using EVS Codec Parameter and Voice Activity Detection Method thereof
JP5994639B2 (en) Sound section detection device, sound section detection method, and sound section detection program
CN110708651B (en) Hearing aid squeal detection and suppression method and device based on segmented trapped wave
Dash et al. Speech intelligibility based enhancement system using modified deep neural network and adaptive multi-band spectral subtraction
Brunet et al. Identification of loudspeakers using fractional derivatives
WO2024087050A1 (en) Machine-learning-based diaphragm excursion prediction for speaker protection
Westhausen et al. Low bit rate binaural link for improved ultra low-latency low-complexity multichannel speech enhancement in Hearing Aids
Selvi et al. Hybridization of spectral filtering with particle swarm optimization for speech signal enhancement
Zhang et al. CGMM-Based Sound Zone Generation Using Robust Pressure Matching With ATF Perturbation Constraints
Krishnan et al. Fast algorithms for acoustic impulse response shaping
Prajna et al. A new approach to dual channel speech enhancement based on gravitational search algorithm (GSA)
Gonzalez et al. Investigating the Design Space of Diffusion Models for Speech Enhancement
Lü et al. Feature compensation based on independent noise estimation for robust speech recognition
Sun et al. The prediction of nonlinear resistance and distortion for a miniature loudspeaker with vented cavities
Schneider et al. An iterative least-squares design method for filters with constrained magnitude response in sound reproduction
Brunet et al. New trends in modeling and identification of loudspeaker with nonlinear distortion
US20240363133A1 (en) Noise suppression model using gated linear units
Alameri et al. Convolutional Deep Neural Network and Full Connectivity for Speech Enhancement.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22963035

Country of ref document: EP

Kind code of ref document: A1