WO2024087050A1 - Machine-learning-based diaphragm excursion prediction for speaker protection - Google Patents
Machine-learning-based diaphragm excursion prediction for speaker protection Download PDFInfo
- Publication number
- WO2024087050A1 WO2024087050A1 PCT/CN2022/127609 CN2022127609W WO2024087050A1 WO 2024087050 A1 WO2024087050 A1 WO 2024087050A1 CN 2022127609 W CN2022127609 W CN 2022127609W WO 2024087050 A1 WO2024087050 A1 WO 2024087050A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speaker
- displacement
- diaphragm
- parameters
- indication
- Prior art date
Links
- 238000010801 machine learning Methods 0.000 title claims abstract description 49
- 238000000034 method Methods 0.000 claims abstract description 87
- 238000006073 displacement reaction Methods 0.000 claims abstract description 58
- 230000009471 action Effects 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims description 58
- 238000012549 training Methods 0.000 claims description 40
- 230000015654 memory Effects 0.000 claims description 23
- 238000003062 neural network model Methods 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 2
- 238000003860 storage Methods 0.000 claims description 2
- 230000001360 synchronised effect Effects 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 description 14
- 238000013135 deep learning Methods 0.000 description 11
- 238000004422 calculation algorithm Methods 0.000 description 9
- 230000000875 corresponding effect Effects 0.000 description 9
- 238000005259 measurement Methods 0.000 description 9
- 239000000243 solution Substances 0.000 description 9
- 230000001537 neural effect Effects 0.000 description 8
- 238000013139 quantization Methods 0.000 description 8
- 230000006978 adaptation Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 238000007637 random forest analysis Methods 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000002347 injection Methods 0.000 description 3
- 239000007924 injection Substances 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000010438 heat treatment Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000026683 transduction Effects 0.000 description 2
- 238000010361 transduction Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012067 mathematical method Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000012528 membrane Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000000725 suspension Substances 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/007—Protection circuits for transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R9/00—Transducers of moving-coil, moving-strip, or moving-wire type
- H04R9/06—Loudspeakers
Definitions
- aspects of the present disclosure relate to speaker diaphragm protection using machine learning techniques.
- a speaker is an electro-acoustic transducer, generating sound from an electric signal produced by a power amplifier.
- the voice coil of a speaker is attached to a diaphragm that is mounted on a fixed frame via a suspension.
- a magnetic field is generated by a permanent magnet that is conducted to the region of the coil gap. Due to the presence of the magnetic field, an electrical current passing through the voice-coil generates a force f c which causes the membrane to move up and down.
- the displacement x d of the diaphragm is the excursion, which has a limit. If the excursion limit is exceeded, the speaker exhibits nonlinear behavior, which in turn manifests as distorted sound and degraded acoustic echo cancellation performance.
- Certain aspects generally relate to machine-learning (ML) -based diaphragm excursion prediction for speaker protection.
- ML machine-learning
- the method generally includes receiving an indication of one or more parameters associated with driving a speaker, predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters, and taking one or more actions based on the predicted displacement.
- processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer- readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and apparatus comprising means for performing the aforementioned methods as well as those further described herein.
- FIG. 1 depicts an example speaker cross-section showing diaphragm displacement.
- FIG. 2 illustrates an example laser excursion measurement
- FIG. 3A illustrates an example machine learning (ML) model to predict displacement.
- ML machine learning
- FIG. 3B illustrates an example input matrix, ML model, and output vector.
- FIG. 4A illustrates an example preprocessing procedure.
- FIG. 4B illustrates an example plot illustrating a correlation between current I and displacement X.
- FIG. 4C illustrates an example input, filter, and extracted frequency.
- FIG. 5 illustrates an example Fourier neural network model.
- FIG. 6 illustrates an example model, in accordance with certain aspects of the present disclosure.
- FIG. 7 illustrates an example batch normalization (BN) re-estimation algorithm, in accordance with certain aspects of the present disclosure.
- FIG. 8 illustrates an example plot illustrating a L1 loss (e.g., a Least Absolute Deviation loss) and a scaled fourth-power loss.
- a L1 loss e.g., a Least Absolute Deviation loss
- a scaled fourth-power loss e.g., a scaled fourth-power loss
- FIG. 9 illustrates an example plot illustrating a predicted value, a ground truth value, and a residual value.
- FIG. 10 illustrates an example system implementation, in accordance with certain aspects of the present disclosure.
- FIG. 11 illustrates an example method flow diagram, in accordance with certain aspects of the present disclosure.
- FIG. 12 illustrates an example method flow diagram, in accordance with certain aspects of the present disclosure.
- FIG. 13 illustrates an example device, in accordance with certain aspects of the present disclosure.
- FIG. 14 illustrates an example device, in accordance with certain aspects of the present disclosure.
- the APPENDIX describes various aspects of the present disclosure.
- aspects of the present disclosure provide apparatuses, methods, processing systems, and computer-readable mediums for speaker diaphragm excursion prediction using machine learning.
- Speaker protection generally leverages the playback signal to prevent over-excursion while maintaining maximum loudness, e.g. for speakerphone or gaming use in tiny loudspeakers, such as are found in smartphones, tablets, laptops, and other portable devices.
- One challenge is to model and predict over-excursion with highly nonlinear characteristics.
- aspects of the present disclosure utilize deep learning (DL) techniques to accurately predict nonlinear excursion of speaker diaphragms when driven.
- Feedback current and/or voltage may be sampled as input, and a laser may be used to measure diaphragm excursion. The sampled current and/or voltage are labeled (or otherwise correlated) with the measured diaphragm excursion.
- DL deep learning
- a convolutional neural network (ConvNet, or CNN) is designed as the baseline, and a fast Fourier transform network (FFTNet) may be used to explore the dominant low-frequency and the unknown harmonic (s) .
- FFTNet fast Fourier transform network
- batch normalization (BN) re-estimation is enabled to achieve online adaptation, and quantization (8-bit integer (INT8) quantization) based on the artificial intelligence model efficiency toolkit (AIMET) is used to further reduce the computational complexity involved in predicting diaphragm excursion from current and voltage.
- AIMET artificial intelligence model efficiency toolkit
- Certain aspects of the present disclosure can achieve greater than 99%of the residual DC causing a diaphragm excursion of less than 0.1 mm, which may exceed the performance of various digital signal processor (DSP) solutions, verified in two speakers considering three scenarios.
- DSP digital signal processor
- Some solutions for speaker protection involve building a speaker protection block, which first monitors the current and voltage, and then analyzes the buffering and predicts the excursion status. Once the predicted excursion is larger than a threshold, the speaker protection block is triggered to attenuate the input power or modify the source signal to decrease diaphragm excursion. However, it is hard to precisely predict diaphragm excursion based on the monitored current and/or voltage. Thus, one simple technique to control diaphragm excursion may use traditional equalization (EQ) filters to attenuate an input signal. These traditional EQ filters are generally designed conservatively due to the wide range of operating factors (e.g., speaker variations, various types of audio signals with large dynamic ranges, etc. ) in which various speakers operate.
- EQ equalization
- an EQ filter for nonlinear distortion in direct-radiator loudspeakers in a closed cabinet may be implemented as an exact inverse of an electro-mechanical model of the loudspeaker. Estimates generated by the digital loudspeaker model may be used to predict the excursion based on the input voltage, and the predicted excursion may be controlled using dynamic range compression in the excursion domain. These approaches, however, do not push the speaker to its true limit. For example, EQ filters still attenuate the output audio, even when low audio-signal energy and diaphragm excursion is within a defined limit or threshold, thereby degrading the audio quality and the volume of the audio.
- DL approaches can be used in modeling the behaviors of a voice coil actuator (VCA) .
- VCA voice coil actuator
- DL approaches generally incorporate a recurrent neural network (RNN) into a multi-physics simulation to enhance the computation efficiency of these DL approaches.
- DL solutions can solve differential equations (DEs) , which can partially model a diverse non-linear system, such as excursion prediction and VCA modeling.
- DEs differential equations
- Neural operators in the RNN can directly learn the mapping from any functional parametric dependence to the solution.
- One example uses physics-informed neural networks that directly solve the ordinary DEs.
- Another example formulates the neural operator n in the Fourier space by parameterizing the integral kernel.
- DL solutions may be highly dependent on the training data set and may be subject to overfitting, especially with highly variable data sets, such as those associated with diaphragm excursion characteristics.
- aspects of the present disclosure provide DL techniques to explore effective features for speaker protection.
- a diaphragm excursion measurement setting is established where a laser is to track the corresponding excursion, and a comprehensive preprocessing pipeline is to prepare the dataset.
- a model based on ConvNet and/or FFTNet, for example, may be trained and verified based on two typical speakers.
- BN re-estimation for online adaptation and quantization in AIMET may also be implemented.
- FIG. 1 depicts a cross-section of an example speaker 100 showing displacement X (also referred to as “excursion” ) of the speaker’s diaphragm.
- the speaker 100 represents the transduction of electrical energy to mechanical energy.
- a continuous-time model for the electrical behavior is shown as:
- v c (t) is the voltage input across the terminals of the voice coil
- i c (t) is the voice coil current
- R eb is the blocked electrical resistance
- ⁇ 0 is the transduction coefficient at the equilibrium state x d (t) , which is the diaphragm excursion.
- Such mechanical characteristics of the speaker may be mostly determined by the parameters R eb and ⁇ 0 , which highly depend on the speaker’s geometric construction and the materials used in the voice coil, the diaphragm, and the enclosure. It is hard to accurately construct the mathematical method to track the nonlinear distribution and variation.
- Equation (1) such nonlinear features can be represented by [v c (t) , i c (t) , x d (t) ] , which is further used in supervised training of models used in a DL solution to learn the characteristics of a speaker.
- FIG. 2 illustrates an example laser excursion measurement environment 200.
- the current (I) and voltage (V) are the inputs.
- the measured excursion (X) from the laser is used/labeled as ground truth data used in supervised learning techniques to train the machine learning models described herein.
- FIG. 3A illustrates an example machine learning (ML) model 300A trained to predict displacement (i.e., excursion) of a speaker diaphragm based on a training data set of V and/or I, labeled with a measured amount of displacement of a speaker diaphragm.
- the ML model 300 may predict the displacement of a speaker based on V and/or I.
- FIG. 3B illustrates an example 300B of an input matrix, ML model, and output vector.
- Example 300B may be constrained by various constraints on model size and latency.
- Example constraints may include the size of a one-dimensional time sequence (e.g., sampled at a sampling rate of 48 kHz) , timing constraints (e.g., 10 ms scheduling time) , input length (e.g., 256 samples provided as input) , and the like.
- the output of the ML model generally includes a prediction of diaphragm excursion given the inputs of V and/or I. In some aspects, the output may be generated using mixed quantization (8 and 16 bits) and may be performed by a central processing unit (CPU) or offloaded to other processors, such as a graphics processing unit (GPU) or a neural processing unit (NPU) .
- CPU central processing unit
- GPU graphics processing unit
- NPU neural processing unit
- FIG. 4A illustrates an example preprocessing procedure 400A, according to certain aspects of the present disclosure.
- a laser may be pointed at the center of the speaker in order to track the displacement (i.e., excursion) of the diaphragm for any given V and/or I used to drive the speaker.
- the measured diaphragm displacement is logged as x d (t) .
- the corresponding real-time current i c (t) and/or voltage v c (t) are measured, as shown in Fig. 1.
- Equation (1) is transferred as
- f ( ⁇ ) is the function to represent the mechanical characteristics of the speaker.
- voltage is the source input, encoded by the voice content, which causes the diaphragm to vibrate and from which excursion from a base plane can be measured.
- the mechanical response is embedded into the feedback’s current.
- the motivation is to learn based on the logged dataset, and to predict diaphragm displacement using the model, given the real-time v c (t) and/or i c (t) .
- aspects of the present disclosure may use direct current (DC) drift prediction, where a low-pass filter (e.g., a second-order Butterworth filter or other second-order filter) with a relatively low cutoff frequency (e.g., 10 Hz) may be involved to extract the DC of the measured diaphragm displacement, which, as discussed above, may be used as ground-truth data associated with a V and/or I measurement for training the models discussed herein.
- DC direct current
- synchronization may be implemented to temporally align a sequence of V and/or I data with the corresponding measurements.
- Cross-correlation between current and measured excursion may be used to time shift the data for accurate training of the model.
- FIG. 4B illustrates an example plot 400B illustrating a correlation between current I and displacement X.
- the plot illustrates synchronization between I and the measured excursion.
- V/I monitoring and laser measurement may be deployed at two independent parts. Correlation between I and X is illustrated at 8 kHz.
- FIG. 4C illustrates an example correlation 400C for an input, filter, and extracted frequency.
- a raw measurement e.g., sampled at 48 kHz
- the filter may comprise a low-pass filter (e.g., a Butterworth filter) , which may extract the DC value in the measured excursion.
- a low-pass filter e.g., a Butterworth filter
- FIG. 5 illustrates an example Fourier neural network model 500, according to certain aspects of the present disclosure.
- the output prediction x N is mapped to the time stamp t as the last sample in the sequence.
- Each sample used to train the model is defined as ⁇ s, x N ⁇ .
- a Fourier Attention Operator Layer may be used to extract the effective frequency components for a given input sequence of voltage and/or current components.
- the multi-head attention is embedded, and the complex value is re-organized into two real parts, simplified for concatenation in the channel domain. After attention processing, the channels may be combined to restore a complex value.
- several FAOL blocks may be concatenated, which may aid in extracting the harmonic features for a given sequence of samples.
- a skip path in the time domain may be used to restore discarded frequency parts from the input sequence which may have been previously discarded.
- the overall structure of a fast Fourier transform network may include a 1-dimensional convolutional layer that increases the size of the channel feature space, an FAOL configured to extract features from an input sequence, and an average pooling layer to down-sample the extracted features into a smaller space.
- a ResNet-based one-dimensional convolutional network may be used, which may have a similar structure as the FFTNet discussed above (e.g., have the same input and output format, a single conv1d layer, several ResNet blocks, average pooling, and a fully connected (FC) layer to regress the predicted DC drift) .
- ConvNet ResNet-based one-dimensional convolutional network
- a fast Fourier transform (FFT) neural model complexity may include 333184.0 float operations and 1725.0 parameters.
- FIG. 6 illustrates an example model 600, in accordance with aspects of the present disclosure.
- a ResNet 1D model complexity may include 3244096.0 float operations and 19073.0 parameters.
- FIG. 7 illustrates an example batch normalization (BN) re-estimation algorithm 700, in accordance with aspects of the present disclosure.
- BN batch normalization
- x j and y j are the input/output scalars of one neuron response in one sample
- X . j denotes the j th column of the input data in one BN layer
- n denotes the batch size
- p is the feature dimension.
- ⁇ j and ⁇ j are parameters to be learned. Once the training is done, the parameters in BN are frozen for inference operations.
- the input is buffered within the given window, further, to calculate the mean and variance. Further, to track the variation, a 1-tap infinite impulse response (IIR) filter is used to track the mean and variance, which is useful for the optimization in the whole space.
- IIR infinite impulse response
- FIG. 8 illustrates an example plot 800 illustrating an L1 loss and a scaled fourth-power loss.
- a first method utilizes an FFTNet with 3 FFT attention blocks, and involves 333k float operations and 1.7k parameters; a second method utilizes a ConvNet with 4 ResNet blocks, involving 3244k float operations and 19k parameters; a third method uses a digital signal processor (DSP) with limited operations and ignorable parameters for memory cost.
- DSP digital signal processor
- an L1 loss threshold can be used so that small loss values may be considered noise, while loss values larger than the threshold can be enhanced.
- predicted diaphragm displacement can track the peak DC jitter with small or no residual loss.
- small DC jitter which is impacted by the random noise (e.g., from the circuit or mechanical diaphragm noise)
- the variation of the predicted diaphragm displacement is hard to track.
- diaphragm displacement gradually reduces to zero, but is not fully aligned with electrical signal control.
- cliffing leads to large loss values, such cliffing would not involve the damage to speaker, as the amount or magnitude of diaphragm excursion is decreasing (leading to less mechanical stress on the diaphragm) .
- the predicted DC drift sequence can track the ground truth DC variation, and the maximum L1 loss may be a value less than a target displacement value (e.g., a measured displacement of 0.0478 mm, less than the target displacement of 0.1 mm) .
- a target displacement value e.g., a measured displacement of 0.0478 mm, less than the target displacement of 0.1 mm
- FFTNet shows more promising performance.
- FFT processing may be computationally complex and may be challenging to deploy in a real hardware device, e.g., accelerated by NPU.
- high precision (e.g., 32-bit floating point) models may be reserved for complex scenarios, such as a DC injection scenario, which may be considered as a highly challenging corner-case scenario against which a speaker protection block is to protect.
- Table 2 illustrates verification of BN re-estimation for online adaptation, where one model, trained based on a specific type of speaker (e.g., an SBS2 speaker) , is used to predict diaphragm displacement for a different type of speaker (e.g., an AAC speaker) .
- One AAC unit is used to verify the performance of the models discussed herein and also to further explore the impact of different filter coefficients ⁇ .
- ⁇ 0.1 achieves the optimum maximum loss, the corresponding mean loss may be higher than the mean loss for other threshold values of ⁇ .
- model quantization may be used to reduce the complexity when deploying models in the edge devices.
- AIMET can be used to perform quantization (e.g., to 8-bit integer (INT8) or some other level of quantization) .
- FFT operations are not supported in the AIMET, the ConvNet described herein may be quantized, and FFTNet need not be used.
- two separate models are designed specifically for SBS2 and AAC.
- the ConvNet32 shows the huge gain in the mean loss, and the maximum loss is close to, but larger than, the target (0.1 mm) .
- the maximum loss is close to, but larger than, the target (0.1 mm) .
- INT8 performance is lightly degraded, and the maximum L1 loss is from 0.1121 mm to 0.1298 mm, but still much better than the DSP solution.
- Inference performance for the SBS2 speaker leads to a similar conclusion as the AAC; however, the mean loss may be somewhat worse than that in the DSP.
- FIG. 9 illustrates an example plot 900 illustrating a predicted value, a ground truth value, and a residual value.
- FIG. 10 illustrates an example system implementation 1000, in accordance with aspects of the present disclosure.
- aspects of the present disclosure provide an end-to-end pipeline to predict DC drift which, as discussed, may be correlated with speaker diaphragm displacement (i.e., excursion) .
- An attention mechanism may be used to extract frequency features from an input of V and/or I, which shows better performance than the ConvNet and the DSP solutions.
- BN re-estimation may be enabled for online adaptation when the model is applied to new scenarios (e.g., applied to speaker protection for different speaker types) .
- FIG. 11 shows an example of a method 1100 for ML-based diaphragm excursion prediction for speaker protection, in accordance with aspects of the present disclosure.
- the method 1100 may be performed by a device, such as the device 1300 illustrated in FIG. 13.
- method 1100 begins at block 1110 with receiving an indication of one or more parameters associated with driving a speaker.
- the operations of this block refer to, or may be performed by, a component (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.
- Method 1100 then proceeds to block 1120 with predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters.
- the operations of this block refer to, or may be performed by, a component (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.
- Method 1100 then proceeds to block 1130 with taking one or more actions based on the predicted displacement.
- the operations of this block refer to, or may be performed by, one or more computer-executable components (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.
- FIG. 12 shows an example of a method 1200 for training a machine learning model, which may be used in speaker protection tasks, to predict speaker diaphragm displacement (excursion) , in accordance with aspects of the present disclosure.
- the method 1200 may be performed by a device, such as the device 1400 illustrated in FIG. 14.
- method 1200 begins at block 1210 with generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters.
- the operations of this block refer to, or may be performed by, a component (e.g., 1424A-1424B) of a memory 1424 as described with reference to FIG. 14.
- Method 1200 then proceeds to block 1220 with training a machine learning model, based on the training data set, to predict the displacement of the diaphragm.
- the operations of this block refer to, or may be performed by, a component (e.g., 1424A-1424B) of a memory 1424 as described with reference to FIG. 14.
- FIG. 13 depicts an example processing system 1300 for ML-based diaphragm excursion prediction for speaker protection, such as described herein for example with respect to FIG. 11.
- Processing system 1300 includes a central processing unit (CPU) 1302, which in some examples may be a multi-core CPU. Instructions executed at the CPU 1302 may be loaded, for example, from a program memory associated with the CPU 1302 or may be loaded from a memory 1324.
- CPU central processing unit
- Instructions executed at the CPU 1302 may be loaded, for example, from a program memory associated with the CPU 1302 or may be loaded from a memory 1324.
- Processing system 1300 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 1304, a digital signal processor (DSP) 1306, a neural processing unit (NPU) 1308, a multimedia processing unit 1310, a wireless connectivity component 1312.
- GPU graphics processing unit
- DSP digital signal processor
- NPU neural processing unit
- MCI multimedia processing unit
- An NPU such as 1308, is generally a specialized circuit configured for implementing control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like.
- An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing units (TPUs) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.
- NSP neural signal processor
- TPUs tensor processing units
- NNP neural network processor
- IPU intelligence processing unit
- VPU vision processing unit
- graph processing unit or graph processing unit.
- NPUs such as 1308, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models.
- a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the plurality of NPUs may be part of a dedicated neural-network accelerator.
- SoC system on a chip
- NPUs may be optimized for training or inference, or in some cases configured to balance performance between both.
- the two tasks may still generally be performed independently.
- NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged) , iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance.
- model parameters such as weights and biases
- NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this new piece through an already trained model to generate a model output (e.g., an inference) .
- a model output e.g., an inference
- NPU 1308 is a part of one or more of CPU 1302, GPU 1304, and/or DSP 1306.
- wireless connectivity component 1312 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE) , fifth generation connectivity (e.g., 5G or NR) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards.
- Wireless connectivity component 1312 is further connected to one or more antennas 1314.
- Processing system 1300 may also include one or more input and/or output devices 1322, such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
- input and/or output devices 1322 such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
- one or more of the processors of processing system 1300 may be based on an ARM or RISC-V instruction set.
- Processing system 1300 also includes memory 1324, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like.
- memory 1324 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 1300.
- memory 1324 includes receiving component 1324A, diaphragm displacement predicting component 1324B, and action taking component 1324C.
- receiving component 1324A diaphragm displacement predicting component 1324B
- action taking component 1324C action taking component
- processing system 1300 and/or components thereof may be configured to perform the methods described herein.
- aspects of processing system 1300 may be omitted, such as where processing system 1300 is a server computer or the like. Further, aspects of processing system 1300 may be distributed, such as training a model and using the model to generate inferences, such as user verification predictions.
- FIG. 14 depicts an example processing system 1400 for training a machine learning model, which may be used in speaker protection tasks, to predict speaker diaphragm displacement (excursion) , such as described herein for example with respect to FIG. 12.
- Processing system 1400 includes a central processing unit (CPU) 1402, which in some examples may be a multi-core CPU. Instructions executed at the CPU 1402 may be loaded, for example, from a program memory associated with the CPU 1402 or may be loaded from a memory 1424.
- CPU central processing unit
- Instructions executed at the CPU 1402 may be loaded, for example, from a program memory associated with the CPU 1402 or may be loaded from a memory 1424.
- Processing system 1400 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 1404, a digital signal processor (DSP) 1406, a neural processing unit (NPU) 1408, a multimedia processing unit 1410, a wireless connectivity component 1412.
- GPU graphics processing unit
- DSP digital signal processor
- NPU neural processing unit
- MPU multimedia processing unit
- An NPU such as 1408, is generally a specialized circuit configured for implementing control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like.
- An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing units (TPUs) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.
- NSP neural signal processor
- TPUs tensor processing units
- NNP neural network processor
- IPU intelligence processing unit
- VPU vision processing unit
- graph processing unit or graph processing unit.
- NPUs such as 1408, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models.
- a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the plurality of NPUs may be part of a dedicated neural-network accelerator.
- SoC system on a chip
- NPUs may be optimized for training or inference, or in some cases configured to balance performance between both.
- the two tasks may still generally be performed independently.
- NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged) , iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance.
- model parameters such as weights and biases
- NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this new piece through an already trained model to generate a model output (e.g., an inference) .
- a model output e.g., an inference
- NPU 1408 is a part of one or more of CPU 1402, GPU 1404, and/or DSP 1406.
- wireless connectivity component 1412 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE) , fifth generation connectivity (e.g., 5G or NR) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards.
- Wireless connectivity component 1412 is further connected to one or more antennas 1414.
- Processing system 1400 may also include one or more input and/or output devices 1422, such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
- input and/or output devices 1422 such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
- one or more of the processors of processing system 1400 may be based on an ARM or RISC-V instruction set.
- Processing system 1400 also includes memory 1424, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like.
- memory 1424 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 1400.
- memory 1424 includes generating component 1424A and training component 1424B.
- the depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.
- processing system 1400 and/or components thereof may be configured to perform the methods described herein.
- aspects of processing system 1400 may be omitted, such as where processing system 1400 is a server computer or the like. Further, aspects of processing system 1400 may be distributed, such as training a model and using the model to generate inferences, such as user verification predictions.
- a processor-implemented method comprising: receiving an indication of one or more parameters associated with driving a speaker; predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters; and taking one or more actions based on the predicted displacement.
- Clause 2 The method of Clause 1, wherein the taking comprises controlling an amplitude of at least one of a voltage signal or a current signal associated with driving the speaker, based on the predicted displacement.
- Clause 3 The method of Clause 1 or 2, wherein the machine learning model comprises a one-dimensional residual neural network model.
- Clause 4 The method of any of Clauses 1-3, wherein the machine learning model comprises a Fourier neural network model.
- Clause 5 The method of Clause 4, wherein the predicting comprises generating a Fourier transform of the one or more parameters, wherein the machine learning model comprises an attention block configured to extract features in frequency domain from the Fourier transform of the one or more parameters.
- Clause 6 The method of Clause 4, wherein the machine learning model comprises a Fourier attention operator layer including a skip path associated with an inverse Fourier transform output in the time domain.
- Clause 7 The method of any of Clauses 1-6, wherein the machine learning model comprises a model trained to predict the displacement of the diaphragm of the speaker based on a training data set mapping the one or more parameters to a measured displacement.
- Clause 8 The method of any of Clauses 1-7, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
- Clause 9 The method of any of Clauses 1-8, further comprising adapting the machine learning model from a first speaker to a second speaker based on batch normalization.
- a processor-implemented method comprising: generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters; and training a machine learning model, based on the training data set, to predict the displacement of the diaphragm.
- Clause 11 The method of Clause 10, wherein generating the training data set comprises filtering the indication of the displacement of the diaphragm to generate a filtered indication.
- Clause 12 The method of Clause 11, wherein the filtering comprises using a Butterworth filter.
- Clause 13 The method of Clause 11 or 12, wherein generating the training data set comprises correlating the filtered indication of the displacement of the diaphragm with the indication of the one or more parameters to effectively synchronize the filtered indication of the displacement with the one or more parameters.
- Clause 14 The method of Clause 13, wherein the training is based on the filtered indication of the displacement effectively synchronized to the indication of the one or more parameters.
- Clause 15 The method of any of Clauses 10-14, wherein the machine learning model comprises a one-dimensional residual neural network model.
- Clause 16 The method of any of Clauses 10-14, wherein the machine learning model comprises a Fourier neural network model.
- Clause 17 The method of any of Clauses 10-16, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
- Clause 18 A system comprising: a memory having executable instructions stored thereon; and a processor configured to execute the executable instructions to cause the system to perform the operations of any of Clauses 1-17.
- Clause 19 A system comprising means for performing the operations of any of Clauses 1-17.
- Clause 20 A computer-readable medium having instructions stored thereon which, when executed by a processor, performs the operations of any of Clauses 1-17.
- an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein.
- the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.
- exemplary means “serving as an example, instance, or illustration. ” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
- a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members.
- “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c) .
- determining encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure) , ascertaining, and the like. Also, “determining” may include receiving (e.g., receiving information) , accessing (e.g., accessing data in a memory) , and the like. Also, “determining” may include resolving, selecting, choosing, establishing, and the like.
- the methods disclosed herein comprise one or more steps or actions for achieving the methods.
- the method steps and/or actions may be interchanged with one another without departing from the scope of the claims.
- the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
- the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions.
- the means may include various hardware and/or software component (s) and/or module (s) , including, but not limited to a circuit, an application specific integrated circuit (ASIC) , or processor.
- ASIC application specific integrated circuit
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Certain aspects of the present disclosure provide techniques and apparatus for machine-learning-based diaphragm excursion prediction for speaker protection. One example method generally includes receiving an indication of one or more parameters associated with driving a speaker, predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters, and taking one or more actions based on the predicted displacement.
Description
INTRODUCTION
Aspects of the present disclosure relate to speaker diaphragm protection using machine learning techniques.
A speaker is an electro-acoustic transducer, generating sound from an electric signal produced by a power amplifier. Generally, the voice coil of a speaker is attached to a diaphragm that is mounted on a fixed frame via a suspension. A magnetic field is generated by a permanent magnet that is conducted to the region of the coil gap. Due to the presence of the magnetic field, an electrical current passing through the voice-coil generates a force f
c which causes the membrane to move up and down. The displacement x
d of the diaphragm is the excursion, which has a limit. If the excursion limit is exceeded, the speaker exhibits nonlinear behavior, which in turn manifests as distorted sound and degraded acoustic echo cancellation performance. Moreover, as current is pushed through the voice coil, some of the electrical energy is converted into heat instead of sound. Further, if the speaker is driven too hard, such excursions heat the diaphragm, which may distort the diaphragm and, in some cases, may manifest as plastic melt visible as bubbles on the edge of the diaphragm. This distortion may create an asymmetry in the diaphragm that causes the diaphragm to not vibrate as a piston. The issue generally becomes more acute as speakers become smaller and more portable (e.g., as used in micro-speakers, earbuds, etc. ) .
BRIEF SUMMARY
Certain aspects generally relate to machine-learning (ML) -based diaphragm excursion prediction for speaker protection.
Certain aspects provide a method. The method generally includes receiving an indication of one or more parameters associated with driving a speaker, predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters, and taking one or more actions based on the predicted displacement.
Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer- readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and apparatus comprising means for performing the aforementioned methods as well as those further described herein.
The following description and the related drawings set forth in detail certain illustrative features of one or more embodiments.
So that the manner in which the above-recited features of the present disclosure can be understood in detail, a more particular description, briefly summarized above, may be had by reference to aspects, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only certain typical aspects of this disclosure and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.
FIG. 1 depicts an example speaker cross-section showing diaphragm displacement.
FIG. 2 illustrates an example laser excursion measurement.
FIG. 3A illustrates an example machine learning (ML) model to predict displacement.
FIG. 3B illustrates an example input matrix, ML model, and output vector.
FIG. 4A illustrates an example preprocessing procedure.
FIG. 4B illustrates an example plot illustrating a correlation between current I and displacement X.
FIG. 4C illustrates an example input, filter, and extracted frequency.
FIG. 5 illustrates an example Fourier neural network model.
FIG. 6 illustrates an example model, in accordance with certain aspects of the present disclosure.
FIG. 7 illustrates an example batch normalization (BN) re-estimation algorithm, in accordance with certain aspects of the present disclosure.
FIG. 8 illustrates an example plot illustrating a L1 loss (e.g., a Least Absolute Deviation loss) and a scaled fourth-power loss.
FIG. 9 illustrates an example plot illustrating a predicted value, a ground truth value, and a residual value.
FIG. 10 illustrates an example system implementation, in accordance with certain aspects of the present disclosure.
FIG. 11 illustrates an example method flow diagram, in accordance with certain aspects of the present disclosure.
FIG. 12 illustrates an example method flow diagram, in accordance with certain aspects of the present disclosure.
FIG. 13 illustrates an example device, in accordance with certain aspects of the present disclosure.
FIG. 14 illustrates an example device, in accordance with certain aspects of the present disclosure.
The APPENDIX describes various aspects of the present disclosure.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the drawings. It is contemplated that elements and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.
Aspects of the present disclosure provide apparatuses, methods, processing systems, and computer-readable mediums for speaker diaphragm excursion prediction using machine learning.
Speaker protection generally leverages the playback signal to prevent over-excursion while maintaining maximum loudness, e.g. for speakerphone or gaming use in tiny loudspeakers, such as are found in smartphones, tablets, laptops, and other portable devices. One challenge is to model and predict over-excursion with highly nonlinear characteristics. To do so, aspects of the present disclosure utilize deep learning (DL) techniques to accurately predict nonlinear excursion of speaker diaphragms when driven. Feedback current and/or voltage may be sampled as input, and a laser may be used to measure diaphragm excursion. The sampled current and/or voltage are labeled (or otherwise correlated) with the measured diaphragm excursion. In some aspects, a convolutional neural network (ConvNet, or CNN) is designed as the baseline, and a fast Fourier transform network (FFTNet) may be used to explore the dominant low-frequency and the unknown harmonic (s) . In some aspects, batch normalization (BN) re-estimation is enabled to achieve online adaptation, and quantization (8-bit integer (INT8) quantization) based on the artificial intelligence model efficiency toolkit (AIMET) is used to further reduce the computational complexity involved in predicting diaphragm excursion from current and voltage. Certain aspects of the present disclosure can achieve greater than 99%of the residual DC causing a diaphragm excursion of less than 0.1 mm, which may exceed the performance of various digital signal processor (DSP) solutions, verified in two speakers considering three scenarios.
Some solutions for speaker protection involve building a speaker protection block, which first monitors the current and voltage, and then analyzes the buffering and predicts the excursion status. Once the predicted excursion is larger than a threshold, the speaker protection block is triggered to attenuate the input power or modify the source signal to decrease diaphragm excursion. However, it is hard to precisely predict diaphragm excursion based on the monitored current and/or voltage. Thus, one simple technique to control diaphragm excursion may use traditional equalization (EQ) filters to attenuate an input signal. These traditional EQ filters are generally designed conservatively due to the wide range of operating factors (e.g., speaker variations, various types of audio signals with large dynamic ranges, etc. ) in which various speakers operate. For example, an EQ filter for nonlinear distortion in direct-radiator loudspeakers in a closed cabinet may be implemented as an exact inverse of an electro-mechanical model of the loudspeaker. Estimates generated by the digital loudspeaker model may be used to predict the excursion based on the input voltage, and the predicted excursion may be controlled using dynamic range compression in the excursion domain. These approaches, however, do not push the speaker to its true limit. For example, EQ filters still attenuate the output audio, even when low audio-signal energy and diaphragm excursion is within a defined limit or threshold, thereby degrading the audio quality and the volume of the audio.
In another example, deep learning (DL) approaches can be used in modeling the behaviors of a voice coil actuator (VCA) . These DL approaches generally incorporate a recurrent neural network (RNN) into a multi-physics simulation to enhance the computation efficiency of these DL approaches. DL solutions can solve differential equations (DEs) , which can partially model a diverse non-linear system, such as excursion prediction and VCA modeling. Neural operators in the RNN can directly learn the mapping from any functional parametric dependence to the solution. One example uses physics-informed neural networks that directly solve the ordinary DEs. Another example formulates the neural operator n in the Fourier space by parameterizing the integral kernel. However, DL solutions may be highly dependent on the training data set and may be subject to overfitting, especially with highly variable data sets, such as those associated with diaphragm excursion characteristics.
To allow for accurate prediction of diaphragm excursion, which may be used in speaker protection tasks, aspects of the present disclosure provide DL techniques to explore effective features for speaker protection. To do so, a diaphragm excursion measurement setting is established where a laser is to track the corresponding excursion, and a comprehensive preprocessing pipeline is to prepare the dataset. A model, based on ConvNet and/or FFTNet, for example, may be trained and verified based on two typical speakers. BN re-estimation for online adaptation and quantization in AIMET may also be implemented.
Example Speaker Diaphragm Excursion Prediction
FIG. 1 depicts a cross-section of an example speaker 100 showing displacement X (also referred to as “excursion” ) of the speaker’s diaphragm.
The speaker 100 represents the transduction of electrical energy to mechanical energy. A continuous-time model for the electrical behavior is shown as:
where v
c (t) is the voltage input across the terminals of the voice coil, i
c (t) is the voice coil current, R
eb is the blocked electrical resistance,
is the diaphragm excursion velocity, φ
0 is the transduction coefficient at the equilibrium state x
d (t) , which is the diaphragm excursion.
Such mechanical characteristics of the speaker may be mostly determined by the parameters R
eb and φ
0, which highly depend on the speaker’s geometric construction and the materials used in the voice coil, the diaphragm, and the enclosure. It is hard to accurately construct the mathematical method to track the nonlinear distribution and variation. In Equation (1) , such nonlinear features can be represented by [v
c (t) , i
c (t) , x
d (t) ] , which is further used in supervised training of models used in a DL solution to learn the characteristics of a speaker.
FIG. 2 illustrates an example laser excursion measurement environment 200. The current (I) and voltage (V) are the inputs. The measured excursion (X) from the laser is used/labeled as ground truth data used in supervised learning techniques to train the machine learning models described herein.
FIG. 3A illustrates an example machine learning (ML) model 300A trained to predict displacement (i.e., excursion) of a speaker diaphragm based on a training data set of V and/or I, labeled with a measured amount of displacement of a speaker diaphragm. After training, the ML model 300 may predict the displacement of a speaker based on V and/or I.
FIG. 3B illustrates an example 300B of an input matrix, ML model, and output vector.
Example 300B may be constrained by various constraints on model size and latency. Example constraints may include the size of a one-dimensional time sequence (e.g., sampled at a sampling rate of 48 kHz) , timing constraints (e.g., 10 ms scheduling time) , input length (e.g., 256 samples provided as input) , and the like. The output of the ML model generally includes a prediction of diaphragm excursion given the inputs of V and/or I. In some aspects, the output may be generated using mixed quantization (8 and 16 bits) and may be performed by a central processing unit (CPU) or offloaded to other processors, such as a graphics processing unit (GPU) or a neural processing unit (NPU) .
FIG. 4A illustrates an example preprocessing procedure 400A, according to certain aspects of the present disclosure.
A laser (or other measurement/metrology device) may be pointed at the center of the speaker in order to track the displacement (i.e., excursion) of the diaphragm for any given V and/or I used to drive the speaker. The measured diaphragm displacement is logged as x
d (t) . Meanwhile, the corresponding real-time current i
c (t) and/or voltage v
c (t) are measured, as shown in Fig. 1. Next, Equation (1) is transferred as
where f (·) is the function to represent the mechanical characteristics of the speaker. For example, in one voltage-controlled speaker, voltage is the source input, encoded by the voice content, which causes the diaphragm to vibrate and from which excursion from a base plane can be measured. The mechanical response is embedded into the feedback’s current. Once a model can be trained/optimized to represent f (·) (e.g., the mechanical characteristics of the speaker) stably, real-time logged current and/or voltage can be used to predict the corresponding diaphragm displacement based on the trained model, as shown in Equation (3) :
where the motivation is to learn
based on the logged dataset, and to predict diaphragm displacement using the model, given the real-time v
c (t) and/or i
c (t) .
The measured diaphragm excursion is generally impacted by noise and the location and environment in which a speaker is installed. As discussed, continuous long-term large excursion may cause progressively more serious damage to the diaphragm. To prevent, or at least reduce, damage to speaker diaphragms, aspects of the present disclosure may use direct current (DC) drift prediction, where a low-pass filter (e.g., a second-order Butterworth filter or other second-order filter) with a relatively low cutoff frequency (e.g., 10 Hz) may be involved to extract the DC of the measured diaphragm displacement, which, as discussed above, may be used as ground-truth data associated with a V and/or I measurement for training the models discussed herein. Moreover, in the logging, because current/voltage and laser measurements utilize separate clocks, synchronization may be implemented to temporally align a sequence of V and/or I data with the corresponding measurements. Cross-correlation between current and measured excursion may be used to time shift the data for accurate training of the model.
FIG. 4B illustrates an example plot 400B illustrating a correlation between current I and displacement X. The plot illustrates synchronization between I and the measured excursion. In some cases, V/I monitoring and laser measurement may be deployed at two independent parts. Correlation between I and X is illustrated at 8 kHz.
FIG. 4C illustrates an example correlation 400C for an input, filter, and extracted frequency. As illustrated, a raw measurement (e.g., sampled at 48 kHz) may be input to a filter, which may reduce noise while maintaining features of the measured excursion. In some cases, low frequency may have an impact on heating. In some cases, the filter may comprise a low-pass filter (e.g., a Butterworth filter) , which may extract the DC value in the measured excursion.
FIG. 5 illustrates an example Fourier neural network model 500, according to certain aspects of the present disclosure.
Mathematical models of speaker diaphragm excursion may not allow for speakers to be driven to full potential; however, these models illustrate that DC drift can be associated with some unknown frequency and the corresponding harmonic components, which is highly associated to the mechanical design of the speaker. Aspects of the present disclosure leverage DL solutions to extract these frequencies and to predict DC drift.
In the training stage, for an input sequence with N samples, the number of state variables is 2N, including the voltage and current components, s
n= {i
n, v
n} , written as s= (s
1, s
2, …, s
N) ∈R
2N. The output prediction x
N is mapped to the time stamp t as the last sample in the sequence. Each sample used to train the model is defined as {s, x
N} .
In some aspects, a Fourier Attention Operator Layer (FAOL) may be used to extract the effective frequency components for a given input sequence of voltage and/or current components. In the FAOL, the multi-head attention is embedded, and the complex value is re-organized into two real parts, simplified for concatenation in the channel domain. After attention processing, the channels may be combined to restore a complex value. In some aspects, several FAOL blocks may be concatenated, which may aid in extracting the harmonic features for a given sequence of samples. Further, a skip path in the time domain may be used to restore discarded frequency parts from the input sequence which may have been previously discarded. In some aspects, the overall structure of a fast Fourier transform network may include a 1-dimensional convolutional layer that increases the size of the channel feature space, an FAOL configured to extract features from an input sequence, and an average pooling layer to down-sample the extracted features into a smaller space.
Consider a scenario in which there are J Fourier Attention Operator Layers (FAOLs) in the neural network. The output of each layer is g
j for j=1, 2, 3, …, J. For the input of each layer, a discrete or fast Fourier transform F may be performed to convert the input time-domain samples into the frequency domain. A multi-head self-attention block parameterized by
may be used to learn in the frequency domain, and then recover the time-domain sequences based on an inverse Fourier transform F
-1. This process may be referred to as a Fourier attention operator
represented by the following equation:
where
is the multi-head attention block, to learn the coefficients based on the given patches. Then,
is the weight tensor conduct linear combination of the modes in the frequency domain. The output of the j-th layer adds up F
-1 output with the initial time-domain sequence weighted by conv1d operator ψ (·) . Rectified linear unit (ReLU) activation is used along with one-dimensional convolutional operations.
Further, as a comparison, in some aspects, a ResNet-based one-dimensional convolutional network (ConvNet) may be used, which may have a similar structure as the FFTNet discussed above (e.g., have the same input and output format, a single conv1d layer, several ResNet blocks, average pooling, and a fully connected (FC) layer to regress the predicted DC drift) .
In some cases, a fast Fourier transform (FFT) neural model complexity may include 333184.0 float operations and 1725.0 parameters.
FIG. 6 illustrates an example model 600, in accordance with aspects of the present disclosure. In some cases, a ResNet 1D model complexity may include 3244096.0 float operations and 19073.0 parameters.
FIG. 7 illustrates an example batch normalization (BN) re-estimation algorithm 700, in accordance with aspects of the present disclosure.
As discussed, speakers (and the performance of these speakers) may be impacted by production and unknown mechanical characteristics. Further, different units of the same model of speaker may have varying characteristics, which may impose difficulties in accurately predicting diaphragm displacement for a speaker. Because of power and computation constraint in edge devices, such as smartphones or other devices in which speakers are included, batch normalization (BN) re-estimation may be used for online adaptation, without any labeling and fine-tuning request, to adapt to variations between different types of speakers and different units of a same speaker type. The BN layer is generally designed to alleviate the issue of internal covariant shifting: a common problem while training a very deep neural network, and defined as Equation (5) :
where x
j and y
j are the input/output scalars of one neuron response in one sample, X
. j denotes the j
th column of the input data in one BN layer, X∈R
n×p, j∈ {1…p} . n denotes the batch size, and p is the feature dimension. γ
j and β
j are parameters to be learned. Once the training is done, the parameters in BN are frozen for inference operations.
When the model is initialized in a new device or new scenario for inference with BN re-estimation, as shown in Algorithm 1, the input is buffered within the given window, further, to calculate the mean and variance. Further, to track the variation, a 1-tap infinite impulse response (IIR) filter is used to track the mean and variance, which is useful for the optimization in the whole space.
Pseudocode describing an algorithm for predicting diaphragm displacement is illustrated below, in accordance with aspects of the present disclosure:
FIG. 8 illustrates an example plot 800 illustrating an L1 loss and a scaled fourth-power loss.
In the experiments, three typical scenarios are considered: normal, heating, and DC injection. Fourteen units from two different speaker types (SBS2 and AAC) are used to collect the data. These are further split: 8 units for training, 4 units for validation, and 2 units for testing.
Three methods are verified: a first method utilizes an FFTNet with 3 FFT attention blocks, and involves 333k float operations and 1.7k parameters; a second method utilizes a ConvNet with 4 ResNet blocks, involving 3244k float operations and 19k parameters; a third method uses a digital signal processor (DSP) with limited operations and ignorable parameters for memory cost. The DSP may still leverage the training and validation dataset to fine-tune the algorithm.
As large DC drift values generally lead to more serious damage to the speaker diaphragm, an L1 loss threshold can be used so that small loss values may be considered noise, while loss values larger than the threshold can be enhanced. One scaled fourth-power loss can be used in the training procedure to focus on the large L1 loss, shown in
where S is the batch size and x
out, i is the prediction of the i-th sample for i=1, 2, …, S, and δ is to adjust the identical point with L1 loss.
As illustrated, predicted diaphragm displacement can track the peak DC jitter with small or no residual loss. For small DC jitter, which is impacted by the random noise (e.g., from the circuit or mechanical diaphragm noise) , the variation of the predicted diaphragm displacement is hard to track. The maximum loss occurs when there is DC cliffing, where, in an ideal system, a zero valued input (e.g., power off or no voice, v
c (t) =i
c (t) =0) has a corresponding excursion of zero. However, due to mechanical constraints of a speaker diaphragm, diaphragm displacement gradually reduces to zero, but is not fully aligned with electrical signal control. Although such cliffing leads to large loss values, such cliffing would not involve the damage to speaker, as the amount or magnitude of diaphragm excursion is decreasing (leading to less mechanical stress on the diaphragm) .
Using a 32-bit floating point FFTNet and a testing sequence, the predicted DC drift sequence can track the ground truth DC variation, and the maximum L1 loss may be a value less than a target displacement value (e.g., a measured displacement of 0.0478 mm, less than the target displacement of 0.1 mm) .
In Table 1, compared to ConvNet, FFTNet shows more promising performance. However, FFT processing may be computationally complex and may be challenging to deploy in a real hardware device, e.g., accelerated by NPU. To allow for an FFTNet to be used in predicting speaker diaphragm excursion, high precision (e.g., 32-bit floating point) models may be reserved for complex scenarios, such as a DC injection scenario, which may be considered as a highly challenging corner-case scenario against which a speaker protection block is to protect.
Mean (mm) | Max (mm) | |
FFTNet | 0.0077 | 0.2169 |
ConvNet | 0.0091 | 0.2314 |
Table 1: L1 Loss Comparison for DC Injection Scenario
Table 2 illustrates verification of BN re-estimation for online adaptation, where one model, trained based on a specific type of speaker (e.g., an SBS2 speaker) , is used to predict diaphragm displacement for a different type of speaker (e.g., an AAC speaker) . One AAC unit is used to verify the performance of the models discussed herein and also to further explore the impact of different filter coefficients α. As shown in Table 2, such adaptation methods can largely improve the performance of machine learning models used to predict speaker diaphragm displacement. For example, compared to a baseline threshold, α=0.1 can achieve 21%gain in the maximum loss. It should be noted, however, that it may be difficult to identify an optimal threshold α, as the optimal threshold may be highly dependent on the data/model weight distribution. Although α=0.1 achieves the optimum maximum loss, the corresponding mean loss may be higher than the mean loss for other threshold values of α.
α= | baseline | 0.1 | 0.05 | 0.001 | 0.0001 |
Mean (mm) | 0.0081 | 0.0035 | 0.0024 | 0.001 | 0.0028 |
Max (mm) | 0.7466 | 0.4179 | 0.5849 | 0.6802 | 0.7055 |
Table 2: Comparison of ConvNet FP32 performance in online adaptation scenarios between different speakers
In some cases, model quantization may be used to reduce the complexity when deploying models in the edge devices. Here, AIMET can be used to perform quantization (e.g., to 8-bit integer (INT8) or some other level of quantization) . As FFT operations are not supported in the AIMET, the ConvNet described herein may be quantized, and FFTNet need not be used. In the experiment, two separate models are designed specifically for SBS2 and AAC.
As shown in Table 3, compared to DSP, the ConvNet32 shows the huge gain in the mean loss, and the maximum loss is close to, but larger than, the target (0.1 mm) . Further, after 8-bit quantization, compared to the baseline FP32, INT8 performance is lightly degraded, and the maximum L1 loss is from 0.1121 mm to 0.1298 mm, but still much better than the DSP solution. Inference performance for the SBS2 speaker leads to a similar conclusion as the AAC; however, the mean loss may be somewhat worse than that in the DSP.
Mean (mm) | Max (mm) | ||
DSP | 0.0140 | 0.2569 | |
AAC | ConvNet FP32 | 0.0038 | 0.1121 |
ConvNet INT8 | 0.0076 | 0.1298 | |
DSP | 0.0020 | 0.2711 | |
SBS2 | ConvNet FP32 | 0.0032 | 0.13171 |
ConvNet INT8 | 0.0046 | 0.1408 |
Table 3: L1 loss results for different speakers and different prediction networks
FIG. 9 illustrates an example plot 900 illustrating a predicted value, a ground truth value, and a residual value.
FIG. 10 illustrates an example system implementation 1000, in accordance with aspects of the present disclosure.
Aspects of the present disclosure provide an end-to-end pipeline to predict DC drift which, as discussed, may be correlated with speaker diaphragm displacement (i.e., excursion) . An attention mechanism may be used to extract frequency features from an input of V and/or I, which shows better performance than the ConvNet and the DSP solutions. Further, BN re-estimation may be enabled for online adaptation when the model is applied to new scenarios (e.g., applied to speaker protection for different speaker types) .
Example Operations for ML-based Diaphragm Excursion Prediction for Speaker Protection
FIG. 11 shows an example of a method 1100 for ML-based diaphragm excursion prediction for speaker protection, in accordance with aspects of the present disclosure. In some examples, the method 1100 may be performed by a device, such as the device 1300 illustrated in FIG. 13.
As illustrated, method 1100 begins at block 1110 with receiving an indication of one or more parameters associated with driving a speaker. In some cases, the operations of this block refer to, or may be performed by, a component (e.g., 1324A-1324C) of a memory 1324 as described with reference to FIG. 13.
FIG. 12 shows an example of a method 1200 for training a machine learning model, which may be used in speaker protection tasks, to predict speaker diaphragm displacement (excursion) , in accordance with aspects of the present disclosure. In some examples, the method 1200 may be performed by a device, such as the device 1400 illustrated in FIG. 14.
As illustrated, method 1200 begins at block 1210 with generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters. In some cases, the operations of this block refer to, or may be performed by, a component (e.g., 1424A-1424B) of a memory 1424 as described with reference to FIG. 14.
Example Processing Systems for ML-based Diaphragm Excursion Prediction for Speaker Protection
FIG. 13 depicts an example processing system 1300 for ML-based diaphragm excursion prediction for speaker protection, such as described herein for example with respect to FIG. 11.
An NPU, such as 1308, is generally a specialized circuit configured for implementing control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like. An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing units (TPUs) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.
NPUs, such as 1308, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the plurality of NPUs may be part of a dedicated neural-network accelerator.
NPUs may be optimized for training or inference, or in some cases configured to balance performance between both. For NPUs that are capable of performing both training and inference, the two tasks may still generally be performed independently.
NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged) , iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance. Generally, optimizing based on a wrong prediction involves propagating back through the layers of the model and determining gradients to reduce the prediction error.
NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this new piece through an already trained model to generate a model output (e.g., an inference) .
In one implementation, NPU 1308 is a part of one or more of CPU 1302, GPU 1304, and/or DSP 1306.
In some examples, wireless connectivity component 1312 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE) , fifth generation connectivity (e.g., 5G or NR) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards. Wireless connectivity component 1312 is further connected to one or more antennas 1314.
In some examples, one or more of the processors of processing system 1300 may be based on an ARM or RISC-V instruction set.
In particular, in this example, memory 1324 includes receiving component 1324A, diaphragm displacement predicting component 1324B, and action taking component 1324C. The depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.
Generally, processing system 1300 and/or components thereof may be configured to perform the methods described herein.
Notably, in other aspects, aspects of processing system 1300 may be omitted, such as where processing system 1300 is a server computer or the like. Further, aspects of processing system 1300 may be distributed, such as training a model and using the model to generate inferences, such as user verification predictions.
FIG. 14 depicts an example processing system 1400 for training a machine learning model, which may be used in speaker protection tasks, to predict speaker diaphragm displacement (excursion) , such as described herein for example with respect to FIG. 12.
An NPU, such as 1408, is generally a specialized circuit configured for implementing control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like. An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing units (TPUs) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.
NPUs, such as 1408, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the plurality of NPUs may be part of a dedicated neural-network accelerator.
NPUs may be optimized for training or inference, or in some cases configured to balance performance between both. For NPUs that are capable of performing both training and inference, the two tasks may still generally be performed independently.
NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged) , iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance. Generally, optimizing based on a wrong prediction involves propagating back through the layers of the model and determining gradients to reduce the prediction error.
NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this new piece through an already trained model to generate a model output (e.g., an inference) .
In one implementation, NPU 1408 is a part of one or more of CPU 1402, GPU 1404, and/or DSP 1406.
In some examples, wireless connectivity component 1412 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE) , fifth generation connectivity (e.g., 5G or NR) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards. Wireless connectivity component 1412 is further connected to one or more antennas 1414.
In some examples, one or more of the processors of processing system 1400 may be based on an ARM or RISC-V instruction set.
In particular, in this example, memory 1424 includes generating component 1424A and training component 1424B. The depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.
Generally, processing system 1400 and/or components thereof may be configured to perform the methods described herein.
Notably, in other aspects, aspects of processing system 1400 may be omitted, such as where processing system 1400 is a server computer or the like. Further, aspects of processing system 1400 may be distributed, such as training a model and using the model to generate inferences, such as user verification predictions.
Example Clauses
Clause 1: A processor-implemented method comprising: receiving an indication of one or more parameters associated with driving a speaker; predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters; and taking one or more actions based on the predicted displacement.
Clause 2: The method of Clause 1, wherein the taking comprises controlling an amplitude of at least one of a voltage signal or a current signal associated with driving the speaker, based on the predicted displacement.
Clause 3: The method of Clause 1 or 2, wherein the machine learning model comprises a one-dimensional residual neural network model.
Clause 4: The method of any of Clauses 1-3, wherein the machine learning model comprises a Fourier neural network model.
Clause 5: The method of Clause 4, wherein the predicting comprises generating a Fourier transform of the one or more parameters, wherein the machine learning model comprises an attention block configured to extract features in frequency domain from the Fourier transform of the one or more parameters.
Clause 6: The method of Clause 4, wherein the machine learning model comprises a Fourier attention operator layer including a skip path associated with an inverse Fourier transform output in the time domain.
Clause 7: The method of any of Clauses 1-6, wherein the machine learning model comprises a model trained to predict the displacement of the diaphragm of the speaker based on a training data set mapping the one or more parameters to a measured displacement.
Clause 8: The method of any of Clauses 1-7, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
Clause 9: The method of any of Clauses 1-8, further comprising adapting the machine learning model from a first speaker to a second speaker based on batch normalization.
Clause 10: A processor-implemented method comprising: generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters; and training a machine learning model, based on the training data set, to predict the displacement of the diaphragm.
Clause 11: The method of Clause 10, wherein generating the training data set comprises filtering the indication of the displacement of the diaphragm to generate a filtered indication.
Clause 12: The method of Clause 11, wherein the filtering comprises using a Butterworth filter.
Clause 13: The method of Clause 11 or 12, wherein generating the training data set comprises correlating the filtered indication of the displacement of the diaphragm with the indication of the one or more parameters to effectively synchronize the filtered indication of the displacement with the one or more parameters.
Clause 14: The method of Clause 13, wherein the training is based on the filtered indication of the displacement effectively synchronized to the indication of the one or more parameters.
Clause 15: The method of any of Clauses 10-14, wherein the machine learning model comprises a one-dimensional residual neural network model.
Clause 16: The method of any of Clauses 10-14, wherein the machine learning model comprises a Fourier neural network model.
Clause 17: The method of any of Clauses 10-16, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
Clause 18: A system comprising: a memory having executable instructions stored thereon; and a processor configured to execute the executable instructions to cause the system to perform the operations of any of Clauses 1-17.
Clause 19: A system comprising means for performing the operations of any of Clauses 1-17.
Clause 20: A computer-readable medium having instructions stored thereon which, when executed by a processor, performs the operations of any of Clauses 1-17.
Additional Considerations
The preceding description is provided to enable any person skilled in the art to practice the various embodiments described herein. The examples discussed herein are not limiting of the scope, applicability, or embodiments set forth in the claims. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments. For example, changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For instance, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Also, features described with respect to some examples may be combined in some other examples. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.
As used herein, the word “exemplary” means “serving as an example, instance, or illustration. ” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
As used herein, a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c) .
As used herein, the term “determining” encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure) , ascertaining, and the like. Also, “determining” may include receiving (e.g., receiving information) , accessing (e.g., accessing data in a memory) , and the like. Also, “determining” may include resolving, selecting, choosing, establishing, and the like.
The methods disclosed herein comprise one or more steps or actions for achieving the methods. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims. Further, the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software component (s) and/or module (s) , including, but not limited to a circuit, an application specific integrated circuit (ASIC) , or processor. Generally, where there are operations illustrated in figures, those operations may have corresponding counterpart means-plus-function components with similar numbering.
The following claims are not intended to be limited to the embodiments shown herein, but are to be accorded the full scope consistent with the language of the claims. Within a claim, reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more. ” Unless specifically stated otherwise, the term “some” refers to one or more. No claim element is to be construed under the provisions of 35 U.S.C. §112 (f) unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for. ” All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims.
Claims (21)
- A processor-implemented method comprising:receiving an indication of one or more parameters associated with driving a speaker;predicting, using a machine learning model, a displacement of a diaphragm of the speaker based on the indication of the one or more parameters; andtaking one or more actions based on the predicted displacement.
- The method of claim 1, wherein the taking comprises controlling an amplitude of at least one of a voltage signal or a current signal associated with driving the speaker, based on the predicted displacement.
- The method of claim 1, wherein the machine learning model comprises a one-dimensional residual neural network model.
- The method of claim 1, wherein the machine learning model comprises a Fourier neural network model.
- The method of claim 4, wherein the predicting comprises generating a Fourier transform of the one or more parameters and wherein the Fourier neural network model comprises an attention block configured to extract features in the frequency domain from the Fourier transform of the one or more parameters.
- The method of claim 4, wherein the machine learning model comprises a Fourier attention operator layer including a skip path associated with an inverse Fourier transform output in the time domain.
- The method of claim 1, wherein the machine learning model comprises a model trained to predict the displacement of the diaphragm of the speaker based on a training data set mapping the one or more parameters to a measured displacement.
- The method of claim 1, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
- The method of claim 1, further comprising adapting the machine learning model from a first speaker to a second speaker based on batch normalization.
- A processor-implemented method comprising:generating a training data set mapping an indication of one or more parameters associated with driving a speaker to an indication of a displacement of a diaphragm of the speaker as the diaphragm moves due to the speaker being driven based on the one or more parameters; andtraining a machine learning model, based on the training data set, to predict the displacement of the diaphragm.
- The method of claim 10, wherein generating the training data set comprises filtering the indication of the displacement of the diaphragm to generate a filtered indication.
- The method of claim 11, wherein the filtering comprises using a second-order Butterworth filter.
- The method of claim 11, wherein generating the training data set comprises correlating the filtered indication of the displacement of the diaphragm with the indication of the one or more parameters to effectively synchronize the filtered indication of the displacement with the one or more parameters.
- The method of claim 13, wherein the training is based on the filtered indication of the displacement effectively synchronized to the indication of the one or more parameters.
- The method of claim 10, wherein the machine learning model comprises a one-dimensional residual neural network model.
- The method of claim 10, wherein the machine learning model comprises a Fourier neural network model.
- The method of claim 10, wherein the one or more parameters comprise one or more of a voltage signal associated with driving the speaker or a current signal associated with driving the speaker.
- A processing system comprising:a memory comprising computer-executable instructions; andone or more processors configured to execute the computer-executable instructions and cause the processing system to perform a method in accordance with any of Claims 1-17.
- A processing system, comprising means for performing a method in accordance with any of Claims 1-17.
- A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform a method in accordance with any of Claims 1-17.
- A computer program product embodied on a computer-readable storage medium comprising code for performing a method in accordance with any of Claims 1-17.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/127609 WO2024087050A1 (en) | 2022-10-26 | 2022-10-26 | Machine-learning-based diaphragm excursion prediction for speaker protection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/127609 WO2024087050A1 (en) | 2022-10-26 | 2022-10-26 | Machine-learning-based diaphragm excursion prediction for speaker protection |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024087050A1 true WO2024087050A1 (en) | 2024-05-02 |
Family
ID=90829523
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/127609 WO2024087050A1 (en) | 2022-10-26 | 2022-10-26 | Machine-learning-based diaphragm excursion prediction for speaker protection |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024087050A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180167732A1 (en) * | 2016-12-13 | 2018-06-14 | Samsung Electronics Co., Ltd. | Method of processing sound signal of electronic device and electronic device for same |
CN112533115A (en) * | 2019-09-18 | 2021-03-19 | 华为技术有限公司 | Method and device for improving tone quality of loudspeaker |
CN114390406A (en) * | 2020-10-16 | 2022-04-22 | 华为技术有限公司 | Method and device for controlling displacement of loudspeaker diaphragm |
US20220141578A1 (en) * | 2020-10-30 | 2022-05-05 | Samsung Electronics Co., Ltd. | Nonlinear control of a loudspeaker with a neural network |
-
2022
- 2022-10-26 WO PCT/CN2022/127609 patent/WO2024087050A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180167732A1 (en) * | 2016-12-13 | 2018-06-14 | Samsung Electronics Co., Ltd. | Method of processing sound signal of electronic device and electronic device for same |
CN112533115A (en) * | 2019-09-18 | 2021-03-19 | 华为技术有限公司 | Method and device for improving tone quality of loudspeaker |
CN114390406A (en) * | 2020-10-16 | 2022-04-22 | 华为技术有限公司 | Method and device for controlling displacement of loudspeaker diaphragm |
US20220141578A1 (en) * | 2020-10-30 | 2022-05-05 | Samsung Electronics Co., Ltd. | Nonlinear control of a loudspeaker with a neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10679612B2 (en) | Speech recognizing method and apparatus | |
KR101864478B1 (en) | Method and arrangement for controlling an electro-acoustical transducer | |
JP7486266B2 (en) | Method and apparatus for determining a depth filter - Patents.com | |
Küçük et al. | Real-time convolutional neural network-based speech source localization on smartphone | |
US20240121570A1 (en) | Apparatus, Methods and Computer Programs for Enabling Audio Rendering | |
US20230395087A1 (en) | Machine Learning for Microphone Style Transfer | |
Klippel | Adaptive stabilization of electro-dynamical transducers | |
KR101704925B1 (en) | Voice Activity Detection based on Deep Neural Network Using EVS Codec Parameter and Voice Activity Detection Method thereof | |
JP5994639B2 (en) | Sound section detection device, sound section detection method, and sound section detection program | |
CN110708651B (en) | Hearing aid squeal detection and suppression method and device based on segmented trapped wave | |
Dash et al. | Speech intelligibility based enhancement system using modified deep neural network and adaptive multi-band spectral subtraction | |
Brunet et al. | Identification of loudspeakers using fractional derivatives | |
WO2024087050A1 (en) | Machine-learning-based diaphragm excursion prediction for speaker protection | |
Westhausen et al. | Low bit rate binaural link for improved ultra low-latency low-complexity multichannel speech enhancement in Hearing Aids | |
Selvi et al. | Hybridization of spectral filtering with particle swarm optimization for speech signal enhancement | |
Zhang et al. | CGMM-Based Sound Zone Generation Using Robust Pressure Matching With ATF Perturbation Constraints | |
Krishnan et al. | Fast algorithms for acoustic impulse response shaping | |
Prajna et al. | A new approach to dual channel speech enhancement based on gravitational search algorithm (GSA) | |
Gonzalez et al. | Investigating the Design Space of Diffusion Models for Speech Enhancement | |
Lü et al. | Feature compensation based on independent noise estimation for robust speech recognition | |
Sun et al. | The prediction of nonlinear resistance and distortion for a miniature loudspeaker with vented cavities | |
Schneider et al. | An iterative least-squares design method for filters with constrained magnitude response in sound reproduction | |
Brunet et al. | New trends in modeling and identification of loudspeaker with nonlinear distortion | |
US20240363133A1 (en) | Noise suppression model using gated linear units | |
Alameri et al. | Convolutional Deep Neural Network and Full Connectivity for Speech Enhancement. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22963035 Country of ref document: EP Kind code of ref document: A1 |