CN116612760A - Audio signal processing method and device and electronic equipment - Google Patents

Audio signal processing method and device and electronic equipment Download PDF

Info

Publication number
CN116612760A
CN116612760A CN202310892837.0A CN202310892837A CN116612760A CN 116612760 A CN116612760 A CN 116612760A CN 202310892837 A CN202310892837 A CN 202310892837A CN 116612760 A CN116612760 A CN 116612760A
Authority
CN
China
Prior art keywords
fft
audio
simulated
data
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310892837.0A
Other languages
Chinese (zh)
Other versions
CN116612760B (en
Inventor
钟雨崎
艾国
杨作兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bianfeng Information Technology Co ltd
Original Assignee
Beijing Bianfeng Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bianfeng Information Technology Co ltd filed Critical Beijing Bianfeng Information Technology Co ltd
Priority to CN202310892837.0A priority Critical patent/CN116612760B/en
Publication of CN116612760A publication Critical patent/CN116612760A/en
Application granted granted Critical
Publication of CN116612760B publication Critical patent/CN116612760B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/285Memory allocation or algorithm optimisation to reduce hardware requirements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The present disclosure relates to an audio signal processing method, an apparatus and an electronic device, the method comprising: acquiring an audio signal; reasoning the audio signal through the trained imitated FFT model to obtain imitated FFT data, wherein the reasoning process only comprises multiplication, addition and comparison operations; and executing audio task processing on the FFT-simulated data to obtain an audio output result. The method and the device utilize the simulated FFT model based on the neural network to process the audio signal so as to realize the function similar to FFT, and because the calculation of the simulated FFT model only involves multiplication and addition, the running power consumption of the device is reduced, so that the hardware equipment adopting the technical scheme of the method and the device can prolong the service time by several times under the condition of battery power supply. According to the technical scheme, all subsequent tasks with functions are formed to counteract the characteristics required by the FFT simulation model to learn all the subsequent tasks, so that the characteristics output by the FFT simulation model are better than the FFT calculation result.

Description

Audio signal processing method and device and electronic equipment
Technical Field
The disclosure relates to the technical field of computers, neural networks and signal processing, and in particular relates to an audio signal processing method, an audio signal processing device and electronic equipment.
Background
In the field of speech, FFT (Fast Fourier Transform ) is an important signal transformation tool in existing Digital Signal Processing (DSP) for converting a time domain signal of speech into a frequency domain signal. The subsequent task based on this further processes the frequency domain signal to obtain a result.
However, the FFT has high computational complexity, requires a large amount of computational resources and power consumption, and in a miniaturized device, the FFT will generate a large amount of power consumption, limiting the service time of the device in the case of limited battery power. Therefore, how to realize the FFT function with low power loss in the miniaturized device and to extend the service time of the miniaturized device is a problem to be solved.
Disclosure of Invention
In view of this, the present disclosure provides an audio signal processing method, an apparatus and an electronic device, which implement a low-power FFT function, thereby reducing the power consumption of the apparatus that needs to implement the FFT function, and extending the service time of the apparatus that needs to implement the FFT function under the condition of limited power supply.
The technical scheme of the present disclosure is realized as follows:
an audio signal processing method, comprising:
acquiring an audio signal;
reasoning the audio signal through the trained imitated fast Fourier transform FFT model to obtain imitated FFT data;
and executing audio task processing on the FFT-simulated data to obtain an audio output result.
Further, the training process of the simulated FFT model comprises the following steps:
acquiring an audio sample signal;
inputting the audio sample signal into a pre-trained simulated FFT model, and obtaining simulated FFT prediction data through the pre-trained simulated FFT model;
executing the audio task processing on the FFT-simulated predicted data to obtain audio sample task predicted data;
performing FFT (fast Fourier transform) on the audio sample signal to obtain a frequency domain signal;
executing the audio task processing on the frequency domain signal to obtain audio sample task processing result data;
and adjusting parameters in the pre-trained simulated FFT model according to the difference between the audio sample task prediction data and the audio sample task processing result data to obtain the trained simulated FFT model.
Further, the adjusting parameters in the pre-trained simulated FFT model according to the difference between the audio sample task prediction data and the audio sample task processing result data includes:
establishing a mean square error MSE loss function according to the audio sample task prediction data and the audio sample task processing result data;
and adjusting parameters in the pre-trained simulated FFT model according to the MSE loss function until the MSE loss function converges to a desired value or reaches the training iteration number.
Further, the audio task includes at least one of noise reduction, gain, echo cancellation, voice wakeup, voice recognition, voiceprint recognition.
Further, the simulated FFT model comprises at least one feature extractor in series;
in the case where there are more than one of the feature extractors, each of the feature extractors is connected in series with each other, feature data output from a preceding feature extractor between any two adjacent feature extractors is input data of a next feature extractor, the audio signal is input to a first feature extractor in the simulated FFT model, and feature data output from a last feature extractor in the simulated FFT model is the simulated FFT data.
Further, each feature extractor comprises at least one convolutional neural network CNN unit and at least one linear rectification function ReLU unit connected with the CNN unit, and the feature data output by the feature extractor is obtained by processing the input data of the feature extractor through the CNN unit and the data obtained by processing the feature extractor through the ReLU unit.
Further, the number of feature extractors is 10 to 100.
Further, the audio signal processing method further includes:
and when a new audio task is added, directly applying FFT-like data obtained by reasoning the audio signal by the trained FFT-like model to the new audio task to obtain an audio output result corresponding to the new audio task under the condition that the number of the feature extractors reaches or exceeds a threshold value of the number of the feature extractors and the number of the audio tasks when the processing of the audio task is performed reaches or exceeds a threshold value of the number of the audio tasks.
An audio signal processing apparatus comprising:
a signal acquisition module configured to perform acquisition of an audio signal;
the Fast Fourier Transform (FFT) simulating processing module is configured to perform reasoning on the audio signal through a trained FFT simulating model to obtain FFT simulating data, wherein the reasoning process of the FFT simulating model on the audio signal only comprises multiplication, addition and comparison operations;
and the audio task processing module is configured to execute the processing of executing the audio task on the FFT-simulated data to obtain an audio output result.
An audio signal processing apparatus comprising:
the neural network processor NPU is used for acquiring an audio signal and reasoning the audio signal through a trained imitated fast Fourier transform FFT model to obtain imitated FFT data, wherein the reasoning process of the imitated FFT model on the audio signal only comprises multiplication, addition and comparison operations;
the audio task processing chip unit is electrically connected with the NPU and used for receiving the imitated FFT data from the NPU and executing audio task processing on the imitated FFT data to obtain an audio output result.
An electronic device, comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to execute the executable instructions to implement the audio signal processing method of any of the above.
As can be seen from the above solutions, the audio signal processing method, apparatus and electronic device of the present disclosure process an audio signal by using an imitated FFT model based on a neural network to implement functions similar to FFT, because the calculation of the feature extractor in the imitated FFT model in the technical solution of the present disclosure involves only multiplication and addition, and further, compared with FFT processing, the running power consumption of the apparatus can be reduced, so that the power consumption is reduced from 300mV to 0.1mV of FFT, and therefore, the hardware device adopting the technical solution of the present disclosure can extend the service time several times compared with the audio signal processing solution including FFT processing under the condition of battery power supply. In the technical scheme, whether the characteristics output by the simulated FFT model are consistent with the FFT result is not concerned, whether the characteristics output by the simulated FFT model are sent into the subsequent tasks to run normally or not is concerned, the result equivalent to the audio output result obtained after the FFT calculation result is sent into the subsequent tasks can be obtained, and the characteristics required by the simulated FFT model to learn all the subsequent tasks are calculated by all the subsequent tasks with the formed functions, so that the characteristics output by the simulated FFT model are better than the calculation result of the FFT. Therefore, the technical scheme disclosed by the invention is beneficial to achieving the purposes of lower power consumption and better result.
Drawings
FIG. 1 is a flow chart of an audio signal processing method according to an exemplary embodiment;
FIG. 2 is a schematic diagram illustrating a topology employing an audio signal processing method, according to an illustrative embodiment;
FIG. 3 is a schematic diagram of a feature extractor shown in accordance with an illustrative embodiment;
FIG. 4 is a flowchart of a training process for an imitated FFT model, according to an illustrative embodiment;
FIG. 5 is a flowchart illustrating a process for adjusting parameters in an imitated FFT model based on differences during an imitated FFT model training process according to an illustrative embodiment;
fig. 6 is a schematic diagram of an audio signal processing apparatus according to an exemplary embodiment;
FIG. 7 is a schematic diagram of an application scenario of an audio signal processing method, apparatus, according to an exemplary embodiment;
FIG. 8 is a schematic diagram illustrating a topological relationship when training an imitated FFT module according to an exemplary embodiment;
fig. 9 is a schematic diagram showing a logic structure of an audio signal processing apparatus according to an exemplary embodiment;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be further described in detail below with reference to the accompanying drawings and examples.
The calculation formula of the fourier transform is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,for complex representation of the signal in the frequency domain, +.>For frequency +.>As a function of the input signal in the time domain, +.>For time (I)>Is an imaginary symbol.
For discrete-time signals, the calculation formula for the fourier transform is:
wherein, the liquid crystal display device comprises a liquid crystal display device,discrete complex number of signal in frequency domain, +.>For discrete frequency +.>For sampling values of the input signal in a discrete time sequence, < >>Is the total number of discrete time sequences.
As can be seen from the above formula, there are e power calculation and complex number calculation in the Fourier transform, and for discrete signalsIs calculated by the computer. Because of these calculations, a DSP (Digital Signal Processing ) unit or CPU (Central Processing Unit, central processing unit) must be used to support the hardware circuits of the existing voice mobile products and voice chips, and because the frequency of the fourier transform is very high, typically about 100 to 200 times/second, the frequency of the fourier transform is about 500 times/second for the products with higher delay requirements, thereby causing the fourier transformThe power consumption of the exchange is significant, typically about 300mW (milliwatts). In the case of a miniaturized device, a device powered by a low-capacity battery, such as a TWS (True Wireless Stereo) earphone, a hearing aid, etc., will seriously affect the use time of the device.
The idea of the embodiment of the disclosure is to provide a method and a device for replacing the existing FFT function and combining the follow-up tasks of the FFT to realize that the audio output result is quite better than the audio output result of the existing FFT and the follow-up tasks thereof, and the power consumption is lower, so that the service time of the device for realizing the FFT function is prolonged under the condition of limited power supply quantity.
Fig. 1 is a flowchart of an audio signal processing method according to an exemplary embodiment, and fig. 2 is a schematic diagram of a topology using an audio signal processing method according to an exemplary embodiment, as shown in fig. 1 and fig. 2, the method mainly includes the following steps 101 to 103.
Step 101, acquiring an audio signal;
102, reasoning the audio signal through the trained imitated FFT model to obtain imitated FFT data, wherein the reasoning process of the imitated FFT model on the audio signal only comprises multiplication, addition and comparison operations;
and 103, performing audio task processing on the FFT-simulated data to obtain an audio output result.
In an exemplary embodiment, the simulated FFT model is a neural network model that includes at least one feature extractor in series. In the case of more than one feature extractor, each feature extractor is connected in series with each other, wherein each feature extractor is connected in series in logic in a data processing process, feature data output by a previous feature extractor between any two adjacent feature extractors is input data of a next feature extractor, an audio signal is input to a first feature extractor in an imitated FFT model, and feature data output by a last feature extractor in the imitated FFT model is imitated FFT data.
Fig. 3 is a schematic diagram of a feature extractor, each including at least one CNN unit and at least one ReLU unit connected to the CNN unit, as shown in fig. 3, according to an illustrative embodiment. For example, each feature extractor includes a CNN unit and a ReLU unit connected to the CNN unit, and the feature extractor outputs feature data obtained by processing the input data of the feature extractor by the CNN unit and then by the ReLU unit.
Among them, CNN, convolutional neural network (Convolutional Neural Network), is a feedforward artificial neural network for processing data having a mesh-like structure. It has wide application in image, audio, video, natural language processing and other fields. The CNN utilizes a convolution layer and a pooling layer to learn deep features of input data layer by layer, so that the processing efficiency and accuracy of the CNN on complex structure data such as images are improved.
ReLU, a linear rectification function or modified linear unit (Rectified Linear Unit), is a nonlinear activation function commonly used in neural networks. For input x, when x is greater than 0, the ReLU outputs x, otherwise the ReLU outputs 0. In neural networks, reLU introduces nonlinear transformations that can better handle complex nonlinear models, and ReLU can "fold" negative values to 0, preserving positive values. The nonlinear characteristics of the ReLU can make the neural network model more flexible and powerful, thereby improving its accuracy and generalization ability. In addition, the ReLU has the advantages of high calculation speed, high convergence speed and the like.
In the embodiment of the disclosure, the feature extractor consisting of the CNN unit and the ReLU unit can enhance the nonlinear expression capability of the simulated FFT model, improve the accuracy and generalization capability of the simulated FFT model, solve the problem of gradient disappearance, and enable the gradient to be more stable in the back propagation process, so that the training speed is faster, the response to the features is more definite, some useless information can be filtered, the robustness of the features to subsequent processing is improved, the convergence speed of the simulated FFT model can be accelerated, and the training efficiency is improved. And, the data processing of CNN unit only includes multiplication operation and addition operation, and the data processing of ReLU unit only includes comparison operation, faster than the operation speed of FFT and lower than the power consumption of FFT.
In the embodiment of the disclosure, the feature extractor can enhance the nonlinear expression capability of the imitated FFT model, improve the accuracy and generalization capability of the imitated FFT model, has more definite response to the feature, can filter useless information, and has stronger robustness to subsequent processing, so that under the condition that the number of the feature extractors is large (for example, the preset threshold value of the number of the feature extractors is reached or exceeded) and the number of the audio tasks is large (for example, the preset threshold value of the number of the audio tasks is reached or exceeded), the imitated FFT model can still obtain ideal processing results when the imitated FFT data obtained by reasoning the imitated FFT model is used for newly added audio tasks without new training or structure adjustment (for example, the number of the feature extractors is increased or reduced). Based on this, in an exemplary embodiment, the audio signal processing method of the embodiment of the present disclosure may further include:
when the number of the feature extractors reaches or exceeds the threshold of the number of the feature extractors and the number of the audio tasks for executing the audio task processing reaches or exceeds the threshold of the number of the audio tasks, when a new audio task is added, the simulated FFT data obtained by reasoning the audio signals by the trained simulated FFT model is directly applied to the new audio task, and an audio output result corresponding to the new audio task is obtained.
That is, the more the number of subsequent audio tasks, the better the universality of the features extracted by the simulated FFT model, on the basis, if a new audio task is additionally added, the simulated FFT model can be directly used without training or structure adjustment again. For example, the trained simulated FFT model is obtained by combining with 10 subsequent audio tasks, and then, based on the trained simulated FFT model and 10 subsequent audio tasks, an additional task is added, that is, the trained simulated FFT model and 11 subsequent audio tasks are utilized, so that an ideal audio output result can still be obtained.
Because the trained simulated FFT model is adopted in the audio signal processing method of the embodiment of the present disclosure, the simulated FFT model needs to be trained before the audio signal processing method is used. In the embodiment of the disclosure, the simulated FFT model does not directly learn the calculation result of the approximated FFT, but rather, all subsequent tasks after the FFT that have formed functions are used to reverse the characteristics required by the simulated FFT model to learn all subsequent tasks, which are better than the calculation result of the FFT itself. That is, in the embodiment of the present disclosure, it is not required to pay attention to what the features output by the simulated FFT model are, but only to whether the features output by the simulated FFT model are sent to the subsequent task to run normally, and to obtain a result equivalent to or better than the audio output result obtained after the calculation result of the FFT is sent to the subsequent task. In the embodiment of the disclosure, the FFT-simulated model is trained by taking the FFT-simulated model as a target, and finally the function is formed.
Since CNN is a purely linear calculation, where only multiplication and addition, there is e's power of the calculation, complex number of the calculation, andin the calculation of (2), the power consumption can be greatly reduced, and from the experimental data, the power consumption of the FFT can be reduced to 0.1mW. Therefore, the audio signal processing method of the embodiment of the disclosure can solve the problem of higher FFT power consumption of the miniaturized device, realize the FFT function with lower power consumption, and prolong the service time of the device needing to realize the FFT function under the condition of limited power supply electric quantity, for example, after the TWS earphone and the hearing aid adopt the audio signal processing method of the embodiment of the disclosure, compared with the existing scheme adopting the FFT, the service time of a plurality of times can be prolonged under the condition that the battery has the same electric quantity.
Fig. 4 is a flowchart of a training process for an imitated FFT model according to an exemplary embodiment, and as shown in fig. 4, the training process for the imitated FFT model includes the following steps 401 to 406.
Step 401, obtaining an audio sample signal;
step 402, inputting an audio sample signal into a pre-trained simulated FFT model, and obtaining simulated FFT prediction data through the pre-trained simulated FFT model;
step 403, performing audio task processing on the FFT-simulated predicted data to obtain audio sample task predicted data;
step 404, performing FFT conversion on the audio sample signal to obtain a frequency domain signal;
step 405, performing audio task processing on the frequency domain signal to obtain audio sample task processing result data;
and step 406, adjusting parameters in the pre-trained simulated FFT model according to the difference between the audio sample task prediction data and the audio sample task processing result data to obtain the trained simulated FFT model.
In an exemplary embodiment, the MSE loss function is used in the training process. FIG. 5 is a flowchart illustrating a process for adjusting parameters in an imitated FFT model based on differences during an imitated FFT model training process according to an exemplary embodiment, and the adjusting parameters in a pre-trained imitated FFT model based on differences between audio sample task prediction data and audio sample task processing result data in step 406 mainly includes the following steps 501 to 502 as shown in FIG. 5.
Step 501, an MSE loss function is established according to audio sample task prediction data and audio sample task processing result data;
step 502, adjusting parameters in the pre-trained simulated FFT model according to the MSE loss function until the MSE loss function converges to an expected value or reaches the training iteration times.
Where MSE is the mean square error (Mean Squared Error), which is a loss function that measures the degree of difference between the model's predicted and actual values in the regression problem. Is often selected for use in linear regression, polynomial regression, and other tasks. The MSE calculation method comprises the following steps: for each sample in a given sample set, the difference between its predicted and true values is calculated, and then the squares of the differences are summed and averaged. I.e.
Wherein, the liquid crystal display device comprises a liquid crystal display device,for the total number of samples->Is->True value of individual samples, +.>Is->Model predictions for individual samples. />The smaller the predictive result of the model is, the closer the predictive result is to the true value.
As can be seen from the training process of the simulated FFT model, in the embodiment of the disclosure, a method similar to unsupervised training is adopted for training the simulated FFT model, the reasoning results of the simulated FFT model and the subsequent tasks are compared with the processing results of the FFT and the subsequent tasks, and the difference between the reasoning results and the processing results of the FFT and the subsequent tasks is minimized in the training process, so that the reasoning results of the simulated FFT model and the subsequent tasks after training are quite better than the processing results of the FFT and the subsequent tasks.
In an exemplary embodiment, the audio tasks include at least one of noise reduction, gain, echo cancellation, voice wakeup, voice recognition, voiceprint recognition.
In an exemplary embodiment, the output of the simulated FFT model may be given to multiple audio tasks simultaneously, because the function of the simulated FFT model is formed by matching with multiple subsequent audio tasks in the training process, the function of the simulated FFT model is to extract features, and by adopting the training process in the embodiment of the present disclosure, the simulated FFT data output by the trained simulated FFT model can include features required by all the subsequent audio tasks. In the training process, the FFT-simulated predicted data obtained by the FFT-simulated model can be simultaneously given to all subsequent audio tasks, and then parameters of the FFT-simulated model are adjusted according to comparison of the audio sample task predicted data and the audio sample task processing result data obtained by all the audio tasks.
In addition, the more the number of subsequent audio tasks is, the better the universality of the features extracted by the simulated FFT model is, on the basis, if new audio tasks are additionally added, the simulated FFT model can be directly used without training again. For example, the trained simulated FFT model is obtained by combining with 10 subsequent audio tasks, and then, based on the trained simulated FFT model and 10 subsequent audio tasks, an additional task is added, that is, the trained simulated FFT model and 11 subsequent audio tasks are utilized, so that an ideal audio output result can still be obtained.
In the simulated FFT model, the number of feature extractors affects the features extracted by the simulated FFT model, and if the number of feature extractors is too small, the extracted features may not be ideal, and if the number of feature extractors is too large, the feature extraction efficiency is reduced, and in order to achieve both the effect and the efficiency of extracting the features, in the exemplary embodiment, the number of feature extractors in the simulated FFT model is 10 to 100.
The audio signal processing method of the embodiment of the disclosure can be applied to all mobile products and miniaturized products related to voice functions, such as Bluetooth headphones, recording equipment, cameras and the like.
According to the audio signal processing method, the audio signal is processed by using the simulated FFT model based on the neural network to achieve the function similar to FFT, and as calculation of the feature extractor in the simulated FFT model in the embodiment of the disclosure only involves multiplication and addition, the operation power consumption of the device can be reduced relative to FFT processing, so that the power consumption is reduced from 300mV of FFT to 0.1mV, and therefore, the hardware equipment adopting the embodiment of the disclosure can prolong the service time by several times relative to an audio signal processing scheme containing FFT processing under the condition of battery power supply. In the embodiment of the disclosure, whether the features output by the simulated FFT model are consistent with the FFT result is not concerned, but whether the features output by the simulated FFT model are sent into the subsequent tasks to run normally or not is concerned, and the result equivalent to the audio output result obtained after the FFT calculation result is sent into the subsequent tasks can be obtained, and the features required by the simulated FFT model to learn all the subsequent tasks are calculated by all the subsequent tasks with the formed functions, so that the features output by the simulated FFT model are better than the calculation result of the FFT itself. Therefore, the audio signal processing method of the embodiment of the disclosure is beneficial to achieving the purposes of lower power consumption and better result.
For example, there is a process including a process a, a process B, and a process C, which are sequentially performed, wherein the process a, for example, FFT, and the process B, and the process C, for example, audio task processing performed after FFT, wherein the hardware power consumption for performing the process a is high.
It is now common to replace process a with process a 'and process B and process C with process B' and process C 'adapted to process a'.
In the technical scheme of the disclosure, the process a is replaced by a process D with smaller power consumption, and the process D is, for example, reasoning performed by using the trained simulated FFT model in the embodiment of the disclosure, and the process B and the process C are kept unchanged. In which the process a involves more power calculation and complex calculation, resulting in high hardware power consumption for executing the process a, and the reasoning process of the audio signal by the simulated FFT model used in the process D involves only multiplication, addition and comparison operations without power calculation and complex calculation, thus achieving simplification in circuit design (because the circuits of power calculation and complex calculation are more complex than the multiplication, addition and comparison operations), and thus enabling significant reduction in hardware power consumption.
In addition, compared to the audio signal processing method of the embodiment of the present disclosure, the typical method using artificial intelligence is: only the output of the CNN network is concerned and the learning of the CNN network is supervised, so that the CNN network outputs specified features (such as features that let the CNN network output the same as the result of the FFT) instead of features required for the CNN network to learn all subsequent tasks against the subsequent tasks that have already formed the function. Compared with the conventional artificial intelligence method, the audio signal processing method of the embodiment of the disclosure does not pay attention to whether the features output by the simulated FFT model are consistent with the FFT results, but pay attention to whether the features output by the simulated FFT model are sent to the subsequent tasks to run normally and can obtain the result equivalent to the audio output result obtained after the calculation result of the FFT is sent to the subsequent tasks, and all the subsequent tasks which have formed functions are used for forcing the simulated FFT model to learn the features required by all the subsequent tasks, so that the features output by the simulated FFT model are better than the calculation result of the FFT itself. Therefore, the technical scheme disclosed by the invention is beneficial to achieving the purposes of lower power consumption and better result.
Fig. 6 is a schematic structural diagram of an audio signal processing device according to an exemplary embodiment, where the audio signal processing device of the present embodiment is a hardware device, and mainly includes an NPU (Neural network Processing Unit, neural network processor) 601 and an audio task processing chip unit 602, and the NPU601 and the audio task processing chip unit 602 are both hardware modules, such as processor hardware and chip hardware. The NPU601 is configured to obtain an audio signal and infer the audio signal through a trained simulated FFT model to obtain simulated FFT data, where the process of reasoning the simulated FFT model about the audio signal only includes multiplication, addition, and comparison operations. The audio task processing chip unit 602 is electrically connected to the NPU601, and is configured to receive the simulated FFT data from the NPU601 and perform audio task processing on the simulated FFT data, so as to obtain an audio output result.
The audio signal processing device of the present embodiment may be an integral part of an audio device, a product, such as a TWS earpiece, a hearing aid, etc. The audio signal processing device of the embodiment can reduce the running power consumption of the audio equipment and the product, can prolong the service time by several times under the condition that the audio equipment and the product are powered by the battery, and can enable the audio equipment and the product to obtain similar or better effects by adopting the original FFT scheme.
Fig. 7 is a schematic flow chart of an application scenario of an audio signal processing method and apparatus according to an exemplary embodiment, and fig. 8 is a schematic topological relation diagram when training an FFT-like module according to an exemplary embodiment, as shown in fig. 7 and 8, where the application scenario of the embodiment mainly includes the following steps 701 to 712.
Step 701, a pre-trained simulated FFT model is built, after which step 702 is performed.
Step 702, an audio sample signal is obtained from the audio sample training set, after which steps 703 and 705 are performed.
Step 703, inputting the audio sample signal into a pre-trained simulated FFT model, obtaining simulated FFT prediction data through the pre-trained simulated FFT model, and then executing step 704.
Step 704, performing audio task processing on the simulated FFT prediction data to obtain audio sample task prediction data, and then performing step 707.
Step 705, the audio sample signal is input to an FFT module for FFT to obtain a frequency domain signal, and then step 706 is executed.
Wherein the FFT module is a calculation module for executing FFT.
Step 706, performing audio task processing on the frequency domain signal to obtain audio sample task processing result data, and then performing step 707.
Wherein the audio task processing in step 704 and step 706 is the same.
In an exemplary embodiment, the audio task processing in step 704 and step 706 each include operations of at least one of noise reduction, gain, echo cancellation, voice wakeup, voice recognition, and voiceprint recognition.
Step 707, obtaining a LOSS function (LOSS) according to the audio sample task prediction data and the audio sample task processing result data, and then executing step 708.
Wherein the loss function characterizes a difference between the audio sample task prediction data and the audio sample task processing result data.
Step 708, adjusting parameters in the pre-trained simulated FFT model according to the loss function, and then executing step 709.
Step 709, judging whether the loss function converges to the expected value, if yes, executing step 711, otherwise executing step 710.
Step 710, judging whether the iteration number of training reaches the preset number, if so, executing step 711, otherwise, executing step 702.
And when the loss function converges to the expected value or the iteration number of training reaches the preset number, finishing training of the simulated FFT model to obtain the trained simulated FFT model.
Step 711, the trained simulated FFT model is reproduced at the hardware level by using the NPU, and the NPU is electrically connected to the audio task processing module, so as to obtain an audio signal processing device, and then step 712 is executed.
The audio task processing chip unit is a hardware circuit unit and is used for receiving the imitated FFT data output by the imitated FFT model and executing audio task processing on the imitated FFT data to obtain an audio output result.
The audio task processing performed by the audio task processing chip unit includes noise reduction, gain, echo cancellation, voice wakeup, voice recognition, and voiceprint recognition or more.
Step 712, obtaining an audio signal, obtaining the imitated FFT data output by the imitated FFT model by using the NPU, and inputting the imitated FFT data into the audio task processing chip unit to further obtain an audio output result.
Fig. 9 is a schematic diagram showing a logic structure of an audio signal processing apparatus according to an exemplary embodiment, and as shown in fig. 9, the audio signal processing apparatus includes a signal acquisition module 901, an FFT-like processing module 902, and an audio task processing module 903. Wherein, the signal acquisition module 901 is configured to perform acquisition of an audio signal; the simulated FFT processing module 902 is configured to perform reasoning on the audio signal through the trained simulated FFT model to obtain simulated FFT data, wherein the reasoning process of the simulated FFT model on the audio signal only includes multiplication, addition and comparison operations; the audio task processing module 903 is configured to perform processing of performing an audio task on the FFT-like data, so as to obtain an audio output result.
In an exemplary embodiment, the audio signal processing apparatus further comprises a model training module configured to perform:
acquiring an audio sample signal;
inputting the audio sample signal into a pre-trained simulated FFT model, and obtaining simulated FFT prediction data through the pre-trained simulated FFT model;
performing audio task processing on the FFT-simulated predicted data to obtain audio sample task predicted data;
performing FFT (fast Fourier transform) on the audio sample signal to obtain a frequency domain signal;
processing the frequency domain signal to obtain audio sample task processing result data;
and adjusting parameters in the pre-trained simulated FFT model according to the difference between the audio sample task prediction data and the audio sample task processing result data to obtain the trained simulated FFT model.
In an exemplary embodiment, the model training module is further configured to perform:
establishing an MSE loss function according to the audio sample task prediction data and the audio sample task processing result data;
and adjusting parameters in the pre-trained simulated FFT model according to the MSE loss function until the MSE loss function converges to a desired value or the training iteration number is reached.
In an exemplary embodiment, the audio tasks include at least one of noise reduction, gain, echo cancellation, voice wakeup, voice recognition, voiceprint recognition.
In an exemplary embodiment, the simulated FFT model includes at least one feature extractor in series;
in the case of more than one feature extractor, each feature extractor is connected in series with each other, wherein each feature extractor is connected in series in logic in a data processing process, feature data output by a previous feature extractor between any two adjacent feature extractors is input data of a next feature extractor, an audio signal is input to a first feature extractor in an imitated FFT model, and feature data output by a last feature extractor in the imitated FFT model is imitated FFT data.
In an exemplary embodiment, each feature extractor includes at least one CNN unit and at least one ReLU unit connected to the CNN unit, and the feature extractor outputs feature data obtained by processing input data of the feature extractor by the CNN unit and then by the ReLU unit.
In an exemplary embodiment, the number of feature extractors is 10 to 100.
The audio signal processing device of the embodiment of the disclosure processes an audio signal by using the neural network-based simulated FFT model to realize the function similar to FFT, and since the calculation of the feature extractor in the simulated FFT model in the embodiment of the disclosure only involves multiplication and addition, the operation power consumption of the device can be reduced relative to the FFT processing, so that the power consumption is reduced from 300mV of the FFT to 0.1mV, and thus the hardware device of the embodiment of the disclosure can prolong the service time by several times relative to the audio signal processing scheme including the FFT processing under the condition of battery power supply. In the embodiment of the disclosure, whether the features output by the simulated FFT model are consistent with the FFT result is not concerned, but whether the features output by the simulated FFT model are sent into the subsequent tasks to run normally or not is concerned, and the result equivalent to the audio output result obtained after the FFT calculation result is sent into the subsequent tasks can be obtained, and the features required by the simulated FFT model to learn all the subsequent tasks are calculated by all the subsequent tasks with the formed functions, so that the features output by the simulated FFT model are better than the calculation result of the FFT itself. Therefore, the audio signal processing device of the embodiment of the disclosure is beneficial to achieving the purposes of lower power consumption and better result.
Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.
With respect to the audio signal processing apparatus in the above-described embodiments, the specific manner in which the respective units perform the operations has been described in detail in the embodiments concerning the audio signal processing method, and will not be described in detail here.
It should be noted that: the above embodiments are only exemplified by the division of the above functional modules, and in practical applications, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the functions described above.
Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. In some embodiments, the electronic device is a server. The electronic device 1000 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 1001 and one or more memories 1002, where at least one program code is stored in the memories 1002, and the at least one program code is loaded and executed by the processors 1001 to implement the audio signal processing method provided in the above embodiments. Of course, the electronic device 1000 may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.
In an exemplary embodiment, a computer readable storage medium is also provided, such as a memory, comprising at least one instruction executable by a processor in a computer device to perform the audio signal processing method of the above embodiment.
Alternatively, the above-described computer-readable storage medium may be a non-transitory computer-readable storage medium, which may include, for example, ROM (Read-Only Memory), RAM (Random-Access Memory), CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, optical data storage device, and the like.
The foregoing description of the preferred embodiments of the present disclosure is not intended to limit the disclosure, but rather to cover all modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present disclosure.

Claims (11)

1. An audio signal processing method, comprising:
acquiring an audio signal;
reasoning the audio signal through a trained imitated fast Fourier transform FFT model to obtain imitated FFT data, wherein the reasoning process of the imitated FFT model on the audio signal only comprises multiplication, addition and comparison operations;
and executing audio task processing on the FFT-simulated data to obtain an audio output result.
2. The audio signal processing method according to claim 1, wherein the training process of the simulated FFT model comprises:
acquiring an audio sample signal;
inputting the audio sample signal into a pre-trained simulated FFT model, and obtaining simulated FFT prediction data through the pre-trained simulated FFT model;
executing the processing of the audio task on the FFT-simulated predicted data to obtain audio sample task predicted data;
performing FFT (fast Fourier transform) on the audio sample signal to obtain a frequency domain signal;
executing the processing of the audio task on the frequency domain signal to obtain audio sample task processing result data;
and adjusting parameters in the pre-trained simulated FFT model according to the difference between the audio sample task prediction data and the audio sample task processing result data to obtain the trained simulated FFT model.
3. The method according to claim 2, wherein said adjusting parameters in the pre-trained simulated FFT model based on differences between the audio sample task prediction data and the audio sample task processing result data comprises:
establishing a mean square error MSE loss function according to the audio sample task prediction data and the audio sample task processing result data;
and adjusting parameters in the pre-trained simulated FFT model according to the MSE loss function until the MSE loss function converges to a desired value or reaches the training iteration number.
4. The audio signal processing method according to claim 1, wherein:
the audio task includes at least one of noise reduction, gain, echo cancellation, voice wakeup, voice recognition, voiceprint recognition.
5. The audio signal processing method according to any one of claims 1 to 4, characterized in that:
the simulated FFT model comprises at least one feature extractor in series;
in the case where there are more than one of the feature extractors, each of the feature extractors is connected in series with each other, feature data output from a preceding feature extractor between any two adjacent feature extractors is input data of a next feature extractor, the audio signal is input to a first feature extractor in the simulated FFT model, and feature data output from a last feature extractor in the simulated FFT model is the simulated FFT data.
6. The audio signal processing method according to claim 5, wherein:
each feature extractor comprises at least one convolutional neural network CNN unit and at least one linear rectification function ReLU unit connected with the CNN unit, and the feature data output by the feature extractor is obtained by processing the input data of the feature extractor through the CNN unit and the data obtained by processing the feature extractor through the ReLU unit.
7. The audio signal processing method according to claim 5, wherein:
the number of feature extractors is 10 to 100.
8. The audio signal processing method according to claim 5, characterized in that the audio signal processing method further comprises:
and when a new audio task is added under the condition that the number of the feature extractors reaches or exceeds a threshold value of the number of the feature extractors and the number of the audio tasks when the audio task processing is executed reaches or exceeds a threshold value of the number of the audio tasks, directly applying the simulated FFT data obtained by reasoning the audio signals by the trained simulated FFT model to the new audio task to obtain an audio output result corresponding to the new audio task.
9. An audio signal processing apparatus, comprising:
a signal acquisition module configured to perform acquisition of an audio signal;
the Fast Fourier Transform (FFT) simulating processing module is configured to perform reasoning on the audio signal through a trained FFT simulating model to obtain FFT simulating data, wherein the reasoning process of the FFT simulating model on the audio signal only comprises multiplication, addition and comparison operations;
and the audio task processing module is configured to execute the processing of executing the audio task on the FFT-simulated data to obtain an audio output result.
10. An audio signal processing apparatus, comprising:
the neural network processor NPU is used for acquiring an audio signal and reasoning the audio signal through a trained imitated fast Fourier transform FFT model to obtain imitated FFT data, wherein the reasoning process of the imitated FFT model on the audio signal only comprises multiplication, addition and comparison operations;
the audio task processing chip unit is electrically connected with the NPU and used for receiving the imitated FFT data from the NPU and executing audio task processing on the imitated FFT data to obtain an audio output result.
11. An electronic device, comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to execute the executable instructions to implement the audio signal processing method of any of claims 1 to 8.
CN202310892837.0A 2023-07-20 2023-07-20 Audio signal processing method and device and electronic equipment Active CN116612760B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310892837.0A CN116612760B (en) 2023-07-20 2023-07-20 Audio signal processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310892837.0A CN116612760B (en) 2023-07-20 2023-07-20 Audio signal processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN116612760A true CN116612760A (en) 2023-08-18
CN116612760B CN116612760B (en) 2023-11-03

Family

ID=87684002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310892837.0A Active CN116612760B (en) 2023-07-20 2023-07-20 Audio signal processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN116612760B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180061394A1 (en) * 2016-09-01 2018-03-01 Samsung Electronics Co., Ltd. Voice recognition apparatus and method
CN114283795A (en) * 2021-12-24 2022-04-05 思必驰科技股份有限公司 Training and recognition method of voice enhancement model, electronic equipment and storage medium
CN114882884A (en) * 2022-07-06 2022-08-09 深圳比特微电子科技有限公司 Multitask implementation method and device based on deep learning model
CN114882873A (en) * 2022-07-12 2022-08-09 深圳比特微电子科技有限公司 Speech recognition model training method and device and readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180061394A1 (en) * 2016-09-01 2018-03-01 Samsung Electronics Co., Ltd. Voice recognition apparatus and method
CN114283795A (en) * 2021-12-24 2022-04-05 思必驰科技股份有限公司 Training and recognition method of voice enhancement model, electronic equipment and storage medium
CN114882884A (en) * 2022-07-06 2022-08-09 深圳比特微电子科技有限公司 Multitask implementation method and device based on deep learning model
CN114882873A (en) * 2022-07-12 2022-08-09 深圳比特微电子科技有限公司 Speech recognition model training method and device and readable storage medium

Also Published As

Publication number Publication date
CN116612760B (en) 2023-11-03

Similar Documents

Publication Publication Date Title
WO2017219991A1 (en) Optimization method and apparatus suitable for model of pattern recognition, and terminal device
Di Lorenzo et al. Adaptive graph signal processing: Algorithms and optimal sampling strategies
WO2020042707A1 (en) Convolutional recurrent neural network-based single-channel real-time noise reduction method
US9400955B2 (en) Reducing dynamic range of low-rank decomposition matrices
EP3301675B1 (en) Parameter prediction device and parameter prediction method for acoustic signal processing
CN110415686A (en) Method of speech processing, device, medium, electronic equipment
JP2019525214A (en) voice recognition
CN113241064B (en) Speech recognition, model training method and device, electronic equipment and storage medium
CN109658943B (en) Audio noise detection method and device, storage medium and mobile terminal
CN111357051B (en) Speech emotion recognition method, intelligent device and computer readable storage medium
CN109766476B (en) Video content emotion analysis method and device, computer equipment and storage medium
CN113571078B (en) Noise suppression method, device, medium and electronic equipment
US20230252294A1 (en) Data processing method, apparatus, and device, and computer-readable storage medium
Cerutti et al. Sub-mW keyword spotting on an MCU: Analog binary feature extraction and binary neural networks
CN112289337B (en) Method and device for filtering residual noise after machine learning voice enhancement
CN116612760B (en) Audio signal processing method and device and electronic equipment
CN115064160B (en) Voice wake-up method and device
Ulkar et al. Ultra-low power keyword spotting at the edge
CN114171043B (en) Echo determination method, device, equipment and storage medium
CN113763978B (en) Voice signal processing method, device, electronic equipment and storage medium
CN110648681B (en) Speech enhancement method, device, electronic equipment and computer readable storage medium
CN117059068A (en) Speech processing method, device, storage medium and computer equipment
TWI763975B (en) System and method for reducing computational complexity of artificial neural network
CN117275499B (en) Noise reduction method of adaptive neural network and related device
CN113516988B (en) Audio processing method and device, intelligent equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant