CN113766405A

CN113766405A - Method and device for detecting noise of loudspeaker, electronic equipment and storage medium

Info

Publication number: CN113766405A
Application number: CN202110833182.0A
Authority: CN
Inventors: 宋广伟
Original assignee: Shanghai Wingtech Information Technology Co Ltd
Current assignee: Shanghai Wingtech Information Technology Co Ltd; Shanghai Wentai Information Technology Co Ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2021-12-07
Also published as: WO2023000444A1

Abstract

The application relates to the technical field of artificial intelligence, and provides a method and a device for detecting noise of a loudspeaker, electronic equipment and a storage medium. The method comprises the following steps: collecting audio signals in a loudspeaker; convolving the audio signal on a plurality of scales respectively to generate a plurality of characteristics corresponding to the scales; generating fusion characteristics according to the characteristics, and determining the probability of the fusion characteristics according to a pre-trained classification model; if the probability is larger than or equal to a threshold value, determining that the audio signal contains noise; if the probability is less than the threshold, determining that the audio signal does not contain noise. By adopting the method, the noise detection accuracy and the processing efficiency can be improved.

Description

Method and device for detecting noise of loudspeaker, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for detecting noise of a speaker, an electronic device, and a storage medium.

Background

The micro speaker is used as a key audio output device in electronic equipment, and is widely used along with the application of intelligent hardware such as intelligent sound boxes, tablet computers, mobile phones and the like. In the production process of the micro-speaker, the noise detection technology becomes a key factor for determining the production quality, and the requirements on the accuracy and the efficiency of the noise detection method are more and more strict.

In the related art, noise detection is performed through a hardware detection system, specifically, an audio signal generator is used for exciting a micro loudspeaker, a sound pressure signal is obtained through a simulation ear, the sound pressure signal is transmitted to a computer through A/D conversion and a data acquisition card, and then a Rub value on each frequency point is calculated, characteristics are extracted, and detection and identification are performed through empirical threshold judgment. In the scheme, a tester artificially selects the noise existence judgment threshold value on each frequency point according to a plurality of tested signals, the judgment threshold value is difficult to set for the noise detection scene with higher precision requirement, and the detection accuracy is to be improved.

Disclosure of Invention

In view of the above, it is desirable to provide a method, an apparatus, an electronic device, and a storage medium for detecting noise of a speaker, which can improve noise detection accuracy and processing efficiency.

The embodiment of the application provides a noise detection method for a loudspeaker, which comprises the following steps:

collecting audio signals in a loudspeaker;

convolving the audio signal on a plurality of scales respectively to generate a plurality of characteristics corresponding to the scales;

generating fusion characteristics according to the characteristics, and determining the probability of the fusion characteristics according to a pre-trained classification model;

if the probability is larger than or equal to a threshold value, determining that the audio signal contains noise;

if the probability is less than the threshold, determining that the audio signal does not contain noise.

In one embodiment, the method further comprises:

acquiring a sweep frequency signal containing noise and a sweep frequency signal not containing noise;

performing convolution on the sweep frequency signals on a plurality of scales respectively to generate a plurality of sample characteristics corresponding to the scales;

fusing the plurality of sample features to generate a sample fusion feature, and determining a prediction probability of the sample fusion feature;

and training a classification model according to the prediction probability and the labeled value of the sweep frequency signal.

In one embodiment, the plurality of scales includes a first scale, a second scale, and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale, the convolving the audio signal over the plurality of scales, respectively, generating a plurality of features corresponding to the plurality of scales includes:

convolving the audio signal over the first scale, the second scale, and the third scale, respectively, to extract first, second, and third features of the audio signal corresponding in time domain to the first, second, and third scales.

In one embodiment, the method further comprises:

normalizing the audio signal in the loudspeaker;

and determining an original value and a target value of the sampling rate of the audio signal, and reducing the sampling rate of the audio signal after normalization processing from the original value to the target value.

In one embodiment, the probability of the fused feature is calculated as follows:

wherein z is_kRepresenting the kth value of the fully connected layer,

representing a vector of audio samples containing a murmur,

representing a vector of audio samples without noise.

In one embodiment, the convolution layer is calculated as follows:

where i denotes the i-th convolutional layer, δ is the activation function, X denotes the audio signal, w denotes the convolutional layer weight, and b denotes the convolutional layer bias.

In one embodiment, the method further comprises:

calculating each sweep frequency signal containing noise by adopting a nearest neighbor algorithm to obtain a plurality of nearest neighbors;

performing random linear interpolation on any two of the neighbors to generate a simulated swept-frequency signal containing noise;

and repeating the steps until the sum of the number of the sweep frequency signals containing the noise and the number of the simulated sweep frequency signals containing the noise is equal to the number of the sweep frequency signals not containing the noise.

The embodiment of the application provides a noise detection device of speaker, the device includes:

the acquisition module is used for acquiring audio signals in the loudspeaker;

the extraction module is used for convolving the audio signals on a plurality of scales respectively to generate a plurality of characteristics corresponding to the scales;

the generating module is used for generating fusion characteristics according to the characteristics and determining the probability of the fusion characteristics according to a pre-trained classification model;

a first determining module, configured to determine that the audio signal includes a noise if the probability is greater than or equal to a threshold;

a second determining module, configured to determine that the audio signal does not include a noise if the probability is smaller than the threshold.

An embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the noise detection method for a speaker provided in any embodiment of the present application when executing the computer program.

Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the noise detection method for a speaker provided in any embodiment of the present application.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

the method comprises the steps of performing convolution on audio signals in a loudspeaker on a plurality of scales respectively to generate a plurality of features corresponding to the scales, further generating fusion features according to the features, determining the probability of the fusion features according to a pre-trained classification model, and determining whether the audio signals contain noise according to the probability, so that fusion feature information of the audio signals on different scales in a time domain can be mined, whether the audio signals contain the noise is judged by calculating the probability, the feature information detection accuracy is improved, the calculation complexity is reduced at a test end, and the noise detection processing efficiency and the noise detection accuracy are improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a schematic flowchart illustrating a noise detection method for a speaker according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of model training provided in the embodiments of the present application;

fig. 3 is a flow chart of noise detection according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a noise detection apparatus for a speaker according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In an embodiment, as shown in fig. 1, a method for detecting noise of a speaker is provided, and this embodiment is illustrated by applying the method to a terminal, it is to be understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step 102, collecting an audio signal in a loudspeaker.

The method provided by the embodiment of the application can be used for detecting whether the audio signal in the loudspeaker contains noise or not. In particular, the method can be used for noise detection of the micro-speaker. The micro-speaker is, for example, an audio output device of intelligent hardware such as an intelligent sound box, a tablet personal computer, a mobile phone and the like, and is composed of a basin frame, magnetic steel, a pole piece, a sound film, a voice coil, a front cover, a wiring board, damping cloth and the like.

The audio signal may be a 20-20KHz standard frequency sweep signal, or speech, music, etc.

In an embodiment of the present application, after the audio signal in the speaker is collected, the audio signal may be preprocessed, specifically, the audio signal in the speaker may be normalized, and the audio signal is mapped to (0, 1) through the normalization processing, so that the test sets are all under the same dimension, and the test result abnormality caused by the inconsistency of the dimensions is avoided while the calculation amount is reduced. Furthermore, an original value and a target value of the audio signal sampling rate are determined, and the sampling rate of the audio signal after normalization processing is reduced from the original value to the target value. For example, the sampling rate of the audio signal is 48KHz, the audio signal is reduced from the original sampling rate to 5KHz by down-sampling, the noise of the micro-speaker is usually medium-low frequency noise, and the normal hearing frequency range of human ears is 20Hz to 2000Hz, so that the data volume is reduced and the detection efficiency is improved when the sampling rate of the audio signal is consistent with the sampling rate in the hearing range of human ears by down-sampling.

Step 104, convolving the audio signal on a plurality of scales respectively to generate a plurality of features corresponding to the plurality of scales.

In the embodiment of the application, multiple scales with different sizes can be set, the audio signal is convolved on the multiple scales respectively, and multiple features are generated, so that the extraction of feature information of the audio signal on different scales on a time domain is realized.

Wherein the scale refers to a convolution kernel scale.

As an example, the plurality of dimensions includes a first dimension, a second dimension, and a third dimension, and the first dimension is smaller than the second dimension, and the second dimension is smaller than the third dimension. In this example, convolving the audio signal over a plurality of scales, respectively, and generating a plurality of features corresponding to the plurality of scales includes: and convolving the audio signal on a first scale, a second scale and a third scale respectively to extract a first feature, a second feature and a third feature of the audio signal corresponding to the first scale, the second scale and the third scale respectively in a time domain.

Optionally, the convolution operation is implemented by a convolutional layer, and a calculation formula of the convolutional layer is as follows:

And 106, generating fusion characteristics according to the plurality of characteristics, and determining the probability of the fusion characteristics according to the pre-trained classification model.

In this embodiment, a plurality of features respectively corresponding to a plurality of scales are subjected to fusion processing to generate a fusion feature.

As an example, taking the multiple scales including the first scale, the second scale and the third scale as an example, the first feature, the second feature and the third feature are fused and combined into a one-dimensional feature, and the one-dimensional feature is taken as a fused feature.

The classification model is obtained by training according to the audio signals containing the noise and the audio signals not containing the noise as training samples, the input of the pre-trained classification model is fusion characteristics, the output of the pre-trained classification model is the probability of the fusion characteristics, and the probability of the fusion characteristics is used for indicating the probability of the audio signals containing the noise.

In one embodiment of the present application, the classification model includes a fully connected layer and a noise detection function, which may be a softmax function. Optionally, the probability of the fusion feature is calculated as follows:

wherein z is_kRepresenting the kth value of the fully connected layer,

representing a vector of audio samples containing a murmur,

representing a vector of audio samples without noise.

Step 108, if the probability is greater than or equal to the threshold value, determining that the audio signal contains noise; if the probability is less than the threshold, it is determined that the audio signal does not contain a noise.

In this embodiment, whether the audio signal contains a noise may be determined according to the probability of the fusion feature. As an example, the threshold is 0.5, and if the probability of the classification model output is greater than or equal to 0.5, it is determined that the audio signal contains a noise, otherwise it is determined that the audio signal does not contain a noise.

It should be noted that, the implementation manner for determining that the audio signal contains the noise when the probability is greater than or equal to the threshold is merely an example, and the specific determination logic may be determined according to the training end, which is not limited herein.

According to the noise detection method of the loudspeaker, the audio signals in the loudspeaker are respectively convoluted on a plurality of scales to generate a plurality of characteristics corresponding to the plurality of scales, then fusion characteristics are generated according to the plurality of characteristics, the probability of the fusion characteristics is determined according to the pre-trained classification model, whether the audio signals contain noise is determined according to the probability, therefore, fusion characteristic information of the audio signals on different scales on a time domain can be mined, whether the audio signals contain noise is judged by calculating the probability, the detection accuracy of the characteristic information is improved, the calculation complexity is reduced at a test end, a hardware detection scheme in the related technology has higher requirement on the equipment precision of an audio signal generator, a judgment threshold is difficult to set, and the detection time is long, compared with the scheme in the related technology, the situation of judgment errors caused by the fact that the threshold is selected improperly is reduced, the noise detection processing efficiency and the detection accuracy are improved.

Based on the above embodiments, the training side is explained below.

Fig. 2 is a schematic flowchart of a model training process provided in an embodiment of the present application, as shown in fig. 2, including the following steps:

step 202, obtaining a sweep frequency signal containing noise and a sweep frequency signal not containing noise.

In this embodiment, the sweep frequency signal containing noise and the sweep frequency signal without noise played by the micro speaker may be collected, and the collected sweep frequency signals may be used as a training set and input to the constructed convolutional neural network containing convolutional kernels with different scales and step lengths for parallel training.

In an embodiment of the present application, after obtaining the frequency sweep signal including the noise and the frequency sweep signal not including the noise, the frequency sweep signal including the noise and the frequency sweep signal not including the noise may be preprocessed, specifically, the frequency sweep signal may be normalized, and the frequency sweep signal is mapped to (0, 1) through the normalization. And further, determining an original value and a target value of the sampling rate of the sweep frequency signal, and reducing the sampling rate of the sweep frequency signal after normalization processing from the original value to the target value, wherein the original value is larger than the target value.

And 204, respectively convolving the sweep frequency signals on a plurality of scales to generate a plurality of sample characteristics corresponding to the plurality of scales.

In the embodiment of the application, multiple scales with different sizes can be set, and the sweep frequency signals (including the sweep frequency signals containing noise and the sweep frequency signals without noise) are respectively convolved on the multiple scales to generate multiple features, so that the extraction of feature information of the audio signals on different scales on a time domain is realized.

Wherein the scale refers to a convolution kernel scale.

As an example, the plurality of scales includes a first scale K₁Second dimension K₂And a third dimension K₃And a first dimension K₁Smaller than the second dimension K₂Second dimension K₂Less than a third dimension K₃. Respectively arranging the frequency sweep signals at a first scale K₁Second dimension K₂And a third dimension K₃The third convolution is carried out to extract the sweep frequency signal from the first scale K in the time domain₁Second dimension K₂And a third dimension K₃The first sample characteristic, the second sample characteristic and the third sample characteristic respectively correspond to each other.

Wherein a plurality of step sizes can also be set for the convolution kernel. For example, the plurality of steps includes a first step S₁Second step length S₂And a third step length S₃And the first step length S₁Is less than the second step length S₂Second step size S₂Less than a third step size S₃. Therefore, the characteristic information of the sweep frequency signal with the length from small to large in the time domain can be extracted.

And step 206, fusing the plurality of sample features to generate a sample fusion feature, and determining the prediction probability of the sample fusion feature.

In this embodiment, a plurality of sample features are subjected to fusion processing and combined into a one-dimensional feature to generate a sample fusion feature.

As an example, taking the multiple scales including the first scale, the second scale and the third scale as an example, the first sample feature, the second sample feature and the third sample feature are fused and combined into a one-dimensional feature, and the one-dimensional feature is taken as a sample fusion feature.

In one embodiment of the present application, the classification model includes a fully connected layer and a noise detection function, which may be a softmax function. Optionally, the prediction probability of the sample fusion feature is calculated as follows:

wherein z is_kRepresenting the kth value of the fully connected layer,

representing a vector of audio samples containing a murmur,

representing a vector of audio samples without noise.

And 208, training a classification model according to the prediction probability and the labeled value of the sweep frequency signal.

In this embodiment, each sweep frequency signal corresponds to a label value, for example, a sweep frequency signal including a noise corresponds to a label value of 1, and a sweep frequency signal not including a noise corresponds to a label value of 0. And calculating a loss value according to a preset loss function, the prediction probability and the mark value, and updating the processing parameters of the model in a back propagation mode until the convergence and the accuracy of the model are greater than the preset value, so that the classification model can accurately predict whether the audio signal contains noise.

Optionally, a multi-scale end-to-end convolutional neural network may be constructed, where the convolutional neural network includes a convolutional layer, a full-link layer, and a noise detection function, and the convolutional neural network is trained through a training set, and the trained convolutional neural network model is stored under the condition that the model convergence and the accuracy are greater than a preset value. The convolutional neural network model is used to determine whether the input audio signal contains a noise.

It should be noted that the input of the model may also be other acoustic characteristics of the audio signal played by the micro-speaker, such as a frequency spectrum, a logarithmic mel-frequency spectrum, and the like.

In the embodiment of the application, fusion characteristic information on different scales in a time domain is extracted through the sweep frequency signal containing the noise and the sweep frequency signal not containing the noise, and the model can be trained through the prediction probability of the sample fusion characteristic and the labeled value of the sweep frequency signal, so that whether the audio signal contains the noise or not can be accurately judged by the model. Furthermore, the pre-trained model is applied to the noise detection of the audio signal in the loudspeaker, so that the noise detection processing efficiency and the detection accuracy are improved.

In an embodiment of the present application, since the collected sweep frequency signals with noise and the sweep frequency signals without noise played by the micro-speaker are usually severely unbalanced in number, the training set can be constructed as follows: calculating each sweep frequency signal containing noise by adopting a nearest neighbor algorithm to obtain a plurality of neighbors, and performing random linear interpolation on any two neighbors to generate simulation sweep frequency signals containing noise; and repeating the steps until the sum of the number of the sweep frequency signals containing the noise and the number of the simulated sweep frequency signals containing the noise is equal to the number of the sweep frequency signals not containing the noise.

As an example, the collected sweep frequency signals with noise and the sweep frequency signals without noise (i.e. non-defective products and non-defective products in production line test) played by the micro-speaker have the numbers of 90 and 3600 respectively, and because the proportion is seriously unbalanced, the application processes 90 non-defective products to generate simulated non-defective products equivalent to the number of the non-defective products, and the steps are as follows: sampling a nearest neighbor algorithm, calculating 5 neighbors of each non-good product sample, and randomly selecting 2 non-good product samples from the 5 neighbors to perform random linear interpolation; and constructing a new simulated non-defective sample, and synthesizing the new sample and the original data to generate a new training set.

Wherein, the number of new data set samples is 7200 (containing 3600 good products and 3600 non-good products), according to 4: 1, dividing a data set to obtain a test set and a verification set, adopting one-hot coding to mark good products and non-good products as '1' and '0' respectively, training the multi-scale end-to-end convolutional neural network by using the training set, repeatedly iterating and updating model parameters to output a trained model after convergence is achieved, evaluating by using the verification set, and outputting a detection result. As an example, fig. 3 is a flowchart of a noise detection scenario provided in an embodiment of the present application.

It should be understood that although the various steps in the flow charts of fig. 1-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 4, there is provided a noise detection apparatus of a speaker, including: the system comprises an acquisition module 41, an extraction module 42, a generation module 43, a first determination module 44 and a second determination module 45.

The acquiring module 41 is configured to acquire an audio signal in a speaker.

And an extraction module 42, configured to convolve the audio signal on multiple scales respectively to generate multiple features corresponding to the multiple scales.

And a generating module 43, configured to generate a fusion feature according to the plurality of features, and determine a probability of the fusion feature according to a pre-trained classification model.

The first determining module 44 is configured to determine that the audio signal includes a noise if the probability is greater than or equal to a threshold.

A second determining module 45, configured to determine that the audio signal does not include a noise if the probability is smaller than the threshold.

In one embodiment, the apparatus further comprises: the training module is used for acquiring sweep frequency signals containing noise and sweep frequency signals without noise; performing convolution on the sweep frequency signals on a plurality of scales respectively to generate a plurality of sample characteristics corresponding to the scales; fusing a plurality of sample features to generate sample fusion features, and determining the prediction probability of the sample fusion features; and training a classification model according to the prediction probability and the labeled value of the sweep frequency signal.

In one embodiment, the plurality of scales includes a first scale, a second scale, and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale, and the extraction module 42 is specifically configured to: the audio signal is convolved on a first scale, a second scale and a third scale respectively to extract a first feature, a second feature and a third feature of the audio signal corresponding to the first scale, the second scale and the third scale in a time domain.

In one embodiment, the apparatus further comprises: the preprocessing module is used for carrying out normalization processing on the audio signals in the loudspeaker; and determining an original value and a target value of the sampling rate of the audio signal, and reducing the sampling rate of the audio signal after normalization processing from the original value to the target value.

In one embodiment, the probability of a fused feature is calculated as follows:

wherein z is_kRepresenting the kth value of the fully connected layer,

representing a vector of audio samples containing a murmur,

representing a vector of audio samples without noise.

In one embodiment, the convolution layer is calculated as follows:

In one embodiment, the apparatus further comprises: the acquisition module is used for calculating each sweep frequency signal containing noise by adopting a nearest neighbor algorithm to obtain a plurality of neighbors; random linear interpolation is carried out on any two of the neighbors to generate simulation frequency sweeping signals containing noise; and repeating the steps until the sum of the number of the sweep frequency signals containing the noise and the number of the simulated sweep frequency signals containing the noise is equal to the number of the sweep frequency signals not containing the noise.

For the specific limitation of the noise detection device for the speaker, reference may be made to the above limitation on the noise detection method for the speaker, and the method has corresponding functional modules and beneficial effects for executing the method, which are not described herein again. The modules in the noise detection device of the speaker can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the electronic device, or can be stored in a memory in the electronic device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, an electronic device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 5. The electronic device comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, Near Field Communication (NFC) or other technologies. The computer program is executed by a processor to implement a method of noise detection for a loudspeaker. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with the present application, and does not constitute a limitation on the electronic device to which the present application is applied, and a particular electronic device may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.

In one embodiment, the noise detection apparatus for a speaker provided in the present application may be implemented in the form of a computer program, and the computer program may be run on an electronic device as shown in fig. 5. The memory of the electronic device may store various program modules of the noise detection apparatus of the speaker, such as the acquisition module 41, the extraction module 42, the generation module 43, the first determination module 44, and the second determination module 45 shown in fig. 4. The computer program constituted by the respective program modules causes the processor to execute the steps in the noise detection method for a speaker of the respective embodiments of the present application described in the present specification.

For example, the electronic device shown in fig. 5 may perform the capturing of the audio signal in the speaker by the capturing module 41 in the noise detection apparatus of the speaker as shown in fig. 4. The electronic device may perform a convolution of the audio signal over the plurality of scales, respectively, by extraction module 42, generating a plurality of features corresponding to the plurality of scales. The electronic device may perform the generation of the fused feature from the plurality of features by the generation module 43 and determine the probability of the fused feature from the pre-trained classification model. The electronic device may perform, by the first determination module 44, determining that the audio signal contains a noise if the probability is greater than or equal to the threshold. The electronic device may perform, by the second determining module 45, determining that the audio signal does not contain a noise if the probability is smaller than the threshold value

In one embodiment, an electronic device is provided, comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program: collecting audio signals in a loudspeaker; convolving the audio signal on a plurality of scales respectively to generate a plurality of characteristics corresponding to the scales; generating fusion characteristics according to the characteristics, and determining the probability of the fusion characteristics according to a pre-trained classification model; if the probability is larger than or equal to the threshold value, determining that the audio signal contains noise; if the probability is less than the threshold, it is determined that the audio signal does not contain a noise.

In one embodiment, the processor when executing the computer program may further perform the steps of: acquiring a sweep frequency signal containing noise and a sweep frequency signal not containing noise; performing convolution on the sweep frequency signals on a plurality of scales respectively to generate a plurality of sample characteristics corresponding to the scales; fusing a plurality of sample features to generate sample fusion features, and determining the prediction probability of the sample fusion features; and training a classification model according to the prediction probability and the labeled value of the sweep frequency signal.

In one embodiment, the processor when executing the computer program may further perform the steps of: the audio signal is convolved on a first scale, a second scale and a third scale respectively to extract a first feature, a second feature and a third feature of the audio signal corresponding to the first scale, the second scale and the third scale in a time domain.

In one embodiment, the processor when executing the computer program may further perform the steps of: carrying out normalization processing on the audio signal in the loudspeaker; and determining an original value and a target value of the sampling rate of the audio signal, and reducing the sampling rate of the audio signal after normalization processing from the original value to the target value.

In one embodiment, the processor when executing the computer program may further perform the steps of: calculating each sweep frequency signal containing noise by adopting a nearest neighbor algorithm to obtain a plurality of nearest neighbors; random linear interpolation is carried out on any two of the neighbors to generate simulation frequency sweeping signals containing noise; and repeating the steps until the sum of the number of the sweep frequency signals containing the noise and the number of the simulated sweep frequency signals containing the noise is equal to the number of the sweep frequency signals not containing the noise.

According to the electronic equipment of the embodiment of the application, when the processor executes a computer program, the following steps are realized, audio signals in the loudspeaker are respectively convoluted on a plurality of scales, a plurality of features corresponding to the scales are generated, then, fusion features are generated according to the features, the probability of the fusion features is determined according to a pre-trained classification model, and whether the audio signals contain noise or not is determined according to the probability.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: collecting audio signals in a loudspeaker; convolving the audio signal on a plurality of scales respectively to generate a plurality of characteristics corresponding to the scales; generating fusion characteristics according to the characteristics, and determining the probability of the fusion characteristics according to a pre-trained classification model; if the probability is larger than or equal to the threshold value, determining that the audio signal contains noise; if the probability is less than the threshold, it is determined that the audio signal does not contain a noise.

In one embodiment, the computer program when executed by the processor may further implement the steps of: acquiring a sweep frequency signal containing noise and a sweep frequency signal not containing noise; performing convolution on the sweep frequency signals on a plurality of scales respectively to generate a plurality of sample characteristics corresponding to the scales; fusing a plurality of sample features to generate sample fusion features, and determining the prediction probability of the sample fusion features; and training a classification model according to the prediction probability and the labeled value of the sweep frequency signal.

In one embodiment, the computer program when executed by the processor may further implement the steps of: the audio signal is convolved on a first scale, a second scale and a third scale respectively to extract a first feature, a second feature and a third feature of the audio signal corresponding to the first scale, the second scale and the third scale in a time domain.

In one embodiment, the computer program when executed by the processor may further implement the steps of: carrying out normalization processing on the audio signal in the loudspeaker; and determining an original value and a target value of the sampling rate of the audio signal, and reducing the sampling rate of the audio signal after normalization processing from the original value to the target value.

In one embodiment, the computer program when executed by the processor may further implement the steps of: calculating each sweep frequency signal containing noise by adopting a nearest neighbor algorithm to obtain a plurality of nearest neighbors; random linear interpolation is carried out on any two of the neighbors to generate simulation frequency sweeping signals containing noise; and repeating the steps until the sum of the number of the sweep frequency signals containing the noise and the number of the simulated sweep frequency signals containing the noise is equal to the number of the sweep frequency signals not containing the noise.

According to the computer-readable storage medium of the embodiment of the application, when a computer program stored on the computer-readable storage medium is executed by a processor, the following steps are realized, audio signals in a loudspeaker are respectively convoluted on a plurality of scales, a plurality of features corresponding to the plurality of scales are generated, then fusion features are generated according to the plurality of features, the probability of the fusion features is determined according to a pre-trained classification model, and whether the audio signals contain noise is determined according to the probability.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM is available in many forms, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), and the like.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for detecting noise of a speaker, comprising:

collecting audio signals in a loudspeaker;

2. The method of claim 1, further comprising:

3. The method of claim 1 or 2, wherein the plurality of scales includes a first scale, a second scale, and a third scale, and wherein the first scale is smaller than the second scale, and wherein the second scale is smaller than the third scale, and wherein convolving the audio signal over the plurality of scales, respectively, to generate the plurality of features corresponding to the plurality of scales comprises:

4. The method of claim 1 or 2, further comprising:

normalizing the audio signal in the loudspeaker;

5. The method of claim 1 or 2, wherein the probability of fusing features is calculated by:

wherein z is_kRepresenting the kth value of the fully connected layer,

representing a vector of audio samples containing a murmur,

representing a vector of audio samples without noise.

6. The method of claim 3, wherein the convolutional layer is calculated as follows:

7. The method of claim 2, further comprising:

8. A noise detection device for a speaker, comprising:

the acquisition module is used for acquiring audio signals in the loudspeaker;

9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.