WO2023000444A1

WO2023000444A1 - Method and apparatus for detecting noise of loudspeaker, and electronic device and storage medium

Info

Publication number: WO2023000444A1
Application number: PCT/CN2021/115791
Authority: WO
Inventors: 宋广伟
Original assignee: 上海闻泰信息技术有限公司
Priority date: 2021-07-22
Filing date: 2021-08-31
Publication date: 2023-01-26
Also published as: CN113766405A

Abstract

A method and apparatus for detecting noise of a loudspeaker, and an electronic device and a storage medium, which relate to the technical field of artificial intelligence. The method comprises: collecting an audio signal in a loudspeaker (102); respectively convolving the audio signal in a plurality of scales, so as to generate a plurality of features corresponding to the plurality of scales (104); generating a fused feature according to the plurality of features, and determining a probability of the fused feature according to a pre-trained classification model (106); and if the probability is greater than or equal to a threshold value, determining that the audio signal includes noise, and if the probability is less than the threshold value, determining that the audio signal does not include noise (108). By using the method, the noise detection accuracy and processing efficiency can be increased.

Description

Loudspeaker noise detection method, device, electronic device and storage medium

This disclosure claims the priority of the Chinese patent application with the application number 202110833182.0 and the title of the invention "Noise detection method, device, electronic equipment and storage medium for loudspeakers" submitted to the China Patent Office on July 22, 2021, the entire content of which is passed References are incorporated in this disclosure.

technical field

The disclosure relates to a loudspeaker noise detection method, device, electronic equipment and storage medium.

Background technique

As a key audio output device in electronic equipment, micro speakers are widely used with the application of smart hardware such as smart speakers, tablet computers, and mobile phones. In the production process of micro-speakers, noise detection technology has become a key factor in determining production quality, and the accuracy and efficiency requirements of noise detection methods are becoming more and more stringent.

In the related technology, the noise detection is carried out through the hardware detection system, and the audio signal generator is used to excite the micro-speaker, and the sound pressure signal is obtained through the artificial ear. The Rub value on the point, feature extraction, and detection and recognition through empirical threshold adjudication. In this solution, the tester artificially selects the judgment threshold for the presence of noise at each frequency point based on testing multiple signals. For noise detection scenarios that require high precision, the judgment threshold is difficult to set, and the detection accuracy needs to be improved.

Contents of the invention

(1) Technical problems to be solved

When using the hardware detection system for noise detection, it is difficult to meet the noise detection scene with high precision requirements by artificially selecting the noise presence judgment threshold at each frequency point, and the detection accuracy is low.

(2) Technical solution

According to various embodiments of the present disclosure, a loudspeaker noise detection method, device, electronic device, and storage medium are provided.

A noise detection method for a loudspeaker, the method comprising:

Acquire the audio signal in the speaker;

Convolving the audio signal on a plurality of scales, respectively, to generate a plurality of features corresponding to the plurality of scales;

Generate fusion features according to the plurality of features, and determine the probability of the fusion features according to the pre-trained classification model;

If the probability is greater than or equal to a threshold, then determining that the audio signal contains noise;

If the probability is less than the threshold, it is determined that the audio signal does not contain noise.

In one embodiment, the method also includes:

Obtain a frequency sweep signal containing noise and a frequency sweep signal not containing noise;

Convolve the frequency sweep signal on multiple scales to generate multiple sample features corresponding to multiple scales;

Fusing the multiple sample features to generate a sample fusion feature, and determining the predicted probability of the sample fusion feature;

A classification model is trained according to the predicted probability and the labeled value of the frequency sweep signal.

In one embodiment, the plurality of scales includes a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale, so Convolving the audio signal on multiple scales, generating multiple features corresponding to multiple scales includes:

Convolving the audio signal on the first scale, the second scale, and the third scale, respectively, to extract the time domain correlation between the audio signal and the first scale, the second scale The first feature, the second feature and the third feature corresponding to the scale and the third scale.

In one embodiment, the method also includes:

performing normalization processing on the audio signal in the loudspeaker;

An original value and a target value of the sampling rate of the audio signal are determined, and the sampling rate of the normalized audio signal is reduced from the original value to the target value.

In one embodiment, the probability of the fusion feature is calculated as follows:

Among them, z _k represents the kth value of the fully connected layer,

Represents a vector of audio samples containing noise,

Vector representing audio samples without noise.

In one embodiment, the calculation formula of the convolutional layer is as follows:

Among them, i represents the i-th convolutional layer, δ is the activation function, X represents the audio signal, w represents the weight of the convolutional layer, and b represents the bias of the convolutional layer.

In one embodiment, the method also includes:

Use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise;

performing random linear interpolation on any two of the plurality of neighbors to generate a simulated frequency sweep signal containing noise;

The above steps are repeated until the sum of the number of noise-containing frequency sweep signals and the number of noise-containing simulated frequency sweep signals is equal to the number of noise-free frequency sweep signals.

A noise detection device for a loudspeaker, said device comprising:

A collection module configured to collect audio signals in the loudspeaker;

An extraction module configured to convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales;

A generating module configured to generate a fusion feature according to the plurality of features, and determine the probability of the fusion feature according to a pre-trained classification model;

A first determination module configured to determine that the audio signal contains noise if the probability is greater than or equal to a threshold;

The second determination module is configured to determine that the audio signal does not contain noise if the probability is less than the threshold.

Optionally, the device further includes: a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Corresponding to multiple sample features; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.

Optionally, the multiple scales include a first scale, a second scale, and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale; Convolution is performed on the first scale, the second scale and the third scale to extract the first feature, the second feature and the third feature of the audio signal corresponding to the first scale, the second scale and the third scale in the time domain.

Optionally, the device further includes: a preprocessing module configured to normalize the audio signal in the speaker; determine the original value and the target value of the sampling rate of the audio signal, and normalize the processed audio signal The sampling rate of is reduced from the original value to the target value.

Optionally, the probability of fusion features is calculated as follows:

Among them, z _k represents the kth value of the fully connected layer,

Represents a vector of audio samples containing noise,

Vector representing audio samples without noise.

Optionally, the calculation formula of the convolutional layer is as follows:

Optionally, the device further includes: an acquisition module configured to use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors to obtain Generate simulated frequency sweep signals containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.

An electronic device, including a memory and one or more processors, the memory stores computer-readable instructions, and the one or more processors execute the computer-readable instructions to implement the method provided by any embodiment of the present disclosure. Steps of a noise detection method for a speaker.

One or more non-transitory computer-readable storage media having stored thereon computer-readable instructions that, when executed by one or more processors, implement any implementation of the present disclosure The steps of the loudspeaker noise detection method provided by the example.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure.

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, for those of ordinary skill in the art, In other words, other drawings can also be obtained from these drawings without paying creative labor.

FIG. 1 is a schematic flowchart of a noise detection method for a speaker provided by one or more embodiments of the present disclosure;

Fig. 2 is a schematic flow chart of model training provided by one or more embodiments of the present disclosure;

FIG. 3 is a flow chart of noise detection provided by one or more embodiments of the present disclosure;

FIG. 4 is a schematic structural diagram of a loudspeaker noise detection device provided by one or more embodiments of the present disclosure;

Fig. 5 is a schematic structural diagram of an electronic device provided by one or more embodiments of the present disclosure.

detailed description

In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present disclosure, and are not configured to limit the present disclosure.

In the embodiments of the present disclosure, words such as "exemplary" or "for example" are used as examples, illustrations or illustrations. Any embodiment or design described as "exemplary" or "for example" in the embodiments of the present disclosure shall not be construed as being preferred or advantageous over other embodiments or designs. To be precise, the use of words such as "exemplary" or "for example" is intended to present related concepts in a specific manner. In addition, in the description of the embodiments of the present disclosure, unless otherwise specified, the meaning of "plurality" refers to two one or more.

In one embodiment, as shown in FIG. 1 , a noise detection method for a speaker is provided. This embodiment uses the method applied to a terminal as an example for illustration. It can be understood that this method can also be applied to a server, or It is applied to a system including a terminal and a server, and is realized through the interaction between the terminal and the server. In this embodiment, the method includes the following steps:

Step 102, collect the audio signal in the speaker.

The method of the embodiment of the present disclosure can be used to detect whether the audio signal in the speaker contains noise. Specifically, it can be used for noise detection of micro-speakers. For example, micro speakers are audio output devices of smart hardware such as smart speakers, tablet computers, and mobile phones. Micro speakers are composed of basin frames, magnetic steel, pole pieces, sound membranes, voice coils, front covers, wiring boards, and damping cloth.

Wherein, the audio signal can be a 20-20KHz standard frequency sweep signal, or voice, music, etc.

In one embodiment of the present application, after the audio signal in the speaker is collected, the audio signal can be preprocessed, specifically, the audio signal in the speaker can be normalized, and the audio signal can be normalized through the normalization process. Mapped to (0, 1), so that the test sets are all in the same dimension, reducing the amount of calculation and avoiding abnormal test results caused by inconsistent dimensions. Furthermore, an original value and a target value of the sampling rate of the audio signal are determined, and the sampling rate of the normalized audio signal is reduced from the original value to the target value. Among them, the original value is greater than the target value. For example, the original value of the sampling rate of the audio signal is 48KHz, and the audio signal is reduced from the original sampling rate to 5KHz by down-sampling, because the noise of the micro-speaker is usually low-frequency noise , and the normal hearing frequency range of the human ear is 20Hz to 2000Hz. By downsampling, the sampling rate of the audio signal is consistent. While the sampling rate is within the hearing range of the human ear, the amount of data is reduced and the detection efficiency is improved.

In step 104, the audio signal is convoluted on multiple scales to generate multiple features corresponding to the multiple scales.

In the embodiment of the present disclosure, multiple scales of different sizes can be set, and the audio signal is convoluted on multiple scales respectively to generate multiple features, so as to extract feature information of different scales of the audio signal in the time domain.

Among them, the scale refers to the convolution kernel scale.

As an example, the multiple scales include a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale. In this example, the audio signal is convolved on multiple scales, and generating multiple features corresponding to multiple scales includes: respectively convolving the audio signal on the first scale, the second scale, and the third scale, The first feature, the second feature and the third feature respectively corresponding to the first scale, the second scale and the third scale of the audio signal in the time domain are extracted.

Optionally, the above convolution operation is implemented through a convolution layer, and the calculation formula of the convolution layer is as follows:

Step 106, generate fusion features according to multiple features, and determine the probability of fusion features according to the pre-trained classification model.

In this embodiment, multiple features respectively corresponding to multiple scales are fused to generate fused features.

As an example, taking multiple scales including the first scale, the second scale and the third scale as an example, the first feature, the second feature and the third feature are fused into a one-dimensional feature, and the one-dimensional feature as a fusion feature.

Among them, the classification model is obtained by training the audio signal containing noise and the audio signal not containing noise as training samples. The input of the pre-trained classification model is the fusion feature, and the output is the probability of the fusion feature. The probability of the fusion feature is used for Indicates the probability that the audio signal contains noise.

In one embodiment of the present disclosure, the classification model includes a fully connected layer and a noise detection function, and the noise detection function may be a softmax function. Optionally, the probability of fusion features is calculated as follows:

Among them, z _k represents the kth value of the fully connected layer,

Represents a vector of audio samples containing noise,

Vector representing audio samples without noise.

Step 108, if the probability is greater than or equal to the threshold, determine that the audio signal contains noise; if the probability is smaller than the threshold, determine that the audio signal does not contain noise.

In this embodiment, it may be determined whether the audio signal contains noise according to the probability of the fused features. As an example, the threshold is 0.5, and if the probability output by the classification model is greater than or equal to 0.5, it is determined that the audio signal contains noise, otherwise it is determined that the audio signal does not contain noise.

It should be noted that the above implementation of determining that the audio signal contains noise when the probability is greater than or equal to the threshold is just an example, and the specific judgment logic can be determined according to the training end, which is not limited here.

According to the loudspeaker noise detection method of the embodiment of the present disclosure, the audio signal in the loudspeaker is respectively convolved on multiple scales to generate multiple features corresponding to the multiple scales, and then generate fusion features based on the multiple features, And according to the pre-trained classification model to determine the probability of the fusion feature, according to the probability to determine whether the audio signal contains noise, thus, it is possible to mine the fusion feature information of the audio signal in different scales in the time domain, and judge whether it contains noise by calculating the probability. Improve the accuracy of feature information detection and reduce the complexity of calculation at the test end. The hardware detection scheme in the related art has high requirements for the equipment accuracy of the audio signal generator, and the judgment threshold is difficult to set and the detection takes a long time. This disclosure Compared with the solution in the related art, the situation of judgment error caused by the improper selection of the threshold value is reduced, and the noise detection processing efficiency and detection accuracy are improved.

Based on the above embodiments, the training terminal will be described below.

Fig. 2 is a schematic flow chart of a model training provided by an embodiment of the present disclosure, as shown in Fig. 2 , including the following steps:

Step 202, acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise.

In this embodiment, the frequency sweep signal containing noise and the frequency sweep signal without noise played by the micro-speaker can be collected, so that the collected frequency sweep signal can be used as a training set and input to the constructed convolution with different scales and step sizes. Parallel training of kernels in convolutional neural networks.

In an embodiment of the present disclosure, after obtaining the frequency sweep signal containing noise and the frequency sweep signal not containing noise, preprocessing may be performed on the frequency sweep signal containing noise and the frequency sweep signal not containing noise, specifically, Normalization processing can be performed on the frequency sweep signal, and the frequency sweep signal is mapped to (0, 1) through the normalization processing. Furthermore, an original value and a target value of the sampling rate of the frequency sweeping signal are determined, and the sampling rate of the normalized frequency scanning signal is reduced from the original value to the target value, wherein the original value is greater than the target value.

In step 204, the frequency sweep signal is convoluted on multiple scales to generate multiple sample features corresponding to the multiple scales.

In the embodiment of the present disclosure, multiple scales of different sizes can be set, and the frequency sweep signal (including the frequency sweep signal containing noise and the frequency sweep signal without noise) is convoluted on multiple scales respectively to generate multiple features , to extract feature information of different scales of the audio signal in the time domain.

Among them, the scale refers to the convolution kernel scale.

As an example, the multiple scales include a first scale K ₁ , a second scale K ₂ and a third scale K ₃ , and the first scale K ₁ is smaller than the second scale K ₂ , and the second scale K ₂ is smaller than the third scale K ₃ . Convolve the frequency sweep signal three times on the first scale K ₁ , the second scale K ₂ and the third scale K ₃ respectively, to extract the time-domain correlation between the frequency sweep signal and the first scale K ₁ , the second scale K ₂ The first sample feature, the second sample feature and the third sample feature respectively correspond to the third scale K ₃ .

Among them, multiple step sizes can also be set for the convolution kernel. For example, the multiple step sizes include the first step size S ₁ , the second step size S ₂ and the third step size S ₃ , and the first step size S ₁ is smaller than the second step size S ₂ , and the second step size S ₂ is smaller than the third step size S ₃ . In this way, the characteristic information of the frequency sweep signal in the time domain from small to large in length can be extracted.

In step 206, a plurality of sample features are fused to generate a sample fusion feature, and a prediction probability of the sample fusion feature is determined.

In this embodiment, multiple sample features are fused and merged into one-dimensional features to generate sample fusion features.

As an example, taking multiple scales including the first scale, the second scale and the third scale as an example, the first sample feature, the second sample feature and the third sample feature are fused and merged into a one-dimensional feature, and the This one-dimensional feature is used as a sample fusion feature.

In one embodiment of the present disclosure, the classification model includes a fully connected layer and a noise detection function, and the noise detection function may be a softmax function. Optionally, the predicted probability of the sample fusion feature is calculated as follows:

Among them, z _k represents the kth value of the fully connected layer,

Represents a vector of audio samples containing noise,

Vector representing audio samples without noise.

Step 208, training a classification model according to the predicted probability and the labeled value of the frequency sweep signal.

In this embodiment, each frequency sweep signal corresponds to a label value, for example, a frequency sweep signal containing noise corresponds to a label value of 1, and a frequency sweep signal that does not contain noise corresponds to a label value of 0. Calculate the loss value according to the preset loss function, prediction probability and label value, and update the processing parameters of the model through back propagation until the model converges and the accuracy rate is greater than the preset value, so that the classification model can accurately predict the audio signal Whether to contain noise.

Optionally, a multi-scale end-to-end convolutional neural network can be constructed, wherein the convolutional neural network includes a convolutional layer, a fully connected layer, and a noise detection function, and the convolutional neural network is trained through a training set. When the model convergence and accuracy are greater than the preset value, save the trained convolutional neural network model. This convolutional neural network model is used to determine whether the input audio signal contains noise or not.

It should be noted that the input of the model can also be other acoustic features of the audio signal played by the micro-speaker, such as frequency spectrum, logarithmic mel spectrum, etc.

In the embodiment of the present disclosure, the fusion feature information on different scales in the time domain is extracted through the frequency sweep signal containing noise and the frequency sweep signal without noise, and the predicted probability of the sample fusion feature and the labeled value of the frequency sweep signal are used for training. A model that enables the model to accurately determine whether an audio signal contains noise. Further, the pre-trained model is applied to the audio signal noise detection in the loudspeaker, which improves the noise detection processing efficiency and detection accuracy.

In one embodiment of the present disclosure, since the frequency sweep signals containing noise and the frequency sweep signals without noise played by the collected micro-speakers are usually severely unbalanced in number, the training set can be constructed in the following manner: using the nearest neighbor algorithm , calculate multiple neighbors for each frequency sweep signal containing noise, perform random linear interpolation on any two of the multiple neighbors to generate a simulated frequency sweep signal containing noise; repeat the above steps until the frequency sweep containing noise The sum of the number of signals and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.

As an example, the number of frequency sweep signals containing noise and frequency sweep signals without noise (that is, non-conforming products and good products in the production line test) played by the collected micro-speakers is 90 and 3600 respectively. Seriously unbalanced, so this disclosure processes 90 non-defective products to generate simulated non-defective products equal to the number of good products. Randomly select two non-defective samples from the nearest neighbors for random linear interpolation; construct a new simulated non-defective sample, and synthesize the new sample with the original data to generate a new training set.

Among them, the number of samples in the new data set is 7200 (including 3600 good products and 3600 non-good products). The data set is divided according to 4:1 to obtain the test set and verification set. Mark them as "1" and "0" respectively, and use the training set to train the above-mentioned multi-scale end-to-end convolutional neural network, update the model parameters through repeated iterations, so that after convergence, output the trained model, and use The validation set is evaluated and the detection results are output. As an example, FIG. 3 is a flowchart of a noise detection scenario provided by an embodiment of the present disclosure.

It should be understood that although the various steps in the flow charts of FIGS. 1-3 are displayed sequentially as indicated by the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in Figures 1-3 may include a plurality of sub-steps or stages, these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, these sub-steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

In one embodiment, as shown in FIG. 4 , a loudspeaker noise detection device is provided, including: an acquisition module 41 , an extraction module 42 , a generation module 43 , a first determination module 44 , and a second determination module 45 .

Wherein, the collecting module 41 is configured to collect the audio signal in the speaker.

The extraction module 42 is configured to convolve the audio signal on multiple scales respectively to generate multiple features corresponding to the multiple scales.

The generating module 43 is configured to generate fusion features according to multiple features, and determine the probability of fusion features according to the pre-trained classification model.

The first determination module 44 is configured to determine that the audio signal contains noise if the probability is greater than or equal to a threshold.

The second determination module 45 is configured to determine that the audio signal does not contain noise if the probability is less than the threshold.

In one embodiment, the device further includes: a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Multiple sample features corresponding to the scale; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.

In one embodiment, the multiple scales include a first scale, a second scale, and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale, and the extraction module 42 is specifically configured to: separate the audio signals Perform convolution on the first scale, the second scale and the third scale to extract the first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale of the audio signal in the time domain .

In one embodiment, the device further includes: a preprocessing module configured to perform normalization processing on the audio signal in the speaker; determine the original value and target value of the sampling rate of the audio signal, and normalize the processed audio The sample rate of the signal is reduced from the original value to the target value.

In one embodiment, the probability of fused features is calculated as follows:

Among them, z _k represents the kth value of the fully connected layer,

Represents a vector of audio samples containing noise,

Vector representing audio samples without noise.

In one embodiment, the device further includes: an acquisition module configured to use a nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors, to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.

For the specific limitations of the loudspeaker noise detection device, please refer to the above definition of the loudspeaker noise detection method, which has the corresponding functional modules and beneficial effects of the implementation method, and will not be repeated here. Each module in the noise detection device for the above-mentioned loudspeaker can be fully or partially realized by software, hardware and a combination thereof. The above-mentioned modules can be embedded in hardware or independent of one or more processors in the electronic device, and can also be stored in the memory of the electronic device in the form of software, so that one or more processors can call and execute the above The operation corresponding to the module.

In one embodiment, an electronic device is provided. The electronic device may be a terminal, and its internal structure may be as shown in FIG. 5 . The electronic device includes one or more processors, memory, communication interface, display screen, and input device connected by a system bus. Wherein, one or more processors of the electronic device are used to provide calculation and control capabilities. The memory of the electronic device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer readable instructions. The internal memory provides an environment for the execution of the operating system and computer readable instructions in the non-volatile storage medium. The communication interface of the electronic device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, near field communication (NFC) or other technologies. When the computer-readable instructions are executed by one or more processors, a loudspeaker noise detection method is realized. The display screen of the electronic device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the housing of the electronic device , and can also be an external keyboard, touchpad, or mouse.

Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of a partial structure related to the disclosed solution, and does not constitute a limitation on the electronic device to which the disclosed solution is applied. The specific electronic device can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.

In one embodiment, the device for detecting noise of a speaker provided by the present disclosure may be implemented in the form of a computer readable instruction, and the computer readable instruction may be run on the electronic device as shown in FIG. 5 . The memory of the electronic device can store each program module forming the noise detection device of the loudspeaker, such as the collection module 41 shown in FIG. 4, the extraction module 42, the generation module 43, the first determination module 44, and the second determination module 45. The computer-readable instructions constituted by various program modules enable one or more processors to execute the steps in the method for detecting noise of a loudspeaker according to various embodiments of the present disclosure described in this specification.

For example, the electronic device shown in FIG. 5 may collect audio signals in the speaker through the acquisition module 41 in the noise detection device for the speaker as shown in FIG. 4 . The electronic device may use the extraction module 42 to perform convolution on the audio signal on multiple scales respectively to generate multiple features corresponding to the multiple scales. The electronic device can generate fusion features according to multiple features through the generation module 43, and determine the probability of fusion features according to the pre-trained classification model. The electronic device may determine that the audio signal contains noise if the probability is greater than or equal to a threshold through the first determination module 44 . The electronic device can perform if the probability is less than the threshold through the second determination module 45, then determine that the audio signal does not contain noise

In one embodiment, an electronic device is provided, including a memory and one or more processors, the memory stores computer-readable instructions, and the one or more processors execute the computer-readable instructions to implement the following steps: collecting Audio signal in the speaker; Convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales; Generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model ; If the probability is greater than or equal to the threshold, it is determined that the audio signal contains noise; if the probability is less than the threshold, it is determined that the audio signal does not contain noise.

In one embodiment, when the one or more processors execute the computer-readable instructions, the following steps can also be implemented: acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise; Convolve on multiple scales to generate multiple sample features corresponding to multiple scales; multiple sample features are fused to generate sample fusion features, and the prediction probability of sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal .

In one embodiment, when the one or more processors execute the computer-readable instructions, the following steps can also be implemented: respectively convolving the audio signal on the first scale, the second scale and the third scale to extract the audio signal The first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale in time domain.

In one embodiment, when the one or more processors execute the computer-readable instructions, the following steps can also be implemented: performing normalization processing on the audio signal in the speaker; determining the original value and the target value of the sampling rate of the audio signal, and The sampling rate of the normalized audio signal is reduced from an original value to a target value.

In one embodiment, when the one or more processors execute the computer-readable instructions, the following steps can also be implemented: using the nearest neighbor algorithm to calculate a plurality of neighbors for each frequency sweep signal containing noise; Random linear interpolation of any two of the two to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to that of the simulated frequency sweep signal not containing noise The number of sweep signals is equal.

According to the electronic device in the embodiment of the present disclosure, the following steps are implemented when the computer-readable instructions are executed by one or more processors, and the audio signals in the speakers are respectively convolved on multiple scales to generate multiple scales corresponding to multiple scales. Then, generate fusion features based on multiple features, determine the probability of fusion features according to the pre-trained classification model, and determine whether the audio signal contains noise according to the probability. Fuse feature information, and judge whether there is noise by calculating probability, improve the accuracy of feature information detection, reduce the complexity of calculation on the test side, and improve the efficiency and accuracy of noise detection.

In one embodiment, one or more non-transitory computer-readable storage media storing computer-readable instructions are provided, and the computer-readable instructions are executed by one or more processors to implement the following steps: collecting Audio signal; Convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales; generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model; if the probability If the probability is greater than or equal to the threshold, it is determined that the audio signal contains noise; if the probability is less than the threshold, it is determined that the audio signal does not contain noise.

In one embodiment, when the computer-readable instructions are executed by one or more processors, the following steps can also be implemented: acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise; Convolve on multiple scales to generate multiple sample features corresponding to multiple scales; multiple sample features are fused to generate sample fusion features, and the prediction probability of sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal .

In one embodiment, when the computer-readable instructions are executed by one or more processors, the following steps can also be implemented: respectively convolving the audio signal on the first scale, the second scale and the third scale to extract the audio signal The first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale in time domain.

In one embodiment, when the computer-readable instructions are executed by one or more processors, the following steps can also be implemented: performing normalization processing on the audio signal in the speaker; determining the original value and the target value of the sampling rate of the audio signal, and The sampling rate of the normalized audio signal is reduced from an original value to a target value.

In one embodiment, when the computer-readable instructions are executed by one or more processors, the following steps can also be implemented: using the nearest neighbor algorithm, calculating multiple neighbors for each frequency sweep signal containing noise; Random linear interpolation of any two of the two to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to that of the simulated frequency sweep signal not containing noise The number of sweep signals is equal.

According to the computer-readable storage medium of the embodiment of the present disclosure, when the computer-readable instructions stored thereon are executed by one or more processors, the following steps are implemented, respectively convoluting the audio signals in the speakers on multiple scales, Generate multiple features corresponding to multiple scales, and then generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model, and determine whether the audio signal contains noise according to the probability, so that the audio signal can be mined Fusion feature information on different scales in the time domain, and judge whether there is noise by calculating the probability, improve the detection accuracy of feature information, reduce the complexity of calculation at the test end, and improve the noise detection processing efficiency and detection accuracy.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing related hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer In the readable storage medium, the computer-readable instructions may include the processes of the embodiments of the above-mentioned methods when executed. Wherein, any reference to storage, database or other media used in the various embodiments provided by the present disclosure may include at least one of non-volatile and volatile storage. Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory or optical memory, etc. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as Static Random Access Memory (SRAM) and Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be within the range described in this specification.

The above-mentioned embodiments only express several implementation modes of the present disclosure, and the descriptions thereof are relatively specific and detailed, but should not be construed as limiting the scope of the patent for the invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present disclosure, and these all belong to the protection scope of the present disclosure. Therefore, the scope of protection of the disclosed patent should be based on the appended claims.

Industrial Applicability

The loudspeaker detection method provided by the present disclosure generates multiple features corresponding to multiple scales by convoluting the audio signal in the speaker on multiple scales, and then determines the multiple features according to the pre-trained classification model The probability of the generated fusion features is used to determine whether the audio signal contains noise according to the probability. Therefore, the fusion feature information of the audio signal on different scales in the time domain can be mined, and the probability is calculated to determine whether it contains noise, and the accuracy of feature information detection can be improved. , the calculation complexity is reduced at the test end, the noise detection processing efficiency and detection accuracy are improved, and it has strong industrial applicability.

Claims

A noise detection method for a loudspeaker, comprising:

Collect the audio signal in the speaker;

Convolving the audio signal on a plurality of scales, respectively, to generate a plurality of features corresponding to the plurality of scales;

Generate fusion features according to the plurality of features, and determine the probability of the fusion features according to the pre-trained classification model;

Under the condition that the probability is greater than or equal to a threshold, determine that the audio signal contains noise;

On the condition that the probability is smaller than the threshold, it is determined that the audio signal does not contain noise.
The method of claim 1, further comprising:

Obtain a frequency sweep signal containing noise and a frequency sweep signal not containing noise;

Convolve the frequency sweep signal on multiple scales to generate multiple sample features corresponding to multiple scales;

Fusing the multiple sample features to generate a sample fusion feature, and determining the predicted probability of the sample fusion feature;

A classification model is trained according to the predicted probability and the labeled value of the frequency sweep signal.
The method according to claim 1 or 2, wherein the plurality of scales includes a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than In the third scale, the step of convolving the audio signal on multiple scales to generate multiple features corresponding to multiple scales includes:

Convolving the audio signal on the first scale, the second scale, and the third scale, respectively, to extract the time domain correlation between the audio signal and the first scale, the second scale The first feature, the second feature and the third feature corresponding to the scale and the third scale.
The method according to claim 1 or 2, further comprising:

performing normalization processing on the audio signal in the loudspeaker;

Determine the original value and the target value of the sampling rate of the audio signal, and reduce the sampling rate of the normalized audio signal from the original value to the target value.
The method according to claim 1 or 2, wherein the probability of the fusion feature is calculated as follows:

Among them, z k represents the kth value of the fully connected layer,
Represents a vector of audio samples containing noise,
Vector representing audio samples without noise.
The method according to claim 3, wherein the calculation formula of the convolutional layer is as follows:

Among them, i represents the i-th convolutional layer, δ is the activation function, X represents the audio signal, w represents the weight of the convolutional layer, and b represents the bias of the convolutional layer.
The method of claim 2, further comprising:

Use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise;

performing random linear interpolation on any two of the plurality of neighbors to generate a simulated frequency sweep signal containing noise;

The above steps are repeated until the sum of the number of noise-containing frequency sweep signals and the number of noise-containing simulated frequency sweep signals is equal to the number of noise-free frequency sweep signals.
A noise detection device for a loudspeaker, comprising:

A collection module configured to collect audio signals in the loudspeaker;

An extraction module configured to convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales;

A generating module configured to generate a fusion feature according to the plurality of features, and determine the probability of the fusion feature according to a pre-trained classification model;

A first determination module configured to determine that the audio signal contains noise if the probability is greater than or equal to a threshold;

The second determination module is configured to determine that the audio signal does not contain noise if the probability is less than the threshold.
The apparatus of claim 8, further comprising:

The training module is configured to obtain a frequency sweep signal containing noise and a frequency sweep signal not containing noise; the frequency sweep signal is respectively convoluted on multiple scales to generate multiple sample features corresponding to multiple scales; the described A plurality of sample features are fused to generate a sample fusion feature, and a prediction probability of the sample fusion feature is determined; a classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
The apparatus according to claim 8 or 9, wherein the plurality of scales comprises a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than said third dimension;

The extraction module is specifically configured to convolve the audio signal on the first scale, the second scale, and the third scale, respectively, so as to extract the audio signal in the time domain and the The first feature, the second feature, and the third feature corresponding to the first scale, the second scale, and the third scale.
The device according to claim 8 or 9, further comprising:

A preprocessing module configured to normalize the audio signal in the speaker; determine the original value and target value of the audio signal sampling rate, and reduce the normalized sampling rate of the audio signal from the original value to the target value .
The device according to claim 8 or 9, wherein the probability of the fusion feature is calculated as follows:

Among them, z k represents the kth value of the fully connected layer,
Represents a vector of audio samples containing noise,
Vector representing audio samples without noise.
The device of claim 9, further comprising:

The acquisition module is configured to use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors to generate a simulated frequency sweep signal containing noise ; Repeat the above steps until the sum of the number of noise-containing frequency sweep signals and the number of noise-containing simulated frequency sweep signals is equal to the number of noise-free frequency sweep signals.
An electronic device comprising a memory and one or more processors, the memory storing computer readable instructions that, when executed by the one or more processors, cause the one or more The processor executes the steps of the loudspeaker noise detection method according to any one of claims 1 to 7.
One or more non-transitory computer-readable storage media storing computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform the entitlement The steps of the method for detecting noise of a loudspeaker according to any one of requirements 1 to 7.