WO2023000444A1 - Method and apparatus for detecting noise of loudspeaker, and electronic device and storage medium - Google Patents

Method and apparatus for detecting noise of loudspeaker, and electronic device and storage medium Download PDF

Info

Publication number
WO2023000444A1
WO2023000444A1 PCT/CN2021/115791 CN2021115791W WO2023000444A1 WO 2023000444 A1 WO2023000444 A1 WO 2023000444A1 CN 2021115791 W CN2021115791 W CN 2021115791W WO 2023000444 A1 WO2023000444 A1 WO 2023000444A1
Authority
WO
WIPO (PCT)
Prior art keywords
scale
noise
audio signal
frequency sweep
probability
Prior art date
Application number
PCT/CN2021/115791
Other languages
French (fr)
Chinese (zh)
Inventor
宋广伟
Original Assignee
上海闻泰信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海闻泰信息技术有限公司 filed Critical 上海闻泰信息技术有限公司
Publication of WO2023000444A1 publication Critical patent/WO2023000444A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R29/00Monitoring arrangements; Testing arrangements
    • H04R29/001Monitoring arrangements; Testing arrangements for loudspeakers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the disclosure relates to a loudspeaker noise detection method, device, electronic equipment and storage medium.
  • micro speakers are widely used with the application of smart hardware such as smart speakers, tablet computers, and mobile phones.
  • noise detection technology has become a key factor in determining production quality, and the accuracy and efficiency requirements of noise detection methods are becoming more and more stringent.
  • the noise detection is carried out through the hardware detection system, and the audio signal generator is used to excite the micro-speaker, and the sound pressure signal is obtained through the artificial ear.
  • the Rub value on the point, feature extraction, and detection and recognition through empirical threshold adjudication.
  • the tester artificially selects the judgment threshold for the presence of noise at each frequency point based on testing multiple signals. For noise detection scenarios that require high precision, the judgment threshold is difficult to set, and the detection accuracy needs to be improved.
  • a loudspeaker noise detection method device, electronic device, and storage medium are provided.
  • a noise detection method for a loudspeaker comprising:
  • the method also includes:
  • a classification model is trained according to the predicted probability and the labeled value of the frequency sweep signal.
  • the plurality of scales includes a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale, so Convolving the audio signal on multiple scales, generating multiple features corresponding to multiple scales includes:
  • the method also includes:
  • An original value and a target value of the sampling rate of the audio signal are determined, and the sampling rate of the normalized audio signal is reduced from the original value to the target value.
  • the probability of the fusion feature is calculated as follows:
  • z k represents the kth value of the fully connected layer, Represents a vector of audio samples containing noise, Vector representing audio samples without noise.
  • the calculation formula of the convolutional layer is as follows:
  • i represents the i-th convolutional layer
  • is the activation function
  • X represents the audio signal
  • w represents the weight of the convolutional layer
  • b represents the bias of the convolutional layer.
  • the method also includes:
  • a noise detection device for a loudspeaker comprising:
  • a collection module configured to collect audio signals in the loudspeaker
  • An extraction module configured to convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales
  • a generating module configured to generate a fusion feature according to the plurality of features, and determine the probability of the fusion feature according to a pre-trained classification model
  • a first determination module configured to determine that the audio signal contains noise if the probability is greater than or equal to a threshold
  • the second determination module is configured to determine that the audio signal does not contain noise if the probability is less than the threshold.
  • the device further includes: a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Corresponding to multiple sample features; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
  • a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Corresponding to multiple sample features; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
  • the multiple scales include a first scale, a second scale, and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale; Convolution is performed on the first scale, the second scale and the third scale to extract the first feature, the second feature and the third feature of the audio signal corresponding to the first scale, the second scale and the third scale in the time domain.
  • the device further includes: a preprocessing module configured to normalize the audio signal in the speaker; determine the original value and the target value of the sampling rate of the audio signal, and normalize the processed audio signal The sampling rate of is reduced from the original value to the target value.
  • a preprocessing module configured to normalize the audio signal in the speaker; determine the original value and the target value of the sampling rate of the audio signal, and normalize the processed audio signal The sampling rate of is reduced from the original value to the target value.
  • the probability of fusion features is calculated as follows:
  • z k represents the kth value of the fully connected layer, Represents a vector of audio samples containing noise, Vector representing audio samples without noise.
  • calculation formula of the convolutional layer is as follows:
  • i represents the i-th convolutional layer
  • is the activation function
  • X represents the audio signal
  • w represents the weight of the convolutional layer
  • b represents the bias of the convolutional layer.
  • the device further includes: an acquisition module configured to use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors to obtain Generate simulated frequency sweep signals containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
  • an acquisition module configured to use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors to obtain Generate simulated frequency sweep signals containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
  • An electronic device including a memory and one or more processors, the memory stores computer-readable instructions, and the one or more processors execute the computer-readable instructions to implement the method provided by any embodiment of the present disclosure. Steps of a noise detection method for a speaker.
  • One or more non-transitory computer-readable storage media having stored thereon computer-readable instructions that, when executed by one or more processors, implement any implementation of the present disclosure
  • the steps of the loudspeaker noise detection method provided by the example are not limited to:
  • FIG. 1 is a schematic flowchart of a noise detection method for a speaker provided by one or more embodiments of the present disclosure
  • Fig. 2 is a schematic flow chart of model training provided by one or more embodiments of the present disclosure
  • FIG. 3 is a flow chart of noise detection provided by one or more embodiments of the present disclosure.
  • FIG. 4 is a schematic structural diagram of a loudspeaker noise detection device provided by one or more embodiments of the present disclosure
  • Fig. 5 is a schematic structural diagram of an electronic device provided by one or more embodiments of the present disclosure.
  • words such as “exemplary” or “for example” are used as examples, illustrations or illustrations. Any embodiment or design described as “exemplary” or “for example” in the embodiments of the present disclosure shall not be construed as being preferred or advantageous over other embodiments or designs. To be precise, the use of words such as “exemplary” or “for example” is intended to present related concepts in a specific manner. In addition, in the description of the embodiments of the present disclosure, unless otherwise specified, the meaning of "plurality” refers to two one or more.
  • a noise detection method for a speaker is provided.
  • This embodiment uses the method applied to a terminal as an example for illustration. It can be understood that this method can also be applied to a server, or It is applied to a system including a terminal and a server, and is realized through the interaction between the terminal and the server.
  • the method includes the following steps:
  • Step 102 collect the audio signal in the speaker.
  • micro speakers are audio output devices of smart hardware such as smart speakers, tablet computers, and mobile phones.
  • Micro speakers are composed of basin frames, magnetic steel, pole pieces, sound membranes, voice coils, front covers, wiring boards, and damping cloth.
  • the audio signal can be a 20-20KHz standard frequency sweep signal, or voice, music, etc.
  • the audio signal in the speaker can be preprocessed, specifically, the audio signal in the speaker can be normalized, and the audio signal can be normalized through the normalization process. Mapped to (0, 1), so that the test sets are all in the same dimension, reducing the amount of calculation and avoiding abnormal test results caused by inconsistent dimensions. Furthermore, an original value and a target value of the sampling rate of the audio signal are determined, and the sampling rate of the normalized audio signal is reduced from the original value to the target value. Among them, the original value is greater than the target value.
  • the original value of the sampling rate of the audio signal is 48KHz
  • the audio signal is reduced from the original sampling rate to 5KHz by down-sampling, because the noise of the micro-speaker is usually low-frequency noise , and the normal hearing frequency range of the human ear is 20Hz to 2000Hz.
  • the sampling rate of the audio signal is consistent. While the sampling rate is within the hearing range of the human ear, the amount of data is reduced and the detection efficiency is improved.
  • step 104 the audio signal is convoluted on multiple scales to generate multiple features corresponding to the multiple scales.
  • multiple scales of different sizes can be set, and the audio signal is convoluted on multiple scales respectively to generate multiple features, so as to extract feature information of different scales of the audio signal in the time domain.
  • the scale refers to the convolution kernel scale.
  • the multiple scales include a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale.
  • the audio signal is convolved on multiple scales, and generating multiple features corresponding to multiple scales includes: respectively convolving the audio signal on the first scale, the second scale, and the third scale, The first feature, the second feature and the third feature respectively corresponding to the first scale, the second scale and the third scale of the audio signal in the time domain are extracted.
  • the above convolution operation is implemented through a convolution layer, and the calculation formula of the convolution layer is as follows:
  • i represents the i-th convolutional layer
  • is the activation function
  • X represents the audio signal
  • w represents the weight of the convolutional layer
  • b represents the bias of the convolutional layer.
  • Step 106 generate fusion features according to multiple features, and determine the probability of fusion features according to the pre-trained classification model.
  • the first feature, the second feature and the third feature are fused into a one-dimensional feature, and the one-dimensional feature as a fusion feature.
  • the classification model is obtained by training the audio signal containing noise and the audio signal not containing noise as training samples.
  • the input of the pre-trained classification model is the fusion feature, and the output is the probability of the fusion feature.
  • the probability of the fusion feature is used for Indicates the probability that the audio signal contains noise.
  • the classification model includes a fully connected layer and a noise detection function, and the noise detection function may be a softmax function.
  • the probability of fusion features is calculated as follows:
  • z k represents the kth value of the fully connected layer, Represents a vector of audio samples containing noise, Vector representing audio samples without noise.
  • Step 108 if the probability is greater than or equal to the threshold, determine that the audio signal contains noise; if the probability is smaller than the threshold, determine that the audio signal does not contain noise.
  • the audio signal may be determined whether the audio signal contains noise according to the probability of the fused features.
  • the threshold is 0.5, and if the probability output by the classification model is greater than or equal to 0.5, it is determined that the audio signal contains noise, otherwise it is determined that the audio signal does not contain noise.
  • the audio signal in the loudspeaker is respectively convolved on multiple scales to generate multiple features corresponding to the multiple scales, and then generate fusion features based on the multiple features, And according to the pre-trained classification model to determine the probability of the fusion feature, according to the probability to determine whether the audio signal contains noise, thus, it is possible to mine the fusion feature information of the audio signal in different scales in the time domain, and judge whether it contains noise by calculating the probability.
  • the hardware detection scheme in the related art has high requirements for the equipment accuracy of the audio signal generator, and the judgment threshold is difficult to set and the detection takes a long time. This disclosure Compared with the solution in the related art, the situation of judgment error caused by the improper selection of the threshold value is reduced, and the noise detection processing efficiency and detection accuracy are improved.
  • the training terminal will be described below.
  • Fig. 2 is a schematic flow chart of a model training provided by an embodiment of the present disclosure, as shown in Fig. 2 , including the following steps:
  • Step 202 acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise.
  • the frequency sweep signal containing noise and the frequency sweep signal without noise played by the micro-speaker can be collected, so that the collected frequency sweep signal can be used as a training set and input to the constructed convolution with different scales and step sizes.
  • preprocessing may be performed on the frequency sweep signal containing noise and the frequency sweep signal not containing noise, specifically, Normalization processing can be performed on the frequency sweep signal, and the frequency sweep signal is mapped to (0, 1) through the normalization processing. Furthermore, an original value and a target value of the sampling rate of the frequency sweeping signal are determined, and the sampling rate of the normalized frequency scanning signal is reduced from the original value to the target value, wherein the original value is greater than the target value.
  • step 204 the frequency sweep signal is convoluted on multiple scales to generate multiple sample features corresponding to the multiple scales.
  • multiple scales of different sizes can be set, and the frequency sweep signal (including the frequency sweep signal containing noise and the frequency sweep signal without noise) is convoluted on multiple scales respectively to generate multiple features , to extract feature information of different scales of the audio signal in the time domain.
  • the scale refers to the convolution kernel scale.
  • the multiple scales include a first scale K 1 , a second scale K 2 and a third scale K 3 , and the first scale K 1 is smaller than the second scale K 2 , and the second scale K 2 is smaller than the third scale K 3 .
  • the first sample feature, the second sample feature and the third sample feature respectively correspond to the third scale K 3 .
  • multiple step sizes can also be set for the convolution kernel.
  • the multiple step sizes include the first step size S 1 , the second step size S 2 and the third step size S 3 , and the first step size S 1 is smaller than the second step size S 2 , and the second step size S 2 is smaller than the third step size S 3 .
  • the characteristic information of the frequency sweep signal in the time domain from small to large in length can be extracted.
  • the above convolution operation is implemented through a convolution layer, and the calculation formula of the convolution layer is as follows:
  • i represents the i-th convolutional layer
  • is the activation function
  • X represents the audio signal
  • w represents the weight of the convolutional layer
  • b represents the bias of the convolutional layer.
  • step 206 a plurality of sample features are fused to generate a sample fusion feature, and a prediction probability of the sample fusion feature is determined.
  • sample features are fused and merged into one-dimensional features to generate sample fusion features.
  • the first sample feature, the second sample feature and the third sample feature are fused and merged into a one-dimensional feature, and the This one-dimensional feature is used as a sample fusion feature.
  • the classification model includes a fully connected layer and a noise detection function, and the noise detection function may be a softmax function.
  • the predicted probability of the sample fusion feature is calculated as follows:
  • z k represents the kth value of the fully connected layer, Represents a vector of audio samples containing noise, Vector representing audio samples without noise.
  • Step 208 training a classification model according to the predicted probability and the labeled value of the frequency sweep signal.
  • each frequency sweep signal corresponds to a label value
  • a frequency sweep signal containing noise corresponds to a label value of 1
  • a frequency sweep signal that does not contain noise corresponds to a label value of 0.
  • a multi-scale end-to-end convolutional neural network can be constructed, wherein the convolutional neural network includes a convolutional layer, a fully connected layer, and a noise detection function, and the convolutional neural network is trained through a training set. When the model convergence and accuracy are greater than the preset value, save the trained convolutional neural network model. This convolutional neural network model is used to determine whether the input audio signal contains noise or not.
  • the input of the model can also be other acoustic features of the audio signal played by the micro-speaker, such as frequency spectrum, logarithmic mel spectrum, etc.
  • the fusion feature information on different scales in the time domain is extracted through the frequency sweep signal containing noise and the frequency sweep signal without noise, and the predicted probability of the sample fusion feature and the labeled value of the frequency sweep signal are used for training.
  • the pre-trained model is applied to the audio signal noise detection in the loudspeaker, which improves the noise detection processing efficiency and detection accuracy.
  • the training set can be constructed in the following manner: using the nearest neighbor algorithm , calculate multiple neighbors for each frequency sweep signal containing noise, perform random linear interpolation on any two of the multiple neighbors to generate a simulated frequency sweep signal containing noise; repeat the above steps until the frequency sweep containing noise The sum of the number of signals and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
  • the number of frequency sweep signals containing noise and frequency sweep signals without noise (that is, non-conforming products and good products in the production line test) played by the collected micro-speakers is 90 and 3600 respectively.
  • this disclosure processes 90 non-defective products to generate simulated non-defective products equal to the number of good products. Randomly select two non-defective samples from the nearest neighbors for random linear interpolation; construct a new simulated non-defective sample, and synthesize the new sample with the original data to generate a new training set.
  • FIG. 3 is a flowchart of a noise detection scenario provided by an embodiment of the present disclosure.
  • a loudspeaker noise detection device including: an acquisition module 41 , an extraction module 42 , a generation module 43 , a first determination module 44 , and a second determination module 45 .
  • the collecting module 41 is configured to collect the audio signal in the speaker.
  • the extraction module 42 is configured to convolve the audio signal on multiple scales respectively to generate multiple features corresponding to the multiple scales.
  • the generating module 43 is configured to generate fusion features according to multiple features, and determine the probability of fusion features according to the pre-trained classification model.
  • the first determination module 44 is configured to determine that the audio signal contains noise if the probability is greater than or equal to a threshold.
  • the second determination module 45 is configured to determine that the audio signal does not contain noise if the probability is less than the threshold.
  • the device further includes: a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Multiple sample features corresponding to the scale; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
  • a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Multiple sample features corresponding to the scale; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
  • the multiple scales include a first scale, a second scale, and a third scale
  • the first scale is smaller than the second scale
  • the second scale is smaller than the third scale
  • the extraction module 42 is specifically configured to: separate the audio signals Perform convolution on the first scale, the second scale and the third scale to extract the first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale of the audio signal in the time domain .
  • the device further includes: a preprocessing module configured to perform normalization processing on the audio signal in the speaker; determine the original value and target value of the sampling rate of the audio signal, and normalize the processed audio The sample rate of the signal is reduced from the original value to the target value.
  • a preprocessing module configured to perform normalization processing on the audio signal in the speaker; determine the original value and target value of the sampling rate of the audio signal, and normalize the processed audio The sample rate of the signal is reduced from the original value to the target value.
  • the probability of fused features is calculated as follows:
  • z k represents the kth value of the fully connected layer, Represents a vector of audio samples containing noise, Vector representing audio samples without noise.
  • the calculation formula of the convolutional layer is as follows:
  • i represents the i-th convolutional layer
  • is the activation function
  • X represents the audio signal
  • w represents the weight of the convolutional layer
  • b represents the bias of the convolutional layer.
  • the device further includes: an acquisition module configured to use a nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors, to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
  • an acquisition module configured to use a nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors, to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
  • each module in the noise detection device for the above-mentioned loudspeaker can be fully or partially realized by software, hardware and a combination thereof.
  • the above-mentioned modules can be embedded in hardware or independent of one or more processors in the electronic device, and can also be stored in the memory of the electronic device in the form of software, so that one or more processors can call and execute the above The operation corresponding to the module.
  • an electronic device is provided.
  • the electronic device may be a terminal, and its internal structure may be as shown in FIG. 5 .
  • the electronic device includes one or more processors, memory, communication interface, display screen, and input device connected by a system bus. Wherein, one or more processors of the electronic device are used to provide calculation and control capabilities.
  • the memory of the electronic device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer readable instructions.
  • the internal memory provides an environment for the execution of the operating system and computer readable instructions in the non-volatile storage medium.
  • the communication interface of the electronic device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, near field communication (NFC) or other technologies.
  • WIFI wireless fidelity
  • NFC near field communication
  • the computer-readable instructions are executed by one or more processors, a loudspeaker noise detection method is realized.
  • the display screen of the electronic device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the electronic device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the housing of the electronic device , and can also be an external keyboard, touchpad, or mouse.
  • FIG. 5 is only a block diagram of a partial structure related to the disclosed solution, and does not constitute a limitation on the electronic device to which the disclosed solution is applied.
  • the specific electronic device can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.
  • the device for detecting noise of a speaker may be implemented in the form of a computer readable instruction, and the computer readable instruction may be run on the electronic device as shown in FIG. 5 .
  • the memory of the electronic device can store each program module forming the noise detection device of the loudspeaker, such as the collection module 41 shown in FIG. 4, the extraction module 42, the generation module 43, the first determination module 44, and the second determination module 45.
  • the computer-readable instructions constituted by various program modules enable one or more processors to execute the steps in the method for detecting noise of a loudspeaker according to various embodiments of the present disclosure described in this specification.
  • the electronic device shown in FIG. 5 may collect audio signals in the speaker through the acquisition module 41 in the noise detection device for the speaker as shown in FIG. 4 .
  • the electronic device may use the extraction module 42 to perform convolution on the audio signal on multiple scales respectively to generate multiple features corresponding to the multiple scales.
  • the electronic device can generate fusion features according to multiple features through the generation module 43, and determine the probability of fusion features according to the pre-trained classification model.
  • the electronic device may determine that the audio signal contains noise if the probability is greater than or equal to a threshold through the first determination module 44 .
  • the electronic device can perform if the probability is less than the threshold through the second determination module 45, then determine that the audio signal does not contain noise
  • an electronic device including a memory and one or more processors, the memory stores computer-readable instructions, and the one or more processors execute the computer-readable instructions to implement the following steps: collecting Audio signal in the speaker; Convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales; Generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model ; If the probability is greater than or equal to the threshold, it is determined that the audio signal contains noise; if the probability is less than the threshold, it is determined that the audio signal does not contain noise.
  • the following steps can also be implemented: acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise; Convolve on multiple scales to generate multiple sample features corresponding to multiple scales; multiple sample features are fused to generate sample fusion features, and the prediction probability of sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal .
  • the following steps can also be implemented: respectively convolving the audio signal on the first scale, the second scale and the third scale to extract the audio signal The first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale in time domain.
  • the following steps can also be implemented: performing normalization processing on the audio signal in the speaker; determining the original value and the target value of the sampling rate of the audio signal, and The sampling rate of the normalized audio signal is reduced from an original value to a target value.
  • the following steps can also be implemented: using the nearest neighbor algorithm to calculate a plurality of neighbors for each frequency sweep signal containing noise; Random linear interpolation of any two of the two to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to that of the simulated frequency sweep signal not containing noise The number of sweep signals is equal.
  • the following steps are implemented when the computer-readable instructions are executed by one or more processors, and the audio signals in the speakers are respectively convolved on multiple scales to generate multiple scales corresponding to multiple scales. Then, generate fusion features based on multiple features, determine the probability of fusion features according to the pre-trained classification model, and determine whether the audio signal contains noise according to the probability. Fuse feature information, and judge whether there is noise by calculating probability, improve the accuracy of feature information detection, reduce the complexity of calculation on the test side, and improve the efficiency and accuracy of noise detection.
  • one or more non-transitory computer-readable storage media storing computer-readable instructions are provided, and the computer-readable instructions are executed by one or more processors to implement the following steps: collecting Audio signal; Convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales; generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model; if the probability If the probability is greater than or equal to the threshold, it is determined that the audio signal contains noise; if the probability is less than the threshold, it is determined that the audio signal does not contain noise.
  • the following steps can also be implemented: acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise; Convolve on multiple scales to generate multiple sample features corresponding to multiple scales; multiple sample features are fused to generate sample fusion features, and the prediction probability of sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal .
  • the following steps can also be implemented: respectively convolving the audio signal on the first scale, the second scale and the third scale to extract the audio signal The first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale in time domain.
  • the following steps can also be implemented: performing normalization processing on the audio signal in the speaker; determining the original value and the target value of the sampling rate of the audio signal, and The sampling rate of the normalized audio signal is reduced from an original value to a target value.
  • the following steps can also be implemented: using the nearest neighbor algorithm, calculating multiple neighbors for each frequency sweep signal containing noise; Random linear interpolation of any two of the two to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to that of the simulated frequency sweep signal not containing noise The number of sweep signals is equal.
  • the following steps are implemented, respectively convoluting the audio signals in the speakers on multiple scales, Generate multiple features corresponding to multiple scales, and then generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model, and determine whether the audio signal contains noise according to the probability, so that the audio signal can be mined Fusion feature information on different scales in the time domain, and judge whether there is noise by calculating the probability, improve the detection accuracy of feature information, reduce the complexity of calculation at the test end, and improve the noise detection processing efficiency and detection accuracy.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory or optical memory, etc.
  • Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory.
  • RAM Random Access Memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • the loudspeaker detection method provided by the present disclosure generates multiple features corresponding to multiple scales by convoluting the audio signal in the speaker on multiple scales, and then determines the multiple features according to the pre-trained classification model
  • the probability of the generated fusion features is used to determine whether the audio signal contains noise according to the probability. Therefore, the fusion feature information of the audio signal on different scales in the time domain can be mined, and the probability is calculated to determine whether it contains noise, and the accuracy of feature information detection can be improved. , the calculation complexity is reduced at the test end, the noise detection processing efficiency and detection accuracy are improved, and it has strong industrial applicability.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A method and apparatus for detecting noise of a loudspeaker, and an electronic device and a storage medium, which relate to the technical field of artificial intelligence. The method comprises: collecting an audio signal in a loudspeaker (102); respectively convolving the audio signal in a plurality of scales, so as to generate a plurality of features corresponding to the plurality of scales (104); generating a fused feature according to the plurality of features, and determining a probability of the fused feature according to a pre-trained classification model (106); and if the probability is greater than or equal to a threshold value, determining that the audio signal includes noise, and if the probability is less than the threshold value, determining that the audio signal does not include noise (108). By using the method, the noise detection accuracy and processing efficiency can be increased.

Description

扬声器的杂音检测方法、装置、电子设备和存储介质Loudspeaker noise detection method, device, electronic device and storage medium
本公开要求于2021年7月22日提交中国专利局、申请号为202110833182.0、发明名称为“扬声器的杂音检测方法、装置、电子设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。This disclosure claims the priority of the Chinese patent application with the application number 202110833182.0 and the title of the invention "Noise detection method, device, electronic equipment and storage medium for loudspeakers" submitted to the China Patent Office on July 22, 2021, the entire content of which is passed References are incorporated in this disclosure.
技术领域technical field
本公开涉及一种扬声器的杂音检测方法、装置、电子设备和存储介质。The disclosure relates to a loudspeaker noise detection method, device, electronic equipment and storage medium.
背景技术Background technique
微型扬声器作为电子设备中的音频输出关键器件,随着智能音箱、平板电脑、手机等智能硬件的应用得到广泛使用。在微型扬声器生产过程中,杂音检测技术成为决定生产质量的关键因素,杂音检测方法的准确性及高效性要求也越来越严格。As a key audio output device in electronic equipment, micro speakers are widely used with the application of smart hardware such as smart speakers, tablet computers, and mobile phones. In the production process of micro-speakers, noise detection technology has become a key factor in determining production quality, and the accuracy and efficiency requirements of noise detection methods are becoming more and more stringent.
相关技术中,通过硬件检测系统进行杂音检测,具体运用音频信号发生器激励微型扬声器,通过仿真耳获取声压信号,声压信号经过A/D转换、数据采集卡传至计算机,进而计算各个频率点上的Rub值、提取特征并通过经验阈值裁决进行检测识别。该方案中,由测试人员根据测试多个信号人为选取各个频率点上杂音存在判决阈值,对于精度要求较高的杂音检测场景,该判决阈值难以设定,检测准确度有待提高。In the related technology, the noise detection is carried out through the hardware detection system, and the audio signal generator is used to excite the micro-speaker, and the sound pressure signal is obtained through the artificial ear. The Rub value on the point, feature extraction, and detection and recognition through empirical threshold adjudication. In this solution, the tester artificially selects the judgment threshold for the presence of noise at each frequency point based on testing multiple signals. For noise detection scenarios that require high precision, the judgment threshold is difficult to set, and the detection accuracy needs to be improved.
发明内容Contents of the invention
(一)要解决的技术问题(1) Technical problems to be solved
利用硬件检测系统进行杂音检测时,通过人为选取各个频率点上的杂音存在判决阈值难以满足精度要求较高的杂音检测场景,检测准确度较低。When using the hardware detection system for noise detection, it is difficult to meet the noise detection scene with high precision requirements by artificially selecting the noise presence judgment threshold at each frequency point, and the detection accuracy is low.
(二)技术方案(2) Technical solution
根据本公开公开的各种实施例,提供一种扬声器的杂音检测方法、装置、电子设备和存储介质。According to various embodiments of the present disclosure, a loudspeaker noise detection method, device, electronic device, and storage medium are provided.
一种扬声器的杂音检测方法,所述方法包括:A noise detection method for a loudspeaker, the method comprising:
采集扬声器中的音频信号;Acquire the audio signal in the speaker;
将所述音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征;Convolving the audio signal on a plurality of scales, respectively, to generate a plurality of features corresponding to the plurality of scales;
根据所述多个特征生成融合特征,并根据预训练的分类模型确定所述融合特征的概率;Generate fusion features according to the plurality of features, and determine the probability of the fusion features according to the pre-trained classification model;
若所述概率大于等于阈值,则确定所述音频信号包含杂音;If the probability is greater than or equal to a threshold, then determining that the audio signal contains noise;
若所述概率小于所述阈值,则确定所述音频信号不包含杂音。If the probability is less than the threshold, it is determined that the audio signal does not contain noise.
在一个实施例中,所述方法还包括:In one embodiment, the method also includes:
获取包含杂音的扫频信号和不包含杂音的扫频信号;Obtain a frequency sweep signal containing noise and a frequency sweep signal not containing noise;
将扫频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个样本特征;Convolve the frequency sweep signal on multiple scales to generate multiple sample features corresponding to multiple scales;
将所述多个样本特征融合生成样本融合特征,并确定所述样本融合特征的预测概率;Fusing the multiple sample features to generate a sample fusion feature, and determining the predicted probability of the sample fusion feature;
根据所述预测概率和扫频信号的标注值训练分类模型。A classification model is trained according to the predicted probability and the labeled value of the frequency sweep signal.
在一个实施例中,所述多个尺度包括第一尺度、第二尺度和第三尺度,且所述第一尺度小于所述第二尺度,所述第二尺度小于所述第三尺度,所述将所述音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征包括:In one embodiment, the plurality of scales includes a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale, so Convolving the audio signal on multiple scales, generating multiple features corresponding to multiple scales includes:
将所述音频信号分别在所述第一尺度、所述第二尺度和所述第三尺度上进行卷积,以提取所述音频信号在时域上与所述第一尺度、所述第二尺度和所述第三尺度对应的第一特征、第二特征和第三特征。Convolving the audio signal on the first scale, the second scale, and the third scale, respectively, to extract the time domain correlation between the audio signal and the first scale, the second scale The first feature, the second feature and the third feature corresponding to the scale and the third scale.
在一个实施例中,所述方法还包括:In one embodiment, the method also includes:
对所述扬声器中的音频信号进行归一化处理;performing normalization processing on the audio signal in the loudspeaker;
确定音频信号采样率的原始值和目标值,并将归一化处理后的音频信号的采样率由所述原始值降低至所述目标值。An original value and a target value of the sampling rate of the audio signal are determined, and the sampling rate of the normalized audio signal is reduced from the original value to the target value.
在一个实施例中,所述融合特征的概率通过如下方式计算得到:In one embodiment, the probability of the fusion feature is calculated as follows:
Figure PCTCN2021115791-appb-000001
Figure PCTCN2021115791-appb-000001
其中,z k表示全连接层的第k个值,
Figure PCTCN2021115791-appb-000002
表示含有杂音的音频样本向量,
Figure PCTCN2021115791-appb-000003
表示不含杂音的音频样本向量。
Among them, z k represents the kth value of the fully connected layer,
Figure PCTCN2021115791-appb-000002
Represents a vector of audio samples containing noise,
Figure PCTCN2021115791-appb-000003
Vector representing audio samples without noise.
在一个实施例中,卷积层的计算公式如下:In one embodiment, the calculation formula of the convolutional layer is as follows:
Figure PCTCN2021115791-appb-000004
Figure PCTCN2021115791-appb-000004
其中,i表示第i层卷积层,δ为激活函数,X表示音频信号,w表示卷积层权重,b表示卷积层偏置。Among them, i represents the i-th convolutional layer, δ is the activation function, X represents the audio signal, w represents the weight of the convolutional layer, and b represents the bias of the convolutional layer.
在一个实施例中,所述方法还包括:In one embodiment, the method also includes:
采用最邻近算法,对每一包含杂音的扫频信号计算得到多个近邻;Use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise;
对所述多个近邻中的任意两个进行随机线性插值,以生成包含杂音的仿真扫频信号;performing random linear interpolation on any two of the plurality of neighbors to generate a simulated frequency sweep signal containing noise;
重复上述步骤,直至所述包含杂音的扫频信号的数量与所述包含杂音的仿真扫频信号的数量之和,与所述不包含杂音的扫频信号的数量相等。The above steps are repeated until the sum of the number of noise-containing frequency sweep signals and the number of noise-containing simulated frequency sweep signals is equal to the number of noise-free frequency sweep signals.
一种扬声器的杂音检测装置,所述装置包括:A noise detection device for a loudspeaker, said device comprising:
采集模块,配置成采集扬声器中的音频信号;A collection module configured to collect audio signals in the loudspeaker;
提取模块,配置成将所述音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征;An extraction module configured to convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales;
生成模块,配置成根据所述多个特征生成融合特征,并根据预训练的分类模型确定所述融合特征的概率;A generating module configured to generate a fusion feature according to the plurality of features, and determine the probability of the fusion feature according to a pre-trained classification model;
第一确定模块,配置成若所述概率大于等于阈值,则确定所述音频信号包含杂音;A first determination module configured to determine that the audio signal contains noise if the probability is greater than or equal to a threshold;
第二确定模块,配置成若所述概率小于所述阈值,则确定所述音频信号不包含杂音。The second determination module is configured to determine that the audio signal does not contain noise if the probability is less than the threshold.
可选的,所述装置还包括:训练模块,配置成获取包含杂音的扫频信号和不包含杂音的扫频信号;将扫频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个样本特征;将多个样本特征融合生成样本融合特征,并确定样本融合特征的预测概率;根据预测概率和扫频信号的标注值训练分类模型。Optionally, the device further includes: a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Corresponding to multiple sample features; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
可选的,多个尺度包括第一尺度、第二尺度和第三尺度,且第一尺度小于第二尺度,第二尺度小于第三尺度;所述提取模块具体配置成将音频信号分别在第一尺度、第二尺度和第三尺度上进行卷积,以提取音频信号在时域上与第一尺度、第二尺度和第三尺度对应的第一特征、第二特征和第三特征。Optionally, the multiple scales include a first scale, a second scale, and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale; Convolution is performed on the first scale, the second scale and the third scale to extract the first feature, the second feature and the third feature of the audio signal corresponding to the first scale, the second scale and the third scale in the time domain.
可选的,所述装置还包括:预处理模块,配置成对扬声器中的音频信号进行归一化处理;确定音频信号采样率的原始值和目标值,并将归一化处理后的音频信号的采样率由原始值降低至目标值。Optionally, the device further includes: a preprocessing module configured to normalize the audio signal in the speaker; determine the original value and the target value of the sampling rate of the audio signal, and normalize the processed audio signal The sampling rate of is reduced from the original value to the target value.
可选的,融合特征的概率通过如下方式计算得到:Optionally, the probability of fusion features is calculated as follows:
Figure PCTCN2021115791-appb-000005
Figure PCTCN2021115791-appb-000005
其中,z k表示全连接层的第k个值,
Figure PCTCN2021115791-appb-000006
表示含有杂音的音频样本向量,
Figure PCTCN2021115791-appb-000007
表示不含杂音的音频样本向量。
Among them, z k represents the kth value of the fully connected layer,
Figure PCTCN2021115791-appb-000006
Represents a vector of audio samples containing noise,
Figure PCTCN2021115791-appb-000007
Vector representing audio samples without noise.
可选的,卷积层的计算公式如下:Optionally, the calculation formula of the convolutional layer is as follows:
Figure PCTCN2021115791-appb-000008
Figure PCTCN2021115791-appb-000008
其中,i表示第i层卷积层,δ为激活函数,X表示音频信号,w表示卷积层权重,b表示卷积层偏置。Among them, i represents the i-th convolutional layer, δ is the activation function, X represents the audio signal, w represents the weight of the convolutional layer, and b represents the bias of the convolutional layer.
可选的,所述装置还包括:获取模块,配置成采用最邻近算法,对每一包含杂音的扫频信号计算得到多个近邻;对多个近邻中的任意两个进行随机线性插值,以生成包含杂音的仿真扫频信号;重复上述步骤,直至包含杂音的扫频信号的数量与包含杂音的仿真扫频信号的数量之和,与不包含杂音的扫频信号的数量相等。Optionally, the device further includes: an acquisition module configured to use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors to obtain Generate simulated frequency sweep signals containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
一种电子设备,包括存储器和一个或多个处理器,所述存储器存储有计算机可读指令,所述一个或多个处理器执行所述计算机可读指令时实现本公开任意实施例所提供的扬声器的杂音检测方法的步骤。An electronic device, including a memory and one or more processors, the memory stores computer-readable instructions, and the one or more processors execute the computer-readable instructions to implement the method provided by any embodiment of the present disclosure. Steps of a noise detection method for a speaker.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时实现本公开任意实施例所提供的扬声器的杂音检测方法的步骤。One or more non-transitory computer-readable storage media having stored thereon computer-readable instructions that, when executed by one or more processors, implement any implementation of the present disclosure The steps of the loudspeaker noise detection method provided by the example.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure.
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, for those of ordinary skill in the art, In other words, other drawings can also be obtained from these drawings without paying creative labor.
图1为本公开一个或多个实施例提供的扬声器的杂音检测方法的流程示意图;FIG. 1 is a schematic flowchart of a noise detection method for a speaker provided by one or more embodiments of the present disclosure;
图2为本公开一个或多个实施例所提供的一种模型训练的流程示意图;Fig. 2 is a schematic flow chart of model training provided by one or more embodiments of the present disclosure;
图3为本公开一个或多个实施例所提供的一种杂音检测流程图;FIG. 3 is a flow chart of noise detection provided by one or more embodiments of the present disclosure;
图4为本公开一个或多个实施例所提供的扬声器的杂音检测装置的结构示意图;FIG. 4 is a schematic structural diagram of a loudspeaker noise detection device provided by one or more embodiments of the present disclosure;
图5为本公开一个或多个实施例所提供的一种电子设备的结构示意图。Fig. 5 is a schematic structural diagram of an electronic device provided by one or more embodiments of the present disclosure.
具体实施方式detailed description
为了使本公开的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本公开进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本公开,并不配置成限定本公开。In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present disclosure, and are not configured to limit the present disclosure.
在本公开实施例中,“示例性的”或者“例如”等词来表示作例子、例证或说明。本公开实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念,此外,在本公开实施例的描述中,除非另有说明,“多个”的含义是指两个或两个以上。In the embodiments of the present disclosure, words such as "exemplary" or "for example" are used as examples, illustrations or illustrations. Any embodiment or design described as "exemplary" or "for example" in the embodiments of the present disclosure shall not be construed as being preferred or advantageous over other embodiments or designs. To be precise, the use of words such as "exemplary" or "for example" is intended to present related concepts in a specific manner. In addition, in the description of the embodiments of the present disclosure, unless otherwise specified, the meaning of "plurality" refers to two one or more.
在一个实施例中,如图1所示,提供了一种扬声器的杂音检测方法,本实施例以该方法应用于终端进行举例说明,可以理解的是,该 方法也可以应用于服务器,还可以应用于包括终端和服务器的系统,并通过终端和服务器的交互实现。本实施例中,该方法包括以下步骤:In one embodiment, as shown in FIG. 1 , a noise detection method for a speaker is provided. This embodiment uses the method applied to a terminal as an example for illustration. It can be understood that this method can also be applied to a server, or It is applied to a system including a terminal and a server, and is realized through the interaction between the terminal and the server. In this embodiment, the method includes the following steps:
步骤102,采集扬声器中的音频信号。 Step 102, collect the audio signal in the speaker.
本公开实施例的方法,可以用于检测扬声器中的音频信号是否包含杂音。具体的,可以用于微型扬声器的杂音检测。微型扬声器例如是智能音箱、平板电脑、手机等智能硬件的音频输出器件,微型扬声器由盆架、磁钢、极片、音膜、音圈、前盖、接线板、阻尼布等构成。The method of the embodiment of the present disclosure can be used to detect whether the audio signal in the speaker contains noise. Specifically, it can be used for noise detection of micro-speakers. For example, micro speakers are audio output devices of smart hardware such as smart speakers, tablet computers, and mobile phones. Micro speakers are composed of basin frames, magnetic steel, pole pieces, sound membranes, voice coils, front covers, wiring boards, and damping cloth.
其中,音频信号可以是20-20KHz标准扫频信号,也可以是语音、音乐等。Wherein, the audio signal can be a 20-20KHz standard frequency sweep signal, or voice, music, etc.
在本申请的一个实施例中,在采集扬声器中的音频信号后,可以对音频信号进行预处理,具体的,可以对扬声器中的音频信号进行归一化处理,通过归一化处理将音频信号映射到(0,1),使得测试集均在同一量纲下,减小计算量的同时避免了量纲不一致而导致的测试结果异常。进而,确定音频信号采样率的原始值和目标值,并将归一化处理后的音频信号的采样率由原始值降低至目标值。其中,原始值大于目标值,举例而言,音频信号的采样率原始值为采样率为48KHz,通过降采样将音频信号由原始的采样率降低为5KHz,由于微型扬声器的杂音通常为中低频杂音,且人耳的正常听力频率范围为20Hz~2000Hz,通过降采样,使得音频信号采样率一致,采样率在人耳听力范围内的同时,减小了数据量,提升了检测效率。In one embodiment of the present application, after the audio signal in the speaker is collected, the audio signal can be preprocessed, specifically, the audio signal in the speaker can be normalized, and the audio signal can be normalized through the normalization process. Mapped to (0, 1), so that the test sets are all in the same dimension, reducing the amount of calculation and avoiding abnormal test results caused by inconsistent dimensions. Furthermore, an original value and a target value of the sampling rate of the audio signal are determined, and the sampling rate of the normalized audio signal is reduced from the original value to the target value. Among them, the original value is greater than the target value. For example, the original value of the sampling rate of the audio signal is 48KHz, and the audio signal is reduced from the original sampling rate to 5KHz by down-sampling, because the noise of the micro-speaker is usually low-frequency noise , and the normal hearing frequency range of the human ear is 20Hz to 2000Hz. By downsampling, the sampling rate of the audio signal is consistent. While the sampling rate is within the hearing range of the human ear, the amount of data is reduced and the detection efficiency is improved.
步骤104,将音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征。In step 104, the audio signal is convoluted on multiple scales to generate multiple features corresponding to the multiple scales.
本公开实施例中,可以设置不同大小的多个尺度,将音频信号分别在多个尺度上进行卷积,生成多个特征,以实现提取音频信号在时域上的不同尺度特征信息。In the embodiment of the present disclosure, multiple scales of different sizes can be set, and the audio signal is convoluted on multiple scales respectively to generate multiple features, so as to extract feature information of different scales of the audio signal in the time domain.
其中,尺度是指卷积核尺度。Among them, the scale refers to the convolution kernel scale.
作为一种示例,多个尺度包括第一尺度、第二尺度和第三尺度, 且第一尺度小于第二尺度,第二尺度小于第三尺度。本示例中,将音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征包括:将音频信号分别在第一尺度、第二尺度和第三尺度上进行卷积,以提取音频信号在时域上与第一尺度、第二尺度和第三尺度分别对应的第一特征、第二特征和第三特征。As an example, the multiple scales include a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale. In this example, the audio signal is convolved on multiple scales, and generating multiple features corresponding to multiple scales includes: respectively convolving the audio signal on the first scale, the second scale, and the third scale, The first feature, the second feature and the third feature respectively corresponding to the first scale, the second scale and the third scale of the audio signal in the time domain are extracted.
可选的,上述卷积操作通过卷积层实现,卷积层的计算公式如下:Optionally, the above convolution operation is implemented through a convolution layer, and the calculation formula of the convolution layer is as follows:
Figure PCTCN2021115791-appb-000009
Figure PCTCN2021115791-appb-000009
其中,i表示第i层卷积层,δ为激活函数,X表示音频信号,w表示卷积层权重,b表示卷积层偏置。Among them, i represents the i-th convolutional layer, δ is the activation function, X represents the audio signal, w represents the weight of the convolutional layer, and b represents the bias of the convolutional layer.
步骤106,根据多个特征生成融合特征,并根据预训练的分类模型确定融合特征的概率。 Step 106, generate fusion features according to multiple features, and determine the probability of fusion features according to the pre-trained classification model.
本实施例中,将与多个尺度分别对应的多个特征进行融合处理,以生成融合特征。In this embodiment, multiple features respectively corresponding to multiple scales are fused to generate fused features.
作为一种示例,以多个尺度包括第一尺度、第二尺度和第三尺度为例,将第一特征、第二特征和第三特征进行融合,合并为一维特征,将该一维特征作为融合特征。As an example, taking multiple scales including the first scale, the second scale and the third scale as an example, the first feature, the second feature and the third feature are fused into a one-dimensional feature, and the one-dimensional feature as a fusion feature.
其中,分类模型是根据包含杂音的音频信号和不包含杂音的音频信号作为训练样本进行训练得到的,该预训练的分类模型输入为融合特征,输出为融合特征的概率,融合特征的概率用于指示音频信号包含杂音的概率。Among them, the classification model is obtained by training the audio signal containing noise and the audio signal not containing noise as training samples. The input of the pre-trained classification model is the fusion feature, and the output is the probability of the fusion feature. The probability of the fusion feature is used for Indicates the probability that the audio signal contains noise.
在本公开的一个实施例中,分类模型包括全连接层和杂音检测函数,杂音检测函数可以是softmax函数。可选的,融合特征的概率通过如下方式计算得到:In one embodiment of the present disclosure, the classification model includes a fully connected layer and a noise detection function, and the noise detection function may be a softmax function. Optionally, the probability of fusion features is calculated as follows:
Figure PCTCN2021115791-appb-000010
Figure PCTCN2021115791-appb-000010
其中,z k表示全连接层的第k个值,
Figure PCTCN2021115791-appb-000011
表示含有杂音的音频样本向量,
Figure PCTCN2021115791-appb-000012
表示不含杂音的音频样本向量。
Among them, z k represents the kth value of the fully connected layer,
Figure PCTCN2021115791-appb-000011
Represents a vector of audio samples containing noise,
Figure PCTCN2021115791-appb-000012
Vector representing audio samples without noise.
步骤108,若概率大于等于阈值,则确定音频信号包含杂音;若概率小于阈值,则确定音频信号不包含杂音。 Step 108, if the probability is greater than or equal to the threshold, determine that the audio signal contains noise; if the probability is smaller than the threshold, determine that the audio signal does not contain noise.
本实施例中,可以根据融合特征的概率确定音频信号是否包含杂音。作为一种示例,阈值为0.5,若分类模型输出的概率大于等于0.5,则确定音频信号包含杂音,否则确定音频信号不包含杂音。In this embodiment, it may be determined whether the audio signal contains noise according to the probability of the fused features. As an example, the threshold is 0.5, and if the probability output by the classification model is greater than or equal to 0.5, it is determined that the audio signal contains noise, otherwise it is determined that the audio signal does not contain noise.
需要说明的是,上述在概率大于等于阈值时确定音频信号包含杂音的实现方式仅为一种示例,具体判断逻辑可根据训练端确定,此处不作限制。It should be noted that the above implementation of determining that the audio signal contains noise when the probability is greater than or equal to the threshold is just an example, and the specific judgment logic can be determined according to the training end, which is not limited here.
根据本公开实施例的扬声器的杂音检测方法,通过将扬声器中的音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征,进而,根据多个特征生成融合特征,并根据预训练的分类模型确定融合特征的概率,根据概率确定音频信号是否包含杂音,由此,能够挖掘音频信号在时域上不同尺度上的融合特征信息,并通过计算概率判断是否含有杂音,提升特征信息检测准确率,在测试端减小了计算的复杂度,相关技术中的硬件检测方案对于音频信号发生器的设备精度要求较高,且判决阈值难以设定、检测耗时长,本公开相比于相关技术中的方案减少了因人为选取阈值不当而导致的判断错误的情况,提高了杂音检测处理效率和检测准确率。According to the loudspeaker noise detection method of the embodiment of the present disclosure, the audio signal in the loudspeaker is respectively convolved on multiple scales to generate multiple features corresponding to the multiple scales, and then generate fusion features based on the multiple features, And according to the pre-trained classification model to determine the probability of the fusion feature, according to the probability to determine whether the audio signal contains noise, thus, it is possible to mine the fusion feature information of the audio signal in different scales in the time domain, and judge whether it contains noise by calculating the probability. Improve the accuracy of feature information detection and reduce the complexity of calculation at the test end. The hardware detection scheme in the related art has high requirements for the equipment accuracy of the audio signal generator, and the judgment threshold is difficult to set and the detection takes a long time. This disclosure Compared with the solution in the related art, the situation of judgment error caused by the improper selection of the threshold value is reduced, and the noise detection processing efficiency and detection accuracy are improved.
基于上述实施例,下面对训练端进行说明。Based on the above embodiments, the training terminal will be described below.
图2为本公开实施例所提供的一种模型训练的流程示意图,如图2所示,包括以下步骤:Fig. 2 is a schematic flow chart of a model training provided by an embodiment of the present disclosure, as shown in Fig. 2 , including the following steps:
步骤202,获取包含杂音的扫频信号和不包含杂音的扫频信号。 Step 202, acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise.
本实施例中,可以采集微型扬声器播放的含有杂音的扫频信号及不含有杂音的扫频信号,以将采集的扫频信号作为训练集,输入到构建的含有不同尺度和步长的卷积核的卷积神经网络中并行训练。In this embodiment, the frequency sweep signal containing noise and the frequency sweep signal without noise played by the micro-speaker can be collected, so that the collected frequency sweep signal can be used as a training set and input to the constructed convolution with different scales and step sizes. Parallel training of kernels in convolutional neural networks.
在本公开的一个实施例中,在获取包含杂音的扫频信号和不包含杂音的扫频信号后,可以对包含杂音的扫频信号和不包含杂音的扫频 信号进行预处理,具体的,可以对扫频信号进行归一化处理,通过归一化处理将扫频信号映射到(0,1)。进而,确定扫频信号采样率的原始值和目标值,并将归一化处理后的扫频信号的采样率由原始值降低至目标值,其中,原始值大于目标值。In an embodiment of the present disclosure, after obtaining the frequency sweep signal containing noise and the frequency sweep signal not containing noise, preprocessing may be performed on the frequency sweep signal containing noise and the frequency sweep signal not containing noise, specifically, Normalization processing can be performed on the frequency sweep signal, and the frequency sweep signal is mapped to (0, 1) through the normalization processing. Furthermore, an original value and a target value of the sampling rate of the frequency sweeping signal are determined, and the sampling rate of the normalized frequency scanning signal is reduced from the original value to the target value, wherein the original value is greater than the target value.
步骤204,将扫频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个样本特征。In step 204, the frequency sweep signal is convoluted on multiple scales to generate multiple sample features corresponding to the multiple scales.
本公开实施例中,可以设置不同大小的多个尺度,将扫频信号(包括含有杂音的扫频信号和不含有杂音的扫频信号)分别在多个尺度上进行卷积,生成多个特征,以实现提取音频信号在时域上的不同尺度特征信息。In the embodiment of the present disclosure, multiple scales of different sizes can be set, and the frequency sweep signal (including the frequency sweep signal containing noise and the frequency sweep signal without noise) is convoluted on multiple scales respectively to generate multiple features , to extract feature information of different scales of the audio signal in the time domain.
其中,尺度是指卷积核尺度。Among them, the scale refers to the convolution kernel scale.
作为一种示例,多个尺度包括第一尺度K 1、第二尺度K 2和第三尺度K 3,且第一尺度K 1小于第二尺度K 2,第二尺度K 2小于第三尺度K 3。将扫频信号分别在第一尺度K 1、第二尺度K 2和第三尺度K 3上进行三次卷积,以提取扫频信号在时域上与第一尺度K 1、第二尺度K 2和第三尺度K 3分别对应的第一样本特征、第二样本特征和第三样本特征。 As an example, the multiple scales include a first scale K 1 , a second scale K 2 and a third scale K 3 , and the first scale K 1 is smaller than the second scale K 2 , and the second scale K 2 is smaller than the third scale K 3 . Convolve the frequency sweep signal three times on the first scale K 1 , the second scale K 2 and the third scale K 3 respectively, to extract the time-domain correlation between the frequency sweep signal and the first scale K 1 , the second scale K 2 The first sample feature, the second sample feature and the third sample feature respectively correspond to the third scale K 3 .
其中,对于卷积核还可以设置多个步长。举例而言,多个步长包括第一步长S 1、第二步长S 2和第三步长S 3,且第一步长S 1小于第二步长S 2,第二步长S 2小于第三步长S 3。由此,能够提取扫频信号在时域上长度由小到大的特征信息。 Among them, multiple step sizes can also be set for the convolution kernel. For example, the multiple step sizes include the first step size S 1 , the second step size S 2 and the third step size S 3 , and the first step size S 1 is smaller than the second step size S 2 , and the second step size S 2 is smaller than the third step size S 3 . In this way, the characteristic information of the frequency sweep signal in the time domain from small to large in length can be extracted.
可选的,上述卷积操作通过卷积层实现,卷积层的计算公式如下:Optionally, the above convolution operation is implemented through a convolution layer, and the calculation formula of the convolution layer is as follows:
Figure PCTCN2021115791-appb-000013
Figure PCTCN2021115791-appb-000013
其中,i表示第i层卷积层,δ为激活函数,X表示音频信号,w表示卷积层权重,b表示卷积层偏置。Among them, i represents the i-th convolutional layer, δ is the activation function, X represents the audio signal, w represents the weight of the convolutional layer, and b represents the bias of the convolutional layer.
步骤206,将多个样本特征融合生成样本融合特征,并确定样本融合特征的预测概率。In step 206, a plurality of sample features are fused to generate a sample fusion feature, and a prediction probability of the sample fusion feature is determined.
本实施例中,将多个样本特征进行融合处理,并合并为一维特征,以生成样本融合特征。In this embodiment, multiple sample features are fused and merged into one-dimensional features to generate sample fusion features.
作为一种示例,以多个尺度包括第一尺度、第二尺度和第三尺度为例,将第一样本特征、第二样本特征和第三样本特征进行融合,合并为一维特征,将该一维特征作为样本融合特征。As an example, taking multiple scales including the first scale, the second scale and the third scale as an example, the first sample feature, the second sample feature and the third sample feature are fused and merged into a one-dimensional feature, and the This one-dimensional feature is used as a sample fusion feature.
在本公开的一个实施例中,分类模型包括全连接层和杂音检测函数,杂音检测函数可以是softmax函数。可选的,样本融合特征的预测概率通过如下方式计算得到:In one embodiment of the present disclosure, the classification model includes a fully connected layer and a noise detection function, and the noise detection function may be a softmax function. Optionally, the predicted probability of the sample fusion feature is calculated as follows:
Figure PCTCN2021115791-appb-000014
Figure PCTCN2021115791-appb-000014
其中,z k表示全连接层的第k个值,
Figure PCTCN2021115791-appb-000015
表示含有杂音的音频样本向量,
Figure PCTCN2021115791-appb-000016
表示不含杂音的音频样本向量。
Among them, z k represents the kth value of the fully connected layer,
Figure PCTCN2021115791-appb-000015
Represents a vector of audio samples containing noise,
Figure PCTCN2021115791-appb-000016
Vector representing audio samples without noise.
步骤208,根据预测概率和扫频信号的标注值训练分类模型。 Step 208, training a classification model according to the predicted probability and the labeled value of the frequency sweep signal.
本实施例中,每一扫频信号对应一个标注值,例如包含杂音的扫频信号对应标注值为1,不包含杂音的扫频信号对应标注值为0。根据预设的损失函数、预测概率和标注值计算损失值,并通过反向传播的方式更新模型的处理参数,直至模型收敛及准确率大于预设值,以使分类模型能够准确预测出音频信号是否包含杂音。In this embodiment, each frequency sweep signal corresponds to a label value, for example, a frequency sweep signal containing noise corresponds to a label value of 1, and a frequency sweep signal that does not contain noise corresponds to a label value of 0. Calculate the loss value according to the preset loss function, prediction probability and label value, and update the processing parameters of the model through back propagation until the model converges and the accuracy rate is greater than the preset value, so that the classification model can accurately predict the audio signal Whether to contain noise.
可选的,可以构建多尺度端到端的卷积神经网络,其中,该卷积神经网络包括卷积层、全连接层和杂音检测函数,并通过训练集对该卷积神经网络进行训练,在模型收敛及准确率大于预设值的情况下,保存训练好的卷积神经网络模型。该卷积神经网络模型用于确定输入的音频信号是否包含杂音。Optionally, a multi-scale end-to-end convolutional neural network can be constructed, wherein the convolutional neural network includes a convolutional layer, a fully connected layer, and a noise detection function, and the convolutional neural network is trained through a training set. When the model convergence and accuracy are greater than the preset value, save the trained convolutional neural network model. This convolutional neural network model is used to determine whether the input audio signal contains noise or not.
需要说明的是,模型的输入也可以是微型扬声器播放的音频信号的其他声学特征,如频谱、对数梅尔谱等。It should be noted that the input of the model can also be other acoustic features of the audio signal played by the micro-speaker, such as frequency spectrum, logarithmic mel spectrum, etc.
本公开实施例中,通过包含杂音的扫频信号和不包含杂音的扫频信号提取在时域上不同尺度上的融合特征信息,并通过样本融合特征 的预测概率和扫频信号的标注值训练模型,使模型能够准确判断音频信号是否含有杂音。进一步,将预训练的模型应用于扬声器中的音频信号杂音检测,提高了杂音检测处理效率和检测准确率。In the embodiment of the present disclosure, the fusion feature information on different scales in the time domain is extracted through the frequency sweep signal containing noise and the frequency sweep signal without noise, and the predicted probability of the sample fusion feature and the labeled value of the frequency sweep signal are used for training. A model that enables the model to accurately determine whether an audio signal contains noise. Further, the pre-trained model is applied to the audio signal noise detection in the loudspeaker, which improves the noise detection processing efficiency and detection accuracy.
在本公开的一个实施例中,由于采集的微型扬声器播放的含有杂音的扫频信号及不含有杂音的扫频信号通常数量严重不均衡,因此,可以通过如下方式构建训练集:采用最邻近算法,对每一包含杂音的扫频信号计算得到多个近邻,对多个近邻中的任意两个进行随机线性插值,以生成包含杂音的仿真扫频信号;重复上述步骤,直至包含杂音的扫频信号的数量与包含杂音的仿真扫频信号的数量之和,与不包含杂音的扫频信号的数量相等。In one embodiment of the present disclosure, since the frequency sweep signals containing noise and the frequency sweep signals without noise played by the collected micro-speakers are usually severely unbalanced in number, the training set can be constructed in the following manner: using the nearest neighbor algorithm , calculate multiple neighbors for each frequency sweep signal containing noise, perform random linear interpolation on any two of the multiple neighbors to generate a simulated frequency sweep signal containing noise; repeat the above steps until the frequency sweep containing noise The sum of the number of signals and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
作为一种示例,其中,采集的微型扬声器播放的含有杂音的扫频信号及不含有杂音的扫频信号(即产线测试中的非良品和良品),数量分别为90和3600个,由于比例严重不均衡,因此本公开对90个非良品进行处理,生成与良品数量对等的仿真非良品,其步骤如下:采样最邻近算法,计算出每个非良品样本的5个近邻,从5个近邻中随机挑选2个非良品样本进行随机线性插值;构造新的仿真非良品样本,将新样本与原数据合成,产生新的训练集。As an example, the number of frequency sweep signals containing noise and frequency sweep signals without noise (that is, non-conforming products and good products in the production line test) played by the collected micro-speakers is 90 and 3600 respectively. Seriously unbalanced, so this disclosure processes 90 non-defective products to generate simulated non-defective products equal to the number of good products. Randomly select two non-defective samples from the nearest neighbors for random linear interpolation; construct a new simulated non-defective sample, and synthesize the new sample with the original data to generate a new training set.
其中,新的数据集样本数为7200(含3600个良品和3600个非良品),按照4:1对数据集进行划分,得到测试集和验证集,采用one-hot编码,将良品和非良品分别标记为“1”和“0”,并利用训练集对上述的多尺度端到端卷积神经网络进行训练,经过反复迭代更新模型参数,使得达到收敛后,输出训练好的模型,并利用验证集进行评估,输出检测结果。作为一种示例,图3为本公开实施例所提供的一种杂音检测场景的流程图。Among them, the number of samples in the new data set is 7200 (including 3600 good products and 3600 non-good products). The data set is divided according to 4:1 to obtain the test set and verification set. Mark them as "1" and "0" respectively, and use the training set to train the above-mentioned multi-scale end-to-end convolutional neural network, update the model parameters through repeated iterations, so that after convergence, output the trained model, and use The validation set is evaluated and the detection results are output. As an example, FIG. 3 is a flowchart of a noise detection scenario provided by an embodiment of the present disclosure.
应该理解的是,虽然图1-3的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图1-3中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在 同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flow charts of FIGS. 1-3 are displayed sequentially as indicated by the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in Figures 1-3 may include a plurality of sub-steps or stages, these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, these sub-steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
在一个实施例中,如图4所示,提供了一种扬声器的杂音检测装置,包括:采集模块41,提取模块42,生成模块43,第一确定模块44,第二确定模块45。In one embodiment, as shown in FIG. 4 , a loudspeaker noise detection device is provided, including: an acquisition module 41 , an extraction module 42 , a generation module 43 , a first determination module 44 , and a second determination module 45 .
其中,采集模块41,配置成采集扬声器中的音频信号。Wherein, the collecting module 41 is configured to collect the audio signal in the speaker.
提取模块42,配置成将音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征。The extraction module 42 is configured to convolve the audio signal on multiple scales respectively to generate multiple features corresponding to the multiple scales.
生成模块43,配置成根据多个特征生成融合特征,并根据预训练的分类模型确定融合特征的概率。The generating module 43 is configured to generate fusion features according to multiple features, and determine the probability of fusion features according to the pre-trained classification model.
第一确定模块44,配置成若概率大于等于阈值,则确定音频信号包含杂音。The first determination module 44 is configured to determine that the audio signal contains noise if the probability is greater than or equal to a threshold.
第二确定模块45,配置成若概率小于所述阈值,则确定音频信号不包含杂音。The second determination module 45 is configured to determine that the audio signal does not contain noise if the probability is less than the threshold.
在一个实施例中,该装置还包括:训练模块,配置成获取包含杂音的扫频信号和不包含杂音的扫频信号;将扫频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个样本特征;将多个样本特征融合生成样本融合特征,并确定样本融合特征的预测概率;根据预测概率和扫频信号的标注值训练分类模型。In one embodiment, the device further includes: a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Multiple sample features corresponding to the scale; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
在一个实施例中,多个尺度包括第一尺度、第二尺度和第三尺度,且第一尺度小于第二尺度,第二尺度小于第三尺度,提取模块42具体配置成:将音频信号分别在第一尺度、第二尺度和第三尺度上进行卷积,以提取音频信号在时域上与第一尺度、第二尺度和第三尺度对应的第一特征、第二特征和第三特征。In one embodiment, the multiple scales include a first scale, a second scale, and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale, and the extraction module 42 is specifically configured to: separate the audio signals Perform convolution on the first scale, the second scale and the third scale to extract the first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale of the audio signal in the time domain .
在一个实施例中,该装置还包括:预处理模块,配置成对扬声器中的音频信号进行归一化处理;确定音频信号采样率的原始值和目标 值,并将归一化处理后的音频信号的采样率由原始值降低至目标值。In one embodiment, the device further includes: a preprocessing module configured to perform normalization processing on the audio signal in the speaker; determine the original value and target value of the sampling rate of the audio signal, and normalize the processed audio The sample rate of the signal is reduced from the original value to the target value.
在一个实施例中,融合特征的概率通过如下方式计算得到:In one embodiment, the probability of fused features is calculated as follows:
Figure PCTCN2021115791-appb-000017
Figure PCTCN2021115791-appb-000017
其中,z k表示全连接层的第k个值,
Figure PCTCN2021115791-appb-000018
表示含有杂音的音频样本向量,
Figure PCTCN2021115791-appb-000019
表示不含杂音的音频样本向量。
Among them, z k represents the kth value of the fully connected layer,
Figure PCTCN2021115791-appb-000018
Represents a vector of audio samples containing noise,
Figure PCTCN2021115791-appb-000019
Vector representing audio samples without noise.
在一个实施例中,卷积层的计算公式如下:In one embodiment, the calculation formula of the convolutional layer is as follows:
Figure PCTCN2021115791-appb-000020
Figure PCTCN2021115791-appb-000020
其中,i表示第i层卷积层,δ为激活函数,X表示音频信号,w表示卷积层权重,b表示卷积层偏置。Among them, i represents the i-th convolutional layer, δ is the activation function, X represents the audio signal, w represents the weight of the convolutional layer, and b represents the bias of the convolutional layer.
在一个实施例中,该装置还包括:获取模块,配置成采用最邻近算法,对每一包含杂音的扫频信号计算得到多个近邻;对多个近邻中的任意两个进行随机线性插值,以生成包含杂音的仿真扫频信号;重复上述步骤,直至包含杂音的扫频信号的数量与包含杂音的仿真扫频信号的数量之和,与不包含杂音的扫频信号的数量相等。In one embodiment, the device further includes: an acquisition module configured to use a nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors, to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
关于扬声器的杂音检测装置的具体限定可以参见上文中对于扬声器的杂音检测方法的限定,具备执行方法相应的功能模块和有益效果,在此不再赘述。上述扬声器的杂音检测装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于电子设备中的一个或多个处理器中,也可以以软件形式存储于电子设备中的存储器中,以便于一个或多个处理器调用执行以上各个模块对应的操作。For the specific limitations of the loudspeaker noise detection device, please refer to the above definition of the loudspeaker noise detection method, which has the corresponding functional modules and beneficial effects of the implementation method, and will not be repeated here. Each module in the noise detection device for the above-mentioned loudspeaker can be fully or partially realized by software, hardware and a combination thereof. The above-mentioned modules can be embedded in hardware or independent of one or more processors in the electronic device, and can also be stored in the memory of the electronic device in the form of software, so that one or more processors can call and execute the above The operation corresponding to the module.
在一个实施例中,提供了一种电子设备,该电子设备可以是终端,其内部结构图可以如图5所示。该电子设备包括通过系统总线连接的一个或多个处理器、存储器、通信接口、显示屏和输入装置。其中,该电子设备的一个或多个处理器用于提供计算和控制能力。该电子设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质 存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该电子设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、近场通信(NFC)或其他技术实现。该计算机可读指令被一个或多个处理器执行时以实现一种扬声器的杂音检测方法。该电子设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该电子设备的输入装置可以是显示屏上覆盖的触摸层,也可以是电子设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, an electronic device is provided. The electronic device may be a terminal, and its internal structure may be as shown in FIG. 5 . The electronic device includes one or more processors, memory, communication interface, display screen, and input device connected by a system bus. Wherein, one or more processors of the electronic device are used to provide calculation and control capabilities. The memory of the electronic device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer readable instructions. The internal memory provides an environment for the execution of the operating system and computer readable instructions in the non-volatile storage medium. The communication interface of the electronic device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, near field communication (NFC) or other technologies. When the computer-readable instructions are executed by one or more processors, a loudspeaker noise detection method is realized. The display screen of the electronic device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the housing of the electronic device , and can also be an external keyboard, touchpad, or mouse.
本领域技术人员可以理解,图5中示出的结构,仅仅是与本公开方案相关的部分结构的框图,并不构成对本公开方案所应用于其上的电子设备的限定,具体的电子设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of a partial structure related to the disclosed solution, and does not constitute a limitation on the electronic device to which the disclosed solution is applied. The specific electronic device can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.
在一个实施例中,本公开提供的扬声器的杂音检测装置可以实现为一种计算机可读指令的形式,计算机可读指令可在如图5所示的电子设备上运行。电子设备的存储器中可存储组成该扬声器的杂音检测装置的各个程序模块,比如,图4所示的采集模块41,提取模块42,生成模块43,第一确定模块44,第二确定模块45。各个程序模块构成的计算机可读指令使得一个或多个处理器执行本说明书中描述的本公开各个实施例的扬声器的杂音检测方法中的步骤。In one embodiment, the device for detecting noise of a speaker provided by the present disclosure may be implemented in the form of a computer readable instruction, and the computer readable instruction may be run on the electronic device as shown in FIG. 5 . The memory of the electronic device can store each program module forming the noise detection device of the loudspeaker, such as the collection module 41 shown in FIG. 4, the extraction module 42, the generation module 43, the first determination module 44, and the second determination module 45. The computer-readable instructions constituted by various program modules enable one or more processors to execute the steps in the method for detecting noise of a loudspeaker according to various embodiments of the present disclosure described in this specification.
例如,图5所示的电子设备可以通过如图4所示的扬声器的杂音检测装置中的采集模块41执行采集扬声器中的音频信号。电子设备可以通过提取模块42执行将音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征。电子设备可以通过生成模块43执行根据多个特征生成融合特征,并根据预训练的分类模型确定融合特征的概率。电子设备可以通过第一确定模块44执行若概率大于等于阈值,则确定音频信号包含杂音。电子设备可以通过第二确定模块45执行若概率小于阈值,则确定音频信号不包含杂音For example, the electronic device shown in FIG. 5 may collect audio signals in the speaker through the acquisition module 41 in the noise detection device for the speaker as shown in FIG. 4 . The electronic device may use the extraction module 42 to perform convolution on the audio signal on multiple scales respectively to generate multiple features corresponding to the multiple scales. The electronic device can generate fusion features according to multiple features through the generation module 43, and determine the probability of fusion features according to the pre-trained classification model. The electronic device may determine that the audio signal contains noise if the probability is greater than or equal to a threshold through the first determination module 44 . The electronic device can perform if the probability is less than the threshold through the second determination module 45, then determine that the audio signal does not contain noise
在一个实施例中,提供了一种电子设备,包括存储器和一个或多 个处理器,该存储器存储有计算机可读指令,该一个或多个处理器执行计算机可读指令时实现以下步骤:采集扬声器中的音频信号;将音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征;根据多个特征生成融合特征,并根据预训练的分类模型确定融合特征的概率;若概率大于等于阈值,则确定音频信号包含杂音;若概率小于阈值,则确定音频信号不包含杂音。In one embodiment, an electronic device is provided, including a memory and one or more processors, the memory stores computer-readable instructions, and the one or more processors execute the computer-readable instructions to implement the following steps: collecting Audio signal in the speaker; Convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales; Generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model ; If the probability is greater than or equal to the threshold, it is determined that the audio signal contains noise; if the probability is less than the threshold, it is determined that the audio signal does not contain noise.
在一个实施例中,该一个或多个处理器执行计算机可读指令时还可以实现以下步骤:获取包含杂音的扫频信号和不包含杂音的扫频信号;将扫频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个样本特征;将多个样本特征融合生成样本融合特征,并确定样本融合特征的预测概率;根据预测概率和扫频信号的标注值训练分类模型。In one embodiment, when the one or more processors execute the computer-readable instructions, the following steps can also be implemented: acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise; Convolve on multiple scales to generate multiple sample features corresponding to multiple scales; multiple sample features are fused to generate sample fusion features, and the prediction probability of sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal .
在一个实施例中,该一个或多个处理器执行计算机可读指令时还可以实现以下步骤:将音频信号分别在第一尺度、第二尺度和第三尺度上进行卷积,以提取音频信号在时域上与第一尺度、第二尺度和第三尺度对应的第一特征、第二特征和第三特征。In one embodiment, when the one or more processors execute the computer-readable instructions, the following steps can also be implemented: respectively convolving the audio signal on the first scale, the second scale and the third scale to extract the audio signal The first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale in time domain.
在一个实施例中,该一个或多个处理器执行计算机可读指令时还可以实现以下步骤:对扬声器中的音频信号进行归一化处理;确定音频信号采样率的原始值和目标值,并将归一化处理后的音频信号的采样率由原始值降低至目标值。In one embodiment, when the one or more processors execute the computer-readable instructions, the following steps can also be implemented: performing normalization processing on the audio signal in the speaker; determining the original value and the target value of the sampling rate of the audio signal, and The sampling rate of the normalized audio signal is reduced from an original value to a target value.
在一个实施例中,该一个或多个处理器执行计算机可读指令时还可以实现以下步骤:采用最邻近算法,对每一包含杂音的扫频信号计算得到多个近邻;对多个近邻中的任意两个进行随机线性插值,以生成包含杂音的仿真扫频信号;重复上述步骤,直至包含杂音的扫频信号的数量与包含杂音的仿真扫频信号的数量之和,与不包含杂音的扫频信号的数量相等。In one embodiment, when the one or more processors execute the computer-readable instructions, the following steps can also be implemented: using the nearest neighbor algorithm to calculate a plurality of neighbors for each frequency sweep signal containing noise; Random linear interpolation of any two of the two to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to that of the simulated frequency sweep signal not containing noise The number of sweep signals is equal.
根据本公开实施例的电子设备,通过一个或多个处理器执行计算机可读指令时实现以下步骤,将扬声器中的音频信号分别在多个尺度 上进行卷积,生成与多个尺度对应的多个特征,进而,根据多个特征生成融合特征,并根据预训练的分类模型确定融合特征的概率,根据概率确定音频信号是否包含杂音,由此,能够挖掘音频信号在时域上不同尺度上的融合特征信息,并通过计算概率判断是否含有杂音,提升特征信息检测准确率,在测试端减小了计算的复杂度,提高了杂音检测处理效率和检测准确率。According to the electronic device in the embodiment of the present disclosure, the following steps are implemented when the computer-readable instructions are executed by one or more processors, and the audio signals in the speakers are respectively convolved on multiple scales to generate multiple scales corresponding to multiple scales. Then, generate fusion features based on multiple features, determine the probability of fusion features according to the pre-trained classification model, and determine whether the audio signal contains noise according to the probability. Fuse feature information, and judge whether there is noise by calculating probability, improve the accuracy of feature information detection, reduce the complexity of calculation on the test side, and improve the efficiency and accuracy of noise detection.
在一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时实现以下步骤:采集扬声器中的音频信号;将音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征;根据多个特征生成融合特征,并根据预训练的分类模型确定融合特征的概率;若概率大于等于阈值,则确定音频信号包含杂音;若概率小于阈值,则确定音频信号不包含杂音。In one embodiment, one or more non-transitory computer-readable storage media storing computer-readable instructions are provided, and the computer-readable instructions are executed by one or more processors to implement the following steps: collecting Audio signal; Convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales; generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model; if the probability If the probability is greater than or equal to the threshold, it is determined that the audio signal contains noise; if the probability is less than the threshold, it is determined that the audio signal does not contain noise.
在一个实施例中,计算机可读指令被一个或多个处理器执行时还可以实现以下步骤:获取包含杂音的扫频信号和不包含杂音的扫频信号;将扫频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个样本特征;将多个样本特征融合生成样本融合特征,并确定样本融合特征的预测概率;根据预测概率和扫频信号的标注值训练分类模型。In one embodiment, when the computer-readable instructions are executed by one or more processors, the following steps can also be implemented: acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise; Convolve on multiple scales to generate multiple sample features corresponding to multiple scales; multiple sample features are fused to generate sample fusion features, and the prediction probability of sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal .
在一个实施例中,计算机可读指令被一个或多个处理器执行时还可以实现以下步骤:将音频信号分别在第一尺度、第二尺度和第三尺度上进行卷积,以提取音频信号在时域上与第一尺度、第二尺度和第三尺度对应的第一特征、第二特征和第三特征。In one embodiment, when the computer-readable instructions are executed by one or more processors, the following steps can also be implemented: respectively convolving the audio signal on the first scale, the second scale and the third scale to extract the audio signal The first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale in time domain.
在一个实施例中,计算机可读指令被一个或多个处理器执行时还可以实现以下步骤:对扬声器中的音频信号进行归一化处理;确定音频信号采样率的原始值和目标值,并将归一化处理后的音频信号的采样率由原始值降低至目标值。In one embodiment, when the computer-readable instructions are executed by one or more processors, the following steps can also be implemented: performing normalization processing on the audio signal in the speaker; determining the original value and the target value of the sampling rate of the audio signal, and The sampling rate of the normalized audio signal is reduced from an original value to a target value.
在一个实施例中,计算机可读指令被一个或多个处理器执行时还 可以实现以下步骤:采用最邻近算法,对每一包含杂音的扫频信号计算得到多个近邻;对多个近邻中的任意两个进行随机线性插值,以生成包含杂音的仿真扫频信号;重复上述步骤,直至包含杂音的扫频信号的数量与包含杂音的仿真扫频信号的数量之和,与不包含杂音的扫频信号的数量相等。In one embodiment, when the computer-readable instructions are executed by one or more processors, the following steps can also be implemented: using the nearest neighbor algorithm, calculating multiple neighbors for each frequency sweep signal containing noise; Random linear interpolation of any two of the two to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to that of the simulated frequency sweep signal not containing noise The number of sweep signals is equal.
根据本公开实施例的计算机可读存储介质,通过其上存储的计算机可读指令被一个或多个处理器执行时实现以下步骤,将扬声器中的音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征,进而,根据多个特征生成融合特征,并根据预训练的分类模型确定融合特征的概率,根据概率确定音频信号是否包含杂音,由此,能够挖掘音频信号在时域上不同尺度上的融合特征信息,并通过计算概率判断是否含有杂音,提升特征信息检测准确率,在测试端减小了计算的复杂度,提高了杂音检测处理效率和检测准确率。According to the computer-readable storage medium of the embodiment of the present disclosure, when the computer-readable instructions stored thereon are executed by one or more processors, the following steps are implemented, respectively convoluting the audio signals in the speakers on multiple scales, Generate multiple features corresponding to multiple scales, and then generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model, and determine whether the audio signal contains noise according to the probability, so that the audio signal can be mined Fusion feature information on different scales in the time domain, and judge whether there is noise by calculating the probability, improve the detection accuracy of feature information, reduce the complexity of calculation at the test end, and improve the noise detection processing efficiency and detection accuracy.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本公开所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,比如静态随机存取存储器(Static Random Access Memory,SRAM)和动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing related hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer In the readable storage medium, the computer-readable instructions may include the processes of the embodiments of the above-mentioned methods when executed. Wherein, any reference to storage, database or other media used in the various embodiments provided by the present disclosure may include at least one of non-volatile and volatile storage. Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory or optical memory, etc. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as Static Random Access Memory (SRAM) and Dynamic Random Access Memory (DRAM), among others.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be within the range described in this specification.
以上所述实施例仅表达了本公开的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本公开构思的前提下,还可以做出若干变形和改进,这些都属于本公开的保护范围。因此,本公开专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present disclosure, and the descriptions thereof are relatively specific and detailed, but should not be construed as limiting the scope of the patent for the invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present disclosure, and these all belong to the protection scope of the present disclosure. Therefore, the scope of protection of the disclosed patent should be based on the appended claims.
工业实用性Industrial Applicability
本公开提供的扬声器的检测方法,通过将扬声器中的音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征,进而,根据预训练的分类模型确定由多个特征生成的融合特征的概率,根据概率确定音频信号是否包含杂音,由此,能够挖掘音频信号在时域上不同尺度上的融合特征信息,并通过计算概率判断是否含有杂音,提升特征信息检测准确率,在测试端减小了计算的复杂度,提高了杂音检测处理效率和检测准确率,具有很强的工业实用性。The loudspeaker detection method provided by the present disclosure generates multiple features corresponding to multiple scales by convoluting the audio signal in the speaker on multiple scales, and then determines the multiple features according to the pre-trained classification model The probability of the generated fusion features is used to determine whether the audio signal contains noise according to the probability. Therefore, the fusion feature information of the audio signal on different scales in the time domain can be mined, and the probability is calculated to determine whether it contains noise, and the accuracy of feature information detection can be improved. , the calculation complexity is reduced at the test end, the noise detection processing efficiency and detection accuracy are improved, and it has strong industrial applicability.

Claims (15)

  1. 一种扬声器的杂音检测方法,其特征在于,包括:A noise detection method for a loudspeaker, comprising:
    采集扬声器中的音频信号;Collect the audio signal in the speaker;
    将所述音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征;Convolving the audio signal on a plurality of scales, respectively, to generate a plurality of features corresponding to the plurality of scales;
    根据所述多个特征生成融合特征,并根据预训练的分类模型确定所述融合特征的概率;Generate fusion features according to the plurality of features, and determine the probability of the fusion features according to the pre-trained classification model;
    在所述概率大于等于阈值的条件下,确定所述音频信号包含杂音;Under the condition that the probability is greater than or equal to a threshold, determine that the audio signal contains noise;
    在所述概率小于所述阈值的条件下,确定所述音频信号不包含杂音。On the condition that the probability is smaller than the threshold, it is determined that the audio signal does not contain noise.
  2. 如权利要求1所述的方法,其中,还包括:The method of claim 1, further comprising:
    获取包含杂音的扫频信号和不包含杂音的扫频信号;Obtain a frequency sweep signal containing noise and a frequency sweep signal not containing noise;
    将扫频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个样本特征;Convolve the frequency sweep signal on multiple scales to generate multiple sample features corresponding to multiple scales;
    将所述多个样本特征融合生成样本融合特征,并确定所述样本融合特征的预测概率;Fusing the multiple sample features to generate a sample fusion feature, and determining the predicted probability of the sample fusion feature;
    根据所述预测概率和扫频信号的标注值训练分类模型。A classification model is trained according to the predicted probability and the labeled value of the frequency sweep signal.
  3. 如权利要求1或2所述的方法,其中,所述多个尺度包括第一尺度、第二尺度和第三尺度,且所述第一尺度小于所述第二尺度,所述第二尺度小于所述第三尺度,所述将所述音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征包括:The method according to claim 1 or 2, wherein the plurality of scales includes a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than In the third scale, the step of convolving the audio signal on multiple scales to generate multiple features corresponding to multiple scales includes:
    将所述音频信号分别在所述第一尺度、所述第二尺度和所述第三尺度上进行卷积,以提取所述音频信号在时域上与所述第一尺度、所述第二尺度和所述第三尺度对应的第一特征、第二特征和第三特征。Convolving the audio signal on the first scale, the second scale, and the third scale, respectively, to extract the time domain correlation between the audio signal and the first scale, the second scale The first feature, the second feature and the third feature corresponding to the scale and the third scale.
  4. 如权利要求1或2所述的方法,其中,还包括:The method according to claim 1 or 2, further comprising:
    对所述扬声器中的音频信号进行归一化处理;performing normalization processing on the audio signal in the loudspeaker;
    确定音频信号采样率的原始值和目标值,并将归一化处理后的音 频信号的采样率由所述原始值降低至所述目标值。Determine the original value and the target value of the sampling rate of the audio signal, and reduce the sampling rate of the normalized audio signal from the original value to the target value.
  5. 如权利要求1或2所述的方法,其中,所述融合特征的概率通过如下方式计算得到:The method according to claim 1 or 2, wherein the probability of the fusion feature is calculated as follows:
    Figure PCTCN2021115791-appb-100001
    Figure PCTCN2021115791-appb-100001
    其中,z k表示全连接层的第k个值,
    Figure PCTCN2021115791-appb-100002
    表示含有杂音的音频样本向量,
    Figure PCTCN2021115791-appb-100003
    表示不含杂音的音频样本向量。
    Among them, z k represents the kth value of the fully connected layer,
    Figure PCTCN2021115791-appb-100002
    Represents a vector of audio samples containing noise,
    Figure PCTCN2021115791-appb-100003
    Vector representing audio samples without noise.
  6. 如权利要求3所述的方法,其中,卷积层的计算公式如下:The method according to claim 3, wherein the calculation formula of the convolutional layer is as follows:
    Figure PCTCN2021115791-appb-100004
    Figure PCTCN2021115791-appb-100004
    其中,i表示第i层卷积层,δ为激活函数,X表示音频信号,w表示卷积层权重,b表示卷积层偏置。Among them, i represents the i-th convolutional layer, δ is the activation function, X represents the audio signal, w represents the weight of the convolutional layer, and b represents the bias of the convolutional layer.
  7. 如权利要求2所述的方法,其中,还包括:The method of claim 2, further comprising:
    采用最邻近算法,对每一包含杂音的扫频信号计算得到多个近邻;Use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise;
    对所述多个近邻中的任意两个进行随机线性插值,以生成包含杂音的仿真扫频信号;performing random linear interpolation on any two of the plurality of neighbors to generate a simulated frequency sweep signal containing noise;
    重复上述步骤,直至所述包含杂音的扫频信号的数量与所述包含杂音的仿真扫频信号的数量之和,与所述不包含杂音的扫频信号的数量相等。The above steps are repeated until the sum of the number of noise-containing frequency sweep signals and the number of noise-containing simulated frequency sweep signals is equal to the number of noise-free frequency sweep signals.
  8. 一种扬声器的杂音检测装置,其中,包括:A noise detection device for a loudspeaker, comprising:
    采集模块,配置成采集扬声器中的音频信号;A collection module configured to collect audio signals in the loudspeaker;
    提取模块,配置成将所述音频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个特征;An extraction module configured to convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales;
    生成模块,配置成根据所述多个特征生成融合特征,并根据预训练的分类模型确定所述融合特征的概率;A generating module configured to generate a fusion feature according to the plurality of features, and determine the probability of the fusion feature according to a pre-trained classification model;
    第一确定模块,配置成若所述概率大于等于阈值,则确定所述音频信号包含杂音;A first determination module configured to determine that the audio signal contains noise if the probability is greater than or equal to a threshold;
    第二确定模块,配置成若所述概率小于所述阈值,则确定所述音频信号不包含杂音。The second determination module is configured to determine that the audio signal does not contain noise if the probability is less than the threshold.
  9. 如权利要求8所述的装置,其中,还包括:The apparatus of claim 8, further comprising:
    训练模块,配置成获取包含杂音的扫频信号和不包含杂音的扫频信号;将扫频信号分别在多个尺度上进行卷积,生成与多个尺度对应的多个样本特征;将所述多个样本特征融合生成样本融合特征,并确定所述样本融合特征的预测概率;根据所述预测概率和扫频信号的标注值训练分类模型。The training module is configured to obtain a frequency sweep signal containing noise and a frequency sweep signal not containing noise; the frequency sweep signal is respectively convoluted on multiple scales to generate multiple sample features corresponding to multiple scales; the described A plurality of sample features are fused to generate a sample fusion feature, and a prediction probability of the sample fusion feature is determined; a classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
  10. 如权利要求8或9所述的装置,其中,所述多个尺度包括第一尺度、第二尺度和第三尺度,且所述第一尺度小于所述第二尺度,所述第二尺度小于所述第三尺度;The apparatus according to claim 8 or 9, wherein the plurality of scales comprises a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than said third dimension;
    所述提取模块,具体配置成将所述音频信号分别在所述第一尺度、所述第二尺度和所述第三尺度上进行卷积,以提取所述音频信号在时域上与所述第一尺度、所述第二尺度和所述第三尺度对应的第一特征、第二特征和第三特征。The extraction module is specifically configured to convolve the audio signal on the first scale, the second scale, and the third scale, respectively, so as to extract the audio signal in the time domain and the The first feature, the second feature, and the third feature corresponding to the first scale, the second scale, and the third scale.
  11. 如权利要求8或9所述的装置,其中,还包括:The device according to claim 8 or 9, further comprising:
    预处理模块,配置成对扬声器中的音频信号进行归一化处理;确定音频信号采样率的原始值和目标值,并将归一化处理后的音频信号的采样率由原始值降低至目标值。A preprocessing module configured to normalize the audio signal in the speaker; determine the original value and target value of the audio signal sampling rate, and reduce the normalized sampling rate of the audio signal from the original value to the target value .
  12. 如权利要求8或9所述的装置,其中,所述融合特征的概率通过如下方式计算得到:The device according to claim 8 or 9, wherein the probability of the fusion feature is calculated as follows:
    Figure PCTCN2021115791-appb-100005
    Figure PCTCN2021115791-appb-100005
    其中,z k表示全连接层的第k个值,
    Figure PCTCN2021115791-appb-100006
    表示含有杂音的音频样本向量,
    Figure PCTCN2021115791-appb-100007
    表示不含杂音的音频样本向量。
    Among them, z k represents the kth value of the fully connected layer,
    Figure PCTCN2021115791-appb-100006
    Represents a vector of audio samples containing noise,
    Figure PCTCN2021115791-appb-100007
    Vector representing audio samples without noise.
  13. 如权利要求9所述的装置,其中,还包括:The device of claim 9, further comprising:
    获取模块,配置成采用最邻近算法,对每一包含杂音的扫频信号计算得到多个近邻;对所述多个近邻中的任意两个进行随机线性插值,以生成包含杂音的仿真扫频信号;重复上述步骤,直至所述包含杂音的扫频信号的数量与所述包含杂音的仿真扫频信号的数量之和,与所 述不包含杂音的扫频信号的数量相等。The acquisition module is configured to use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors to generate a simulated frequency sweep signal containing noise ; Repeat the above steps until the sum of the number of noise-containing frequency sweep signals and the number of noise-containing simulated frequency sweep signals is equal to the number of noise-free frequency sweep signals.
  14. 一种电子设备,包括存储器和一个或多个处理器,所述存储器存储有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行权利要求1至7中任一项所述的扬声器的杂音检测方法的步骤。An electronic device comprising a memory and one or more processors, the memory storing computer readable instructions that, when executed by the one or more processors, cause the one or more The processor executes the steps of the loudspeaker noise detection method according to any one of claims 1 to 7.
  15. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行权利要求1至7中任一项所述的扬声器的杂音检测方法的步骤。One or more non-transitory computer-readable storage media storing computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform the entitlement The steps of the method for detecting noise of a loudspeaker according to any one of requirements 1 to 7.
PCT/CN2021/115791 2021-07-22 2021-08-31 Method and apparatus for detecting noise of loudspeaker, and electronic device and storage medium WO2023000444A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110833182.0 2021-07-22
CN202110833182.0A CN113766405A (en) 2021-07-22 2021-07-22 Method and device for detecting noise of loudspeaker, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023000444A1 true WO2023000444A1 (en) 2023-01-26

Family

ID=78787853

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/115791 WO2023000444A1 (en) 2021-07-22 2021-08-31 Method and apparatus for detecting noise of loudspeaker, and electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN113766405A (en)
WO (1) WO2023000444A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114627891A (en) * 2022-05-16 2022-06-14 山东捷瑞信息技术产业研究院有限公司 Moving coil loudspeaker quality detection method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109346102A (en) * 2018-09-18 2019-02-15 腾讯音乐娱乐科技(深圳)有限公司 Detection method, device and the storage medium of audio beginning sonic boom
CN109711281A (en) * 2018-12-10 2019-05-03 复旦大学 A kind of pedestrian based on deep learning identifies again identifies fusion method with feature
US20200194009A1 (en) * 2018-12-14 2020-06-18 Samsung Electronics Co., Ltd. Display apparatus and method of controlling the same
CN112199548A (en) * 2020-09-28 2021-01-08 华南理工大学 Music audio classification method based on convolution cyclic neural network

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222218B (en) * 2019-04-18 2021-07-09 杭州电子科技大学 Image retrieval method based on multi-scale NetVLAD and depth hash
CN110414323A (en) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 Mood detection method, device, electronic equipment and storage medium
CN112232258A (en) * 2020-10-27 2021-01-15 腾讯科技(深圳)有限公司 Information processing method and device and computer readable storage medium
CN112966778B (en) * 2021-03-29 2024-03-15 上海冰鉴信息科技有限公司 Data processing method and device for unbalanced sample data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109346102A (en) * 2018-09-18 2019-02-15 腾讯音乐娱乐科技(深圳)有限公司 Detection method, device and the storage medium of audio beginning sonic boom
CN109711281A (en) * 2018-12-10 2019-05-03 复旦大学 A kind of pedestrian based on deep learning identifies again identifies fusion method with feature
US20200194009A1 (en) * 2018-12-14 2020-06-18 Samsung Electronics Co., Ltd. Display apparatus and method of controlling the same
CN112199548A (en) * 2020-09-28 2021-01-08 华南理工大学 Music audio classification method based on convolution cyclic neural network

Also Published As

Publication number Publication date
CN113766405A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
Li et al. Glance and gaze: A collaborative learning framework for single-channel speech enhancement
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
US9666183B2 (en) Deep neural net based filter prediction for audio event classification and extraction
WO2019101123A1 (en) Voice activity detection method, related device, and apparatus
US9640194B1 (en) Noise suppression for speech processing based on machine-learning mask estimation
US9131295B2 (en) Multi-microphone audio source separation based on combined statistical angle distributions
WO2016095218A1 (en) Speaker identification using spatial information
US11514925B2 (en) Using a predictive model to automatically enhance audio having various audio quality issues
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN111312273A (en) Reverberation elimination method, apparatus, computer device and storage medium
Zhang et al. Multi-channel multi-frame ADL-MVDR for target speech separation
CN111369976A (en) Method and device for testing voice recognition equipment
JP6306528B2 (en) Acoustic model learning support device and acoustic model learning support method
CN111868823A (en) Sound source separation method, device and equipment
CN112185410A (en) Audio processing method and device
Shankar et al. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids
WO2023000444A1 (en) Method and apparatus for detecting noise of loudspeaker, and electronic device and storage medium
KR102194194B1 (en) Method, apparatus for blind signal seperating and electronic device
CN112951263B (en) Speech enhancement method, apparatus, device and storage medium
CN116705045B (en) Echo cancellation method, apparatus, computer device and storage medium
Wu et al. Self-supervised speech denoising using only noisy audio signals
CN113555031B (en) Training method and device of voice enhancement model, and voice enhancement method and device
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
CN114283833A (en) Speech enhancement model training method, speech enhancement method, related device and medium
US20190385590A1 (en) Generating device, generating method, and non-transitory computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21950685

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE