WO2023000444A1 - Procédé et appareil de détection de bruit de haut-parleur, dispositif électronique et support de stockage - Google Patents

Procédé et appareil de détection de bruit de haut-parleur, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2023000444A1
WO2023000444A1 PCT/CN2021/115791 CN2021115791W WO2023000444A1 WO 2023000444 A1 WO2023000444 A1 WO 2023000444A1 CN 2021115791 W CN2021115791 W CN 2021115791W WO 2023000444 A1 WO2023000444 A1 WO 2023000444A1
Authority
WO
WIPO (PCT)
Prior art keywords
scale
noise
audio signal
frequency sweep
probability
Prior art date
Application number
PCT/CN2021/115791
Other languages
English (en)
Chinese (zh)
Inventor
宋广伟
Original Assignee
上海闻泰信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海闻泰信息技术有限公司 filed Critical 上海闻泰信息技术有限公司
Publication of WO2023000444A1 publication Critical patent/WO2023000444A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R29/00Monitoring arrangements; Testing arrangements
    • H04R29/001Monitoring arrangements; Testing arrangements for loudspeakers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the disclosure relates to a loudspeaker noise detection method, device, electronic equipment and storage medium.
  • micro speakers are widely used with the application of smart hardware such as smart speakers, tablet computers, and mobile phones.
  • noise detection technology has become a key factor in determining production quality, and the accuracy and efficiency requirements of noise detection methods are becoming more and more stringent.
  • the noise detection is carried out through the hardware detection system, and the audio signal generator is used to excite the micro-speaker, and the sound pressure signal is obtained through the artificial ear.
  • the Rub value on the point, feature extraction, and detection and recognition through empirical threshold adjudication.
  • the tester artificially selects the judgment threshold for the presence of noise at each frequency point based on testing multiple signals. For noise detection scenarios that require high precision, the judgment threshold is difficult to set, and the detection accuracy needs to be improved.
  • a loudspeaker noise detection method device, electronic device, and storage medium are provided.
  • a noise detection method for a loudspeaker comprising:
  • the method also includes:
  • a classification model is trained according to the predicted probability and the labeled value of the frequency sweep signal.
  • the plurality of scales includes a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale, so Convolving the audio signal on multiple scales, generating multiple features corresponding to multiple scales includes:
  • the method also includes:
  • An original value and a target value of the sampling rate of the audio signal are determined, and the sampling rate of the normalized audio signal is reduced from the original value to the target value.
  • the probability of the fusion feature is calculated as follows:
  • z k represents the kth value of the fully connected layer, Represents a vector of audio samples containing noise, Vector representing audio samples without noise.
  • the calculation formula of the convolutional layer is as follows:
  • i represents the i-th convolutional layer
  • is the activation function
  • X represents the audio signal
  • w represents the weight of the convolutional layer
  • b represents the bias of the convolutional layer.
  • the method also includes:
  • a noise detection device for a loudspeaker comprising:
  • a collection module configured to collect audio signals in the loudspeaker
  • An extraction module configured to convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales
  • a generating module configured to generate a fusion feature according to the plurality of features, and determine the probability of the fusion feature according to a pre-trained classification model
  • a first determination module configured to determine that the audio signal contains noise if the probability is greater than or equal to a threshold
  • the second determination module is configured to determine that the audio signal does not contain noise if the probability is less than the threshold.
  • the device further includes: a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Corresponding to multiple sample features; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
  • a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Corresponding to multiple sample features; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
  • the multiple scales include a first scale, a second scale, and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale; Convolution is performed on the first scale, the second scale and the third scale to extract the first feature, the second feature and the third feature of the audio signal corresponding to the first scale, the second scale and the third scale in the time domain.
  • the device further includes: a preprocessing module configured to normalize the audio signal in the speaker; determine the original value and the target value of the sampling rate of the audio signal, and normalize the processed audio signal The sampling rate of is reduced from the original value to the target value.
  • a preprocessing module configured to normalize the audio signal in the speaker; determine the original value and the target value of the sampling rate of the audio signal, and normalize the processed audio signal The sampling rate of is reduced from the original value to the target value.
  • the probability of fusion features is calculated as follows:
  • z k represents the kth value of the fully connected layer, Represents a vector of audio samples containing noise, Vector representing audio samples without noise.
  • calculation formula of the convolutional layer is as follows:
  • i represents the i-th convolutional layer
  • is the activation function
  • X represents the audio signal
  • w represents the weight of the convolutional layer
  • b represents the bias of the convolutional layer.
  • the device further includes: an acquisition module configured to use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors to obtain Generate simulated frequency sweep signals containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
  • an acquisition module configured to use the nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors to obtain Generate simulated frequency sweep signals containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
  • An electronic device including a memory and one or more processors, the memory stores computer-readable instructions, and the one or more processors execute the computer-readable instructions to implement the method provided by any embodiment of the present disclosure. Steps of a noise detection method for a speaker.
  • One or more non-transitory computer-readable storage media having stored thereon computer-readable instructions that, when executed by one or more processors, implement any implementation of the present disclosure
  • the steps of the loudspeaker noise detection method provided by the example are not limited to:
  • FIG. 1 is a schematic flowchart of a noise detection method for a speaker provided by one or more embodiments of the present disclosure
  • Fig. 2 is a schematic flow chart of model training provided by one or more embodiments of the present disclosure
  • FIG. 3 is a flow chart of noise detection provided by one or more embodiments of the present disclosure.
  • FIG. 4 is a schematic structural diagram of a loudspeaker noise detection device provided by one or more embodiments of the present disclosure
  • Fig. 5 is a schematic structural diagram of an electronic device provided by one or more embodiments of the present disclosure.
  • words such as “exemplary” or “for example” are used as examples, illustrations or illustrations. Any embodiment or design described as “exemplary” or “for example” in the embodiments of the present disclosure shall not be construed as being preferred or advantageous over other embodiments or designs. To be precise, the use of words such as “exemplary” or “for example” is intended to present related concepts in a specific manner. In addition, in the description of the embodiments of the present disclosure, unless otherwise specified, the meaning of "plurality” refers to two one or more.
  • a noise detection method for a speaker is provided.
  • This embodiment uses the method applied to a terminal as an example for illustration. It can be understood that this method can also be applied to a server, or It is applied to a system including a terminal and a server, and is realized through the interaction between the terminal and the server.
  • the method includes the following steps:
  • Step 102 collect the audio signal in the speaker.
  • micro speakers are audio output devices of smart hardware such as smart speakers, tablet computers, and mobile phones.
  • Micro speakers are composed of basin frames, magnetic steel, pole pieces, sound membranes, voice coils, front covers, wiring boards, and damping cloth.
  • the audio signal can be a 20-20KHz standard frequency sweep signal, or voice, music, etc.
  • the audio signal in the speaker can be preprocessed, specifically, the audio signal in the speaker can be normalized, and the audio signal can be normalized through the normalization process. Mapped to (0, 1), so that the test sets are all in the same dimension, reducing the amount of calculation and avoiding abnormal test results caused by inconsistent dimensions. Furthermore, an original value and a target value of the sampling rate of the audio signal are determined, and the sampling rate of the normalized audio signal is reduced from the original value to the target value. Among them, the original value is greater than the target value.
  • the original value of the sampling rate of the audio signal is 48KHz
  • the audio signal is reduced from the original sampling rate to 5KHz by down-sampling, because the noise of the micro-speaker is usually low-frequency noise , and the normal hearing frequency range of the human ear is 20Hz to 2000Hz.
  • the sampling rate of the audio signal is consistent. While the sampling rate is within the hearing range of the human ear, the amount of data is reduced and the detection efficiency is improved.
  • step 104 the audio signal is convoluted on multiple scales to generate multiple features corresponding to the multiple scales.
  • multiple scales of different sizes can be set, and the audio signal is convoluted on multiple scales respectively to generate multiple features, so as to extract feature information of different scales of the audio signal in the time domain.
  • the scale refers to the convolution kernel scale.
  • the multiple scales include a first scale, a second scale and a third scale, and the first scale is smaller than the second scale, and the second scale is smaller than the third scale.
  • the audio signal is convolved on multiple scales, and generating multiple features corresponding to multiple scales includes: respectively convolving the audio signal on the first scale, the second scale, and the third scale, The first feature, the second feature and the third feature respectively corresponding to the first scale, the second scale and the third scale of the audio signal in the time domain are extracted.
  • the above convolution operation is implemented through a convolution layer, and the calculation formula of the convolution layer is as follows:
  • i represents the i-th convolutional layer
  • is the activation function
  • X represents the audio signal
  • w represents the weight of the convolutional layer
  • b represents the bias of the convolutional layer.
  • Step 106 generate fusion features according to multiple features, and determine the probability of fusion features according to the pre-trained classification model.
  • the first feature, the second feature and the third feature are fused into a one-dimensional feature, and the one-dimensional feature as a fusion feature.
  • the classification model is obtained by training the audio signal containing noise and the audio signal not containing noise as training samples.
  • the input of the pre-trained classification model is the fusion feature, and the output is the probability of the fusion feature.
  • the probability of the fusion feature is used for Indicates the probability that the audio signal contains noise.
  • the classification model includes a fully connected layer and a noise detection function, and the noise detection function may be a softmax function.
  • the probability of fusion features is calculated as follows:
  • z k represents the kth value of the fully connected layer, Represents a vector of audio samples containing noise, Vector representing audio samples without noise.
  • Step 108 if the probability is greater than or equal to the threshold, determine that the audio signal contains noise; if the probability is smaller than the threshold, determine that the audio signal does not contain noise.
  • the audio signal may be determined whether the audio signal contains noise according to the probability of the fused features.
  • the threshold is 0.5, and if the probability output by the classification model is greater than or equal to 0.5, it is determined that the audio signal contains noise, otherwise it is determined that the audio signal does not contain noise.
  • the audio signal in the loudspeaker is respectively convolved on multiple scales to generate multiple features corresponding to the multiple scales, and then generate fusion features based on the multiple features, And according to the pre-trained classification model to determine the probability of the fusion feature, according to the probability to determine whether the audio signal contains noise, thus, it is possible to mine the fusion feature information of the audio signal in different scales in the time domain, and judge whether it contains noise by calculating the probability.
  • the hardware detection scheme in the related art has high requirements for the equipment accuracy of the audio signal generator, and the judgment threshold is difficult to set and the detection takes a long time. This disclosure Compared with the solution in the related art, the situation of judgment error caused by the improper selection of the threshold value is reduced, and the noise detection processing efficiency and detection accuracy are improved.
  • the training terminal will be described below.
  • Fig. 2 is a schematic flow chart of a model training provided by an embodiment of the present disclosure, as shown in Fig. 2 , including the following steps:
  • Step 202 acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise.
  • the frequency sweep signal containing noise and the frequency sweep signal without noise played by the micro-speaker can be collected, so that the collected frequency sweep signal can be used as a training set and input to the constructed convolution with different scales and step sizes.
  • preprocessing may be performed on the frequency sweep signal containing noise and the frequency sweep signal not containing noise, specifically, Normalization processing can be performed on the frequency sweep signal, and the frequency sweep signal is mapped to (0, 1) through the normalization processing. Furthermore, an original value and a target value of the sampling rate of the frequency sweeping signal are determined, and the sampling rate of the normalized frequency scanning signal is reduced from the original value to the target value, wherein the original value is greater than the target value.
  • step 204 the frequency sweep signal is convoluted on multiple scales to generate multiple sample features corresponding to the multiple scales.
  • multiple scales of different sizes can be set, and the frequency sweep signal (including the frequency sweep signal containing noise and the frequency sweep signal without noise) is convoluted on multiple scales respectively to generate multiple features , to extract feature information of different scales of the audio signal in the time domain.
  • the scale refers to the convolution kernel scale.
  • the multiple scales include a first scale K 1 , a second scale K 2 and a third scale K 3 , and the first scale K 1 is smaller than the second scale K 2 , and the second scale K 2 is smaller than the third scale K 3 .
  • the first sample feature, the second sample feature and the third sample feature respectively correspond to the third scale K 3 .
  • multiple step sizes can also be set for the convolution kernel.
  • the multiple step sizes include the first step size S 1 , the second step size S 2 and the third step size S 3 , and the first step size S 1 is smaller than the second step size S 2 , and the second step size S 2 is smaller than the third step size S 3 .
  • the characteristic information of the frequency sweep signal in the time domain from small to large in length can be extracted.
  • the above convolution operation is implemented through a convolution layer, and the calculation formula of the convolution layer is as follows:
  • i represents the i-th convolutional layer
  • is the activation function
  • X represents the audio signal
  • w represents the weight of the convolutional layer
  • b represents the bias of the convolutional layer.
  • step 206 a plurality of sample features are fused to generate a sample fusion feature, and a prediction probability of the sample fusion feature is determined.
  • sample features are fused and merged into one-dimensional features to generate sample fusion features.
  • the first sample feature, the second sample feature and the third sample feature are fused and merged into a one-dimensional feature, and the This one-dimensional feature is used as a sample fusion feature.
  • the classification model includes a fully connected layer and a noise detection function, and the noise detection function may be a softmax function.
  • the predicted probability of the sample fusion feature is calculated as follows:
  • z k represents the kth value of the fully connected layer, Represents a vector of audio samples containing noise, Vector representing audio samples without noise.
  • Step 208 training a classification model according to the predicted probability and the labeled value of the frequency sweep signal.
  • each frequency sweep signal corresponds to a label value
  • a frequency sweep signal containing noise corresponds to a label value of 1
  • a frequency sweep signal that does not contain noise corresponds to a label value of 0.
  • a multi-scale end-to-end convolutional neural network can be constructed, wherein the convolutional neural network includes a convolutional layer, a fully connected layer, and a noise detection function, and the convolutional neural network is trained through a training set. When the model convergence and accuracy are greater than the preset value, save the trained convolutional neural network model. This convolutional neural network model is used to determine whether the input audio signal contains noise or not.
  • the input of the model can also be other acoustic features of the audio signal played by the micro-speaker, such as frequency spectrum, logarithmic mel spectrum, etc.
  • the fusion feature information on different scales in the time domain is extracted through the frequency sweep signal containing noise and the frequency sweep signal without noise, and the predicted probability of the sample fusion feature and the labeled value of the frequency sweep signal are used for training.
  • the pre-trained model is applied to the audio signal noise detection in the loudspeaker, which improves the noise detection processing efficiency and detection accuracy.
  • the training set can be constructed in the following manner: using the nearest neighbor algorithm , calculate multiple neighbors for each frequency sweep signal containing noise, perform random linear interpolation on any two of the multiple neighbors to generate a simulated frequency sweep signal containing noise; repeat the above steps until the frequency sweep containing noise The sum of the number of signals and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
  • the number of frequency sweep signals containing noise and frequency sweep signals without noise (that is, non-conforming products and good products in the production line test) played by the collected micro-speakers is 90 and 3600 respectively.
  • this disclosure processes 90 non-defective products to generate simulated non-defective products equal to the number of good products. Randomly select two non-defective samples from the nearest neighbors for random linear interpolation; construct a new simulated non-defective sample, and synthesize the new sample with the original data to generate a new training set.
  • FIG. 3 is a flowchart of a noise detection scenario provided by an embodiment of the present disclosure.
  • a loudspeaker noise detection device including: an acquisition module 41 , an extraction module 42 , a generation module 43 , a first determination module 44 , and a second determination module 45 .
  • the collecting module 41 is configured to collect the audio signal in the speaker.
  • the extraction module 42 is configured to convolve the audio signal on multiple scales respectively to generate multiple features corresponding to the multiple scales.
  • the generating module 43 is configured to generate fusion features according to multiple features, and determine the probability of fusion features according to the pre-trained classification model.
  • the first determination module 44 is configured to determine that the audio signal contains noise if the probability is greater than or equal to a threshold.
  • the second determination module 45 is configured to determine that the audio signal does not contain noise if the probability is less than the threshold.
  • the device further includes: a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Multiple sample features corresponding to the scale; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
  • a training module configured to acquire a frequency sweep signal containing noise and a frequency sweep signal not containing noise; respectively perform convolution on the frequency sweep signal on multiple scales to generate Multiple sample features corresponding to the scale; multiple sample features are fused to generate sample fusion features, and the prediction probability of the sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal.
  • the multiple scales include a first scale, a second scale, and a third scale
  • the first scale is smaller than the second scale
  • the second scale is smaller than the third scale
  • the extraction module 42 is specifically configured to: separate the audio signals Perform convolution on the first scale, the second scale and the third scale to extract the first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale of the audio signal in the time domain .
  • the device further includes: a preprocessing module configured to perform normalization processing on the audio signal in the speaker; determine the original value and target value of the sampling rate of the audio signal, and normalize the processed audio The sample rate of the signal is reduced from the original value to the target value.
  • a preprocessing module configured to perform normalization processing on the audio signal in the speaker; determine the original value and target value of the sampling rate of the audio signal, and normalize the processed audio The sample rate of the signal is reduced from the original value to the target value.
  • the probability of fused features is calculated as follows:
  • z k represents the kth value of the fully connected layer, Represents a vector of audio samples containing noise, Vector representing audio samples without noise.
  • the calculation formula of the convolutional layer is as follows:
  • i represents the i-th convolutional layer
  • is the activation function
  • X represents the audio signal
  • w represents the weight of the convolutional layer
  • b represents the bias of the convolutional layer.
  • the device further includes: an acquisition module configured to use a nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors, to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
  • an acquisition module configured to use a nearest neighbor algorithm to calculate multiple neighbors for each frequency sweep signal containing noise; perform random linear interpolation on any two of the multiple neighbors, to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to the number of frequency sweep signals not containing noise.
  • each module in the noise detection device for the above-mentioned loudspeaker can be fully or partially realized by software, hardware and a combination thereof.
  • the above-mentioned modules can be embedded in hardware or independent of one or more processors in the electronic device, and can also be stored in the memory of the electronic device in the form of software, so that one or more processors can call and execute the above The operation corresponding to the module.
  • an electronic device is provided.
  • the electronic device may be a terminal, and its internal structure may be as shown in FIG. 5 .
  • the electronic device includes one or more processors, memory, communication interface, display screen, and input device connected by a system bus. Wherein, one or more processors of the electronic device are used to provide calculation and control capabilities.
  • the memory of the electronic device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer readable instructions.
  • the internal memory provides an environment for the execution of the operating system and computer readable instructions in the non-volatile storage medium.
  • the communication interface of the electronic device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, near field communication (NFC) or other technologies.
  • WIFI wireless fidelity
  • NFC near field communication
  • the computer-readable instructions are executed by one or more processors, a loudspeaker noise detection method is realized.
  • the display screen of the electronic device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the electronic device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the housing of the electronic device , and can also be an external keyboard, touchpad, or mouse.
  • FIG. 5 is only a block diagram of a partial structure related to the disclosed solution, and does not constitute a limitation on the electronic device to which the disclosed solution is applied.
  • the specific electronic device can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.
  • the device for detecting noise of a speaker may be implemented in the form of a computer readable instruction, and the computer readable instruction may be run on the electronic device as shown in FIG. 5 .
  • the memory of the electronic device can store each program module forming the noise detection device of the loudspeaker, such as the collection module 41 shown in FIG. 4, the extraction module 42, the generation module 43, the first determination module 44, and the second determination module 45.
  • the computer-readable instructions constituted by various program modules enable one or more processors to execute the steps in the method for detecting noise of a loudspeaker according to various embodiments of the present disclosure described in this specification.
  • the electronic device shown in FIG. 5 may collect audio signals in the speaker through the acquisition module 41 in the noise detection device for the speaker as shown in FIG. 4 .
  • the electronic device may use the extraction module 42 to perform convolution on the audio signal on multiple scales respectively to generate multiple features corresponding to the multiple scales.
  • the electronic device can generate fusion features according to multiple features through the generation module 43, and determine the probability of fusion features according to the pre-trained classification model.
  • the electronic device may determine that the audio signal contains noise if the probability is greater than or equal to a threshold through the first determination module 44 .
  • the electronic device can perform if the probability is less than the threshold through the second determination module 45, then determine that the audio signal does not contain noise
  • an electronic device including a memory and one or more processors, the memory stores computer-readable instructions, and the one or more processors execute the computer-readable instructions to implement the following steps: collecting Audio signal in the speaker; Convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales; Generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model ; If the probability is greater than or equal to the threshold, it is determined that the audio signal contains noise; if the probability is less than the threshold, it is determined that the audio signal does not contain noise.
  • the following steps can also be implemented: acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise; Convolve on multiple scales to generate multiple sample features corresponding to multiple scales; multiple sample features are fused to generate sample fusion features, and the prediction probability of sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal .
  • the following steps can also be implemented: respectively convolving the audio signal on the first scale, the second scale and the third scale to extract the audio signal The first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale in time domain.
  • the following steps can also be implemented: performing normalization processing on the audio signal in the speaker; determining the original value and the target value of the sampling rate of the audio signal, and The sampling rate of the normalized audio signal is reduced from an original value to a target value.
  • the following steps can also be implemented: using the nearest neighbor algorithm to calculate a plurality of neighbors for each frequency sweep signal containing noise; Random linear interpolation of any two of the two to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to that of the simulated frequency sweep signal not containing noise The number of sweep signals is equal.
  • the following steps are implemented when the computer-readable instructions are executed by one or more processors, and the audio signals in the speakers are respectively convolved on multiple scales to generate multiple scales corresponding to multiple scales. Then, generate fusion features based on multiple features, determine the probability of fusion features according to the pre-trained classification model, and determine whether the audio signal contains noise according to the probability. Fuse feature information, and judge whether there is noise by calculating probability, improve the accuracy of feature information detection, reduce the complexity of calculation on the test side, and improve the efficiency and accuracy of noise detection.
  • one or more non-transitory computer-readable storage media storing computer-readable instructions are provided, and the computer-readable instructions are executed by one or more processors to implement the following steps: collecting Audio signal; Convolve the audio signal on multiple scales to generate multiple features corresponding to multiple scales; generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model; if the probability If the probability is greater than or equal to the threshold, it is determined that the audio signal contains noise; if the probability is less than the threshold, it is determined that the audio signal does not contain noise.
  • the following steps can also be implemented: acquiring a frequency sweep signal containing noise and a frequency sweep signal not containing noise; Convolve on multiple scales to generate multiple sample features corresponding to multiple scales; multiple sample features are fused to generate sample fusion features, and the prediction probability of sample fusion features is determined; the classification model is trained according to the prediction probability and the labeled value of the frequency sweep signal .
  • the following steps can also be implemented: respectively convolving the audio signal on the first scale, the second scale and the third scale to extract the audio signal The first feature, the second feature and the third feature corresponding to the first scale, the second scale and the third scale in time domain.
  • the following steps can also be implemented: performing normalization processing on the audio signal in the speaker; determining the original value and the target value of the sampling rate of the audio signal, and The sampling rate of the normalized audio signal is reduced from an original value to a target value.
  • the following steps can also be implemented: using the nearest neighbor algorithm, calculating multiple neighbors for each frequency sweep signal containing noise; Random linear interpolation of any two of the two to generate a simulated frequency sweep signal containing noise; repeat the above steps until the sum of the number of frequency sweep signals containing noise and the number of simulated frequency sweep signals containing noise is equal to that of the simulated frequency sweep signal not containing noise The number of sweep signals is equal.
  • the following steps are implemented, respectively convoluting the audio signals in the speakers on multiple scales, Generate multiple features corresponding to multiple scales, and then generate fusion features based on multiple features, and determine the probability of fusion features according to the pre-trained classification model, and determine whether the audio signal contains noise according to the probability, so that the audio signal can be mined Fusion feature information on different scales in the time domain, and judge whether there is noise by calculating the probability, improve the detection accuracy of feature information, reduce the complexity of calculation at the test end, and improve the noise detection processing efficiency and detection accuracy.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory or optical memory, etc.
  • Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory.
  • RAM Random Access Memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • the loudspeaker detection method provided by the present disclosure generates multiple features corresponding to multiple scales by convoluting the audio signal in the speaker on multiple scales, and then determines the multiple features according to the pre-trained classification model
  • the probability of the generated fusion features is used to determine whether the audio signal contains noise according to the probability. Therefore, the fusion feature information of the audio signal on different scales in the time domain can be mined, and the probability is calculated to determine whether it contains noise, and the accuracy of feature information detection can be improved. , the calculation complexity is reduced at the test end, the noise detection processing efficiency and detection accuracy are improved, and it has strong industrial applicability.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

La présente invention, qui relève du domaine technique de l'intelligence artificielle, concerne un procédé et un appareil de détection de bruit d'un haut-parleur, un dispositif électronique et un support de stockage. Le procédé comprend : la collecte d'un signal audio dans un haut-parleur (102) ; la convolution respectivement du signal audio dans une pluralité d'échelles de façon à générer une pluralité de caractéristiques correspondant à la pluralité d'échelles (104) ; la génération d'une caractéristique fusionnée selon la pluralité de caractéristiques, et la détermination d'une probabilité de la caractéristique fusionnée selon un modèle de classification préentraîné (106) ; et si la probabilité est supérieure ou égale à une valeur seuil, le fait de déterminer que le signal audio comporte du bruit, et si la probabilité est inférieure à la valeur seuil, le fait de déterminer que le signal audio ne comporte pas de bruit (108). Par l'utilisation du procédé, la précision de détection du bruit et l'efficacité du traitement peuvent être améliorées.
PCT/CN2021/115791 2021-07-22 2021-08-31 Procédé et appareil de détection de bruit de haut-parleur, dispositif électronique et support de stockage WO2023000444A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110833182.0A CN113766405A (zh) 2021-07-22 2021-07-22 扬声器的杂音检测方法、装置、电子设备和存储介质
CN202110833182.0 2021-07-22

Publications (1)

Publication Number Publication Date
WO2023000444A1 true WO2023000444A1 (fr) 2023-01-26

Family

ID=78787853

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/115791 WO2023000444A1 (fr) 2021-07-22 2021-08-31 Procédé et appareil de détection de bruit de haut-parleur, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN113766405A (fr)
WO (1) WO2023000444A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114627891A (zh) * 2022-05-16 2022-06-14 山东捷瑞信息技术产业研究院有限公司 一种动圈扬声器质量检测方法和装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109346102A (zh) * 2018-09-18 2019-02-15 腾讯音乐娱乐科技(深圳)有限公司 音频开头爆音的检测方法、装置及存储介质
CN109711281A (zh) * 2018-12-10 2019-05-03 复旦大学 一种基于深度学习的行人重识别与特征识别融合方法
US20200194009A1 (en) * 2018-12-14 2020-06-18 Samsung Electronics Co., Ltd. Display apparatus and method of controlling the same
CN112199548A (zh) * 2020-09-28 2021-01-08 华南理工大学 一种基于卷积循环神经网络的音乐音频分类方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222218B (zh) * 2019-04-18 2021-07-09 杭州电子科技大学 基于多尺度NetVLAD和深度哈希的图像检索方法
CN110414323A (zh) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 情绪检测方法、装置、电子设备及存储介质
CN112232258A (zh) * 2020-10-27 2021-01-15 腾讯科技(深圳)有限公司 一种信息处理方法、装置及计算机可读存储介质
CN112966778B (zh) * 2021-03-29 2024-03-15 上海冰鉴信息科技有限公司 针对不平衡样本数据的数据处理方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109346102A (zh) * 2018-09-18 2019-02-15 腾讯音乐娱乐科技(深圳)有限公司 音频开头爆音的检测方法、装置及存储介质
CN109711281A (zh) * 2018-12-10 2019-05-03 复旦大学 一种基于深度学习的行人重识别与特征识别融合方法
US20200194009A1 (en) * 2018-12-14 2020-06-18 Samsung Electronics Co., Ltd. Display apparatus and method of controlling the same
CN112199548A (zh) * 2020-09-28 2021-01-08 华南理工大学 一种基于卷积循环神经网络的音乐音频分类方法

Also Published As

Publication number Publication date
CN113766405A (zh) 2021-12-07

Similar Documents

Publication Publication Date Title
Li et al. Glance and gaze: A collaborative learning framework for single-channel speech enhancement
CN110600017B (zh) 语音处理模型的训练方法、语音识别方法、系统及装置
US9666183B2 (en) Deep neural net based filter prediction for audio event classification and extraction
US9640194B1 (en) Noise suppression for speech processing based on machine-learning mask estimation
US9131295B2 (en) Multi-microphone audio source separation based on combined statistical angle distributions
WO2016095218A1 (fr) Identification d'orateur à l'aide d'informations spatiales
US11514925B2 (en) Using a predictive model to automatically enhance audio having various audio quality issues
CN111785288B (zh) 语音增强方法、装置、设备及存储介质
CN111312273A (zh) 混响消除方法、装置、计算机设备和存储介质
Zhang et al. Multi-channel multi-frame ADL-MVDR for target speech separation
CN111369976A (zh) 测试语音识别设备的方法及测试装置
JP6306528B2 (ja) 音響モデル学習支援装置、音響モデル学習支援方法
CN111868823A (zh) 一种声源分离方法、装置及设备
CN112185410A (zh) 音频处理方法及装置
WO2023000444A1 (fr) Procédé et appareil de détection de bruit de haut-parleur, dispositif électronique et support de stockage
KR102194194B1 (ko) 암묵 신호 분리를 위한 방법, 장치 및 전자 장치
CN112951263B (zh) 语音增强方法、装置、设备和存储介质
CN116705045B (zh) 回声消除方法、装置、计算机设备和存储介质
WO2023226572A1 (fr) Appareil et procédé d'extraction de représentation de caractéristique, dispositif, support et produit-programme
CN110739006A (zh) 音频处理方法、装置、存储介质及电子设备
Wu et al. Self-supervised speech denoising using only noisy audio signals
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
CN114283833A (zh) 语音增强模型训练方法、语音增强方法、相关设备及介质
CN114302301A (zh) 频响校正方法及相关产品
CN113555031A (zh) 语音增强模型的训练方法及装置、语音增强方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21950685

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE