CN113963719A

CN113963719A - Deep learning-based sound classification method and apparatus, storage medium, and computer

Info

Publication number: CN113963719A
Application number: CN202010700261.XA
Authority: CN
Inventors: 韩旭; 钟亘明
Original assignee: Dongsheng Suzhou Intelligent Technology Co ltd
Current assignee: Dongsheng Suzhou Intelligent Technology Co ltd
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2022-01-21

Abstract

The invention provides a sound classification method and device, a storage medium and a computer. The sound classification method comprises the following steps: providing a training set, a verification set and a convolutional neural network model; obtaining time domain characteristics and frequency domain characteristics from a sample in a windowing mode, and superposing the extracted time domain characteristics and frequency domain characteristics to obtain time-frequency combination characteristics of the sample; inputting the time-frequency combination characteristics obtained based on the training samples in the training set into a convolutional neural network model for training, inputting the time-frequency combination characteristics obtained based on the verification samples in the verification set into the convolutional neural network model for verification, and obtaining the convolutional neural network model after training verification through multiple times of training and verification. Therefore, the method has the advantages that the frequency energy change is considered, the overall information of the sound is higher in recognition degree, and meanwhile the classification accuracy can be improved.

Description

Deep learning-based sound classification method and apparatus, storage medium, and computer

Technical Field

The present invention relates to the field of deep learning, and in particular, to a method and an apparatus for sound classification based on deep learning, a storage medium, and a computer.

Background

In the last decade, with the improvement of the computing power of hardware, neural networks are gradually distinguished in a plurality of machine learning algorithms, especially the Convolutional Neural Network (CNN) is proposed, so that the accuracy of the neural networks in the image classification field is far higher than that of other algorithms. Compared with the prosperous field of images, people do not sufficiently research on sound classification, and the sound classification has many application scenes in actual life, such as judging whether the operation of an automobile engine is failed, distinguishing whether a bearing in motion is normal, and judging whether the installation of a keyboard key cap is normal. The traditional sound analysis technology extracts the characteristics of energy intensity, zero-crossing rate, short-time energy and the like or extracts a spectrogram through time-frequency analysis, and judges the spectrogram through algorithms such as clustering or decision trees and the like.

Therefore, there is a need to provide a new solution to overcome the related problems.

Disclosure of Invention

The invention aims to provide a sound classification method and device, a storage medium and a computer, which can combine the characteristics of a time domain and a frequency domain, ensure that the frequency energy change is considered, have higher recognition degree on the whole sound information, and improve the classification accuracy.

To achieve the object, according to one aspect of the present invention, there is provided a sound classification method including: providing a training set, a validation set and a convolutional neural network model, wherein the training set comprises a plurality of training samples, the validation set comprises one or more validation samples, and each sample is a piece of marked sound data; adding a first window to a sample, calculating sound data in the first window to extract time domain features, adding a second window to the sound data in the first window, converting the sound data in the second window from a time domain to a frequency domain, then extracting frequency domain features, and overlapping the extracted time domain features and the extracted frequency domain features to obtain time-frequency combination features of the sample; inputting the time-frequency combination characteristics obtained based on the training samples in the training set into a convolutional neural network model for training, inputting the time-frequency combination characteristics obtained based on the verification samples in the verification set into the convolutional neural network model for verification, and obtaining the convolutional neural network model after training verification through multiple times of training and verification.

According to another aspect of the present invention, there is provided a sound classification apparatus comprising: the system comprises a feature extraction module, a data analysis module and a data analysis module, wherein the feature extraction module is used for adding a first window to each sample, calculating sound data in the first window to extract time domain features, adding a second window to the sound data in the first window, converting the sound data in the second window from a time domain to a frequency domain to extract frequency domain features, and overlapping the extracted time domain features and the extracted frequency domain features to obtain time-frequency combination features of the sample, wherein data input to the feature extraction module comprises a training set, a verification set and a test set, the training set comprises a plurality of training samples, the verification set comprises one or more verification samples, the test set comprises one or more test samples, and each sample is a section of marked sound data; and the convolutional neural network model is configured to receive the time-frequency combination characteristics obtained based on the training samples in the training set for training, receive the time-frequency combination characteristics obtained based on the verification samples in the verification set for verification, and obtain the convolutional neural network model after the training verification through multiple times of training and verification.

According to another aspect of the present invention, there is provided a storage medium storing program instructions, the program instructions being executable to perform a sound classification method. The sound classification method comprises the following steps: providing a training set, a validation set and a convolutional neural network model, wherein the training set comprises a plurality of training samples, the validation set comprises one or more validation samples, and each sample is a piece of marked sound data; adding a first window to a sample, calculating sound data in the first window to extract time domain features, adding a second window to the sound data in the first window, converting the sound data in the second window from a time domain to a frequency domain, then extracting frequency domain features, and overlapping the extracted time domain features and the extracted frequency domain features to obtain time-frequency combination features of the sample; inputting the time-frequency combination characteristics obtained based on the training samples in the training set into a convolutional neural network model for training, inputting the time-frequency combination characteristics obtained based on the verification samples in the verification set into the convolutional neural network model for verification, and obtaining the convolutional neural network model after training verification through multiple times of training and verification.

According to another aspect of the present invention, a computer is provided that includes a processor and a memory, the memory having stored therein program instructions that are executed by the processor to perform a sound classification method. The sound classification method comprises the following steps: providing a training set, a validation set and a convolutional neural network model, wherein the training set comprises a plurality of training samples, the validation set comprises one or more validation samples, and each sample is a piece of marked sound data; adding a first window to a sample, calculating sound data in the first window to extract time domain features, adding a second window to the sound data in the first window, converting the sound data in the second window from a time domain to a frequency domain, then extracting frequency domain features, and overlapping the extracted time domain features and the extracted frequency domain features to obtain time-frequency combination features of the sample; inputting the time-frequency combination characteristics obtained based on the training samples in the training set into a convolutional neural network model for training, inputting the time-frequency combination characteristics obtained based on the verification samples in the verification set into the convolutional neural network model for verification, and obtaining the convolutional neural network model after training verification through multiple times of training and verification.

Compared with the prior art, the method has the advantages that two layers of windows are used, a large window is selected for time domain characteristics to calculate, a smaller window is used for frequency domain characteristics to carry out short-time Fourier transform, and the characteristics of a time domain and a frequency domain are combined, so that the frequency energy change is considered, and meanwhile, the integral information of the sound is higher in recognition degree; in addition, the convolution neural network model with the optimized structure is adopted, deep features of time-frequency are extracted through convolution for classification, and therefore the anti-noise capacity, the tolerance of system errors and the classification accuracy can be improved.

Drawings

FIG. 1 is a diagram illustrating a time-frequency combination signature of the present invention after superposition of frequency-domain signatures and time-domain signatures in one embodiment;

FIG. 2 is a flow diagram illustrating a sound classification method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a sound classification apparatus according to an embodiment of the present invention.

Detailed Description

To further explain the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the embodiments, structures, features and effects according to the present invention will be given with reference to the accompanying drawings and preferred embodiments.

Fig. 2 is a flow chart of the sound classification method 100 according to an embodiment of the present invention. As shown in fig. 2, the sound classification method 100 is generally divided into a training phase and a usage phase.

As shown in fig. 2, the sound classification method 100 includes the following steps or operations.

Step 110, providing a training set, a validation set, and a Convolutional Neural Network (CNN) model.

The training set includes a plurality of training samples, and the validation set includes one or more validation samples, each sample being a piece of labeled acoustic data. The flags may include a yes (OK) flag and a No (NG) flag, where the flags may be manually calibrated, samples marked as being marked are considered OK, and samples marked as being marked no are considered not. Of course, the labels may also have multiple types of labels, such as a first type of label, a second type of label, a third type of label, and the like, and thus the trained convolutional neural network model may classify the input sample into the first type, the second type, or the third type.

Additionally, a test set may also be provided, the test set including one or more test samples. The test sample is used for testing the convolutional neural network model to determine whether the convolutional neural network model can be formally used. Of course, in some embodiments, the test set may not be provided, depending on the application.

The obtaining process of the sample comprises the following steps: providing an initial sample; an initial sample is pre-treated to form the sample. Preferably, the sample set (sample composition set) may be extended by methods such as noise addition, data perturbation, speed perturbation, tuning, and the like.

Specifically, the sound is collected by the recording device, and if the background sound needs to be recorded separately as background data in a scene with high noise, a plurality of initial samples can be obtained. The pretreatment comprises the following steps: merging the data channels, for example, merging the left and right channels into one data channel; andor, converting all initial samples into the same sampling rate through resampling; and/or, cutting the long initial sample, and filling the short initial sample; and/or, noise reduction or sound enhancement is performed on the initial samples; and, labeling the initial sample. The marking of the initial sample may be done manually.

In one embodiment, the set of samples may be randomly divided by a certain proportion to form a training set, a validation set, and a test set. It should be noted that, each sample is divided into a training set to be called a training sample, a validation set to be a validation sample, and a test set to be a test sample, and there is no essential difference in the technology of the training sample, the validation sample, and the test sample.

And step 121, adding a first window to one sample, and calculating the sound data in the first window to extract time domain features.

Specifically, the first window is a rectangular window, and the window size is a first predetermined time period, for example, about 1 s. The time domain features include one or more of a mean, a standard deviation, an amplitude, a root mean square, a maximum point, a skewness factor, a kurtosis factor, a margin factor, and a crest factor.

And step 122, adding a second window to the sound data in the first window, converting the sound data in the second window from a time domain to a frequency domain, and then extracting frequency domain features.

And converting the sound data in the second window from the time domain to the frequency domain through short-time Fourier transform. The second window is a hanning window, the window size is a second predetermined time length, for example, 20ms, the first predetermined time length is N times the second predetermined time length, N is an integer greater than or equal to 2, for example, N is 50. Specifically, the sound data in the frequency domain in the second window is converted into MEL (MEL) scale, and logarithm is calculated to obtain the frequency domain feature. Transforming the data to Mel-scale allows Convolutional Neural Network (CNN) models to have the same degree of frequency discrimination as humans.

And step 123, superposing the extracted time domain features and the extracted frequency domain features to obtain time-frequency combination features of the sample. As shown in fig. 1, the horizontal direction is time, the lower part is time domain features, and the upper part is frequency domain features, and since the first window is N times of the second window, the upper part in fig. 1 is N frequency domain features sequentially arranged according to time and corresponding to the time domain features according to time, so that time-frequency combination features, also called feature matrices, can be formed.

Therefore, the convolutional neural network model can simultaneously use the information characteristics of the time-frequency two spaces, and can keep the time dependence of longer distance due to the fact that the characteristics come from different time scales, so that the convolutional neural network model can better mine deep features.

For ease of understanding,

steps

121, 122 and 123 may be collectively referred to as a time-frequency combination feature extraction step, which is used to extract time-frequency combination features from the samples. The time-frequency combination feature can be extracted from samples such as training samples and verification samples through a time-frequency combination feature extraction step, and the step can be applied to training sets, verification sets and test sets.

And step 130, inputting the time-frequency combination characteristics obtained based on the training samples in the training set into a convolutional neural network model for training, inputting the time-frequency combination characteristics obtained based on the verification samples in the verification set into the convolutional neural network model for verification, and obtaining the convolutional neural network model after training and verification through continuous training and verification.

Preferably, the convolutional neural network model comprises a convolutional layer, a pooling layer, a Dense (Dense) block, a full connection layer and a sigmoid function layer which are sequentially connected, an activation function layer and a normalization (batch-norm) layer are sequentially arranged between the convolutional layer and the pooling layer, an activation function layer and a normalization layer are sequentially arranged between the Dense block and the full connection layer, and the sigmoid function layer is used for enabling a result to be contracted between 0 and 1. Specifically, the dense block has two, is called first dense block and the dense block of second respectively, full articulamentum has two, is called first full articulamentum and the full articulamentum of second respectively, and first dense block links to each other with the pooling layer, has set gradually activation function layer and normalization layer between first dense block and the dense block of second, has set gradually activation function layer and normalization layer between the dense block of second and the first full articulamentum, and the full articulamentum of second connects after the full articulamentum of second connects in the full articulamentum of second, and the normalization layer is used for accelerating the training to the distribution difference between the outstanding data. The regular term is added into the loss function, so that good generalization capability can be kept while training for a long time, the neural network can better learn the characteristics, and higher accuracy is obtained. In the convolutional neural network model, a dense layer is adopted to extract deep features, and the dense layer has super-strong deep feature extraction capability, so that better features can be obtained. At the end, a full connection layer is used as a classifier, and finally, the result is shrunk to be between 0 and 1 through a sigmoid function layer.

Parameters are optimized through a BP (Back Propagation) algorithm in the training process, the training is stopped when the loss function is stable and unchanged, and the convolutional neural network model is stored.

Step 140, the time-frequency combination features are extracted from the test sample in the same manner as in step 120.

And 150, detecting and classifying the time-frequency combination characteristics of the test sample by using the trained and verified convolutional neural network model.

As shown in FIG. 2, steps 110-130 may be collectively referred to as the training verification phase of the convolutional neural network model, and

steps

140 and 150 may be referred to as the testing phase.

Application example:

and detecting the bearing fault of packaging equipment of a certain known manufacturer. The bearing in the equipment plays a role in supporting rotation, and the function of reducing friction is an important component of mechanical equipment. Once the bearing breaks down, can cause very big injury to equipment, if can carry out real-time detection to the bearing just can reduce the loss that causes because of bearing trouble. Manufacturers currently distinguish different bearing positions from day to day by the ear of the person using stethoscope equipment. Because the position space of the bearing is very small, the operation of the equipment is easily interfered by the stethoscope sometimes, so that the artificial intelligence is selected to replace the manual work, and the real-time detection can be realized under the condition that the equipment and the person are prevented from being damaged.

The detected defect categories are: bearing good or bearing failure, i.e., "yes mark" indicates bearing good and "no mark" indicates bearing failure.

Difficulty in detection: the bearing is located inside the machine, the background noise is particularly loud, and the bearing is a very small object, the low friction of which determines the low sound produced by the bearing. After passing through the noise reduction algorithm, the signal-to-noise ratio is still not very high. How we capture the signal we need in the noise and classify it is the key point to decide the final accuracy. The traditional method is to extract the mfcc characteristic or the log-mel characteristic in the frequency domain and classify through a neural network, and the method classifies the sound segment with large noise energy and has low accuracy.

Considering that the bearing sound belongs to sound generated by mechanical rotation and has periodicity in time, the windowed time domain feature can better reflect the rotation condition of the bearing, and the frequency domain feature can better reflect the change of energy of different frequencies along with time. And obtaining time-frequency combination characteristics through a noise reduction algorithm and time-frequency analysis, wherein the time-frequency superposition characteristics can simultaneously reflect frequency energy change and short-time domain energy. Training is carried out based on the convolutional neural network, the loss function uses cross entropy, the performance on a verification set is good in the training process, the accuracy is very high, the training is stopped when the loss function is stable and unchanged, and a final convolutional neural network model is produced.

And finally, detecting through a test set, wherein the omission factor is kept below 0.5% and the total accuracy rate exceeds 99% under the condition of ensuring the overdetection rate to be 0.

Therefore, two layers of windows are used in the method, a large window is selected for time domain characteristics to be calculated, a smaller window is used for frequency domain characteristics to be subjected to short-time Fourier transform, and the characteristics of a time domain and a frequency domain are combined, so that the frequency energy change is considered, and meanwhile, the integral sound information is higher in recognition degree; in addition, the convolution neural network model with the optimized structure is adopted, deep features of time-frequency are extracted through convolution for classification, and therefore the anti-noise capacity, the tolerance of system errors and the classification accuracy can be improved.

According to another aspect of the present invention, the present invention can also be realized as a sound classification apparatus. Fig. 3 is a schematic structural diagram of a sound classification apparatus 300 according to an embodiment of the present invention. The sound classification apparatus 300 includes a feature extraction module 310 and a convolutional neural network model 320.

The feature extraction module 310 adds a first window to each sample, calculates sound data in the first window to extract time-domain features, adds a second window to the sound data in the first window, converts the sound data in the second window from time domain to frequency domain, then extracts frequency-domain features, and superimposes the extracted time-domain features and frequency-domain features to obtain time-frequency combination features of the sample, wherein the data input to the feature extraction module includes a training set, a verification set, and a test set, the training set includes a plurality of training samples, the verification set includes one or more verification samples, the test set includes one or more test samples, and each sample is a segment of labeled sound data.

The convolutional neural network model 320 is configured to receive the time-frequency combination features obtained based on the training samples in the training set for training, receive the time-frequency combination features obtained based on the verification samples in the verification set for verification, and obtain the convolutional neural network model after training and verification through continuous training and verification. And detecting the time-frequency combination characteristics obtained based on the test samples in the test set by using the trained and verified convolutional neural network model 320.

Since the sound classification apparatus 300 is technically identical to the sound classification method 100, the repeated parts are not repeated here.

According to another aspect of the present invention, there is provided a storage medium storing program instructions, the program instructions being executable to perform the sound classification method described above. The sound classification method comprises the following steps: providing a training set, a validation set and a convolutional neural network model, wherein the training set comprises a plurality of training samples, the validation set comprises one or more validation samples, and each sample is a piece of marked sound data; adding a first window to a sample, calculating sound data in the first window to extract time domain features, adding a second window to the sound data in the first window, converting the sound data in the second window from a time domain to a frequency domain, then extracting frequency domain features, and overlapping the extracted time domain features and the extracted frequency domain features to obtain time-frequency combination features of the sample; inputting the time-frequency combination characteristics obtained based on the training samples in the training set into a convolutional neural network model for training, inputting the time-frequency combination characteristics obtained based on the verification samples in the verification set into the convolutional neural network model for verification, and obtaining the convolutional neural network model after training and verification through continuous training and verification. The remaining steps of the sound classification method 100 are not repeated here.

According to another aspect of the present invention, the present invention provides a computer comprising a processor and a memory, wherein the memory stores program instructions, and the processor executes the program instructions to execute the sound classification method. The sound classification method comprises the following steps: providing a training set, a validation set and a convolutional neural network model, wherein the training set comprises a plurality of training samples, the validation set comprises one or more validation samples, and each sample is a piece of marked sound data; adding a first window to a sample, calculating sound data in the first window to extract time domain features, adding a second window to the sound data in the first window, converting the sound data in the second window from a time domain to a frequency domain, then extracting frequency domain features, and overlapping the extracted time domain features and the extracted frequency domain features to obtain time-frequency combination features of the sample; inputting the time-frequency combination characteristics obtained based on the training samples in the training set into a convolutional neural network model for training, inputting the time-frequency combination characteristics obtained based on the verification samples in the verification set into the convolutional neural network model for verification, and obtaining the convolutional neural network model after training and verification through continuous training and verification. The remaining steps of the sound classification method 100 are not repeated here.

As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, including not only those elements listed, but also other elements not expressly listed.

In this document, the terms front, back, upper and lower are used to define the components in the drawings and the positions of the components relative to each other, and are used for clarity and convenience of the technical solution. It is to be understood that the use of the directional terms should not be taken to limit the scope of the claims.

The features of the embodiments and embodiments described herein above may be combined with each other without conflict.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of sound classification, comprising:

providing a training set, a validation set and a convolutional neural network model, wherein the training set comprises a plurality of training samples, the validation set comprises one or more validation samples, and each sample is a piece of marked sound data;

adding a first window to a sample, calculating sound data in the first window to extract time domain features, adding a second window to the sound data in the first window, converting the sound data in the second window from a time domain to a frequency domain, then extracting frequency domain features, and overlapping the extracted time domain features and the extracted frequency domain features to obtain time-frequency combination features of the sample;

inputting the time-frequency combination characteristics obtained based on the training samples in the training set into a convolutional neural network model for training, inputting the time-frequency combination characteristics obtained based on the verification samples in the verification set into the convolutional neural network model for verification, and obtaining the convolutional neural network model after training verification through multiple times of training and verification.

2. The sound classification method according to claim 1, characterized in that it further comprises: providing a test set, the test set comprising one or more test samples;

and inputting the time-frequency combination characteristics obtained based on the test samples in the test set to the convolutional neural network model after training verification for test detection.

3. The sound classification method according to claim 1, characterized in that the obtaining of the samples comprises:

providing an initial sample; and

pre-treating an initial sample to form the sample, the pre-treating comprising: converting all initial samples into the same sampling rate through resampling; and/or, cutting the long initial sample, and filling the short initial sample; and/or, noise reduction or sound enhancement is performed on the initial samples; and, labeling the initial sample,

and dividing the set of samples to form a training set and a verification set.

4. The sound classification method according to claim 1,

the first window is a rectangular window, the size of the window is a first preset time length, the size of the second window is a Hanning window (one type of window function) and is a second preset time length, the first preset time length is N times of the second preset time length, and N is an integer greater than or equal to 2.

5. The deep learning-based sound classification method according to claim 4, wherein the time-domain features extracted based on the first window and the frequency-domain features extracted based on the N second windows are combined to form time-frequency combined features.

6. The sound classification method according to claim 1,

the time domain features include one or more of a mean, a standard deviation, an amplitude, a root mean square, a maximum point, a skewness factor, a kurtosis factor, a margin factor, and a crest factor,

and converting the sound data in the second window from a time domain to a frequency domain through short-time Fourier transform, converting the sound data in the frequency domain to a Mel (MEL) scale, and calculating logarithm of the sound data to obtain frequency domain characteristics.

7. The sound classification method according to claim 1,

the convolutional neural network model comprises a convolutional layer, a pooling layer, a dense block, a full connection layer and a sigmoid function layer which are sequentially connected, an activation function layer and a normalization layer are sequentially arranged between the convolutional layer and the pooling layer, an activation function layer and a normalization layer are sequentially arranged between the dense block and the full connection layer, and the sigmoid function layer is used for enabling a result to be shrunk between 0 and 1.

8. The sound classification method according to claim 7, wherein there are two dense blocks, which are respectively referred to as a first dense block and a second dense block, and there are two full-link layers, which are respectively referred to as a first full-link layer and a second full-link layer, the first dense block is connected to the pooling layer, an activation function layer and a normalization layer are sequentially disposed between the first dense block and the second dense block, an activation function layer and a normalization layer are sequentially disposed between the second dense block and the first full-link layer, and after the second full-link layer is connected to the second full-link layer, the normalization layer is used to accelerate training and highlight the distribution difference between data.

9. The sound classification method of claim 7, characterized in that the parameters are optimized by the BP algorithm during the training process, and the training is terminated when the loss function is stable.

10. A sound classification apparatus, characterized in that it comprises:

the feature extraction module is used for adding a first window to each sample, calculating sound data in the first window to extract time domain features, adding a second window to the sound data in the first window, converting the sound data in the second window from a time domain to a frequency domain, then extracting frequency domain features, and overlapping the extracted time domain features and the extracted frequency domain features to obtain time-frequency combination features of the samples, wherein the data input into the feature extraction module comprises a training set and a verification set, the training set comprises a plurality of training samples, the verification set comprises one or more verification samples, and each sample is a segment of marked sound data;

and the convolutional neural network model is configured to receive the time-frequency combination characteristics obtained based on the training samples in the training set for training, receive the time-frequency combination characteristics obtained based on the verification samples in the verification set for verification, and obtain the convolutional neural network model after the training verification through multiple times of training and verification.

11. The sound classification apparatus of claim 10,

the data input into the feature extraction module comprises a test set, the test set comprises one or more test samples, and the time-frequency combination features obtained based on the test samples in the test set are detected by using a trained and verified convolutional neural network model.

12. A storage medium storing program instructions which when executed are operative to perform a sound classification method according to any one of claims 1 to 9.

13. A computer, characterized in that it comprises a processor and a memory, in which are stored program instructions that the processor executes in order to perform the sound classification method according to any one of claims 1-9.