CN112216287A

CN112216287A - Environmental sound identification method based on ensemble learning and convolution neural network

Info

Publication number: CN112216287A
Application number: CN202011020706.6A
Authority: CN
Inventors: 陈俊; 谢维; 王震宇; 郭宏成
Original assignee: Jiangsu Lishi Technology Co ltd
Current assignee: Jiangsu Lishi Technology Co ltd
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2021-01-12

Abstract

The invention discloses an environmental sound identification method based on integrated learning and convolutional neural network, which comprises the following steps: s1, feature extraction, namely framing and windowing the original audio, obtaining a Mel energy spectrum of the sound by utilizing a Mel filter bank, and finally obtaining the final Mel energy spectrum feature as a data set; s2, model training, namely performing model training on the data set by adopting K-fold cross validation and using a mixup data enhancement method to obtain K convolutional neural network models; and S3, testing the sound, and identifying the sound sample to be tested through a convolutional neural network model. The method can train k models by utilizing k-fold cross validation and combine the k models to perform voice recognition, greatly enhances the generalization capability of the models, effectively relieves the phenomenon of overfitting, and uses mixup data to enhance the generalization capability of the models by mixing the original samples aiming at the condition of small data volume.

Description

Environmental sound identification method based on ensemble learning and convolution neural network

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an environmental sound identification method based on integrated learning and convolutional neural network.

Background

In the research of audio information, environmental sound identification is an important research field, and has great application potential in the fields of safety monitoring, medical monitoring, smart home, scene analysis and the like. Compared with voice recognition, the environmental sound has the characteristics of noise-like property, wide frequency spectrum and the like, so that the recognition of the environmental sound is more challenging.

The existing environmental sound recognition method based on the convolutional neural network generally divides the existing data into a training set and a test set, then trains a model by using the training set until the model converges, tests the model by using the test set in the training process, selects the model which best appears on the test set for storage, and finally performs the environmental sound recognition by using the stored convolutional neural network

The existing identification method based on the convolutional neural network, the identification method based on the convolutional neural network and the cyclic neural network and the identification method based on the Gaussian mixture model are all used for identifying unknown environmental audio by training a single model through existing environmental audio data, and the models trained by the method have the defects of weak generalization capability and easy occurrence of overfitting.

Disclosure of Invention

In view of the foregoing defects in the prior art, the technical problem to be solved by the present invention is to provide an environmental sound recognition method based on ensemble learning and convolutional neural network, which can train k models by using k-fold cross validation and perform sound recognition by combining the k models, greatly enhance the generalization capability of the models, effectively alleviate the over-fitting phenomenon, and further enhance the generalization capability of the models by mixing the original samples by using mix-up data enhancement in view of the small data volume.

In order to achieve the above object, the present invention provides an environmental sound identification method based on ensemble learning and convolutional neural network, comprising the following steps:

s1, feature extraction, namely framing and windowing the original audio, obtaining a corresponding amplitude spectrum through FFT (fast Fourier transform) for each short-time analysis window, taking a square to obtain an energy spectrum of the sound, then obtaining a Mel energy spectrum of the sound by using a Mel filter bank, and then carrying out log nonlinear transformation on the Mel energy spectrum to obtain the final Mel energy spectrum feature which is used as a data set;

s2, performing model training, namely dividing the data set into K parts in an equal proportion by adopting K-fold cross validation, taking one part as test data and the other K-1 parts as training data, then mixing the training data by using mixup data enhancement for model training, storing the model which is best in performance on the test data, and repeating the operation for more than K times to obtain K convolutional neural network models;

and S3, sound testing, wherein the same characteristic extraction step as the step S1 is adopted for the sound sample to be tested, the Mel energy spectrum characteristic of the sound to be tested is obtained and is used as a test sample, the test sample is input into k trained convolutional neural network models, the output of the k convolutional neural network models is sent into a combination module, the combination module takes the output mode as the final output of the integrated model, the final output is compared with the class corresponding to the test set sample, and the recognition rate of the environmental sound is calculated.

Further, the step S1 of framing and windowing the original audio specifically includes: the audio data N sampling points are collected into an observation unit called a frame, an overlapping area is formed between two adjacent frames, and each frame is substituted into a window function to eliminate signal discontinuity caused by two ends of each frame.

Further, in each operation of step S2, a different data portion is selected from the k portions to be used as test data, it is ensured that the data of the k portions are respectively used as test data, and the remaining k-1 portions are used as training data.

Further, the mixup data enhancement in step S2 is specifically: randomly selecting two characteristic samples, mixing the two characteristic samples in proportion, and constructing a new training sample and a new label in a linear interpolation mode, wherein the label is finally processed by the following formula:

(x_i,y_i)、(x_j,y_j) The two data pairs are training sample pairs in the original data set, i.e. training samples and their corresponding labels, where λ is a parameter subject to B distribution, λ -Beta (α, α).

Further, when the model training is performed in step S2, the convolution kernel and the weight are initialized uniformly by Glorot, and the bias is initialized by all 0S.

Further, when the model training is performed in step S2, the Adam algorithm is used to update the network parameters, and when the number of network iterations reaches a preset number of iterations or the recognition accuracy on the verification set is not improved, the training is stopped and the trained convolutional neural network model is stored.

The invention has the beneficial effects that:

the method can train k models by utilizing k-fold cross validation and combine the k models to perform voice recognition, greatly enhances the generalization capability of the models, effectively relieves the phenomenon of overfitting, and uses mixup data to enhance the generalization capability of the models by mixing the original samples aiming at the condition of small data volume.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of the integrated model prediction of the present invention.

Detailed Description

As shown in fig. 1, a flowchart of an environmental sound identification method based on ensemble learning and convolutional neural network includes the following steps:

s1, extracting characteristics, wherein for the convenience of speech analysis, N sampling points are firstly collected into an observation unit called a frame, so as to avoid overlarge change of two adjacent frames, and therefore, an overlapping area is formed between the two adjacent frames. Each frame is substituted into a window function to eliminate signal discontinuities that may be caused across the frames. For each short-time analysis window, obtaining a corresponding amplitude spectrum through FFT, taking a square to obtain an energy spectrum of sound, then obtaining a Mel energy spectrum of the sound by utilizing a Mel filter bank, and then obtaining log nonlinear transformation of the Mel energy spectrum to obtain the final Mel energy spectrum characteristic;

and S2, model training, namely dividing the data set into k parts in an equal proportion by adopting k-fold cross validation, wherein one part is used as test data, and the other k-1 parts are used as training data. Meanwhile, because the data volume of the data set is small, the embodiment mixes the characteristic data by using mixup data enhancement and then uses the characteristic data for model training so as to improve the generalization capability of the model. And inputting the training set into a convolutional neural network model for supervised training, storing the model with the best performance on test data, and repeating the operation for more than K times to obtain K convolutional neural network models. During training, the convolution kernel and the weight are initialized uniformly by using Glorot, and the bias is initialized by using all 0. And updating network parameters by adopting an Adam algorithm, and stopping training and storing the trained convolutional neural network model when the network iteration times reach the preset iteration times or the identification precision on the verification set is not improved for a long time.

The Mixup data enhancement refers to the random selection of two characteristic samples, the mixing of the two characteristic samples is carried out in proportion, a new training sample and a new label are constructed in a linear interpolation mode, and finally the label is processed according to the following formula:

(x_i,y_i)，(x_j,y_j) The two data pairs are training sample pairs (training samples and their corresponding labels) in the raw data set. Where λ is a parameter subject to the B distribution, λ -Beta (α, α).

S3, testing: the method comprises the steps of obtaining Mel energy spectrum characteristics of sound to be tested by adopting the same characteristic extraction steps as those in a training stage for the sound sample to be tested, inputting the test sample into k trained convolutional neural network models, sending the output of the k models into a combination module, using the output mode of the combination module as the final output of an integrated model, comparing the final output with the category corresponding to a test set sample, and calculating the recognition rate of environmental sound. FIG. 2 is a schematic diagram of model prediction.

Specifically, the present embodiment uses a convolutional neural network and a mixup method for performance testing on ESC-50 data sets. The ESC-50 data set contains 2000 natural environment sound segments, each with a 5 second sound length and a sample rate of 44.1 kHz. The data set includes: 5 major categories, namely 5 major categories of animal cry, natural environment sound, human non-voice sound, indoor sound and urban outdoor sound, wherein each major category comprises 10 types of sound, and each type of sound has 40 samples. The data set details are shown in table 1.

TABLE 1 ambient Sound data set

Framing the sound signal by using a Hann window, selecting 1764 sampling points in each frame, and selecting 882 sampling points in each frame in order to keep the continuity between adjacent frames; the amplitude spectrum of the sound is obtained by FFT, the energy spectrum of the sound is obtained by squaring the amplitude spectrum, and then the energy spectrum of the sound is converted into a Mel energy spectrum by utilizing a Mel filter bank. Finally, in order to enhance the low-frequency representation of the sound and enhance the feature information hidden in the low-frequency part, the embodiment performs log nonlinear transformation on the mel-energy spectrum to obtain 2000 40 × 251 wirler-energy spectrum features, wherein 1600 are training sets, and the other 400 are test sets. The 1600 training sets are further divided into a training set and a verification set according to the ratio of 4:1, wherein the training set is used for training the models, and the verification set is used for storing the best models.

The convolutional neural network comprises: six convolutional layers, four maximum pooling layers, one global average pooling layer, and three fully-connected layers, wherein: the maximum pooling layer is connected behind each of the first two convolution layers, and the maximum pooling layer is connected behind each of the second four convolution layers; the global average pooling layer is between the convolution pooling layer and the full-connected layer; the number of convolution kernels of the six convolution layers is respectively 64, 128, 256, 512 and 512, the size of the convolution kernels is 3x3, the step length is 3, the filling mode is complementary 0, and the activation function is relu; the convolution kernel size of the four maximum pooling layers is 2x2, and the filling mode is 0 complementing; the first two full connection layers are both provided with 256 nodes, and the activation function is relu; the number of nodes of the last fully connected layer is the number of sound classes, ESC-50 has 50 sound classes, so the number of nodes of the layer is 50, and the activation function is softmax. Table 2 shows the settings of the model specific parameters.

TABLE 2 model parameter settings

The k of the k-fold cross validation used in the training of the method is set to be 5, 5 models are integrated for voice recognition after the training is completed, and table 3 shows the performance comparison of the CNN based on ensemble learning and other methods provided by the method on ESC-50. The invention obtains the best performance at present on an ESC-50 public environment sound data set, and compared with a CNN model which also uses Mel frequency spectrum characteristic extraction and mixu data enhancement, the recognition accuracy of the integrated CNN model provided by the invention is improved by 6.25% compared with that of Single CNN, and is improved by 13.1% compared with that of EnvNet-v2 which uses data enhancement.

TABLE 3 comparison of Performance of different ambient Sound identification methods

In conclusion, the method can train k models by using k-fold cross validation and combine the k models to perform voice recognition, greatly enhances the generalization capability of the models, effectively relieves the phenomenon of overfitting, and uses mixup data to enhance the generalization capability of the models by mixing the original samples aiming at the condition of small data volume.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. The environmental sound identification method based on the ensemble learning and the convolutional neural network is characterized by comprising the following steps of:

2. The ensemble learning and convolutional neural network-based ambient sound recognition method of claim 1, wherein: the step S1 of framing and windowing the original audio specifically includes: the audio data N sampling points are collected into an observation unit called a frame, an overlapping area is formed between two adjacent frames, and each frame is substituted into a window function to eliminate signal discontinuity caused by two ends of each frame.

3. The ensemble learning and convolutional neural network-based ambient sound recognition method of claim 1, wherein: in each operation of step S2, a different data portion is selected from the k portions as test data, it is ensured that the k portions of data have been respectively tested, and the remaining k-1 portions are used as training data.

4. The ensemble learning and convolutional neural network-based ambient sound recognition method of claim 1, wherein: the step S2 of enhancing the mixup data specifically includes: randomly selecting two characteristic samples, mixing the two characteristic samples in proportion, and constructing a new training sample and a new label in a linear interpolation mode, wherein the label is finally processed by the following formula:

5. The ensemble learning and convolutional neural network-based ambient sound recognition method of claim 1, wherein: when the model training is performed in step S2, the convolution kernel and the weight are initialized uniformly by gloot, and the bias is initialized by all 0S.

6. The ensemble learning and convolutional neural network-based ambient sound recognition method of claim 1, wherein: when model training is performed in the step S2, network parameters are updated by using an Adam algorithm, and when the number of network iterations reaches a preset number of iterations or the recognition accuracy on the verification set is not improved, the training is stopped and the trained convolutional neural network model is stored.