CN108122562A

CN108122562A - A kind of audio frequency classification method based on convolutional neural networks and random forest

Info

Publication number: CN108122562A
Application number: CN201810037337.8A
Authority: CN
Inventors: 彭德中; 付炜
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2018-01-16
Filing date: 2018-01-16
Publication date: 2018-06-05

Abstract

The invention discloses an audio classification method based on a convolutional neural network and a random forest. The method includes: S1: performing spectrum analysis on an original audio data set, including segmenting, framing, windowing, and Fourier transform, to obtain The spectrogram corresponding to the original audio file; S2: use the obtained spectrogram as input, train a convolutional neural network feature extractor; S3: remove the softmax layer of the convolutional neural network, and extract the high-level features of the spectrogram; S4: use the extracted The high-level features of the spectrogram train the random forest classifier; S5: Based on the high-level features extracted by the convolutional neural network, the trained random forest is used for audio classification. The present invention is based on the convolutional neural network for feature extraction, avoiding the cumbersome process of manually constructing and extracting features, and at the same time aiming at the problem of insufficient generalization ability caused by using softmax as the convolutional neural network classifier, the random forest is used to replace the convolutional neural network softmax layer, as the final classifier. High precision and recall were achieved during the test.

Description

An Audio Classification Method Based on Convolutional Neural Network and Random Forest

技术领域technical field

本发明属于机器学习领域，涉及一种基于卷积神经网络和随机森林的音频分类方法。The invention belongs to the field of machine learning and relates to an audio classification method based on a convolutional neural network and a random forest.

背景技术Background technique

互联网和多媒体技术的发展让我们的生活充斥着大量的音频，尤其是各种音乐网站，拥有数量庞大且风格迥异的音频文件。面对海量的音频，音频检索能帮助我们快速准确地找到所需的音频文件。音频分类是音频检索的前提，但对大量音频文件进行人工分类却是一项十分耗时且乏味的工作。随着人的听觉疲劳，人工分类的准确率也会有所降低。针对大量音频文件，快速准确的自动分类显得十分有必要。有关音频分类方法的研究较多，例如采用基于隐马尔可夫模型和支持向量机混合的两级音频分类方法，先利用隐马尔可夫模型对音频进行初步分类，确定最可能的两种分类结果，再用相应的支持向量机分类器做最终判决。还有根据音频内容间的相似度对音频进行分类的方法，用每个音频的音高集代表该音频文件，以LDA主题模型对音频分类。也有采用高斯混合模型、决策树等作为分类器进行分类的。但这些方法大都采用传统的方式手工构造特征，既繁琐，提取的特征也不够充分。而且采用单一的分类器，导致模型的泛化能力不强。The development of the Internet and multimedia technology has filled our lives with a large amount of audio, especially various music websites, which have a large number of audio files with different styles. In the face of massive audio, audio retrieval can help us quickly and accurately find the audio files we need. Audio classification is the premise of audio retrieval, but manual classification of a large number of audio files is a very time-consuming and tedious task. With people's hearing fatigue, the accuracy of manual classification will also decrease. For a large number of audio files, fast and accurate automatic classification is very necessary. There are many studies on audio classification methods. For example, a two-level audio classification method based on a mixture of hidden Markov model and support vector machine is used. First, the hidden Markov model is used to initially classify the audio, and the two most likely classification results are determined. , and then use the corresponding support vector machine classifier to make the final decision. There is also a method of classifying audio according to the similarity between audio contents, using the pitch set of each audio to represent the audio file, and classifying the audio with the LDA topic model. There are also Gaussian mixture models, decision trees, etc. used as classifiers for classification. However, most of these methods use traditional methods to manually construct features, which is cumbersome and the extracted features are not sufficient. Moreover, the use of a single classifier leads to poor generalization ability of the model.

近年来，深度学习逐渐火热，其结构含有多隐层，通过组合底层特征形成更加抽象的高层表示属性或特征，能更好的挖掘数据的分布式表示特征，比传统手动构造特征的方式效果更好。针对现状及上述问题，有必要设计一种基于深度学习的音频分类方法。In recent years, deep learning has become increasingly popular. Its structure contains multiple hidden layers. By combining the underlying features to form more abstract high-level representation attributes or features, it can better mine the distributed representation features of data, which is more effective than the traditional way of manually constructing features. it is good. In view of the current situation and the above problems, it is necessary to design an audio classification method based on deep learning.

发明内容Contents of the invention

本发明所要解决的技术问题是提供一种基于卷积神经网络和随机森林的音频分类方法，该方法采用卷积神经网络自动提取高层特征，采用随机森林解决单一分类器泛化能力不强的问题，具有较高的准确率和召回率。The technical problem to be solved by the present invention is to provide an audio classification method based on convolutional neural network and random forest, which uses convolutional neural network to automatically extract high-level features, and uses random forest to solve the problem that the generalization ability of a single classifier is not strong , with high precision and recall.

发明技术解决方案如下：The technical solution of the invention is as follows:

一种基于卷积神经网络和随机森林的音频分类方法，包括以下步骤。An audio classification method based on convolutional neural network and random forest, comprising the following steps.

步骤1：对原始音频文件进行频谱分析，获取其对应的频谱图。由于音频文件往往较长，直接对原始音频做频谱分析得到的频谱图过大，导致后期训练模型占用系统资源较多。所以对原始音频采取适当分段，再对每段音频做频谱分析，包括分帧、加窗、短时傅里叶变换等过程。假设是一个长序列，是长度为N的窗函数，用给加加窗，得到N点序列，即 Step 1: Spectrum analysis is performed on the original audio file to obtain its corresponding spectrogram. Since the audio files are often long, the spectrogram obtained by directly analyzing the spectrum of the original audio is too large, resulting in the later training model occupying more system resources. Therefore, the original audio is properly segmented, and then the spectrum analysis is performed on each segment of audio, including framing, windowing, short-time Fourier transform and other processes. suppose is a long sequence, is a window function of length N, with to add Add window to get N point sequence ,Right now

在频域上有 In the frequency domain there are

短时傅里叶变换的公式如下：The formula for the short-time Fourier transform is as follows:

其中为原信号，为窗函数。通过频谱分析，得到了音频对应的频谱图。in is the original signal, is a window function. Through spectrum analysis, the spectrum diagram corresponding to the audio is obtained.

步骤2：利用步骤1中得到的频谱图作为训练集，训练一个改进的卷积神经网络。该网络有14层，包括卷积层、下采样层、Dropout层、Flatten层、全连接层、BatchNormalization层、softmax层等，采用交叉熵作为损失函数。各层具体说明如下：Step 2: Use the spectrogram obtained in step 1 as a training set to train an improved convolutional neural network. The network has 14 layers, including convolutional layer, downsampling layer, Dropout layer, Flatten layer, fully connected layer, BatchNormalization layer, softmax layer, etc., using cross entropy as the loss function. The details of each layer are as follows:

输入：尺寸为248*248的频谱图；Input: spectrogram with size 248*248;

Layer1:卷积层，核尺寸为(5,5),64个，strides=1，输出特征图尺寸为(244,244)；Layer1: Convolutional layer, the kernel size is (5,5), 64, strides=1, the output feature map size is (244,244);

Layer2:下采样层，核尺寸为(2,2)，输出特征图尺寸为(122,122)；Layer2: Downsampling layer, the kernel size is (2,2), and the output feature map size is (122,122);

Layer3:卷积层，核尺寸为(3,3),128个，strides =2，输出特征图尺寸为(60,60)；Layer3: convolutional layer, the kernel size is (3,3), 128, strides =2, the output feature map size is (60,60);

Layer4:下采样层，核尺寸为(2,2), 输出特征图尺寸为(30,30)；Layer4: Downsampling layer, the kernel size is (2,2), and the output feature map size is (30,30);

Layer5:卷积层，核尺寸为(3,3),256个，strides =2, 输出特征图尺寸为(14,14)；Layer5: convolutional layer, kernel size is (3,3), 256, strides =2, output feature map size is (14,14);

Layer6:下采样层，核尺寸为(2,2)，输出特征图尺寸为(7,7)；Layer6: downsampling layer, the kernel size is (2,2), and the output feature map size is (7,7);

Layer7:卷积层，核尺寸为(2,2),512个，strides =1，输出特征图尺寸为(6,6)；Layer7: convolutional layer, kernel size is (2,2), 512, strides =1, output feature map size is (6,6);

Layer8:下采样层，核尺寸为(2,2)，输出特征图尺寸为(3,3)；Layer8: downsampling layer, the kernel size is (2,2), and the output feature map size is (3,3);

Layer9:Dropout层，dropout=0.5，在训练过程中使神经元按一定概率失效，防止过拟合；Layer9: Dropout layer, dropout=0.5, during the training process, the neurons will be invalidated with a certain probability to prevent overfitting;

Layer10:Flatten层，把多维数据一维化，过渡到全连接层；Layer10: Flatten layer, which converts multi-dimensional data into one dimension and transitions to a fully connected layer;

Layer11:全连接层，输出神经元个数为128；Layer11: fully connected layer, the number of output neurons is 128;

Layer12:Batch Normalization，对输入信号做归一化，同时又保持模型的表达能力；Layer12: Batch Normalization, which normalizes the input signal while maintaining the expressive ability of the model;

Layer13:全连接层，输出神经元个数为9，因为采用的数据集样本有9类；Layer13: fully connected layer, the number of output neurons is 9, because there are 9 types of data set samples;

Layer14:softmax层，分类器，输出为最终的概率分布，每个值代表一种类别的概率。Layer14: softmax layer, classifier, the output is the final probability distribution, each value represents the probability of a category.

步骤3：将步骤2中训练好的卷积神经网络的softmax层去掉，将最后一个全连接层的输出作为频谱图的高层特征。Step 3: Remove the softmax layer of the convolutional neural network trained in step 2, and use the output of the last fully connected layer as the high-level feature of the spectrogram.

步骤4：利用步骤3中提取的高层特征训练随机森林分类器。采用Gini不纯度作为决策树特征选择的准则。算法描述如下：Step 4: Train a random forest classifier using the high-level features extracted in step 3. Gini impurity is used as the criterion for decision tree feature selection. The algorithm is described as follows:

输入：样本集D = {(x1,y1), (x2,y2)…(xm,ym)}，弱分类器迭代次数T；Input: sample set D = {(x1,y1), (x2,y2)...(xm,ym)}, weak classifier iteration number T;

输出：最终的强分类器f(x)；Output: final strong classifier f(x);

对于t = 1,2…Tfor t = 1,2...T

a)从原始数据集中进行第t次随机采样，共采样m次，得到采样集Dm；a) The tth random sampling is performed from the original data set, and a total of m samples are taken to obtain the sampling set Dm;

b)利用采样集Dm构建第m个决策树Gm(x)。在样本所有特征中随机选择一部分特征，然后再从这些特征中选择最优的一个特征来为决策树划分左右子树。b) Construct the mth decision tree Gm(x) by using the sampling set Dm. Randomly select some features from all the features of the sample, and then select the optimal feature from these features to divide the left and right subtrees for the decision tree.

步骤5：将待分类的音频进行步骤1中的频谱分析得到频谱图，然后用步骤3中去掉softmax层的卷积神经网络提取频谱图高层特征，最后将提取的高层特征输入到步骤4中训练好的随机森林分类器进行音频分类，用T个弱学习器投出的最多票数的类别作为最终类别。Step 5: Perform spectral analysis on the audio to be classified to obtain a spectrogram, then use the convolutional neural network with the softmax layer removed in step 3 to extract high-level features of the spectrogram, and finally input the extracted high-level features to step 4 for training A good random forest classifier performs audio classification, and the category with the most votes cast by T weak learners is used as the final category.

本发明基于深度学习提出了一种音频分类方法，采用了卷积神经网络和随机森林相结合的混合模型。针对传统模型对特征提取不充分的问题，本发明将音频转换成频谱图，再利用卷积神经网络提取频谱图的高层特征，充分发挥了卷积神经网络对图像的强大特征提取能力，简化了特征提取的复杂过程。针对单一分类器泛化能力不强的问题，采用了随机森林模型，充分发挥随机森林集成学习的优点，构建多棵决策树来分类，弥补了单一分类器的不足。从分类结果上看，本发明具有较高的准确率和召回率。The present invention proposes an audio classification method based on deep learning, which adopts a hybrid model combining convolutional neural network and random forest. Aiming at the problem of insufficient feature extraction by the traditional model, the present invention converts the audio into a spectrogram, and then uses the convolutional neural network to extract the high-level features of the spectrogram, which fully utilizes the powerful feature extraction ability of the convolutional neural network for images, and simplifies The complex process of feature extraction. Aiming at the problem that the generalization ability of a single classifier is not strong, a random forest model is used to give full play to the advantages of random forest ensemble learning, and multiple decision trees are built to classify, which makes up for the shortcomings of a single classifier. From the classification results, the present invention has higher accuracy and recall.

附图说明Description of drawings

图1为本发明一种基于卷积神经网络和随机森林的音频分类方法的流程图。Fig. 1 is a flow chart of an audio classification method based on convolutional neural network and random forest in the present invention.

图2频谱分析后获取的频谱图。Figure 2 Spectrum diagram obtained after spectrum analysis.

图3为采用改进后的卷积神经网络进行高层特征提取的流程图。Figure 3 is a flowchart of high-level feature extraction using the improved convolutional neural network.

具体实施方式Detailed ways

下面结合附图和实施例，对本发明的具体实施方法做进一步描述。以下施例仅用于说明本发明，但不用来限制本发明的范围。The specific implementation method of the present invention will be further described below in conjunction with the accompanying drawings and embodiments. The following examples are only used to illustrate the present invention, but are not intended to limit the scope of the present invention.

实施例1是本发明的一种实例，以“GTZAN Genre Collection”作为数据集，采用其中九种不同流派的音频文件作为训练集和测试集，九种类别为：blues、C1assical、Country、Disco、Jazz、Metal、Pop、Reggae和Rock。Embodiment 1 is an example of the present invention, using "GTZAN Genre Collection" as a data set, adopting nine kinds of audio files of different genres as a training set and a test set, nine kinds of categories are: blues, C1assical, Country, Disco, Jazz, Metal, Pop, Reggae and Rock.

1. 将音频文件分为等长的6段，每一段都对应相同的标签。对每一段音频分帧、加窗、傅里叶变换，得到其频谱图。附图2展示的即为获取的频谱图。将频谱图读入，转换为灰度图。再将每张图的尺寸调整为248*248。最后将调整后的图片的像素值保存到数组，作为卷积神经网络数据集中的一个样本。经过上面的操作，得到数据集D(5400,248,248)，表示有5400张频谱图，每张频谱图的宽度为248，高度为248。将数据集划分为训练集和测试集，其中80%作为训练集，20%作为测试集，最终得到训练集T(4320,248,248)，测试集V(1080,248,248)。1. Divide the audio file into 6 segments of equal length, and each segment corresponds to the same tag. Framing, windowing, and Fourier transform each segment of audio to obtain its spectrogram. Figure 2 shows the obtained frequency spectrum. Read in the spectrogram and convert it to grayscale. Then adjust the size of each picture to 248*248. Finally, save the pixel values of the adjusted image to an array as a sample in the convolutional neural network dataset. After the above operations, the data set D (5400, 248, 248) is obtained, which means that there are 5400 spectrograms, and the width of each spectrogram is 248 and the height is 248. The data set is divided into training set and test set, 80% of which are used as training set and 20% are used as test set. Finally, training set T (4320, 248, 248) and test set V (1080, 248, 248) are obtained.

2. 利用训练集T(4320,248,248)训练卷积神经网络模型。网络一共14层，包括卷积层、下采样层、全连接层、Dropout层、Batch Normalization层等。2. Use the training set T (4320, 248, 248) to train the convolutional neural network model. The network has a total of 14 layers, including convolutional layers, downsampling layers, fully connected layers, Dropout layers, Batch Normalization layers, etc.

3. 当卷积神经网络训练完成后，去掉最后的softmax层。用训练好的卷积神经网络对频谱图进行更深层次的特征提取，将由频谱图构成的原始训练集T(4320,248,248)重构为新的训练集T’(4320,9)，将由频谱图构成的原始测试集V(1080,248,248)重构为新的测试集V’(1080,9)。3. When the convolutional neural network is trained, remove the last softmax layer. Use the trained convolutional neural network to perform deeper feature extraction on the spectrogram, and reconstruct the original training set T(4320,248,248) composed of the spectrogram into a new training set T'(4320,9), which will be composed of the spectrogram The original test set V(1080,248,248) is reconstructed into a new test set V'(1080,9).

4. 用新的训练集T’和测试集V’来训练随机森林，作为最终的分类器。采用不同参数组合设置，其中4. Use the new training set T' and test set V' to train the random forest as the final classifier. Different parameter combinations are used, among which

参数parameter 数值value n_estimatorsn_estimators [10,50,100][10,50,100] min_samples_splitmin_samples_split [2, 3, 4][2, 3, 4] min_samples_leafmin_samples_leaf [1, 2, 3][1, 2, 3]

经过挑选，最佳参数组合为n_estimators:100，min_samples_split:3，min_samples_leaf:1。随机森林训练完成后，在测试集上进行测试，结果如下：After selection, the best parameter combination is n_estimators:100, min_samples_split:3, min_samples_leaf:1. After the random forest training is completed, it is tested on the test set, and the results are as follows:

ClassesClasses PrecisionPrecision Recallrecall F1-scoreF1-score supportsupport 00 0.800.80 0.740.74 0.770.77 118118 11 0.890.89 0.920.92 0.900.90 133133 22 0.750.75 0.800.80 0.780.78 117117 33 0.750.75 0.830.83 0.790.79 118118 44 0.930.93 0.880.88 0.900.90 134134 55 0.940.94 0.900.90 0.920.92 108108 66 0.880.88 0.850.85 0.870.87 103103 77 0.860.86 0.780.78 0.820.82 124124 88 0.640.64 0.680.68 0.660.66 125125 Avg/totalAvg/total 0.830.83 0.820.82 0.820.82 10801080

由上表可以看出该方法能够较准确地对音频进行自动分类，其中平均准确率达到了83%，平均召回率达到了82%。It can be seen from the above table that this method can automatically classify audio more accurately, with an average accuracy rate of 83% and an average recall rate of 82%.

Claims

1. A kind of audio classification method based on convolutional neural network and random forest, its feature comprises the steps:

Step 1: carry out frequency spectrum analysis to original audio data set, at first long audio file is divided into several sections of equal length, each section of audio corresponds to the same label, then each section of audio is carried out into frames, windowed, Fourier transform, Get the spectrogram of each piece of audio as a sample of the new training set;

Step 2: Utilize all spectrograms obtained in step 1 and their corresponding labels to train an improved convolutional neural network, which has 14 layers;

Step 3: remove the softmax layer of the convolutional neural network learned in step 2, and then use the convolutional neural network to extract the high-level features of all spectrograms;

Step 4: Utilize the high-level feature training random forest classifier of the spectrogram extracted in step 3, adopt Gini impurity as the criterion of decision tree feature selection;

Step 5: Carry out the spectrum analysis in step 1 to the audio frequency to be classified to obtain the spectrogram, then use the convolutional neural network that removes the softmax layer in step 3 to extract the high-level features of the spectrogram, and finally input the high-level features extracted to step 4 for training A good random forest classifier performs audio classification, and the final classification result is obtained by voting.

2. a kind of audio classification method based on convolutional neural network and random forest according to claim 1, it is characterized in that, for audio feature, the specific implementation process of this method comprises two-stage feature extraction, and the first-level feature extraction is Through spectrum analysis, the frequency spectrum corresponding to the audio is obtained, and its low-level time-frequency features are initially extracted. The second-level feature extraction uses an improved convolutional neural network to further extract high-level features from the frequency spectrum.

3. a kind of audio classification method based on convolutional neural network and random forest according to claim 1, it is characterized in that, this method causes the problem that generalization ability is not strong in order to overcome softmax as convolutional neural network classifier, adopts Random Forest replaces the last layer of the Convolutional Neural Network as the final audio classifier.