CN108122562A - A kind of audio frequency classification method based on convolutional neural networks and random forest - Google Patents

A kind of audio frequency classification method based on convolutional neural networks and random forest Download PDF

Info

Publication number
CN108122562A
CN108122562A CN201810037337.8A CN201810037337A CN108122562A CN 108122562 A CN108122562 A CN 108122562A CN 201810037337 A CN201810037337 A CN 201810037337A CN 108122562 A CN108122562 A CN 108122562A
Authority
CN
China
Prior art keywords
convolutional neural
neural network
audio
random forest
spectrogram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810037337.8A
Other languages
Chinese (zh)
Inventor
彭德中
付炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201810037337.8A priority Critical patent/CN108122562A/en
Publication of CN108122562A publication Critical patent/CN108122562A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于卷积神经网络和随机森林的音频分类方法,该方法包括:S1:对原始音频数据集进行频谱分析,包括分段、分帧、加窗、傅里叶变换,得到原始音频文件对应的频谱图;S2:以得到的频谱图作为输入,训练一个卷积神经网络特征提取器;S3:去掉卷积神经网络的softmax层,提取频谱图的高层特征;S4:利用提取的频谱图高层特征训练随机森林分类器;S5:基于卷积神经网络提取的高层特征,利用训练好的随机森林进行音频分类。本发明基于卷积神经网络做特征提取,避免了手动构造提取特征的繁琐过程,同时针对采用softmax作为卷积神经网络分类器导致泛化能力不足的问题,采用随机森林替换掉卷积神经网络的softmax层,作为最终的分类器。在测试过程中取得了较高的准确率和召回率。

The invention discloses an audio classification method based on a convolutional neural network and a random forest. The method includes: S1: performing spectrum analysis on an original audio data set, including segmenting, framing, windowing, and Fourier transform, to obtain The spectrogram corresponding to the original audio file; S2: use the obtained spectrogram as input, train a convolutional neural network feature extractor; S3: remove the softmax layer of the convolutional neural network, and extract the high-level features of the spectrogram; S4: use the extracted The high-level features of the spectrogram train the random forest classifier; S5: Based on the high-level features extracted by the convolutional neural network, the trained random forest is used for audio classification. The present invention is based on the convolutional neural network for feature extraction, avoiding the cumbersome process of manually constructing and extracting features, and at the same time aiming at the problem of insufficient generalization ability caused by using softmax as the convolutional neural network classifier, the random forest is used to replace the convolutional neural network softmax layer, as the final classifier. High precision and recall were achieved during the test.

Description

一种基于卷积神经网络和随机森林的音频分类方法An Audio Classification Method Based on Convolutional Neural Network and Random Forest

技术领域technical field

本发明属于机器学习领域,涉及一种基于卷积神经网络和随机森林的音频分类方法。The invention belongs to the field of machine learning and relates to an audio classification method based on a convolutional neural network and a random forest.

背景技术Background technique

互联网和多媒体技术的发展让我们的生活充斥着大量的音频,尤其是各种音乐网站,拥有数量庞大且风格迥异的音频文件。面对海量的音频,音频检索能帮助我们快速准确地找到所需的音频文件。音频分类是音频检索的前提,但对大量音频文件进行人工分类却是一项十分耗时且乏味的工作。随着人的听觉疲劳,人工分类的准确率也会有所降低。针对大量音频文件,快速准确的自动分类显得十分有必要。有关音频分类方法的研究较多,例如采用基于隐马尔可夫模型和支持向量机混合的两级音频分类方法,先利用隐马尔可夫模型对音频进行初步分类,确定最可能的两种分类结果,再用相应的支持向量机分类器做最终判决。还有根据音频内容间的相似度对音频进行分类的方法,用每个音频的音高集代表该音频文件,以LDA主题模型对音频分类。也有采用高斯混合模型、决策树等作为分类器进行分类的。但这些方法大都采用传统的方式手工构造特征,既繁琐,提取的特征也不够充分。而且采用单一的分类器,导致模型的泛化能力不强。The development of the Internet and multimedia technology has filled our lives with a large amount of audio, especially various music websites, which have a large number of audio files with different styles. In the face of massive audio, audio retrieval can help us quickly and accurately find the audio files we need. Audio classification is the premise of audio retrieval, but manual classification of a large number of audio files is a very time-consuming and tedious task. With people's hearing fatigue, the accuracy of manual classification will also decrease. For a large number of audio files, fast and accurate automatic classification is very necessary. There are many studies on audio classification methods. For example, a two-level audio classification method based on a mixture of hidden Markov model and support vector machine is used. First, the hidden Markov model is used to initially classify the audio, and the two most likely classification results are determined. , and then use the corresponding support vector machine classifier to make the final decision. There is also a method of classifying audio according to the similarity between audio contents, using the pitch set of each audio to represent the audio file, and classifying the audio with the LDA topic model. There are also Gaussian mixture models, decision trees, etc. used as classifiers for classification. However, most of these methods use traditional methods to manually construct features, which is cumbersome and the extracted features are not sufficient. Moreover, the use of a single classifier leads to poor generalization ability of the model.

近年来,深度学习逐渐火热,其结构含有多隐层,通过组合底层特征形成更加抽象的高层表示属性或特征,能更好的挖掘数据的分布式表示特征,比传统手动构造特征的方式效果更好。针对现状及上述问题,有必要设计一种基于深度学习的音频分类方法。In recent years, deep learning has become increasingly popular. Its structure contains multiple hidden layers. By combining the underlying features to form more abstract high-level representation attributes or features, it can better mine the distributed representation features of data, which is more effective than the traditional way of manually constructing features. it is good. In view of the current situation and the above problems, it is necessary to design an audio classification method based on deep learning.

发明内容Contents of the invention

本发明所要解决的技术问题是提供一种基于卷积神经网络和随机森林的音频分类方法,该方法采用卷积神经网络自动提取高层特征,采用随机森林解决单一分类器泛化能力不强的问题,具有较高的准确率和召回率。The technical problem to be solved by the present invention is to provide an audio classification method based on convolutional neural network and random forest, which uses convolutional neural network to automatically extract high-level features, and uses random forest to solve the problem that the generalization ability of a single classifier is not strong , with high precision and recall.

发明技术解决方案如下:The technical solution of the invention is as follows:

一种基于卷积神经网络和随机森林的音频分类方法,包括以下步骤。An audio classification method based on convolutional neural network and random forest, comprising the following steps.

步骤1:对原始音频文件进行频谱分析,获取其对应的频谱图。由于音频文件往往较长,直接对原始音频做频谱分析得到的频谱图过大,导致后期训练模型占用系统资源较多。所以对原始音频采取适当分段,再对每段音频做频谱分析,包括分帧、加窗、短时傅里叶变换等过程。假设是一个长序列,是长度为N的窗函数,用给加加窗,得到N点序列,即 Step 1: Spectrum analysis is performed on the original audio file to obtain its corresponding spectrogram. Since the audio files are often long, the spectrogram obtained by directly analyzing the spectrum of the original audio is too large, resulting in the later training model occupying more system resources. Therefore, the original audio is properly segmented, and then the spectrum analysis is performed on each segment of audio, including framing, windowing, short-time Fourier transform and other processes. suppose is a long sequence, is a window function of length N, with to add Add window to get N point sequence ,Right now

在频域上有 In the frequency domain there are

短时傅里叶变换的公式如下:The formula for the short-time Fourier transform is as follows:

其中为原信号,为窗函数。通过频谱分析,得到了音频对应的频谱图。in is the original signal, is a window function. Through spectrum analysis, the spectrum diagram corresponding to the audio is obtained.

步骤2:利用步骤1中得到的频谱图作为训练集,训练一个改进的卷积神经网络。该网络有14层,包括卷积层、下采样层、Dropout层、Flatten层、全连接层、BatchNormalization层、softmax层等,采用交叉熵作为损失函数。各层具体说明如下:Step 2: Use the spectrogram obtained in step 1 as a training set to train an improved convolutional neural network. The network has 14 layers, including convolutional layer, downsampling layer, Dropout layer, Flatten layer, fully connected layer, BatchNormalization layer, softmax layer, etc., using cross entropy as the loss function. The details of each layer are as follows:

输入:尺寸为248*248的频谱图;Input: spectrogram with size 248*248;

Layer1:卷积层,核尺寸为(5,5),64个,strides=1,输出特征图尺寸为(244,244);Layer1: Convolutional layer, the kernel size is (5,5), 64, strides=1, the output feature map size is (244,244);

Layer2:下采样层,核尺寸为(2,2),输出特征图尺寸为(122,122);Layer2: Downsampling layer, the kernel size is (2,2), and the output feature map size is (122,122);

Layer3:卷积层,核尺寸为(3,3),128个,strides =2,输出特征图尺寸为(60,60);Layer3: convolutional layer, the kernel size is (3,3), 128, strides =2, the output feature map size is (60,60);

Layer4:下采样层,核尺寸为(2,2), 输出特征图尺寸为(30,30);Layer4: Downsampling layer, the kernel size is (2,2), and the output feature map size is (30,30);

Layer5:卷积层,核尺寸为(3,3),256个,strides =2, 输出特征图尺寸为(14,14);Layer5: convolutional layer, kernel size is (3,3), 256, strides =2, output feature map size is (14,14);

Layer6:下采样层,核尺寸为(2,2),输出特征图尺寸为(7,7);Layer6: downsampling layer, the kernel size is (2,2), and the output feature map size is (7,7);

Layer7:卷积层,核尺寸为(2,2),512个,strides =1,输出特征图尺寸为(6,6);Layer7: convolutional layer, kernel size is (2,2), 512, strides =1, output feature map size is (6,6);

Layer8:下采样层,核尺寸为(2,2),输出特征图尺寸为(3,3);Layer8: downsampling layer, the kernel size is (2,2), and the output feature map size is (3,3);

Layer9:Dropout层,dropout=0.5,在训练过程中使神经元按一定概率失效,防止过拟合;Layer9: Dropout layer, dropout=0.5, during the training process, the neurons will be invalidated with a certain probability to prevent overfitting;

Layer10:Flatten层,把多维数据一维化,过渡到全连接层;Layer10: Flatten layer, which converts multi-dimensional data into one dimension and transitions to a fully connected layer;

Layer11:全连接层,输出神经元个数为128;Layer11: fully connected layer, the number of output neurons is 128;

Layer12:Batch Normalization,对输入信号做归一化,同时又保持模型的表达能力;Layer12: Batch Normalization, which normalizes the input signal while maintaining the expressive ability of the model;

Layer13:全连接层,输出神经元个数为9,因为采用的数据集样本有9类;Layer13: fully connected layer, the number of output neurons is 9, because there are 9 types of data set samples;

Layer14:softmax层,分类器,输出为最终的概率分布,每个值代表一种类别的概率。Layer14: softmax layer, classifier, the output is the final probability distribution, each value represents the probability of a category.

步骤3:将步骤2中训练好的卷积神经网络的softmax层去掉,将最后一个全连接层的输出作为频谱图的高层特征。Step 3: Remove the softmax layer of the convolutional neural network trained in step 2, and use the output of the last fully connected layer as the high-level feature of the spectrogram.

步骤4:利用步骤3中提取的高层特征训练随机森林分类器。采用Gini不纯度作为决策树特征选择的准则。算法描述如下:Step 4: Train a random forest classifier using the high-level features extracted in step 3. Gini impurity is used as the criterion for decision tree feature selection. The algorithm is described as follows:

输入:样本集D = {(x1,y1), (x2,y2)…(xm,ym)},弱分类器迭代次数T;Input: sample set D = {(x1,y1), (x2,y2)...(xm,ym)}, weak classifier iteration number T;

输出:最终的强分类器f(x);Output: final strong classifier f(x);

对于t = 1,2…Tfor t = 1,2...T

a)从原始数据集中进行第t次随机采样,共采样m次,得到采样集Dm;a) The tth random sampling is performed from the original data set, and a total of m samples are taken to obtain the sampling set Dm;

b)利用采样集Dm构建第m个决策树Gm(x)。在样本所有特征中随机选择一部分特征,然后再从这些特征中选择最优的一个特征来为决策树划分左右子树。b) Construct the mth decision tree Gm(x) by using the sampling set Dm. Randomly select some features from all the features of the sample, and then select the optimal feature from these features to divide the left and right subtrees for the decision tree.

步骤5:将待分类的音频进行步骤1中的频谱分析得到频谱图,然后用步骤3中去掉softmax层的卷积神经网络提取频谱图高层特征,最后将提取的高层特征输入到步骤4中训练好的随机森林分类器进行音频分类,用T个弱学习器投出的最多票数的类别作为最终类别。Step 5: Perform spectral analysis on the audio to be classified to obtain a spectrogram, then use the convolutional neural network with the softmax layer removed in step 3 to extract high-level features of the spectrogram, and finally input the extracted high-level features to step 4 for training A good random forest classifier performs audio classification, and the category with the most votes cast by T weak learners is used as the final category.

本发明基于深度学习提出了一种音频分类方法,采用了卷积神经网络和随机森林相结合的混合模型。针对传统模型对特征提取不充分的问题,本发明将音频转换成频谱图,再利用卷积神经网络提取频谱图的高层特征,充分发挥了卷积神经网络对图像的强大特征提取能力,简化了特征提取的复杂过程。针对单一分类器泛化能力不强的问题,采用了随机森林模型,充分发挥随机森林集成学习的优点,构建多棵决策树来分类,弥补了单一分类器的不足。从分类结果上看,本发明具有较高的准确率和召回率。The present invention proposes an audio classification method based on deep learning, which adopts a hybrid model combining convolutional neural network and random forest. Aiming at the problem of insufficient feature extraction by the traditional model, the present invention converts the audio into a spectrogram, and then uses the convolutional neural network to extract the high-level features of the spectrogram, which fully utilizes the powerful feature extraction ability of the convolutional neural network for images, and simplifies The complex process of feature extraction. Aiming at the problem that the generalization ability of a single classifier is not strong, a random forest model is used to give full play to the advantages of random forest ensemble learning, and multiple decision trees are built to classify, which makes up for the shortcomings of a single classifier. From the classification results, the present invention has higher accuracy and recall.

附图说明Description of drawings

图1为本发明一种基于卷积神经网络和随机森林的音频分类方法的流程图。Fig. 1 is a flow chart of an audio classification method based on convolutional neural network and random forest in the present invention.

图2频谱分析后获取的频谱图。Figure 2 Spectrum diagram obtained after spectrum analysis.

图3为采用改进后的卷积神经网络进行高层特征提取的流程图。Figure 3 is a flowchart of high-level feature extraction using the improved convolutional neural network.

具体实施方式Detailed ways

下面结合附图和实施例,对本发明的具体实施方法做进一步描述。以下施例仅用于说明本发明,但不用来限制本发明的范围。The specific implementation method of the present invention will be further described below in conjunction with the accompanying drawings and embodiments. The following examples are only used to illustrate the present invention, but are not intended to limit the scope of the present invention.

实施例1是本发明的一种实例,以“GTZAN Genre Collection”作为数据集,采用其中九种不同流派的音频文件作为训练集和测试集,九种类别为:blues、C1assical、Country、Disco、Jazz、Metal、Pop、Reggae和Rock。Embodiment 1 is an example of the present invention, using "GTZAN Genre Collection" as a data set, adopting nine kinds of audio files of different genres as a training set and a test set, nine kinds of categories are: blues, C1assical, Country, Disco, Jazz, Metal, Pop, Reggae and Rock.

1. 将音频文件分为等长的6段,每一段都对应相同的标签。对每一段音频分帧、加窗、傅里叶变换,得到其频谱图。附图2展示的即为获取的频谱图。将频谱图读入,转换为灰度图。再将每张图的尺寸调整为248*248。最后将调整后的图片的像素值保存到数组,作为卷积神经网络数据集中的一个样本。经过上面的操作,得到数据集D(5400,248,248),表示有5400张频谱图,每张频谱图的宽度为248,高度为248。将数据集划分为训练集和测试集,其中80%作为训练集,20%作为测试集,最终得到训练集T(4320,248,248),测试集V(1080,248,248)。1. Divide the audio file into 6 segments of equal length, and each segment corresponds to the same tag. Framing, windowing, and Fourier transform each segment of audio to obtain its spectrogram. Figure 2 shows the obtained frequency spectrum. Read in the spectrogram and convert it to grayscale. Then adjust the size of each picture to 248*248. Finally, save the pixel values of the adjusted image to an array as a sample in the convolutional neural network dataset. After the above operations, the data set D (5400, 248, 248) is obtained, which means that there are 5400 spectrograms, and the width of each spectrogram is 248 and the height is 248. The data set is divided into training set and test set, 80% of which are used as training set and 20% are used as test set. Finally, training set T (4320, 248, 248) and test set V (1080, 248, 248) are obtained.

2. 利用训练集T(4320,248,248)训练卷积神经网络模型。网络一共14层,包括卷积层、下采样层、全连接层、Dropout层、Batch Normalization层等。2. Use the training set T (4320, 248, 248) to train the convolutional neural network model. The network has a total of 14 layers, including convolutional layers, downsampling layers, fully connected layers, Dropout layers, Batch Normalization layers, etc.

3. 当卷积神经网络训练完成后,去掉最后的softmax层。用训练好的卷积神经网络对频谱图进行更深层次的特征提取,将由频谱图构成的原始训练集T(4320,248,248)重构为新的训练集T’(4320,9),将由频谱图构成的原始测试集V(1080,248,248)重构为新的测试集V’(1080,9)。3. When the convolutional neural network is trained, remove the last softmax layer. Use the trained convolutional neural network to perform deeper feature extraction on the spectrogram, and reconstruct the original training set T(4320,248,248) composed of the spectrogram into a new training set T'(4320,9), which will be composed of the spectrogram The original test set V(1080,248,248) is reconstructed into a new test set V'(1080,9).

4. 用新的训练集T’和测试集V’来训练随机森林,作为最终的分类器。采用不同参数组合设置,其中4. Use the new training set T' and test set V' to train the random forest as the final classifier. Different parameter combinations are used, among which

参数parameter 数值value n_estimatorsn_estimators [10,50,100][10,50,100] min_samples_splitmin_samples_split [2, 3, 4][2, 3, 4] min_samples_leafmin_samples_leaf [1, 2, 3][1, 2, 3]

经过挑选,最佳参数组合为n_estimators:100,min_samples_split:3,min_samples_leaf:1。随机森林训练完成后,在测试集上进行测试,结果如下:After selection, the best parameter combination is n_estimators:100, min_samples_split:3, min_samples_leaf:1. After the random forest training is completed, it is tested on the test set, and the results are as follows:

ClassesClasses PrecisionPrecision Recallrecall F1-scoreF1-score supportsupport 00 0.800.80 0.740.74 0.770.77 118118 11 0.890.89 0.920.92 0.900.90 133133 22 0.750.75 0.800.80 0.780.78 117117 33 0.750.75 0.830.83 0.790.79 118118 44 0.930.93 0.880.88 0.900.90 134134 55 0.940.94 0.900.90 0.920.92 108108 66 0.880.88 0.850.85 0.870.87 103103 77 0.860.86 0.780.78 0.820.82 124124 88 0.640.64 0.680.68 0.660.66 125125 Avg/totalAvg/total 0.830.83 0.820.82 0.820.82 10801080

由上表可以看出该方法能够较准确地对音频进行自动分类,其中平均准确率达到了83%,平均召回率达到了82%。It can be seen from the above table that this method can automatically classify audio more accurately, with an average accuracy rate of 83% and an average recall rate of 82%.

Claims (3)

1.一种基于卷积神经网络和随机森林的音频分类方法,其特征包括如下步骤:1. A kind of audio classification method based on convolutional neural network and random forest, its feature comprises the steps: 步骤1:对原始音频数据集进行频谱分析,首先将长音频文件分为等长的若干段,每段音频对应相同的标签,然后对每段音频进行分帧、加窗、傅里叶变换,得到每段音频的频谱图,作为新的训练集的一个样本;Step 1: carry out frequency spectrum analysis to original audio data set, at first long audio file is divided into several sections of equal length, each section of audio corresponds to the same label, then each section of audio is carried out into frames, windowed, Fourier transform, Get the spectrogram of each piece of audio as a sample of the new training set; 步骤2:利用步骤1得到的所有频谱图及其对应的标签,训练一个改进的卷积神经网络,该网络具有14层;Step 2: Utilize all spectrograms obtained in step 1 and their corresponding labels to train an improved convolutional neural network, which has 14 layers; 步骤3:去掉步骤2学习到的卷积神经网络的softmax层,然后再用卷积神经网络提取所有频谱图的高层特征;Step 3: remove the softmax layer of the convolutional neural network learned in step 2, and then use the convolutional neural network to extract the high-level features of all spectrograms; 步骤4:利用步骤3提取的频谱图的高层特征训练随机森林分类器,采用Gini不纯度作为决策树特征选择的准则;Step 4: Utilize the high-level feature training random forest classifier of the spectrogram extracted in step 3, adopt Gini impurity as the criterion of decision tree feature selection; 步骤5:将待分类的音频进行步骤1中的频谱分析得到频谱图,然后用步骤3中去掉softmax层的卷积神经网络提取频谱图高层特征,最后将提取的高层特征输入到步骤4中训练好的随机森林分类器进行音频分类,以投票的方式得到最终的分类结果。Step 5: Carry out the spectrum analysis in step 1 to the audio frequency to be classified to obtain the spectrogram, then use the convolutional neural network that removes the softmax layer in step 3 to extract the high-level features of the spectrogram, and finally input the high-level features extracted to step 4 for training A good random forest classifier performs audio classification, and the final classification result is obtained by voting. 2.根据权利要求1所述的一种基于卷积神经网络和随机森林的音频分类方法,其特征在于,针对音频特征,该方法的具体实施过程包括两级特征提取,第一级特征提取是通过频谱分析,获取音频对应的频谱图,初步提取其低层时频特征,第二级特征提取采用改进的卷积神经网络,进一步对频谱图提取高层特征。2. a kind of audio classification method based on convolutional neural network and random forest according to claim 1, it is characterized in that, for audio feature, the specific implementation process of this method comprises two-stage feature extraction, and the first-level feature extraction is Through spectrum analysis, the frequency spectrum corresponding to the audio is obtained, and its low-level time-frequency features are initially extracted. The second-level feature extraction uses an improved convolutional neural network to further extract high-level features from the frequency spectrum. 3.根据权利要求1所述的一种基于卷积神经网络和随机森林的音频分类方法,其特征在于,该方法为了克服softmax作为卷积神经网络分类器导致泛化能力不强的问题,采用随机森林替换卷积神经网络的最后一层,作为最终的音频分类器。3. a kind of audio classification method based on convolutional neural network and random forest according to claim 1, it is characterized in that, this method causes the problem that generalization ability is not strong in order to overcome softmax as convolutional neural network classifier, adopts Random Forest replaces the last layer of the Convolutional Neural Network as the final audio classifier.
CN201810037337.8A 2018-01-16 2018-01-16 A kind of audio frequency classification method based on convolutional neural networks and random forest Pending CN108122562A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810037337.8A CN108122562A (en) 2018-01-16 2018-01-16 A kind of audio frequency classification method based on convolutional neural networks and random forest

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810037337.8A CN108122562A (en) 2018-01-16 2018-01-16 A kind of audio frequency classification method based on convolutional neural networks and random forest

Publications (1)

Publication Number Publication Date
CN108122562A true CN108122562A (en) 2018-06-05

Family

ID=62232892

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810037337.8A Pending CN108122562A (en) 2018-01-16 2018-01-16 A kind of audio frequency classification method based on convolutional neural networks and random forest

Country Status (1)

Country Link
CN (1) CN108122562A (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108766461A (en) * 2018-07-17 2018-11-06 厦门美图之家科技有限公司 Audio feature extraction methods and device
CN109002529A (en) * 2018-07-17 2018-12-14 厦门美图之家科技有限公司 Audio search method and device
CN109166593A (en) * 2018-08-17 2019-01-08 腾讯音乐娱乐科技(深圳)有限公司 audio data processing method, device and storage medium
CN109493881A (en) * 2018-11-22 2019-03-19 北京奇虎科技有限公司 A kind of labeling processing method of audio, device and calculate equipment
CN109684506A (en) * 2018-11-22 2019-04-26 北京奇虎科技有限公司 A kind of labeling processing method of video, device and calculate equipment
CN109739112A (en) * 2018-12-29 2019-05-10 张卫校 A kind of wobble objects control method and wobble objects
CN109949825A (en) * 2019-03-06 2019-06-28 河北工业大学 Noise classification method based on FPGA-accelerated PCNN algorithm
CN110010128A (en) * 2019-04-09 2019-07-12 天津松下汽车电子开发有限公司 A kind of sound control method and system of high discrimination
CN110324657A (en) * 2019-05-29 2019-10-11 北京奇艺世纪科技有限公司 Model generation, method for processing video frequency, device, electronic equipment and storage medium
CN110414483A (en) * 2019-08-13 2019-11-05 山东浪潮人工智能研究院有限公司 A face recognition method and system based on deep neural network and random forest
CN110600038A (en) * 2019-08-23 2019-12-20 北京工业大学 Audio fingerprint dimension reduction method based on discrete kini coefficient
CN110675893A (en) * 2019-09-19 2020-01-10 腾讯音乐娱乐科技(深圳)有限公司 Song identification method and device, storage medium and electronic equipment
CN110808033A (en) * 2019-09-25 2020-02-18 武汉科技大学 An Audio Classification Method Based on Double Data Augmentation Strategy
CN110931045A (en) * 2019-12-20 2020-03-27 重庆大学 Audio feature generation method based on convolutional neural network
CN110931046A (en) * 2019-11-29 2020-03-27 福州大学 Audio high-level semantic feature extraction method and system for overlapped sound event detection
CN110933236A (en) * 2019-10-25 2020-03-27 杭州哲信信息技术有限公司 Machine learning-based null number identification method
CN111159464A (en) * 2019-12-26 2020-05-15 腾讯科技(深圳)有限公司 Audio clip detection method and related equipment
CN111179971A (en) * 2019-12-03 2020-05-19 杭州网易云音乐科技有限公司 Nondestructive audio detection method and device, electronic equipment and storage medium
CN111508526A (en) * 2020-04-10 2020-08-07 腾讯音乐娱乐科技(深圳)有限公司 Method and device for detecting audio beat information and storage medium
CN111583890A (en) * 2019-02-15 2020-08-25 阿里巴巴集团控股有限公司 Audio classification method and device
CN112735386A (en) * 2021-01-18 2021-04-30 苏州大学 Voice recognition method based on glottal wave information
CN113313197A (en) * 2021-06-17 2021-08-27 哈尔滨工业大学 Full-connection neural network training method
CN113729715A (en) * 2021-10-11 2021-12-03 山东大学 Parkinson's disease intelligent diagnosis system based on finger pressure
CN113901977A (en) * 2020-06-22 2022-01-07 中国电力科学研究院有限公司 A deep learning-based method and system for identifying electricity theft by power users
CN115064184A (en) * 2022-06-28 2022-09-16 镁佳(北京)科技有限公司 Audio file musical instrument content identification vector representation method and device
US11905926B2 (en) * 2019-12-31 2024-02-20 Envision Digital International Pte. Ltd. Method and apparatus for inspecting wind turbine blade, and device and storage medium thereof
CN118098270A (en) * 2024-04-24 2024-05-28 安徽大学 A noise source tracing method based on feature extraction and feature fusion

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106408015A (en) * 2016-09-13 2017-02-15 电子科技大学成都研究院 Road fork identification and depth estimation method based on convolutional neural network
CN106952274A (en) * 2017-03-14 2017-07-14 西安电子科技大学 Pedestrian detection and ranging method based on stereo vision
CN107066553A (en) * 2017-03-24 2017-08-18 北京工业大学 A kind of short text classification method based on convolutional neural networks and random forest
CN107492383A (en) * 2017-08-07 2017-12-19 上海六界信息技术有限公司 Screening technique, device, equipment and the storage medium of live content
CN107491606A (en) * 2017-08-17 2017-12-19 安徽工业大学 Variable working condition epicyclic gearbox sun gear method for diagnosing faults based on more attribute convolutional neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106408015A (en) * 2016-09-13 2017-02-15 电子科技大学成都研究院 Road fork identification and depth estimation method based on convolutional neural network
CN106952274A (en) * 2017-03-14 2017-07-14 西安电子科技大学 Pedestrian detection and ranging method based on stereo vision
CN107066553A (en) * 2017-03-24 2017-08-18 北京工业大学 A kind of short text classification method based on convolutional neural networks and random forest
CN107492383A (en) * 2017-08-07 2017-12-19 上海六界信息技术有限公司 Screening technique, device, equipment and the storage medium of live content
CN107491606A (en) * 2017-08-17 2017-12-19 安徽工业大学 Variable working condition epicyclic gearbox sun gear method for diagnosing faults based on more attribute convolutional neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
曹林林: ""卷积神经网络在高分遥感影像分类中的应用"", 《测绘科学》 *
罗建华: ""基于深度卷积神经网络的高光谱遥感图像分类"", 《西华大学学报》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108766461A (en) * 2018-07-17 2018-11-06 厦门美图之家科技有限公司 Audio feature extraction methods and device
CN109002529A (en) * 2018-07-17 2018-12-14 厦门美图之家科技有限公司 Audio search method and device
CN108766461B (en) * 2018-07-17 2021-01-26 厦门美图之家科技有限公司 Audio feature extraction method and device
CN109002529B (en) * 2018-07-17 2021-02-02 厦门美图之家科技有限公司 Audio retrieval method and device
CN109166593A (en) * 2018-08-17 2019-01-08 腾讯音乐娱乐科技(深圳)有限公司 audio data processing method, device and storage medium
CN109684506A (en) * 2018-11-22 2019-04-26 北京奇虎科技有限公司 A kind of labeling processing method of video, device and calculate equipment
CN109493881B (en) * 2018-11-22 2023-12-05 北京奇虎科技有限公司 Method and device for labeling audio and computing equipment
CN109684506B (en) * 2018-11-22 2023-10-20 三六零科技集团有限公司 Video tagging processing method and device and computing equipment
CN109493881A (en) * 2018-11-22 2019-03-19 北京奇虎科技有限公司 A kind of labeling processing method of audio, device and calculate equipment
CN109739112A (en) * 2018-12-29 2019-05-10 张卫校 A kind of wobble objects control method and wobble objects
CN109739112B (en) * 2018-12-29 2022-03-04 张卫校 Swinging object control method and swinging object
CN111583890A (en) * 2019-02-15 2020-08-25 阿里巴巴集团控股有限公司 Audio classification method and device
CN109949825A (en) * 2019-03-06 2019-06-28 河北工业大学 Noise classification method based on FPGA-accelerated PCNN algorithm
CN110010128A (en) * 2019-04-09 2019-07-12 天津松下汽车电子开发有限公司 A kind of sound control method and system of high discrimination
CN110324657A (en) * 2019-05-29 2019-10-11 北京奇艺世纪科技有限公司 Model generation, method for processing video frequency, device, electronic equipment and storage medium
CN110414483A (en) * 2019-08-13 2019-11-05 山东浪潮人工智能研究院有限公司 A face recognition method and system based on deep neural network and random forest
CN110600038B (en) * 2019-08-23 2022-04-05 北京工业大学 A Dimensionality Reduction Method of Audio Fingerprint Based on Discrete Gini Coefficient
CN110600038A (en) * 2019-08-23 2019-12-20 北京工业大学 Audio fingerprint dimension reduction method based on discrete kini coefficient
CN110675893A (en) * 2019-09-19 2020-01-10 腾讯音乐娱乐科技(深圳)有限公司 Song identification method and device, storage medium and electronic equipment
CN110808033A (en) * 2019-09-25 2020-02-18 武汉科技大学 An Audio Classification Method Based on Double Data Augmentation Strategy
CN110808033B (en) * 2019-09-25 2022-04-15 武汉科技大学 Audio classification method based on dual data enhancement strategy
CN110933236A (en) * 2019-10-25 2020-03-27 杭州哲信信息技术有限公司 Machine learning-based null number identification method
CN110931046A (en) * 2019-11-29 2020-03-27 福州大学 Audio high-level semantic feature extraction method and system for overlapped sound event detection
CN111179971A (en) * 2019-12-03 2020-05-19 杭州网易云音乐科技有限公司 Nondestructive audio detection method and device, electronic equipment and storage medium
CN110931045A (en) * 2019-12-20 2020-03-27 重庆大学 Audio feature generation method based on convolutional neural network
CN111159464A (en) * 2019-12-26 2020-05-15 腾讯科技(深圳)有限公司 Audio clip detection method and related equipment
CN111159464B (en) * 2019-12-26 2023-12-15 腾讯科技(深圳)有限公司 Audio clip detection method and related equipment
US11905926B2 (en) * 2019-12-31 2024-02-20 Envision Digital International Pte. Ltd. Method and apparatus for inspecting wind turbine blade, and device and storage medium thereof
CN111508526B (en) * 2020-04-10 2022-07-01 腾讯音乐娱乐科技(深圳)有限公司 Method and device for detecting audio beat information and storage medium
CN111508526A (en) * 2020-04-10 2020-08-07 腾讯音乐娱乐科技(深圳)有限公司 Method and device for detecting audio beat information and storage medium
CN113901977A (en) * 2020-06-22 2022-01-07 中国电力科学研究院有限公司 A deep learning-based method and system for identifying electricity theft by power users
CN112735386B (en) * 2021-01-18 2023-03-24 苏州大学 Voice recognition method based on glottal wave information
CN112735386A (en) * 2021-01-18 2021-04-30 苏州大学 Voice recognition method based on glottal wave information
CN113313197A (en) * 2021-06-17 2021-08-27 哈尔滨工业大学 Full-connection neural network training method
CN113729715A (en) * 2021-10-11 2021-12-03 山东大学 Parkinson's disease intelligent diagnosis system based on finger pressure
CN115064184A (en) * 2022-06-28 2022-09-16 镁佳(北京)科技有限公司 Audio file musical instrument content identification vector representation method and device
CN118098270A (en) * 2024-04-24 2024-05-28 安徽大学 A noise source tracing method based on feature extraction and feature fusion

Similar Documents

Publication Publication Date Title
CN108122562A (en) A kind of audio frequency classification method based on convolutional neural networks and random forest
CN101247470B (en) Method realized by computer for detecting scene boundaries in videos
US10515292B2 (en) Joint acoustic and visual processing
CN107301171A (en) A kind of text emotion analysis method and system learnt based on sentiment dictionary
CN111986699B (en) Sound event detection method based on full convolution network
CN109308912A (en) Music style recognition methods, device, computer equipment and storage medium
CN109508379A (en) A kind of short text clustering method indicating and combine similarity based on weighted words vector
CN103268339A (en) Method and system for named entity recognition in microblog messages
CN110120218A (en) Expressway oversize vehicle recognition methods based on GMM-HMM
CN110399478A (en) Event discovery method and device
CN104166684A (en) Cross-media retrieval method based on uniform sparse representation
CN108846047A (en) A kind of picture retrieval method and system based on convolution feature
CN110990563A (en) A method and system for constructing traditional cultural material library based on artificial intelligence
CN112489689B (en) Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure
CN109086794B (en) Driving behavior pattern recognition method based on T-LDA topic model
CN110910175A (en) Tourist ticket product portrait generation method
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
Flamary et al. Spoken WordCloud: Clustering recurrent patterns in speech
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN109492105A (en) A kind of text sentiment classification method based on multiple features integrated study
Ferragne et al. Towards phonetic interpretability in deep learning applied to voice comparison
Blanchard et al. Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities
CN107292348A (en) A kind of Bagging_BSJ short text classification methods
CN108985369A (en) A kind of same distribution for unbalanced dataset classification integrates prediction technique and system
CN108920451A (en) Text emotion analysis method based on dynamic threshold and multi-categorizer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180605

WD01 Invention patent application deemed withdrawn after publication