CN110085218A

CN110085218A - A kind of audio scene recognition method based on feature pyramid network

Info

Publication number: CN110085218A
Application number: CN201910233193.8A
Authority: CN
Inventors: 张涛; 梁晋华
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2019-08-02

Abstract

A kind of audio scene recognition method based on feature pyramid network: it establishes and is used for audio scene identification feature pyramid network model；It will include that the audio file of different scenes classification and the training set of corresponding scene type are trained to audio scene identification feature pyramid network model is used for；It reads the audio file for needing to identify and truncation is carried out to the audio file；Extract Meier feature, obtain two Jan Vermeer sonographs of each audio frame, and it is normalized, and propagated forward is carried out for audio scene identification feature pyramid network model after training, the prediction probability to different audio scene classifications is obtained, the maximum scene type of prediction probability is taken to export as the prediction of audio frame corresponding to two Jan Vermeer sonographs；The audio file identified is needed to predict to whole section.The present invention takes full advantage of low-level image feature information, improves model performance.Information brought by the more and more data provided under current big data trend can be made full use of, predetermined speed is very fast.

Description

A kind of audio scene recognition method based on feature pyramid network

Technical field

The present invention relates to a kind of audio scene recognition methods.More particularly to a kind of audio based on feature pyramid network Scene recognition method.

Background technique

Audio scene identification is the data flow for allowing machine to pass through one section of audio file recorded of processing or upload, and purpose is Allow machine that can imitate the mankind to identify one kind of audio specific background information (such as: park, street or dining room) behind Method.

In machine learning field, in order to solve the problems, such as scene Recognition, many different models and audio frequency characteristics are proposed Representation method.Early in 1997, the correlative study that scene audio is solved the problems, such as using neural network is just had already appeared. Liu in 1998 et al. uses Recognition with Recurrent Neural Network (Recurrent Neural Networks, RNNs) and nearest neighbor classifier The ambient sounds different to five classes distinguish.However, due to having introduced excessive parameter, both the above mind in training process Model complexity through network is very high, and performance is poor after training.In the match that 2013 are held by IEEE AASP In, many teams participating in the contest attempt to use some traditional machine learning methods, such as gauss hybrid models (GaussianMixtureModels, GMMs), support vector machines (Support Vector Machines, SVMs) are based on tree Classification method (Tree-based Methods) and classification method (Bag-based Methods) based on packet, to distinguish 10 The different sound scenery classification of class.Although these methods have lower computation complexity, due to their model structure phase To simple and be unable to fully utilize the more and more data provided under current big data trend, conventional machines learning method It is unable to reach satisfactory audio scene recognition effect.

In recent years, the proposition of convolutional neural networks (Convolutional Neural Networks, CNNs) has pushed mind Application through network and deep learning in fields such as pattern-recognitions.The thought that wherein local sensing and weight are shared is reducing mould While shape parameter, more features can also be captured to improve network model performance.Valenti in 2017 et al. is by CNN It applies and identifies field in audio scene, and achieve good results.However, the feature extraction of tradition CNNs from bottom to top Journey can not effectively utilize the detailed information of low-level image feature.

Recently, computer vision field propose it is a kind of utilize the pyramidal method of CNNs network struction feature, can be with While retaining higher level of abstraction semantic information again, low-level image feature information is made full use of.

Summary of the invention

The technical problem to be solved by the invention is to provide a kind of with higher accuracy based on feature pyramid network Audio scene recognition method.

The technical scheme adopted by the invention is that: a kind of audio scene recognition method based on feature pyramid network, packet Include following steps:

1) it establishes and is used for audio scene identification feature pyramid network model；

It 2) will include that the audio file of different scenes classification and the training set of corresponding scene type input and be used for audio Scene Recognition feature pyramid network model is trained to for audio scene identification feature pyramid network model；

3) it reads the audio file for needing to identify and truncation is carried out to the audio file；

4) Meier feature is extracted, obtains two Jan Vermeer sonographs of each audio frame, and be normalized；

5) audio scene identification feature gold will be used for after the two Jan Vermeer sonographs input training after each normalized Word tower network model carries out propagated forward, obtains frame by frame finally by softmax layers general to the prediction of different audio scene classifications Rate takes the maximum scene type of prediction probability to export as the prediction of audio frame corresponding to the two Jan Vermeers sonograph；

6) it needs the audio file identified to predict to whole section, i.e., in the prediction output of all audio frames, will occur The prediction result for the audio file that the highest audio scene classification of frequency needs to identify as whole section exports.

It is used for audio scene identification feature pyramid network model described in step 1), is using Xception as feature The backbone structure of pyramid network model, it is 3 that the fallout predictor in model, which is by being input to output sequence to be followed successively by convolution kernel size, × 3 convolutional layer, global pool layer, full articulamentum and softmax layers of composition.

Truncation described in step 3) is that the audio file to be identified is cut into the several of fixed duration 10s A signal segment.

Extraction Meier feature described in step 4), comprising:

(4.1) framing windowing process is carried out to each signal segment respectively；

(4.2) after filtering obtained each audio frame by Meier filter group, each time in audio frame is calculated It walks in range through the energy of each Meier filter, it is all by Meier filter by what is obtained within the scope of each time step Energy forms energy vectors, and the energy vectors within the scope of all time steps are merged, the two-dimentional plum of corresponding audio frame is finally obtained That sonograph.

The energy within the scope of each time step in audio frame by each Meier filter is calculated described in (4.2) step, It is

Wherein, M is the quantity of Meier filter, and H (k) is the transmission function of Meier filter, and X (k) is corresponding FFT Range value.

A kind of audio scene recognition method based on feature pyramid network of the invention, due to having used in deep learning The method of neural network, the present invention can make full use of the more and more data provided under current big data trend to be brought Information.Meanwhile because relating only to the prediction process of propagated forward in actual use, its predetermined speed is very fast.And biography The CNN method of system is compared, and the present invention takes full advantage of low-level image feature information.Can on the basis of not increasing model complexity, Improve model performance.

Detailed description of the invention

Fig. 1 is a kind of flow chart of the audio scene recognition method based on feature pyramid network of the present invention.

Specific embodiment

Below with reference to embodiment and attached drawing to a kind of audio scene identification side based on feature pyramid network of the invention Method is described in detail.

A kind of audio scene recognition method based on feature pyramid network of the invention, includes the following steps:

Described is used for audio scene identification feature pyramid network model, is using Xception as feature pyramid The backbone structure of network model, it is 3 × 3 that the fallout predictor in model, which is by being input to output sequence to be followed successively by convolution kernel size, Convolutional layer, global pool layer, full articulamentum and softmax layers of composition.

It 2) will include that the audio file of different scenes classification and the training set of corresponding scene type input and be used for audio Scene Recognition feature pyramid network model is trained to for audio scene identification feature pyramid network model, is used After training set is trained network, prediction process pertains only to propagated forward；

For fast convergence and best performance is obtained, is learned using Adam optimizer in training process and provided with adaptive The decaying of habit rate.

3) it reads the audio file for needing to identify and truncation is carried out to the audio file；The truncation is The audio file to be identified is cut into several signal segments of fixed duration 10s.

4) Meier (Mel) feature is extracted, obtains two Jan Vermeer sonographs of each audio frame, and be normalized； Described extraction Meier (Mel) feature includes:

(3.1) framing windowing process is carried out to each signal segment respectively；

(3.2) after filtering obtained each audio frame by Meier filter group, each time in audio frame is calculated It walks in range through the energy of each Meier filter, it is all by Meier filter by what is obtained within the scope of each time step Energy forms energy vectors, and the energy vectors within the scope of all time steps are merged, the two-dimentional plum of corresponding audio frame is finally obtained That sonograph.

By the energy of each Meier filter within the scope of each time step calculated in audio frame, it is:

Specific example is given below:

1, it reads audio signal and carries out truncation, every section of sound bite for being cut into fixed duration 10s；

2, framing windowing process is carried out to the voice signal of fixed duration, 2048 sampled points of every frame add 2048 Hammings Window；

3, the signal after framing is subjected to feature extraction by Mel filter group and takes logarithm, filter number 134 It is a, a length of 1704 points of the window of filter, Chong Die 852 points between frame and frame；

4, obtained Mel sonograph is normalized；

5, the Mel sonograph after normalization is inputted into ASCFPN network, carries out propagated forward；

6, using ballot method, the prediction result of each frame is counted, most scene types is predicted and is taken as whole section audio Prediction result output.

All kinds of audio scene recognizers of table 1 compare

Shown in as shown above, ASCFPN is algorithm proposed by the present invention, under identical data set, based on ASCFPN's The accuracy rate of audio scene recognition method is apparently higher than other two kinds of pedestal methods, thus in the present invention mentioned method performance compared with It is good.

Claims

1. a kind of audio scene recognition method based on feature pyramid network, which comprises the steps of:

It 2) will include that the audio file of different scenes classification and the training set of corresponding scene type input and be used for audio scene Identification feature pyramid network model is trained to for audio scene identification feature pyramid network model；

5) audio scene identification feature pyramid will be used for after the two Jan Vermeer sonographs input training after each normalized Network model carries out propagated forward, obtains taking the prediction probability of different audio scene classifications frame by frame finally by softmax layers The maximum scene type of prediction probability is exported as the prediction of audio frame corresponding to the two Jan Vermeers sonograph；

6) audio file identified is needed to predict to whole section, i.e., in the prediction output of all audio frames, by the frequency of occurrences The prediction result for the audio file that highest audio scene classification needs to identify as whole section exports.

2. a kind of audio scene recognition method based on feature pyramid network according to claim 1, which is characterized in that It is used for audio scene identification feature pyramid network model described in step 1), is using Xception as feature pyramid network The backbone structure of network model, the fallout predictor in model are to be followed successively by the convolution that convolution kernel size is 3 × 3 by being input to output sequence Layer, global pool layer, full articulamentum and softmax layers of composition.

3. a kind of audio scene recognition method based on feature pyramid network according to claim 1, which is characterized in that Truncation described in step 3) is several signal segments that the audio file to be identified is cut into fixed duration 10s.

4. a kind of audio scene recognition method based on feature pyramid network according to claim 1, which is characterized in that Extraction Meier feature described in step 4), comprising:

(4.2) after filtering obtained each audio frame by Meier filter group, each time step model in audio frame is calculated By the energy of each Meier filter in enclosing, all energy bins by Meier filter that will be obtained within the scope of each time step At energy vectors, the energy vectors within the scope of all time steps are merged, two Jan Vermeer sound spectrums of corresponding audio frame are finally obtained Figure.

5. a kind of audio scene recognition method based on feature pyramid network according to claim 4, which is characterized in that By the energy of each Meier filter within the scope of each time step in calculating audio frame described in (4.2) step, it is

Wherein, M is the quantity of Meier filter, and H (k) is the transmission function of Meier filter, and X (k) is the amplitude of corresponding FFT Value.