CN110782878B

CN110782878B - Attention mechanism-based multi-scale audio scene recognition method

Info

Publication number: CN110782878B
Application number: CN201910960088.4A
Authority: CN
Inventors: 张涛; 梁晋华
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2022-04-05
Anticipated expiration: 2039-10-10
Also published as: CN110782878A

Abstract

A multi-scale audio scene identification method based on an attention mechanism comprises the following steps: establishing a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, wherein the multi-scale audio scene recognition convolutional neural network model is used for accurately recognizing audio scenes with different frequency band sizes and different durations; inputting audio files containing different scene categories and a training set of corresponding scene categories into a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, and training the multi-scale audio scene recognition convolutional neural network model based on the attention mechanism; reading an audio file and preprocessing the audio file to obtain an audio signal segment; extracting a logarithmic Mel map from the audio signal segment; and inputting the logarithmic Mel diagram into a trained multi-scale audio scene recognition convolutional neural network model based on an attention mechanism to obtain a final scene type. The method has good identification accuracy on multi-scale sound scenes with different frequency band ranges and durations, and can be applied to embedded mobile equipment and the like.

Description

Attention mechanism-based multi-scale audio scene recognition method

Technical Field

The invention relates to an audio scene recognition method. In particular to a multi-scale audio scene recognition method based on an attention mechanism.

Background

Audio scene recognition is a type of method that lets machines recognize specific background information behind audio (e.g., parks, streets, or restaurants) by processing a recorded audio file or uploaded data stream in order for the machine to mimic a human.

In the field of machine learning, many different models and audio feature representation methods are proposed to solve the problem of scene recognition. Related studies to solve the problem of scene audio using neural networks have been made as early as 1997. Liu et al used Recurrent Neural Networks (RNNs) and nearest neighbor classifiers to distinguish five different classes of ambient sounds in 1998. However, due to the introduction of excessive parameters during the training process, the models of the above two neural networks have very high complexity, and the performance after training is poor. In the game hosted by IEEE AASP in 2013, many teams participating in the game attempt to distinguish 10 different classes of sound scenes using some conventional machine learning Methods, such as Gaussian Mixture Models (GMMs), Support Vector Machines (SVMs), Tree-based classification Methods (Tree-based Methods), and packet-based classification Methods (Bag-based Methods). Although these methods have low computational complexity, the conventional machine learning methods cannot achieve satisfactory audio scene recognition effect because their model structures are relatively simple and cannot fully utilize more and more data provided under the current big data trend.

In recent years, the application of Neural Networks and deep learning in the fields of pattern recognition and the like has been promoted by the proposal of Convolutional Neural Networks (CNNs). The idea of local perception and weight sharing can capture more features while reducing model parameters, thereby improving the performance of the network model. In 2017, Valenti et al apply CNN to the field of audio scene recognition, and obtain good effect. Kong et al in 2018 proposed a CNN structure for 8-layer convolution operations and achieved good performance in the decade challenge held by ieee aasp in 2018. However, in the existing methods based on CNNs, scene recognition is often performed by using a single-scale sound feature, so that the trained model is often suitable for a certain type or some types of special scenes, and the accuracy of overall scene recognition is not ideal.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a multi-scale audio scene recognition method based on an attention mechanism, which has higher accuracy and better real-time performance.

The technical scheme adopted by the invention is as follows: a multi-scale audio scene recognition method based on an attention mechanism comprises the following steps:

1) establishing a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, wherein the multi-scale audio scene recognition convolutional neural network model is used for accurately recognizing audio scenes with different frequency band sizes and different durations;

2) inputting audio files containing different scene categories and a training set of corresponding scene categories into a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, and training the multi-scale audio scene recognition convolutional neural network model based on the attention mechanism;

3) reading an audio file and preprocessing the audio file to obtain an audio signal segment;

4) extracting a logarithmic mel map from said audio signal segment;

5) and inputting the logarithmic Mel map into a trained multi-scale audio scene recognition convolutional neural network model based on an attention mechanism to obtain a final scene category.

The attention mechanism-based multi-scale audio scene recognition convolutional neural network model in the step 1) comprises the following components in series: the system comprises a feature extraction module, a feature processing module, an attention module and a weight distribution module, wherein the feature extraction module is used for extracting features of different scales of a received logarithmic Mel's diagram and is formed by an Xceptance model, the feature processing module is used for processing the features of different scales extracted by the feature extraction module to obtain feature vectors representing different scales, the attention module is used for fusing the feature vectors representing different scales and classifying scenes, and the weight distribution module is used for processing the features of the lowest scale output by the feature extraction module and outputting the processed features to the attention module.

Different scale features output by the second, third and fourth pooling layers of the feature extraction module are respectively sent to the feature processing module, and the scale feature at the bottommost layer output by the first pooling layer of the feature extraction module is sent to the weight distribution module.

The characteristic processing module comprises:

the first transverse connection structure is used for sequentially carrying out 1 × 1 convolution processing, 3 × 3 convolution processing and global pooling processing on the received upper-layer scale features to obtain upper-layer scale feature vectors, sending the upper-layer scale feature vectors to the attention module, and sending the upper-layer scale feature information subjected to the 1 × 1 convolution processing to the second transverse connection structure;

the second transverse connection structure is used for respectively carrying out 1 multiplied by 1 convolution processing on the received middle-layer scale features and carrying out up-sampling processing on the received upper-layer scale feature information, carrying out 3 multiplied by 3 convolution processing and global pooling processing on the middle-layer scale feature information obtained by adding the result of the 1 multiplied by 1 convolution processing and the result of the up-sampling processing to obtain middle-layer scale feature vectors, sending the middle-layer scale feature vectors to the attention module, and sending the middle-layer scale feature information to the third transverse connection structure;

and the third transverse connection structure is used for respectively carrying out 1 × 1 convolution processing on the received bottom-layer scale features and carrying out up-sampling processing on the received middle-layer scale feature information, then carrying out 3 × 3 convolution processing and global pooling processing on the bottom-layer scale feature information obtained by adding the result of the 1 × 1 convolution processing and the result of the up-sampling processing to obtain bottom-layer scale feature vectors, and sending the bottom-layer scale feature vectors to the attention module.

The weight distribution module comprises the following steps of: and performing 1 × 1 convolution processing, 3 × 3 convolution processing, global pooling processing and full-connection layer processing on the scale features of the bottommost layer to obtain three weight coefficients for distributing attention to different scales, and sending the weight coefficients to an attention module.

The attention module carries out weighted average on the upper-layer scale feature vector, the middle-layer scale feature vector and the bottom-layer scale feature vector output by the feature processing module by utilizing three weight coefficients output by the weight distribution module, and then carries out full-connection layer processing and classification processing in sequence to obtain the final scene category.

The preprocessing in the step 3) is to cut off the input signal into signal segments with fixed time length of 10 s.

The step 4) comprises the following steps:

(1) performing frame division and windowing on an input audio signal segment;

(2) enabling the obtained audio frame to pass through a Mel filter bank, calculating the energy passing through each Mel filter in each time step range in the audio frame, forming energy vectors by all the energy passing through the Mel filter in each time step range, and combining the energy vectors in all the time step ranges to obtain a two-dimensional Mel map corresponding to the audio frame;

(3) and carrying out logarithm processing on the two-dimensional Mel image to obtain a logarithm Mel image.

Compared with the traditional single-scale convolutional neural network method, the attention mechanism-based multi-scale audio scene identification method has good identification accuracy on multi-scale sound scenes with different frequency band ranges and different duration. Under the condition of equivalent model complexity, the method has higher overall accuracy. The method has the advantages of low model complexity, low memory consumption in practical application and better real-time performance, so the method can be applied to embedded mobile equipment and the like. In addition, due to the fact that a large number of tensor calculations are used, the method can be applied to GPU/TPU and the like to greatly improve processing speed.

Drawings

FIG. 1 is a flow chart of a multi-scale audio scene recognition method based on attention mechanism according to the present invention;

FIG. 2 is a schematic diagram of a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism in the present invention;

FIG. 3 is a schematic diagram of a first cross-connect structure of feature processing modules in an attention-based multi-scale audio scene recognition convolutional neural network model;

FIG. 4 is a schematic diagram of a second cross-connect structure of feature processing modules in an attention-based multi-scale audio scene recognition convolutional neural network model;

FIG. 5 is a schematic diagram of a third cross-connect structure of feature processing modules in an attention-based multi-scale audio scene recognition convolutional neural network model;

FIG. 6 is a schematic diagram of a weight assignment module in a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism;

FIG. 7 is a schematic diagram of an attention module in a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism.

Detailed Description

The following describes a multi-scale audio scene recognition method based on attention mechanism in detail with reference to embodiments and drawings.

The invention discloses a multi-scale audio scene recognition method based on an attention mechanism, which is used for extracting sound characteristics of key acoustic events with different time lengths and frequency band distributions in a scene. On the basis, the invention allocates attention to the extracted sound features of each scale by using an attention mechanism, so that the sound features of each scale are fused into new sound features for identifying the sound scene.

As shown in fig. 1, the method for identifying a multi-scale audio scene based on an attention mechanism of the present invention includes the following steps:

1) establishing a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, wherein the multi-scale audio scene recognition convolutional neural network model is used for accurately recognizing audio scenes with different frequency band sizes and different durations; wherein the content of the first and second substances,

as shown in fig. 2, the convolutional neural network model for multi-scale audio scene recognition based on attention mechanism includes: the system comprises a feature extraction module 1 consisting of an Xception model and used for extracting features of different scales of a received logarithmic Mel's diagram, a feature processing module 2 used for processing the features of different scales extracted by the feature extraction module 1 to obtain feature vectors representing different scales, an attention module 4 used for fusing the feature vectors representing different scales and carrying out scene classification, and a weight distribution module 3 used for processing the scale features of the lowest layer output by the feature extraction module 1 and outputting the processed feature features to the attention module 4. Wherein the content of the first and second substances,

different scale features output by the second, third and fourth pooling layers of the feature extraction module 1 are respectively sent to the feature processing module 2, and the scale feature at the bottommost layer output by the first pooling layer of the feature extraction module 1 is sent to the weight distribution module 4.

The characteristic processing module 2 comprises:

as shown in fig. 3, the first transverse connection structure 2.1 sequentially performs 1 × 1 convolution processing, 3 × 3 convolution processing and global pooling processing on the received upper scale features to obtain upper scale feature vectors, sends the upper scale feature vectors to the attention module 4, and sends upper scale feature information subjected to 1 × 1 convolution processing to the second transverse connection structure 2.2;

as shown in fig. 4, the second transverse connection structure 2.2 performs 1 × 1 convolution processing on the received middle-layer scale features and performs upsampling processing on the received upper-layer scale feature information, and then performs 3 × 3 convolution processing and global pooling processing on the middle-layer scale feature information obtained by adding the result of the 1 × 1 convolution processing and the result of the upsampling processing to obtain a middle-layer scale feature vector, and sends the middle-layer scale feature vector to the attention module 4 and the third transverse connection structure 2.3;

as shown in fig. 5, the third transverse connection structure 2.3 performs 1 × 1 convolution processing on the received bottom-layer scale features and performs upsampling processing on the received middle-layer scale feature information, and then performs 3 × 3 convolution processing and global pooling processing on the bottom-layer scale feature information obtained by adding the result of the 1 × 1 convolution processing and the result of the upsampling processing to obtain a bottom-layer scale feature vector, which is sent to the attention module 4.

As shown in fig. 6, the weight assignment module 3 includes, in sequence: and performing 1 × 1 convolution processing, 3 × 3 convolution processing, global pooling processing and full-connection layer processing on the scale features of the bottommost layer to obtain three weight coefficients for distributing attention to different scales, and sending the weight coefficients to the attention module 4.

As shown in fig. 7, the attention module 4 performs weighted averaging on the upper-layer scale feature vector, the middle-layer scale feature vector, and the bottom-layer scale feature vector output by the feature processing module 2 by using three weight coefficients output by the weight distribution module 3, and then performs full-connected layer processing and classification processing in sequence to obtain a final scene category.

3) reading an audio file and preprocessing the audio file to obtain an audio signal segment; the preprocessing is to cut off the input signal into a signal segment with a fixed time length of 10s, so as to facilitate the next processing.

4) Extracting a logarithmic mel map from said audio signal segment; the method comprises the following steps:

(1) performing frame division and windowing on an input audio signal segment; since a speech signal is a typical non-stationary signal, but the motion of a sound organ is very slow compared with the speed of sound wave vibration, it is generally considered that the speech signal is a stationary signal in a time period of 10ms to 30ms, and therefore, the signal to be measured is subjected to frame windowing.

the energy passing through each Mel filter in each time step range in the audio frame is calculated by the following formula:

where M is the number of Mel filters, H (k) is the transfer function of the Mel filters, and X (k) is the corresponding FFT amplitude value.

Specific examples are given below:

for fast convergence and best performance, an Adam optimizer is used and an adaptive learning rate attenuation is set in the training process, and 32 samples are trained each time. The concrete measures are as follows:

1. reading audio signal and cutting off, each section is cut into voice segments with fixed time length of 10s

2. And performing frame windowing on the voice signal with fixed time length, and adding a Hamming window with 2048 points to 2048 sampling points in each frame.

3. And (3) performing feature extraction on the framed signal through a Mel filter bank, and taking a logarithm to obtain a logarithmic Mel map, wherein the number of the filters is 134, the window length of the filters is 1704 points, and 852 points are overlapped between frames.

4. And inputting the logarithmic Mel diagram into a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, and performing forward propagation to obtain a final scene type.

TABLE 1 comparison of various audio scene recognition algorithms

As shown in the table, under the same data set, the accuracy of the multi-scale audio scene recognition method based on the attention mechanism is obviously higher than that of the other two reference methods, so that the performance of the method is better.

Claims

1. A multi-scale audio scene recognition method based on an attention mechanism is characterized by comprising the following steps:

the attention mechanism-based multi-scale audio scene recognition convolutional neural network model comprises the following components in series: the system comprises a feature extraction module (1) composed of an Xception model and used for extracting features of different scales of a received logarithmic Mel's diagram, a feature processing module (2) used for processing the features of different scales extracted by the feature extraction module (1) to obtain feature vectors representing different scales, an attention module (4) used for fusing the feature vectors representing different scales and carrying out scene classification, and a weight distribution module (3) used for processing the scale features of the bottom layer output by the feature extraction module (1) and outputting the processed feature features to the attention module (4); wherein, the characteristic processing module (2) comprises:

the first transverse connection structure (2.1) is used for sequentially carrying out 1 × 1 convolution processing, 3 × 3 convolution processing and global pooling processing on the received upper-layer scale features to obtain upper-layer scale feature vectors, sending the upper-layer scale feature vectors into the attention module (4), and sending the upper-layer scale feature information subjected to 1 × 1 convolution processing into the second transverse connection structure (2.2);

the second transverse connection structure (2.2) is used for respectively carrying out 1 x 1 convolution processing on the received middle-layer scale features and carrying out up-sampling processing on the received upper-layer scale feature information, then carrying out 3 x 3 convolution processing and global pooling processing on the middle-layer scale feature information obtained by adding the result of the 1 x 1 convolution processing and the result of the up-sampling processing to obtain a middle-layer scale feature vector, sending the middle-layer scale feature vector to the attention module (4), and sending the middle-layer scale feature information to the third transverse connection structure (2.3);

the third transverse connection structure (2.3) is used for respectively carrying out 1 multiplied by 1 convolution processing on the received bottom-layer scale features and carrying out up-sampling processing on the received middle-layer scale feature information, then carrying out 3 multiplied by 3 convolution processing and global pooling processing on the bottom-layer scale feature information obtained by adding the result of the 1 multiplied by 1 convolution processing and the result of the up-sampling processing to obtain bottom-layer scale feature vectors, and sending the bottom-layer scale feature vectors to the attention module (4);

4) extracting a logarithmic mel map from said audio signal segment;

2. The attention mechanism-based multi-scale audio scene recognition method according to claim 1, wherein different scale features output by the second, third and fourth pooling layers of the feature extraction module (1) are respectively fed into the feature processing module (2), and the lowest scale feature output by the first pooling layer of the feature extraction module (1) is fed into the weight assignment module (4).

3. The attention-based multi-scale audio scene recognition method according to claim 1, wherein the weight assignment module (3) comprises sequentially: and carrying out 1 × 1 convolution processing, 3 × 3 convolution processing, global pooling processing and full-connection layer processing on the scale features of the bottommost layer to obtain three weight coefficients for distributing attention to different scales, and sending the weight coefficients to an attention module (4).

4. The method for identifying a multi-scale audio scene based on an attention mechanism as claimed in claim 1, wherein the attention module (4) performs weighted averaging on the upper-layer scale feature vector, the middle-layer scale feature vector and the bottom-layer scale feature vector output by the feature processing module (2) by using three weight coefficients output by the weight distribution module (3), and then sequentially performs full-connection layer processing and classification processing to obtain a final scene category.

5. The method as claimed in claim 1, wherein the preprocessing in step 3) is to perform truncation processing on the input signal into a signal segment with a fixed duration of 10 s.

6. The attention mechanism-based multi-scale audio scene recognition method according to claim 1, wherein the step 4) comprises:

(1) performing frame division and windowing on an input audio signal segment;