CN113362854A

CN113362854A - Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment

Info

Publication number: CN113362854A
Application number: CN202110619344.0A
Authority: CN
Inventors: 韩纪庆; 关亚东; 薛嘉宾; 郑贵滨; 郑铁然
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2021-09-07
Anticipated expiration: 2041-06-03
Also published as: CN113362854B

Abstract

An acoustic event detection method, system, storage medium and device based on a sparse self-attention mechanism belong to the field of machine auditory intelligence. The method aims to solve the problem that an existing time sequence feature extraction network cannot realize effective time sequence modeling, and therefore performance of an existing acoustic event detection system is limited. Firstly, extracting a Mel spectrogram from an input audio signal, then respectively inputting the Mel spectrogram into a convolutional neural network to extract local characteristics, and extracting time domain characteristics by using a transformer encoder based on a sparse self-attention mechanism; and finally, inputting the acoustic event data into a full connection layer for classification, carrying out post-processing on the result, and outputting the category and the start-stop time of each detected acoustic event according to the result. Mainly for the detection of acoustic events.

Description

Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment

Technical Field

The invention belongs to the field of auditory intelligence of machines, and relates to a method, a system, a storage medium and equipment for detecting acoustic events.

Background

The acoustic event detection refers to the analysis and processing of the sound signal to identify the type of acoustic event occurring in the audio signal and the start-stop time of each acoustic event. The acoustic event detection has wide application prospects in the aspects of security, smart home, smart city, multimedia information retrieval, biodiversity detection, intelligent robot environment perception and the like.

The existing acoustic event detection system structurally comprises a spectrum feature extraction module, a neural network module and a post-processing module, wherein the neural network module is a core module of the acoustic event detection system. The neural network module mainly comprises two parts: a local feature extraction network and a time sequence feature extraction network. The existing time sequence feature extraction network usually adopts a self-attention mechanism, which enables the network to be influenced by features of all moments in an audio frequency segment when processing the features of specific moments, and actually, the features of a plurality of moments are useless or even harmful to modeling of the features of the current moment, so that the network cannot realize effective time sequence modeling, thereby limiting the performance of the existing acoustic event detection system and influencing the practicability of the prior art.

Disclosure of Invention

The invention aims to solve the problem that the existing time sequence feature extraction network can not realize effective time sequence modeling, thereby limiting the performance of the existing acoustic event detection system.

An acoustic event detection method based on a sparse self-attention mechanism comprises the following steps:

firstly, extracting a Mel spectrogram from an input audio signal, then respectively inputting the Mel spectrogram into a convolutional neural network to extract local features, and extracting time domain features by using a transform Encoder based on a sparse self-attention mechanism; finally, inputting the acoustic event data into a full connection layer for classification, carrying out post-processing on the result, and outputting the category and the start-stop time of each detected acoustic event;

the process of extracting the time domain features comprises the following steps:

will liftLocal feature H taken_iInputting the data into a single-layer Transformer Encoder model, and normalizing the attention weight by adopting a sparse normalization method; the normalization operation on the obtained attention weight matrix comprises the following steps:

2.1 column t of A is A_tTo A, a_tThe elements in (1) are arranged in descending order; a is an attention weight matrix in the self-attention layer;

2.2 finding an intermediate parameter k satisfying the following conditions_t；

k_t∈max{k∈[T]|1+kA_t,k＞∑_j≤kA_t,j}

Where T represents the size of the time dimension, [ T [ [ T ]]＝{1,2...T}，A_t,k、A_t,jAre respectively vector A_tThe kth and jth elements of (a);

2.3 calculating the threshold value tau_t

2.4 for A_tEach element j in (1) finds:

A′_t,j＝[A_t,j-τ_t]₊

wherein [ ·]₊Is represented by [ ·]₊＝max{0,·}；

2.5, return to step 2.1 until T ═ T, resulting in a normalized attention weight matrix a'.

Further, the attention weight matrix in the self-attention layer:

wherein

Query and Key matrices in self-attribute, d_kIs the characteristic dimension size.

Further, the convolutional neural networks are respectively input into the convolutional neural networks to extract local features, and each convolutional neural network is composed of at least one convolution module, wherein each convolution module comprises a convolution layer, a normalization layer, a nonlinear layer and a maximum pooling layer.

Further, the convolutional neural network for extracting local features is composed of seven convolutional modules, and the number of stacked convolutional filters of the convolutional neural network is 16,32,64,128 and 128 in sequence; the pooling sizes of the maximum pooling layers are (2,2), (1,2), and (1, 2).

The convolutional layers in each convolutional module are two-dimensional convolutional layers, the convolutional kernel size is (3,3), and the step size is (1, 1).

Further, the process of finally inputting to the full connection layer for classification comprises the following steps:

and classifying the features by using a full-connection layer, wherein the hidden layer parameter is 128, and the activation function adopts a Sigmoid activation function.

Further, the process of post-processing the results and outputting the category and start-stop time of each detected acoustic event as a result comprises the steps of:

smoothing the output probability by using median filtering to obtain the prediction probability of the acoustic event

When in use

Greater than 0.5 indicates that the acoustic event of class c occurs at the t moment, and otherwise indicates that the event of class c does not occur; and then obtain the information whether the sound event happens at each moment, and then can obtain the occurrence and end time of the sound event.

Further, the process of extracting the mel spectrogram from the input audio signal comprises the following steps:

the input sound signal is a sound segment of 10 seconds, and the sampling rate is 16 kHz; the Mel spectrum extraction process adopts a window length of 2048, frame shift of 255, 128 Mel domain filters, and combines the numbersThe values are mapped to a natural log domain; finally, a 10 second sound fragment, the extracted Mel spectrogram X_iIs (648,128); where 648 is the number of frames and 128 is the order of the mel-filter coefficients.

An acoustic event detection system based on a sparse self-attention mechanism, the system being configured to perform a sparse self-attention mechanism based acoustic event detection method.

A storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a sparse self-attention mechanism based acoustic event detection method.

An apparatus comprising a processor and a memory, the storage medium having stored therein at least one instruction that is loaded and executed by the processor to implement a sparse self-attention mechanism based acoustic event detection method.

Has the advantages that:

according to the method provided by the invention, the attention weight is thinned, so that the coupling of the model with certain irrelevant moment information is reduced when the time sequence structure of the sound event is modeled, the more effective time sequence modeling is realized, the performance of the conventional acoustic event detection system is improved, the method provided by the invention is verified on the internationally disclosed acoustic event detection data set, and the result shows that the classification performance of the method is greatly improved compared with that of the conventional system.

Drawings

Fig. 1 is a schematic diagram of an acoustic event detection method based on a sparse self-attention mechanism.

Fig. 2 is a schematic structural diagram of the convolutional neural network portion in fig. 1, wherein x7 indicates that the left bracket contains modules stacked 7 times.

FIG. 3 is a schematic structural diagram of the transform Encoder network portion of FIG. 1, which includes the proposed self-attention weight thinning method.

Fig. 4 is a graph comparing the detection performance of the method of the present invention and the original baseline system on the international public data set.

Detailed Description

The first embodiment is as follows:

the invention provides an acoustic event detection method based on a sparse self-attention mechanism, which is used for replacing a Softmax normalization method in the self-attention mechanism with a sparse normalization method, so that coupling with useless features at other moments can be selectively reduced when a time sequence feature extraction network carries out time sequence modeling, and the time sequence features of an acoustic signal can be more effectively modeled.

Fig. 1 is a schematic diagram of an implementation model of an acoustic event detection method based on a sparse self-attention mechanism. Firstly, extracting a Mel spectrogram from an input audio signal, then respectively inputting the Mel spectrogram into a convolutional neural network CNN to extract local features, extracting time sequence features by using a transform Encoder based on a sparse self-attention mechanism, finally inputting the time sequence features into a full-link layer to be classified, carrying out post-processing on the result, and outputting the category and start-stop time of each detected acoustic event. Specifically, the method comprises the following steps:

step 1, extracting local features of the audio signal.

Step 1.1, extracting a Mel spectrogram.

Firstly, extracting commonly used Mel spectrogram characteristics of an input sound signal as model input. In some embodiments, the input sound signal is a 10 second sound segment with a sample rate of 16 kHz. The mel-frequency spectrum extraction process adopts a window length of 2048, a frame shift of 255, 128 mel-frequency domain filters, and maps the values to a natural log domain. Finally, a 10 second sound fragment, the extracted Mel spectrogram X_iIs (648,128). Where 648 is the number of frames and 128 is the order of the mel-filter coefficients.

And step 1.2, extracting local features.

And inputting the extracted Mel spectrogram into a convolutional neural network model. The convolutional neural network consists of a series of convolutional blocks including convolutional layers, normalization layers, and nonlinear layers, the maximum pooling layer, as shown in fig. 2. In some embodiments, 7 convolution modules are used, the two-dimensional convolution layer convolution kernel size is (3,3), the step size is (1,1), and the number of convolution filters stacked is (16,32,64, 128) in that order. The pooling size of the maximum pooling layer is ((2,2), (2,2), (1,2), (1,2), (1, 2)). All the used two-dimensional convolution layers, the maximum pooling layer, the batch normalization layer and the linear rectifying unit are standard components in a common neural network framework.

The input Mel spectrogram is mapped by a convolution neural network to obtain local characteristics H_i. Wherein H_iIs (157,128), 157 is the time dimension, and 128 is the feature dimension.

And 2, extracting time domain features.

Extracting local feature H_iInput into a single-layer Transformer Encoder model. The Transformer Encoder model is composed of a full connection layer, a self-attention layer and a dropout layer, and the detailed configuration and parameters of the model are shown in FIG. 3. The number of attention points shown in the figure is 16, the dimension of the linear mapping layer is 512, and dropout is 0.2. The attention weight is normalized by adopting the proposed sparse normalization method, and other components are standard components in a common neural network framework. The network output tensor is denoted as M_iIts dimension is (157,128).

Wherein the calculation process of the attention weight matrix in the self-attention layer is as follows:

wherein

The normalization operation on the obtained attention weight matrix is as follows:

2.1 column t of A is A_tTo A, a_tThe elements in (1) are arranged in descending order;

k_t∈max{k∈[T]|1+kA_t,k＞∑_j≤kA_t,j} (2)

2.3 calculating the threshold value tau_t

2.4 for A_tEach element j in (1) finds:

A′_t,j＝[A_t,j-τ_t]₊ (4)

wherein [ ·]₊Is represented by [ ·]₊＝max{0,·}。

The characteristics of the process according to the invention are analyzed as follows: at a certain time t₁And the acoustic event with the type S is subjected to time sequence modeling, the ideal situation is that only the characteristic weighted sum of all the moments belonging to the class S in the audio segment is carried out, and other moments t not belonging to the class S₂The weight corresponding to the feature of (2) is 0, and the weight here means the time t₁Characteristic of (D) and time t₂Similarity of features of (c). In addition, in general, a larger attention weight value represents the similarity between the same kind of acoustic events, while a smaller attention weight value often represents the similarity between different kinds of acoustic events.

The normalized attention weight a' obtained by the above method can ignore a relatively small attention weight value compared to a normal normalized attention weight obtained by softmax transformation. Therefore, when the normalized attention weight is used for weighted summation of features at different time instants, the sparse attention weight a' may make the neural network less affected by acoustic event features of other categories when modeling acoustic event features of the same category. In addition, the invention does not simply set the small weight value to 0 directly by means of the card threshold value, but sets the relatively small weight value to 0 adaptively after all the weight values are considered comprehensively.

In conclusion, the method and the device are beneficial to modeling of the sound event by the neural network and improving the performance of the acoustic event detection system.

And step 3, characteristic classification.

Classifying the features by using a full-connection layer, wherein the hidden layer parameter is 128, the output dimension is (157,10), the activation function adopts a Sigmoid activation function, and the output matrix uses O_iAnd (4) showing.

And 4, post-treatment.

When in use

A value greater than 0.5 indicates that an acoustic event of class c occurred at time t, whereas it indicates that an event of class c did not occur. Thus, the information about whether the sound event occurs at each moment is obtained, and then the occurrence time and the end time of the sound event can be obtained.

Examples

In order to verify the effectiveness of the method, the scheme of the first embodiment is adopted to verify on the acoustic event detection data set DESED which is internationally disclosed at present, and the proposed method is compared with the original baseline method. As shown in fig. 4, the method proposed by the present invention has better recognition performance for all ten types of acoustic event detection than the original baseline system. The average performance of the original system on the data set was 44.22% and the average performance of the proposed method was 47.65%, and the results exceeded the performance of the first single model of DCASE 2020 race task four. Therefore, the experimental results fully verify the effectiveness of the invention.

The second embodiment is as follows:

the embodiment is an acoustic event detection system based on a sparse self-attention mechanism, and the system is used for executing an acoustic event detection method based on the sparse self-attention mechanism.

The third concrete implementation mode:

the present embodiment is a storage medium having at least one instruction stored therein, where the at least one instruction is loaded and executed by a processor to implement a sparse auto-attention mechanism based acoustic event detection method.

The fourth concrete implementation mode:

the present embodiment is an apparatus comprising a processor and a memory, the storage medium having at least one instruction stored therein, the at least one instruction being loaded and executed by the processor to implement a sparse self-attention mechanism based acoustic event detection method.

The above-described calculation examples of the present invention are merely to explain the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications of the present invention can be made based on the above description, and it is not intended to be exhaustive or to limit the invention to the precise form disclosed, and all such modifications and variations are possible and contemplated as falling within the scope of the invention.

Claims

1. An acoustic event detection method based on a sparse self-attention mechanism comprises the following steps:

the method is characterized in that the process of extracting the time domain features comprises the following steps:

extracting local feature H_iTransformer input to a monolayerIn the Encoder model, a sparse normalization method is adopted to normalize the attention weight; the normalization operation on the obtained attention weight matrix comprises the following steps:

k_t∈max{k∈[T]|1+kA_t,k＞∑_j≤kA_t,j}

2.3 calculating the threshold value tau_t

2.4 for A_tEach element j in (1) finds:

A′_t,j＝[A_t,j-τ_t]₊

wherein [ ·]₊Is represented by [ ·]₊＝max{0,·}；

2. The sparse self-attention mechanism-based acoustic event detection method according to claim 1, wherein the attention weight matrix in the self-attention layer is:

wherein

Are self-attenti respectivelyQuery and Key matrix in on, d_kIs the characteristic dimension size.

3. The sparse attention mechanism based acoustic event detection method of claim 2, wherein the convolutional neural networks, each of which is input to a convolutional neural network to extract local features, are comprised of at least one convolutional module, the convolutional module comprising a convolutional layer, a normalization layer, a non-linear layer and a max-pooling layer.

4. The sparse attention mechanism-based acoustic event detection method according to claim 3, wherein the convolutional neural network for extracting local features is composed of seven convolutional modules, and the number of stacked convolutional filters of the convolutional neural network is 16,32,64, 128; the pooling sizes of the maximum pooling layers are (2,2), (1,2), and (1, 2).

5. The sparse self-attention mechanism-based acoustic event detection method according to claim 4, wherein the process of finally inputting to the full connection layer for classification comprises the following steps:

6. The sparse self-attention mechanism-based acoustic event detection method according to any one of claims 1 to 5, wherein the step of post-processing the result and outputting the category and start-stop time of each detected acoustic event comprises the steps of:

When in use

7. The sparse self-attention mechanism-based acoustic event detection method of claim 6, wherein the process of extracting the Mel spectrogram from the input audio signal comprises the following steps:

the input sound signal is a sound segment of 10 seconds, and the sampling rate is 16 kHz; in the Mel sound spectrum extraction process, 128 Mel domain filters with window length of 2048 and frame shift of 255 are adopted, and the values are mapped to a natural log domain; finally, a 10 second sound fragment, the extracted Mel spectrogram X_iIs (648,128); where 648 is the number of frames and 128 is the order of the mel-filter coefficients.

8. An acoustic event detection system based on a sparse self-attention mechanism, characterized in that the system is configured to perform a sparse self-attention mechanism based acoustic event detection method according to one of claims 1 to 7.

9. A storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a sparse self-attention mechanism based acoustic event detection method as claimed in any one of claims 1 to 7.

10. An apparatus comprising a processor and a memory, wherein the storage medium has stored therein at least one instruction that is loaded and executed by the processor to implement a sparse self-attentive mechanism based acoustic event detection method as claimed in any one of claims 1 to 7.