CN113744758A

CN113744758A - Sound event detection method based on 2-DenseGRUNet model

Info

Publication number: CN113744758A
Application number: CN202111089655.7A
Authority: CN
Inventors: 曹毅; 黄子龙; 费鸿博; 吴伟官; 夏宇; 周辉
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-03
Anticipated expiration: 2041-09-16
Also published as: CN113744758B

Abstract

The sound event detection method based on the 2-DenseGRUNet model provided by the invention is based on a 2-order DenseNet network model, a gated cycle unit GRU network is added, and a sound event detection model is constructed; compared with the traditional convolutional neural network and the cyclic neural network model, the sound event detection model in the technical scheme of the patent combines the advantages of the 2-DenseNet and the GRU, can more efficiently use the characteristic information to fuse the characteristic information, obtains more effective characteristic information, and can effectively perform time sequence modeling. The sound event detection model based on the technical scheme of the patent has lower average fragment error rate and higher F-Score in urban sound event detection, and ensures that the sound classification result based on the method is more accurate.

Description

Sound event detection method based on 2-DenseGRUNet model

Technical Field

The invention relates to the technical field of sound detection, in particular to a sound event detection method based on a 2-DenseGRUNet model.

Background

The sound carries a large amount of information about life scenes and physical events in the city, and the information is automatically extracted by intelligently sensing each sound source through a deep learning method, so that the intelligent city building method has great potential and application prospect in building intelligent cities. In smart cities, sound event detection is an important basis for recognition and semantic understanding of environmental sound scenes. The urban sound event detection research is mainly applied to the aspects of environment perception, factory equipment detection, urban security, automatic driving and the like. In the prior art, the urban sound event detection technology is mainly realized based on MLP, CNN and LSTM network models. However, when the 3 network models are evaluated by F-Score, which is an index that comprehensively considers harmonic values of Precision and Recall, the F-Score has a low Score due to a high average fragment error rate, and the application range is limited in practical application.

Disclosure of Invention

In order to solve the problem of average segment error rate of urban sound event detection in the center of the prior art, the sound event detection method based on the 2-DenseGRUNet model can extract more effective acoustic information when processing audio data, and has better time sequence modeling capability, so that the model has lower average segment error rate and higher usability in urban sound event detection.

The technical scheme of the invention is as follows: the sound event detection method based on the 2-DenseGRUNet model comprises the following steps:

s1: collecting audio data to be processed, preprocessing an original audio signal of the audio data to be processed, and outputting an audio frame sequence;

the preprocessing operation comprises the following steps: sampling and quantizing, pre-emphasis processing and windowing;

s2: analyzing the time domain and the frequency domain of the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original characteristic vector sequence;

s3: reconstructing the feature information and the label of the original feature vector sequence, and outputting a reconstructed feature vector sequence after feature reconstruction processing;

converting the starting time, the ending time and the category of the sound event in the original feature vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed feature vector sequence;

s4: constructing a sound event detection model, and performing iterative training on the model to obtain the trained sound event detection model;

s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into the trained sound event detection model for identification and detection, and obtaining a sound event detection result of the audio data to be processed;

the method is characterized in that:

the sound event detection model comprises: an input layer, a 2-order DenseNet model and a GRU unit; all of the GRU units are connected in series after the 2 nd order DenseNet model;

a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2 nd-order DenseNet model;

the 2 nd order DenseNet model comprises: the device comprises 2-DenseBlock structures which are continuous, wherein a Transition layer structure is arranged behind each 2-DenseBlock structure;

in each 2-DenseBlock structure, the connection between the feature layers and the feature layers is based on the correlation connection of a 2-order Markov model, and the input of the current feature layer is related to the output of the previous 2 feature layers; each Transition layer structure comprises a convolution layer and a pooling layer.

It is further characterized in that:

the 2-order Densenet model is sequentially connected with the GRU units in series, and the 2-order Densenet model is connected with the first GRU unit through a reshape layer; a Time Distributed layer, a full connection layer and an output layer are sequentially arranged behind the GRU unit;

each 2-DenseBlock comprises feature layers which are connected in sequence; each feature layer comprises a1 × 1 convolution layer and a 3 × 3 convolution layer which are continuous, and in the feature layer, input data are subjected to batch standardization processing and activation function processing before entering the convolution layers for convolution processing; merging and cascading the last convolution layer and the next convolution layer in each feature layer through collocation respectively; a dropout layer is added between the first characteristic layer and the second characteristic layer in each 2-DenseBlock;

extracting the features of the mel-frequency cepstrum coefficient in the step S2, wherein the mel-frequency cepstrum coefficient extracted by the specific dimension is 40 mfcc; the structure of the original feature vector sequence obtained in step S2 is a 2-dimensional vector, the first-dimensional vector is a number of frames after sampling the audio data to be processed, and the second-dimensional vector is a dimension of the mel-frequency cepstrum coefficient;

in step S3, when reconstructing the feature information and the label of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1S, the number of frames of the mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the following in the original sound event label: from the starting time to the ending time, the method comprises the following steps: starting frame number to ending frame number;

in step S5, before the reconstructed feature vector sequence is input to the trained voice event detection model, it is necessary to convert the feature vector from a 2-dimensional vector to a 3-dimensional vector, and input the 3-dimensional vector to a network model, where the 3 rd-dimensional vector is the number of channels in the voice event detection model;

the characteristic input of the GRU unit is a two-dimensional characteristic vector;

the Transition Layer comprises: one convolution kernel is 1 × 1 convolution layer, one maximum pooling layer with pooling size [2,2 ];

the number of the neurons of the full connection layer is set to be 256 or 128;

the output layer is implemented based on a Sigmoid function.

Drawings

FIG. 1 is a flow chart of feature reconstruction in sound event detection according to the present invention;

FIG. 2 is a network block diagram of the 2-DenseGRUNet model of the present invention;

FIG. 3 is a schematic diagram of a 2-DenseBlock in a 2-DenseNet network according to the present invention;

fig. 4 is a schematic diagram of a gated loop unit GRU in the present invention.

Detailed Description

As shown in fig. 1 to 4, the sound event detection method based on the 2-DenseGRUNet model of the present invention specifically includes the following steps.

the preprocessing operation comprises the following steps: sampling and quantizing, pre-emphasis processing and windowing.

S2: performing time domain and frequency domain analysis on the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original characteristic vector sequence;

extracting the characteristics of the Mel frequency cepstrum coefficient, wherein the Mel frequency cepstrum coefficient extracted by the specific dimension is 40 mfcc; the structure of the original feature vector sequence obtained in step S2 is a 2-dimensional vector, the first-dimensional vector is the number of frames after sampling of the audio data to be processed, and the second-dimensional vector is the dimension of the mel-frequency cepstrum coefficient, in this embodiment, the mel-frequency cepstrum coefficient is 40mfcc, and the second-dimensional vector is 40.

converting the starting time, the ending time and the category of the sound event in the original characteristic vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed characteristic vector sequence;

when reconstructing the feature information and the label of the original feature vector sequence, the specific duration of data in the original feature vector sequence is 1s, the number of frames of mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the following data in the original sound event label: from the starting time to the ending time, the method comprises the following steps: number of start frames to number of end frames.

As shown in fig. 1, in step S3, the detailed implementation process of reconstructing the feature information and the tag of the original feature vector sequence is as follows:

let the existing original audio sample include: the file names of the files are a013, a010, a129, b099, b008 and b100, and the time length of each file is as follows: 3min44s, 3min30s, 4min0s, 4min0s, 3min30s, 3min01 s. Splicing the audio sample files with different durations according to a time dimension to construct total audio data, wherein the total duration of the total audio data is T:

T＝3min44s+3min30s+4min0s+4min0s+3min30s+3min01s

extracting the tag information of the audio frequency from the total audio frequency data; each segment of file in the total audio data is respectively marked as:

time-start, time-end, category, respectively, representing the [ start time, end time, category ] of the audio event corresponding to the original audio sample.

Extracting the features of the labeled total audio data, wherein f is 44.1kHz, nfft is 2048, Win-len is 2048, hop _ len is 1024, t is 0.0232s, and Segment len is 128;

wherein f represents sampling frequency, nfft represents the length of fast fourier transform, wen _ len represents the number of sampling points of each frame, hop _ len represents the number of sampling points between two frames, T represents the duration of each frame, and Segment len-128 represents the feature of dividing the audio with the whole duration of T into a plurality of frames of 128 after feature extraction;

after the feature reconstruction, the total audio data with the length T, which contains different samples, respectively correspond to the audio segments labeled as: frame _ start, frame _ end, one-hot;

representing for each audio segment after segmentation: [ starting frame number, ending frame number, one-hot encoding of category ];

and finally, outputting the reconstructed feature vector sequence after the reconstructed feature processing, namely representing the reconstructed feature vector sequence by frame _ start, frame _ end and one-hot.

S4: and constructing a sound event detection model, and performing iterative training on the model to obtain the trained sound event detection model.

Based on a 2-order DenseNet model, a network model constructed by combining the characteristics of a gating cycle unit GRU model is a sound event detection model in the invention, which is simply referred to as: 2-DenseGRUNet model. The sound event detection model takes a 2-order DenseNet model as a feature extraction network at the front end of the network, and then a GRU unit with good time sequence modeling capability is connected in series at the front end, so that the feature information of the sound event can be fused with high efficiency, more effective feature information can be obtained, time sequence modeling can be effectively carried out, and the average segment error rate of sound segments can be reduced.

The sound event detection model comprises: an input layer, a 2-order DenseNet model and a GRU unit; all GRU units are connected in series behind a 2-order DenseNet model;

a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2-order DenseNet model;

the 2 nd order DenseNet model includes: m continuous 2-DenseBlock structures, wherein m is a natural number greater than or equal to 1;

a Transition layer structure is arranged behind each 2-DenseBlock structure;

the 2-order Densenet model is sequentially connected with the GRU units in series, and the 2-order Densenet model is connected with the first GRU unit through a reshape layer; a Time Distributed layer, a full connection layer and an output layer are sequentially arranged behind the GRU unit; wherein, the characteristic input of the GRU unit is a two-dimensional characteristic vector; the output layer is realized based on a Sigmoid function; the Transition Layer comprises: one convolution kernel is 1 × 1 convolution layer, one maximum pooling layer of pooling size [2,2 ]. The number of the neurons with the full connection layer number is set to be 256 or 128, parameter adjustment and inhibition overfitting are carried out on the network, and finally the detection result is output after being processed through a Sigmoid function.

The structure of the feature layer in 2-DenseBlock is shown in Table 1 below:

table 1: structure of characteristic layer in 2-DenseBlock

Conv(1×1)
	BN(·)
ReLU activation function
	Conv(3×3)
BN(·)
	ReLU activation function
Concatenate function
	dropout layer

Each 2-DenseBlock structure comprises l feature layers which are connected in sequence;

as shown in table 1, each feature layer includes 2 convolutional layers, and in the feature layer, after the convolutional layers perform convolution processing on input data, batch normalization processing (BN) and ReLU activation function processing are also performed;

merging and cascading the last convolution layer and the next convolution layer in each feature layer through Concatenate respectively; and a dropout layer is arranged between the first characteristic layer and the second characteristic layer in each 2-DenseBlock to perform small overfitting inhibition, so that parameter adjustment of a network model in the later stage is facilitated.

DenseBlock is used as the basic structure of the DenseNet model, and DenseBlock in 2-order DenseNet is also two-order; that is, in each 2-DenseBlock structure, the connection between the feature layers is based on the correlation connection of a 2-step Markov model, and the current feature layer input is related to the output of the previous 2 feature layers; each Transition layer structure comprises a convolution layer and a pooling layer; the feature layer carries out weight sharing in a Transition layer structure, carries out time sequence distinguishing to the maximum extent and detects the starting time and the ending time of a sound event;

each 2-DenseBlock comprises l feature layers which are connected in sequence, wherein l is a natural number which is more than or equal to 1; each feature layer comprises a1 × 1 convolution layer and a 3 × 3 convolution layer which are continuous, and in the feature layer, input data are subjected to batch standardization processing and activation function processing before entering the convolution layers for convolution processing; merging and cascading the last convolution layer and the next convolution layer in each feature layer through collocation respectively; and a dropout layer is added between the first characteristic layer and the second characteristic layer in each 2-DenseBlock.

In the 2-DenseBlock network structure, the 1 × 1 and 3 × 3 convolutional layers in each feature layer are a set of non-linearly varying feature layers. As shown in fig. 3 of the drawings of the specification, the 2-DenseBlock comprises 4 feature layers, and the input layer of each non-linearly changing feature layer is defined as: x₁,X₂,…,X_lU1-U8 are convolutional layers in the feature layer, and W2-W9 are weight systems corresponding to each convolutional layerA number matrix;

in 2-DenseBlock, layer 3 initiates the feature output U of the network convolution transform_cCan be defined as:

wherein, [ X ]_l,X_l-1,X_l-2]Representing the current layer to carry out channel number merging cascade operation through 2-order related connection mode, using the feature mapping of the first two layers as the input of the current layer, W_3×3And W_1×1Representing convolution kernel sizes of 1 × 1 and 3 × 3 kernel functions, respectively, BN (·) representing batch normalization, f (·) representing a ReLU activation function, BN (·) representing batch normalization, B representing a bias coefficient;

each Transition layer structure comprises a convolution layer and a pooling layer; the convolution kernel of the convolution layer is 1 x1, feature dimension reduction processing is carried out on the convolution layer, then the convolution layer is connected with the pooling layer, the size of the matrix is reduced through the pooling layer processing, the parameters of the final full-connection layer are reduced, and the expression formula is as follows:

wherein:

representing the output of a pooling layer, taking a maximum value in a pooling region, wherein l is the number of the feature layers included in each 2-DenseBlock structure, k is the number of channels of the feature map, and m and n are the sizes of convolution kernels; x (i, j) corresponds to a pixel on the feature map; p is a pre-specified parameter that, when p tends to infinity,

the maximum value is taken in the pooling region and is the maximum pooling (max pooling).

In the embodiment shown in fig. 3, when the number of layers l is 4, the output of the 1 st layer is X₁Is not usedThe input that the collocation layer forwards propagates to layer 2 is X₂(ii) a The profile of the input at layer 3 is only related to the profile of the output at layer 2 and layer 1, i.e. X₃＝f([x₃,x₂,x₁]) (ii) a The profile of the input at level 4 is only related to the profile of the output at levels 3 and 2, i.e. X₄＝f([x₄,x₃,x₂])。

As shown in fig. 4, the GRU unit is a gated cyclic unit, which is a gating mechanism in the cyclic neural network. In the mechanism for gating the cyclic unit, an update gate z is provided_tAnd a reset gate r_t. Updating the door z_tFor controlling the state x at the previous moment_t-1Information is brought into the current state x_tMiddle level, reset gate r_tThe GRU has better performance in timing modeling by controlling how much information of the previous state is written to the current candidate set. The expression formula is as follows:

z_t＝σ(W_z·[h_t-1,x_t])

r_t＝σ(W_r·[h_t-1,x_t])

in the formula: sigma denotes the fully connected layer and the activation function,

indicates a hidden state, h_tRepresenting an output, representing a multiplication of elements;

W_zweight coefficient, W, representing the updated gate_rRepresents the weight coefficient of the reset gate, σ (g) represents the fully connected layer and activation function, and tanh (g) represents the tanh activation function.

As shown in fig. 2, the feature vector sequence of the input sound event detection model is sequentially subjected to a Layer of convolution operation and a pooling process, and then sequentially input into m consecutive 2-denseblocks, wherein each 2-DenseBlock is followed by a Transition Layer; after the processing of 2-DeseNet (m) comprising m continuous 2-DeseBlock structures and Transition layers, converting the feature dimension by a reshape Layer, converting the three-dimensional feature vector into a two-dimensional feature vector, then connecting n gating cycle units GRU for Time sequence modeling processing, then inputting the Time sequence tensor operation in a Time Distributed Layer, inputting the Time sequence tensor operation in a full-connection Layer for detection processing, and finally outputting the detection result after the processing of a Sigmoid function. The number m and the number l of layers of the 2-DenseBlock are evaluated according to the actual hardware condition and the data complexity. In the embodiments shown in fig. 2 and 3 of the drawings, n is 2, m is 2, and l is 4.

Sigmoid function f (z) is given by the formula:

s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into a trained sound event detection model for identification detection, and obtaining a sound event detection result of audio data to be processed; meanwhile, before the reconstructed feature vector sequence is input into the trained sound event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3 rd-dimensional vector is the number of channels in the sound event detection model.

In the technical scheme, information in the characteristic diagram can be more effectively extracted based on 2-order DenseNet, and a gated cycle unit GRU model is simple, so that the method is more suitable for constructing a larger network; in the patent, a 2-order DenseNet network is combined with 2 gate control cycle unit GRU models, so that the calculation efficiency is higher from the calculation perspective; based on the technical scheme, the method can effectively extract the frequency domain information of the feature map, can also effectively capture the time sequence feature of the long-term audio sequence, and can more efficiently realize the classification task and the regression task in the detection.

As shown in table 2 below, is an example of a network structure of the 2-DenseGRUNet model.

Table 2: example of the 2-DenseGRUNet model

Input：mfcc[128,80,1]
	Conv(3×3)：[128,80,32]
Pooling(2×2)：[64,80,32]
	2-Denseblock(1)：[32,80,16]
Transition Layer(16,80,8)
	2-Denseblock(2)：[16,80,8]
Transition Layer(16,80,8)
	Reshape：[64,160]
GRU(1)：[64,64]
	GRU(2)：[64,32]
TimeDistributed：[64,6]
	Full connecting layer (64,6)
Output(Sigmoid)：[64,6]

Using the Dnase 2017 dataset, the detection categories of the dataset are 6 categories each, with the time tag being the time starting at the end point. Using the 2-densegruet model shown in table 2, where m is set to 2, that is, the feature vector sequence of the input detection model is sequentially subjected to a layer of convolution operation and a pooling process, and then is sequentially input into 2 consecutive 2-denseblocks; and l is 4 in each 2-DenseBlock structure according to the data condition and the performance of equipment, namely each 2-DenseBlock comprises 4 characteristic layers. n is 2, i.e., the number of layers of the gated loop unit GRU is 2.

Performing time domain and frequency domain analysis on the audio frame sequence, extracting a Mel frequency cepstrum coefficient and outputting a characteristic vector sequence; the number of sampling frames for the input audio data in the Dnase 2017 data set is 128, and the feature dimension of the selected mel-frequency cepstrum coefficient is as follows: under 40mfcc, under 40 Mel filtering groups, extracting the 40 dimensional mfcc characteristic, and outputting the Mel cepstrum coefficient characteristic vector sequence as (128, 40).

The 2-dimensional vector is converted into 3-dimensional data through reshape, because the number of channels of Input in the network structure of the 2-DenseNet model is 1, after the 2-DenseNet model is converted into three-dimensional data, the feature vectors of Ddase 2017 are respectively (128, 40, 1), and then the three-dimensional feature vectors are converted into two-dimensional feature vectors (256, 40) and Input into a gating cycle unit GRU for time series modeling.

And inputting the feature vector into a 2-DenseNet model, wherein the input feature map sequence firstly passes through a convolution layer, then the adopted pooling layer is subjected to pooling treatment, and the obtained three-dimensional data is sequentially input into 2 continuous 2-DenseBlock.

In each 2-DenseBlock there are 4 layers of features, i.e. 4 2-DenseBlock functions, the input of which is a sequence of feature maps. In the processing of the 2-DenseBlock function, batch standardization (BN) processing is firstly carried out, and an activation function is a ReLU function; then transferring to the convolution layer; the procedure is performed twice within the function, with a first convolution kernel size of 1 x1 and a second convolution kernel size of 3 x 3. The specific operation in the 2-DenseBlock function (denoted in the formula: 2-DenseBlock) is therefore:

three-dimensional data processed by two continuous 2-DenseBlock and Transition Layer is converted into two-dimensional eigenvectors, input into a gating circulation unit GRU for Time sequence modeling, then enter a Time Distributed Layer for Time sequence tensor operation, share weight, maximally model Time sequence, detect the starting Time and the ending Time of a sound event, and finally output after the detection result is processed by a Sigmoid function.

Under the experimental environment of a Window10 system, a video card GTX1060, a CPU (central processing unit) i7-8750H and a memory 16G; taking keras + TensorFlow as a deep learning frame, adopting a data set Dnase 2017, firstly, respectively carrying out comparison experiments on different characteristics, different characteristic dimensions and whether a network structure of a GRU unit is adopted in the Dnase 2017 data set, and verifying the fragment error rate and the F-score of the 2-DenseGRUNet model; and then, the good performance of the 2-DenseGRUNet model is verified through a comparison experiment with the existing research model.

The detection experiment of the audio data is carried out on the Dnase 2017 data set through the extracted 40-dimensional and 120-dimensional Mel cepstral coefficient characteristics and the 40-dimensional gamma pass cepstral coefficient characteristics in the 2-DenseNet and 2-DenseGRUNet network models, and the specific results are shown in the following table 3.

Table 3: effect fusion experimental result of 2-DenseGRUNet model

Model (model)	Feature(s)	Average fragment error rate	F-score of
				2-DenseNet	40mfcc	0.566	0.610
2-DenseNet	128mfcc	0.562	0.613
				2-DenseNet	40gfcc	0.586	0.607
2-DenseGRUNet	40mfcc	0.543	0.651
				2-DenseGRUNet	128mfcc	0.541	0.648
2-DenseGRUNet	40gfcc	0.563	0.631

According to the experimental results, compared with the 2-DenseGRUNet model without using the gated cyclic unit GRU, the 2-DenseGRUNet model adopted by the invention has the characteristics of 40mfcc, 128mfcc and 40gfcc, the average fragment error rate is reduced by 2.3%, 2.1% and 2.3% respectively, and the F-score fraction is improved by 4.1%, 3.5% and 2.4% respectively; under the 2-DenseGRUNet model, the average fragment error rate is respectively reduced by 2.0 percent and the F-score fraction is respectively improved by 2.0 percent under the characteristics of 40mfcc relative to 40 gfcc; under the condition of using 40mfcc relative to 128mfcc characteristics, the average fragment error rate and the F-score change range are about 0.1%, but the model training time can be effectively reduced and the operation requirement of a computer can be reduced by using 40 mfcc. In conclusion, the 2-DenseGRUNet model under 40mfcc can more efficiently utilize feature information fusion, obtain more effective feature information and effectively perform time series modeling. The optimal average fragment error rate was 0.543% and the F-score was 65.12%.

Further tests are carried out on the 2-DenseGRUNet model in the Dnase 2017 dataset, the test results are compared with the accuracy of the existing models of researchers at home and abroad, and the comparison test results are shown in a table 4.

TABLE 4 test results of different models

Compared with the test results of researchers at home and abroad, the D-2-DenseNet model adopted by the technical scheme of the invention has the advantages that the classification accuracy of the technical scheme of the invention is respectively reduced by 14.7% compared with the average fragment error rate of the MLP model of baseline, and the F-score fraction is respectively improved by 8.4%; the average fragment error rate is respectively reduced by 1.9 percent compared with the LSTM model, and the F-score fraction is respectively improved by 3.9 percent; the average fragment error rate of the technical scheme of the invention is obviously reduced, and the F-score is obviously improved.

In summary, the technical scheme provided by the invention can more efficiently utilize feature information fusion when processing audio data, obtain more effective feature information, and the model has lower average segment error rate and higher F-score, i.e. the accuracy of sound classification realized based on the method of the invention is higher.

Claims

1. The sound event detection method based on the 2-DenseGRUNet model comprises the following steps:

the method is characterized in that:

2. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: the 2-order Densenet model is sequentially connected with the GRU units in series, and the 2-order Densenet model is connected with the first GRU unit through a reshape layer; and a Time Distributed layer, a full connection layer and an output layer are sequentially arranged behind the GRU unit.

3. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: each 2-DenseBlock comprises feature layers which are connected in sequence; each feature layer comprises a1 × 1 convolution layer and a 3 × 3 convolution layer which are continuous, and in the feature layer, input data are subjected to batch standardization processing and activation function processing before entering the convolution layers for convolution processing; merging and cascading the last convolution layer and the next convolution layer in each feature layer through collocation respectively; and a dropout layer is added between the first characteristic layer and the second characteristic layer in each 2-DenseBlock.

4. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: extracting the features of the mel-frequency cepstrum coefficient in the step S2, wherein the mel-frequency cepstrum coefficient extracted by the specific dimension is 40 mfcc; the structure of the original feature vector sequence obtained in step S2 is a 2-dimensional vector, the first-dimensional vector is a number of frames after sampling the audio data to be processed, and the second-dimensional vector is a dimension of the mel-frequency cepstrum coefficient.

5. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: in step S3, when reconstructing the feature information and the label of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1S, the number of frames of the mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the following in the original sound event label: from the starting time to the ending time, the method comprises the following steps: number of start frames to number of end frames.

6. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: in step S5, before the reconstructed feature vector sequence is input to the trained voice event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3-dimensional vector is the number of channels in the voice event detection model.

7. The method of claim 2 for detecting a voice event based on the 2-DenseGRUNet model, wherein: the feature input of the GRU unit is a two-dimensional feature vector.

8. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: the Transition Layer comprises: one convolution kernel is 1 × 1 convolution layer, one maximum pooling layer of pooling size [2,2 ].

9. The method of claim 2 for detecting a voice event based on the 2-DenseGRUNet model, wherein: the number of the neurons of the full connection layer is set to be 256 or 128.

10. The method of claim 2 for detecting a voice event based on the 2-DenseGRUNet model, wherein: the output layer is implemented based on a Sigmoid function.