CN113744758A - Sound event detection method based on 2-DenseGRUNet model - Google Patents

Sound event detection method based on 2-DenseGRUNet model Download PDF

Info

Publication number
CN113744758A
CN113744758A CN202111089655.7A CN202111089655A CN113744758A CN 113744758 A CN113744758 A CN 113744758A CN 202111089655 A CN202111089655 A CN 202111089655A CN 113744758 A CN113744758 A CN 113744758A
Authority
CN
China
Prior art keywords
layer
model
feature
densegrunet
event detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111089655.7A
Other languages
Chinese (zh)
Other versions
CN113744758B (en
Inventor
曹毅
黄子龙
费鸿博
吴伟官
夏宇
周辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202111089655.7A priority Critical patent/CN113744758B/en
Publication of CN113744758A publication Critical patent/CN113744758A/en
Application granted granted Critical
Publication of CN113744758B publication Critical patent/CN113744758B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Abstract

The sound event detection method based on the 2-DenseGRUNet model provided by the invention is based on a 2-order DenseNet network model, a gated cycle unit GRU network is added, and a sound event detection model is constructed; compared with the traditional convolutional neural network and the cyclic neural network model, the sound event detection model in the technical scheme of the patent combines the advantages of the 2-DenseNet and the GRU, can more efficiently use the characteristic information to fuse the characteristic information, obtains more effective characteristic information, and can effectively perform time sequence modeling. The sound event detection model based on the technical scheme of the patent has lower average fragment error rate and higher F-Score in urban sound event detection, and ensures that the sound classification result based on the method is more accurate.

Description

Sound event detection method based on 2-DenseGRUNet model
Technical Field
The invention relates to the technical field of sound detection, in particular to a sound event detection method based on a 2-DenseGRUNet model.
Background
The sound carries a large amount of information about life scenes and physical events in the city, and the information is automatically extracted by intelligently sensing each sound source through a deep learning method, so that the intelligent city building method has great potential and application prospect in building intelligent cities. In smart cities, sound event detection is an important basis for recognition and semantic understanding of environmental sound scenes. The urban sound event detection research is mainly applied to the aspects of environment perception, factory equipment detection, urban security, automatic driving and the like. In the prior art, the urban sound event detection technology is mainly realized based on MLP, CNN and LSTM network models. However, when the 3 network models are evaluated by F-Score, which is an index that comprehensively considers harmonic values of Precision and Recall, the F-Score has a low Score due to a high average fragment error rate, and the application range is limited in practical application.
Disclosure of Invention
In order to solve the problem of average segment error rate of urban sound event detection in the center of the prior art, the sound event detection method based on the 2-DenseGRUNet model can extract more effective acoustic information when processing audio data, and has better time sequence modeling capability, so that the model has lower average segment error rate and higher usability in urban sound event detection.
The technical scheme of the invention is as follows: the sound event detection method based on the 2-DenseGRUNet model comprises the following steps:
s1: collecting audio data to be processed, preprocessing an original audio signal of the audio data to be processed, and outputting an audio frame sequence;
the preprocessing operation comprises the following steps: sampling and quantizing, pre-emphasis processing and windowing;
s2: analyzing the time domain and the frequency domain of the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original characteristic vector sequence;
s3: reconstructing the feature information and the label of the original feature vector sequence, and outputting a reconstructed feature vector sequence after feature reconstruction processing;
converting the starting time, the ending time and the category of the sound event in the original feature vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed feature vector sequence;
s4: constructing a sound event detection model, and performing iterative training on the model to obtain the trained sound event detection model;
s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into the trained sound event detection model for identification and detection, and obtaining a sound event detection result of the audio data to be processed;
the method is characterized in that:
the sound event detection model comprises: an input layer, a 2-order DenseNet model and a GRU unit; all of the GRU units are connected in series after the 2 nd order DenseNet model;
a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2 nd-order DenseNet model;
the 2 nd order DenseNet model comprises: the device comprises 2-DenseBlock structures which are continuous, wherein a Transition layer structure is arranged behind each 2-DenseBlock structure;
in each 2-DenseBlock structure, the connection between the feature layers and the feature layers is based on the correlation connection of a 2-order Markov model, and the input of the current feature layer is related to the output of the previous 2 feature layers; each Transition layer structure comprises a convolution layer and a pooling layer.
It is further characterized in that:
the 2-order Densenet model is sequentially connected with the GRU units in series, and the 2-order Densenet model is connected with the first GRU unit through a reshape layer; a Time Distributed layer, a full connection layer and an output layer are sequentially arranged behind the GRU unit;
each 2-DenseBlock comprises feature layers which are connected in sequence; each feature layer comprises a1 × 1 convolution layer and a 3 × 3 convolution layer which are continuous, and in the feature layer, input data are subjected to batch standardization processing and activation function processing before entering the convolution layers for convolution processing; merging and cascading the last convolution layer and the next convolution layer in each feature layer through collocation respectively; a dropout layer is added between the first characteristic layer and the second characteristic layer in each 2-DenseBlock;
extracting the features of the mel-frequency cepstrum coefficient in the step S2, wherein the mel-frequency cepstrum coefficient extracted by the specific dimension is 40 mfcc; the structure of the original feature vector sequence obtained in step S2 is a 2-dimensional vector, the first-dimensional vector is a number of frames after sampling the audio data to be processed, and the second-dimensional vector is a dimension of the mel-frequency cepstrum coefficient;
in step S3, when reconstructing the feature information and the label of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1S, the number of frames of the mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the following in the original sound event label: from the starting time to the ending time, the method comprises the following steps: starting frame number to ending frame number;
in step S5, before the reconstructed feature vector sequence is input to the trained voice event detection model, it is necessary to convert the feature vector from a 2-dimensional vector to a 3-dimensional vector, and input the 3-dimensional vector to a network model, where the 3 rd-dimensional vector is the number of channels in the voice event detection model;
the characteristic input of the GRU unit is a two-dimensional characteristic vector;
the Transition Layer comprises: one convolution kernel is 1 × 1 convolution layer, one maximum pooling layer with pooling size [2,2 ];
the number of the neurons of the full connection layer is set to be 256 or 128;
the output layer is implemented based on a Sigmoid function.
The sound event detection method based on the 2-DenseGRUNet model provided by the invention is based on a 2-order DenseNet network model, a gated cycle unit GRU network is added, and a sound event detection model is constructed; compared with the traditional convolutional neural network and the cyclic neural network model, the sound event detection model in the technical scheme of the patent combines the advantages of the 2-DenseNet and the GRU, can more efficiently use the characteristic information to fuse the characteristic information, obtains more effective characteristic information, and can effectively perform time sequence modeling. The sound event detection model based on the technical scheme of the patent has lower average fragment error rate and higher F-Score in urban sound event detection, and ensures that the sound classification result based on the method is more accurate.
Drawings
FIG. 1 is a flow chart of feature reconstruction in sound event detection according to the present invention;
FIG. 2 is a network block diagram of the 2-DenseGRUNet model of the present invention;
FIG. 3 is a schematic diagram of a 2-DenseBlock in a 2-DenseNet network according to the present invention;
fig. 4 is a schematic diagram of a gated loop unit GRU in the present invention.
Detailed Description
As shown in fig. 1 to 4, the sound event detection method based on the 2-DenseGRUNet model of the present invention specifically includes the following steps.
S1: collecting audio data to be processed, preprocessing an original audio signal of the audio data to be processed, and outputting an audio frame sequence;
the preprocessing operation comprises the following steps: sampling and quantizing, pre-emphasis processing and windowing.
S2: performing time domain and frequency domain analysis on the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original characteristic vector sequence;
extracting the characteristics of the Mel frequency cepstrum coefficient, wherein the Mel frequency cepstrum coefficient extracted by the specific dimension is 40 mfcc; the structure of the original feature vector sequence obtained in step S2 is a 2-dimensional vector, the first-dimensional vector is the number of frames after sampling of the audio data to be processed, and the second-dimensional vector is the dimension of the mel-frequency cepstrum coefficient, in this embodiment, the mel-frequency cepstrum coefficient is 40mfcc, and the second-dimensional vector is 40.
S3: reconstructing the feature information and the label of the original feature vector sequence, and outputting a reconstructed feature vector sequence after feature reconstruction processing;
converting the starting time, the ending time and the category of the sound event in the original characteristic vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed characteristic vector sequence;
when reconstructing the feature information and the label of the original feature vector sequence, the specific duration of data in the original feature vector sequence is 1s, the number of frames of mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the following data in the original sound event label: from the starting time to the ending time, the method comprises the following steps: number of start frames to number of end frames.
As shown in fig. 1, in step S3, the detailed implementation process of reconstructing the feature information and the tag of the original feature vector sequence is as follows:
let the existing original audio sample include: the file names of the files are a013, a010, a129, b099, b008 and b100, and the time length of each file is as follows: 3min44s, 3min30s, 4min0s, 4min0s, 3min30s, 3min01 s. Splicing the audio sample files with different durations according to a time dimension to construct total audio data, wherein the total duration of the total audio data is T:
T=3min44s+3min30s+4min0s+4min0s+3min30s+3min01s
extracting the tag information of the audio frequency from the total audio frequency data; each segment of file in the total audio data is respectively marked as:
time-start, time-end, category, respectively, representing the [ start time, end time, category ] of the audio event corresponding to the original audio sample.
Extracting the features of the labeled total audio data, wherein f is 44.1kHz, nfft is 2048, Win-len is 2048, hop _ len is 1024, t is 0.0232s, and Segment len is 128;
wherein f represents sampling frequency, nfft represents the length of fast fourier transform, wen _ len represents the number of sampling points of each frame, hop _ len represents the number of sampling points between two frames, T represents the duration of each frame, and Segment len-128 represents the feature of dividing the audio with the whole duration of T into a plurality of frames of 128 after feature extraction;
after the feature reconstruction, the total audio data with the length T, which contains different samples, respectively correspond to the audio segments labeled as: frame _ start, frame _ end, one-hot;
representing for each audio segment after segmentation: [ starting frame number, ending frame number, one-hot encoding of category ];
and finally, outputting the reconstructed feature vector sequence after the reconstructed feature processing, namely representing the reconstructed feature vector sequence by frame _ start, frame _ end and one-hot.
S4: and constructing a sound event detection model, and performing iterative training on the model to obtain the trained sound event detection model.
Based on a 2-order DenseNet model, a network model constructed by combining the characteristics of a gating cycle unit GRU model is a sound event detection model in the invention, which is simply referred to as: 2-DenseGRUNet model. The sound event detection model takes a 2-order DenseNet model as a feature extraction network at the front end of the network, and then a GRU unit with good time sequence modeling capability is connected in series at the front end, so that the feature information of the sound event can be fused with high efficiency, more effective feature information can be obtained, time sequence modeling can be effectively carried out, and the average segment error rate of sound segments can be reduced.
The sound event detection model comprises: an input layer, a 2-order DenseNet model and a GRU unit; all GRU units are connected in series behind a 2-order DenseNet model;
a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2-order DenseNet model;
the 2 nd order DenseNet model includes: m continuous 2-DenseBlock structures, wherein m is a natural number greater than or equal to 1;
a Transition layer structure is arranged behind each 2-DenseBlock structure;
the 2-order Densenet model is sequentially connected with the GRU units in series, and the 2-order Densenet model is connected with the first GRU unit through a reshape layer; a Time Distributed layer, a full connection layer and an output layer are sequentially arranged behind the GRU unit; wherein, the characteristic input of the GRU unit is a two-dimensional characteristic vector; the output layer is realized based on a Sigmoid function; the Transition Layer comprises: one convolution kernel is 1 × 1 convolution layer, one maximum pooling layer of pooling size [2,2 ]. The number of the neurons with the full connection layer number is set to be 256 or 128, parameter adjustment and inhibition overfitting are carried out on the network, and finally the detection result is output after being processed through a Sigmoid function.
The structure of the feature layer in 2-DenseBlock is shown in Table 1 below:
table 1: structure of characteristic layer in 2-DenseBlock
Conv(1×1)
BN(·)
ReLU activation function
Conv(3×3)
BN(·)
ReLU activation function
Concatenate function
dropout layer
Each 2-DenseBlock structure comprises l feature layers which are connected in sequence;
as shown in table 1, each feature layer includes 2 convolutional layers, and in the feature layer, after the convolutional layers perform convolution processing on input data, batch normalization processing (BN) and ReLU activation function processing are also performed;
merging and cascading the last convolution layer and the next convolution layer in each feature layer through Concatenate respectively; and a dropout layer is arranged between the first characteristic layer and the second characteristic layer in each 2-DenseBlock to perform small overfitting inhibition, so that parameter adjustment of a network model in the later stage is facilitated.
DenseBlock is used as the basic structure of the DenseNet model, and DenseBlock in 2-order DenseNet is also two-order; that is, in each 2-DenseBlock structure, the connection between the feature layers is based on the correlation connection of a 2-step Markov model, and the current feature layer input is related to the output of the previous 2 feature layers; each Transition layer structure comprises a convolution layer and a pooling layer; the feature layer carries out weight sharing in a Transition layer structure, carries out time sequence distinguishing to the maximum extent and detects the starting time and the ending time of a sound event;
each 2-DenseBlock comprises l feature layers which are connected in sequence, wherein l is a natural number which is more than or equal to 1; each feature layer comprises a1 × 1 convolution layer and a 3 × 3 convolution layer which are continuous, and in the feature layer, input data are subjected to batch standardization processing and activation function processing before entering the convolution layers for convolution processing; merging and cascading the last convolution layer and the next convolution layer in each feature layer through collocation respectively; and a dropout layer is added between the first characteristic layer and the second characteristic layer in each 2-DenseBlock.
In the 2-DenseBlock network structure, the 1 × 1 and 3 × 3 convolutional layers in each feature layer are a set of non-linearly varying feature layers. As shown in fig. 3 of the drawings of the specification, the 2-DenseBlock comprises 4 feature layers, and the input layer of each non-linearly changing feature layer is defined as: x1,X2,…,XlU1-U8 are convolutional layers in the feature layer, and W2-W9 are weight systems corresponding to each convolutional layerA number matrix;
in 2-DenseBlock, layer 3 initiates the feature output U of the network convolution transformcCan be defined as:
Figure BDA0003266437950000041
wherein, [ X ]l,Xl-1,Xl-2]Representing the current layer to carry out channel number merging cascade operation through 2-order related connection mode, using the feature mapping of the first two layers as the input of the current layer, W3×3And W1×1Representing convolution kernel sizes of 1 × 1 and 3 × 3 kernel functions, respectively, BN (·) representing batch normalization, f (·) representing a ReLU activation function, BN (·) representing batch normalization, B representing a bias coefficient;
each Transition layer structure comprises a convolution layer and a pooling layer; the convolution kernel of the convolution layer is 1 x1, feature dimension reduction processing is carried out on the convolution layer, then the convolution layer is connected with the pooling layer, the size of the matrix is reduced through the pooling layer processing, the parameters of the final full-connection layer are reduced, and the expression formula is as follows:
Figure BDA0003266437950000042
wherein:
Figure BDA0003266437950000043
representing the output of a pooling layer, taking a maximum value in a pooling region, wherein l is the number of the feature layers included in each 2-DenseBlock structure, k is the number of channels of the feature map, and m and n are the sizes of convolution kernels; x (i, j) corresponds to a pixel on the feature map; p is a pre-specified parameter that, when p tends to infinity,
Figure BDA0003266437950000044
the maximum value is taken in the pooling region and is the maximum pooling (max pooling).
In the embodiment shown in fig. 3, when the number of layers l is 4, the output of the 1 st layer is X1Is not usedThe input that the collocation layer forwards propagates to layer 2 is X2(ii) a The profile of the input at layer 3 is only related to the profile of the output at layer 2 and layer 1, i.e. X3=f([x3,x2,x1]) (ii) a The profile of the input at level 4 is only related to the profile of the output at levels 3 and 2, i.e. X4=f([x4,x3,x2])。
As shown in fig. 4, the GRU unit is a gated cyclic unit, which is a gating mechanism in the cyclic neural network. In the mechanism for gating the cyclic unit, an update gate z is providedtAnd a reset gate rt. Updating the door ztFor controlling the state x at the previous momentt-1Information is brought into the current state xtMiddle level, reset gate rtThe GRU has better performance in timing modeling by controlling how much information of the previous state is written to the current candidate set. The expression formula is as follows:
zt=σ(Wz·[ht-1,xt])
rt=σ(Wr·[ht-1,xt])
Figure BDA0003266437950000045
Figure BDA0003266437950000046
in the formula: sigma denotes the fully connected layer and the activation function,
Figure BDA0003266437950000047
indicates a hidden state, htRepresenting an output, representing a multiplication of elements;
Wzweight coefficient, W, representing the updated gaterRepresents the weight coefficient of the reset gate, σ (g) represents the fully connected layer and activation function, and tanh (g) represents the tanh activation function.
As shown in fig. 2, the feature vector sequence of the input sound event detection model is sequentially subjected to a Layer of convolution operation and a pooling process, and then sequentially input into m consecutive 2-denseblocks, wherein each 2-DenseBlock is followed by a Transition Layer; after the processing of 2-DeseNet (m) comprising m continuous 2-DeseBlock structures and Transition layers, converting the feature dimension by a reshape Layer, converting the three-dimensional feature vector into a two-dimensional feature vector, then connecting n gating cycle units GRU for Time sequence modeling processing, then inputting the Time sequence tensor operation in a Time Distributed Layer, inputting the Time sequence tensor operation in a full-connection Layer for detection processing, and finally outputting the detection result after the processing of a Sigmoid function. The number m and the number l of layers of the 2-DenseBlock are evaluated according to the actual hardware condition and the data complexity. In the embodiments shown in fig. 2 and 3 of the drawings, n is 2, m is 2, and l is 4.
Sigmoid function f (z) is given by the formula:
Figure BDA0003266437950000051
s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into a trained sound event detection model for identification detection, and obtaining a sound event detection result of audio data to be processed; meanwhile, before the reconstructed feature vector sequence is input into the trained sound event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3 rd-dimensional vector is the number of channels in the sound event detection model.
In the technical scheme, information in the characteristic diagram can be more effectively extracted based on 2-order DenseNet, and a gated cycle unit GRU model is simple, so that the method is more suitable for constructing a larger network; in the patent, a 2-order DenseNet network is combined with 2 gate control cycle unit GRU models, so that the calculation efficiency is higher from the calculation perspective; based on the technical scheme, the method can effectively extract the frequency domain information of the feature map, can also effectively capture the time sequence feature of the long-term audio sequence, and can more efficiently realize the classification task and the regression task in the detection.
As shown in table 2 below, is an example of a network structure of the 2-DenseGRUNet model.
Table 2: example of the 2-DenseGRUNet model
Input:mfcc[128,80,1]
Conv(3×3):[128,80,32]
Pooling(2×2):[64,80,32]
2-Denseblock(1):[32,80,16]
Transition Layer(16,80,8)
2-Denseblock(2):[16,80,8]
Transition Layer(16,80,8)
Reshape:[64,160]
GRU(1):[64,64]
GRU(2):[64,32]
TimeDistributed:[64,6]
Full connecting layer (64,6)
Output(Sigmoid):[64,6]
Using the Dnase 2017 dataset, the detection categories of the dataset are 6 categories each, with the time tag being the time starting at the end point. Using the 2-densegruet model shown in table 2, where m is set to 2, that is, the feature vector sequence of the input detection model is sequentially subjected to a layer of convolution operation and a pooling process, and then is sequentially input into 2 consecutive 2-denseblocks; and l is 4 in each 2-DenseBlock structure according to the data condition and the performance of equipment, namely each 2-DenseBlock comprises 4 characteristic layers. n is 2, i.e., the number of layers of the gated loop unit GRU is 2.
Performing time domain and frequency domain analysis on the audio frame sequence, extracting a Mel frequency cepstrum coefficient and outputting a characteristic vector sequence; the number of sampling frames for the input audio data in the Dnase 2017 data set is 128, and the feature dimension of the selected mel-frequency cepstrum coefficient is as follows: under 40mfcc, under 40 Mel filtering groups, extracting the 40 dimensional mfcc characteristic, and outputting the Mel cepstrum coefficient characteristic vector sequence as (128, 40).
The 2-dimensional vector is converted into 3-dimensional data through reshape, because the number of channels of Input in the network structure of the 2-DenseNet model is 1, after the 2-DenseNet model is converted into three-dimensional data, the feature vectors of Ddase 2017 are respectively (128, 40, 1), and then the three-dimensional feature vectors are converted into two-dimensional feature vectors (256, 40) and Input into a gating cycle unit GRU for time series modeling.
And inputting the feature vector into a 2-DenseNet model, wherein the input feature map sequence firstly passes through a convolution layer, then the adopted pooling layer is subjected to pooling treatment, and the obtained three-dimensional data is sequentially input into 2 continuous 2-DenseBlock.
In each 2-DenseBlock there are 4 layers of features, i.e. 4 2-DenseBlock functions, the input of which is a sequence of feature maps. In the processing of the 2-DenseBlock function, batch standardization (BN) processing is firstly carried out, and an activation function is a ReLU function; then transferring to the convolution layer; the procedure is performed twice within the function, with a first convolution kernel size of 1 x1 and a second convolution kernel size of 3 x 3. The specific operation in the 2-DenseBlock function (denoted in the formula: 2-DenseBlock) is therefore:
Figure BDA0003266437950000061
three-dimensional data processed by two continuous 2-DenseBlock and Transition Layer is converted into two-dimensional eigenvectors, input into a gating circulation unit GRU for Time sequence modeling, then enter a Time Distributed Layer for Time sequence tensor operation, share weight, maximally model Time sequence, detect the starting Time and the ending Time of a sound event, and finally output after the detection result is processed by a Sigmoid function.
Under the experimental environment of a Window10 system, a video card GTX1060, a CPU (central processing unit) i7-8750H and a memory 16G; taking keras + TensorFlow as a deep learning frame, adopting a data set Dnase 2017, firstly, respectively carrying out comparison experiments on different characteristics, different characteristic dimensions and whether a network structure of a GRU unit is adopted in the Dnase 2017 data set, and verifying the fragment error rate and the F-score of the 2-DenseGRUNet model; and then, the good performance of the 2-DenseGRUNet model is verified through a comparison experiment with the existing research model.
The detection experiment of the audio data is carried out on the Dnase 2017 data set through the extracted 40-dimensional and 120-dimensional Mel cepstral coefficient characteristics and the 40-dimensional gamma pass cepstral coefficient characteristics in the 2-DenseNet and 2-DenseGRUNet network models, and the specific results are shown in the following table 3.
Table 3: effect fusion experimental result of 2-DenseGRUNet model
Model (model) Feature(s) Average fragment error rate F-score of
2-DenseNet 40mfcc 0.566 0.610
2-DenseNet 128mfcc 0.562 0.613
2-DenseNet 40gfcc 0.586 0.607
2-DenseGRUNet 40mfcc 0.543 0.651
2-DenseGRUNet 128mfcc 0.541 0.648
2-DenseGRUNet 40gfcc 0.563 0.631
According to the experimental results, compared with the 2-DenseGRUNet model without using the gated cyclic unit GRU, the 2-DenseGRUNet model adopted by the invention has the characteristics of 40mfcc, 128mfcc and 40gfcc, the average fragment error rate is reduced by 2.3%, 2.1% and 2.3% respectively, and the F-score fraction is improved by 4.1%, 3.5% and 2.4% respectively; under the 2-DenseGRUNet model, the average fragment error rate is respectively reduced by 2.0 percent and the F-score fraction is respectively improved by 2.0 percent under the characteristics of 40mfcc relative to 40 gfcc; under the condition of using 40mfcc relative to 128mfcc characteristics, the average fragment error rate and the F-score change range are about 0.1%, but the model training time can be effectively reduced and the operation requirement of a computer can be reduced by using 40 mfcc. In conclusion, the 2-DenseGRUNet model under 40mfcc can more efficiently utilize feature information fusion, obtain more effective feature information and effectively perform time series modeling. The optimal average fragment error rate was 0.543% and the F-score was 65.12%.
Further tests are carried out on the 2-DenseGRUNet model in the Dnase 2017 dataset, the test results are compared with the accuracy of the existing models of researchers at home and abroad, and the comparison test results are shown in a table 4.
TABLE 4 test results of different models
Figure BDA0003266437950000062
Compared with the test results of researchers at home and abroad, the D-2-DenseNet model adopted by the technical scheme of the invention has the advantages that the classification accuracy of the technical scheme of the invention is respectively reduced by 14.7% compared with the average fragment error rate of the MLP model of baseline, and the F-score fraction is respectively improved by 8.4%; the average fragment error rate is respectively reduced by 1.9 percent compared with the LSTM model, and the F-score fraction is respectively improved by 3.9 percent; the average fragment error rate of the technical scheme of the invention is obviously reduced, and the F-score is obviously improved.
In summary, the technical scheme provided by the invention can more efficiently utilize feature information fusion when processing audio data, obtain more effective feature information, and the model has lower average segment error rate and higher F-score, i.e. the accuracy of sound classification realized based on the method of the invention is higher.

Claims (10)

1. The sound event detection method based on the 2-DenseGRUNet model comprises the following steps:
s1: collecting audio data to be processed, preprocessing an original audio signal of the audio data to be processed, and outputting an audio frame sequence;
the preprocessing operation comprises the following steps: sampling and quantizing, pre-emphasis processing and windowing;
s2: analyzing the time domain and the frequency domain of the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original characteristic vector sequence;
s3: reconstructing the feature information and the label of the original feature vector sequence, and outputting a reconstructed feature vector sequence after feature reconstruction processing;
converting the starting time, the ending time and the category of the sound event in the original feature vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed feature vector sequence;
s4: constructing a sound event detection model, and performing iterative training on the model to obtain the trained sound event detection model;
s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into the trained sound event detection model for identification and detection, and obtaining a sound event detection result of the audio data to be processed;
the method is characterized in that:
the sound event detection model comprises: an input layer, a 2-order DenseNet model and a GRU unit; all of the GRU units are connected in series after the 2 nd order DenseNet model;
a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2 nd-order DenseNet model;
the 2 nd order DenseNet model comprises: the device comprises 2-DenseBlock structures which are continuous, wherein a Transition layer structure is arranged behind each 2-DenseBlock structure;
in each 2-DenseBlock structure, the connection between the feature layers and the feature layers is based on the correlation connection of a 2-order Markov model, and the input of the current feature layer is related to the output of the previous 2 feature layers; each Transition layer structure comprises a convolution layer and a pooling layer.
2. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: the 2-order Densenet model is sequentially connected with the GRU units in series, and the 2-order Densenet model is connected with the first GRU unit through a reshape layer; and a Time Distributed layer, a full connection layer and an output layer are sequentially arranged behind the GRU unit.
3. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: each 2-DenseBlock comprises feature layers which are connected in sequence; each feature layer comprises a1 × 1 convolution layer and a 3 × 3 convolution layer which are continuous, and in the feature layer, input data are subjected to batch standardization processing and activation function processing before entering the convolution layers for convolution processing; merging and cascading the last convolution layer and the next convolution layer in each feature layer through collocation respectively; and a dropout layer is added between the first characteristic layer and the second characteristic layer in each 2-DenseBlock.
4. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: extracting the features of the mel-frequency cepstrum coefficient in the step S2, wherein the mel-frequency cepstrum coefficient extracted by the specific dimension is 40 mfcc; the structure of the original feature vector sequence obtained in step S2 is a 2-dimensional vector, the first-dimensional vector is a number of frames after sampling the audio data to be processed, and the second-dimensional vector is a dimension of the mel-frequency cepstrum coefficient.
5. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: in step S3, when reconstructing the feature information and the label of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1S, the number of frames of the mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the following in the original sound event label: from the starting time to the ending time, the method comprises the following steps: number of start frames to number of end frames.
6. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: in step S5, before the reconstructed feature vector sequence is input to the trained voice event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3-dimensional vector is the number of channels in the voice event detection model.
7. The method of claim 2 for detecting a voice event based on the 2-DenseGRUNet model, wherein: the feature input of the GRU unit is a two-dimensional feature vector.
8. The method of claim 1 for detecting a voice event based on the 2-DenseGRUNet model, wherein: the Transition Layer comprises: one convolution kernel is 1 × 1 convolution layer, one maximum pooling layer of pooling size [2,2 ].
9. The method of claim 2 for detecting a voice event based on the 2-DenseGRUNet model, wherein: the number of the neurons of the full connection layer is set to be 256 or 128.
10. The method of claim 2 for detecting a voice event based on the 2-DenseGRUNet model, wherein: the output layer is implemented based on a Sigmoid function.
CN202111089655.7A 2021-09-16 2021-09-16 Sound event detection method based on 2-DenseGRUNet model Active CN113744758B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111089655.7A CN113744758B (en) 2021-09-16 2021-09-16 Sound event detection method based on 2-DenseGRUNet model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111089655.7A CN113744758B (en) 2021-09-16 2021-09-16 Sound event detection method based on 2-DenseGRUNet model

Publications (2)

Publication Number Publication Date
CN113744758A true CN113744758A (en) 2021-12-03
CN113744758B CN113744758B (en) 2023-12-01

Family

ID=78739499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111089655.7A Active CN113744758B (en) 2021-09-16 2021-09-16 Sound event detection method based on 2-DenseGRUNet model

Country Status (1)

Country Link
CN (1) CN113744758B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065075A (en) * 2018-09-26 2018-12-21 广州势必可赢网络科技有限公司 A kind of method of speech processing, device, system and computer readable storage medium
US20180373985A1 (en) * 2017-06-23 2018-12-27 Nvidia Corporation Transforming convolutional neural networks for visual sequence learning
CN109949824A (en) * 2019-01-24 2019-06-28 江南大学 City sound event classification method based on N-DenseNet and higher-dimension mfcc feature
CN110084292A (en) * 2019-04-18 2019-08-02 江南大学 Object detection method based on DenseNet and multi-scale feature fusion
CN112890828A (en) * 2021-01-14 2021-06-04 重庆兆琨智医科技有限公司 Electroencephalogram signal identification method and system for densely connecting gating network
CN113012714A (en) * 2021-02-22 2021-06-22 哈尔滨工程大学 Acoustic event detection method based on pixel attention mechanism capsule network model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180373985A1 (en) * 2017-06-23 2018-12-27 Nvidia Corporation Transforming convolutional neural networks for visual sequence learning
CN109065075A (en) * 2018-09-26 2018-12-21 广州势必可赢网络科技有限公司 A kind of method of speech processing, device, system and computer readable storage medium
CN109949824A (en) * 2019-01-24 2019-06-28 江南大学 City sound event classification method based on N-DenseNet and higher-dimension mfcc feature
CN110084292A (en) * 2019-04-18 2019-08-02 江南大学 Object detection method based on DenseNet and multi-scale feature fusion
CN112890828A (en) * 2021-01-14 2021-06-04 重庆兆琨智医科技有限公司 Electroencephalogram signal identification method and system for densely connecting gating network
CN113012714A (en) * 2021-02-22 2021-06-22 哈尔滨工程大学 Acoustic event detection method based on pixel attention mechanism capsule network model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曹毅: "N-DenseNet城市声音事件分类模型", 《西安电子科技大学学报》, vol. 46, no. 6, pages 10 - 15 *

Also Published As

Publication number Publication date
CN113744758B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
Gupta et al. Comparing recurrent convolutional neural networks for large scale bird species classification
CN110390952B (en) City sound event classification method based on dual-feature 2-DenseNet parallel connection
CN110827804B (en) Sound event labeling method from audio frame sequence to event label sequence
CN107221320A (en) Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model
WO2020028760A1 (en) System and method for neural network orchestration
CN109448703B (en) Audio scene recognition method and system combining deep neural network and topic model
WO2022048239A1 (en) Audio processing method and device
CN111161715A (en) Specific sound event retrieval and positioning method based on sequence classification
Passricha et al. A comparative analysis of pooling strategies for convolutional neural network based Hindi ASR
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
Zhang et al. Learning audio sequence representations for acoustic event classification
Zhang et al. Temporal Transformer Networks for Acoustic Scene Classification.
CN111666996A (en) High-precision equipment source identification method based on attention mechanism
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN113129908B (en) End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion
Mustika et al. Comparison of keras optimizers for earthquake signal classification based on deep neural networks
CN112489689B (en) Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure
CN112466284B (en) Mask voice identification method
CN116741159A (en) Audio classification and model training method and device, electronic equipment and storage medium
Liu et al. Bird song classification based on improved Bi-LSTM-DenseNet network
CN113744758B (en) Sound event detection method based on 2-DenseGRUNet model
CN112861949B (en) Emotion prediction method and system based on face and sound
WO2021147084A1 (en) Systems and methods for emotion recognition in user-generated video(ugv)
Spoorthy et al. Polyphonic Sound Event Detection Using Mel-Pseudo Constant Q-Transform and Deep Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant