CN116013276A

CN116013276A - Indoor environment sound automatic classification method based on lightweight ECAPA-TDNN neural network

Info

Publication number: CN116013276A
Application number: CN202211715093.7A
Authority: CN
Inventors: 杨俊杰; 丁家辉; 翁士龙; 胡锦业; 谢胜利; 刘子瑜; 李津
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-04-25

Abstract

The invention discloses an automatic classification method of indoor scene environmental sounds based on a lightweight ECAPA-TDNN neural network, which relates to the technical field of environmental sound classification and comprises the following steps: first, the augmentation of the initial ambient sound data is accomplished by time masking, frequency masking, and panning the audio data. Secondly, extracting the features of a Mel spectrogram of the environmental sound data of the indoor scene through the steps of pre-emphasis, short-time Fourier transform, mel filtering and the like; and dividing the acquired characteristic data of the environmental sound Mel spectrogram into a training set and a testing set. Thirdly, constructing an ECAPA-TDNN network model, and optimizing ECAPA-TDNN network neuron parameters through a training set; and then the trained neural network is used for classifying the environmental sound test set. Compared with the traditional training classification frame environment sound classification method, the method provided by the invention has the advantages of higher accuracy, wider applicability and less consumption of computing resources.

Description

Indoor environment sound automatic classification method based on lightweight ECAPA-TDNN neural network

Technical Field

The invention relates to the technical field of environmental sound classification, in particular to an indoor complex scene environmental sound accurate classification technology based on a lightweight time delay neural network.

Background

Sound is an important information carrier, and is often used for assisting environment sensing and information decision, and is widely applied to the field of intelligent home safety detection due to the advantages of easy collection, no limitation of light rays and space angles and the like. The intelligent device receives indoor environment audio signals, can detect and judge life body events such as crying sounds of babies, falling sounds of old people, knocking sounds and the like through an environment sound identification technology, can quickly sense changes in the environment, such as far and near footstep sounds, and can timely make reasonable reactions and decisions. Therefore, it is important to develop a high-precision ambient sound recognition technique. From the current state of the art, the deep learning-based environmental sound recognition method is a mainstream environmental sound recognition method, but two main challenges still remain: 1. the environment sound type is complex, the distinction degree of original features of different original environment sound signals is low, the data volume is small, and the task of classifying tags subsequently is difficult; 2. the neural network parameters related to most environmental sound classification are large in scale, complex in operation and not suitable for terminal deployment. Based on the analysis, the invention provides an indoor environment sound automatic classification method based on a lightweight ECAPA-TDNN neural network, which is used for solving the problems set forth in the above.

Disclosure of Invention

The invention aims to provide an indoor environment sound automatic classification method based on a lightweight ECAPA-TDNN neural network, so as to solve any one of the problems set forth in the background art.

In order to achieve the above purpose, the present invention provides the following technical solutions: an indoor environment sound automatic classification method based on a lightweight ECAPA-TDNN neural network comprises the following steps:

s1, audio pretreatment: converting the multi-channel environment sound signal into fixed channel number and carrying out sampling and standardized length processing;

s2, data augmentation: performing time shielding and frequency shielding on the environment sound signal of the S1, and translating the audio to augment the environment sound data;

s3, feature extraction: based on the amplified environmental sound data, pre-emphasis is carried out on the data, short-time Fourier transform (STFT) is carried out, then Mel filtering is carried out, and Mel spectrum feature vectors are output; carrying out secondary random normalization processing on the Mel spectrum feature vector by adopting cepstrum average subtraction, establishing a feature data set and loading;

s4, constructing a classifier of ECAPA-TDNN: inputting the environment sound characteristic data set extracted in the step S3 into a convolution layer (dimension rise), and adding nonlinear relations among layers of a network by using Kaiming normalization and deviation zeroing; after batch standardization, the method enters a convolution layer iteration, and when the number of the convolution layer channels reaches a threshold value, the convolution layer channels are output to the next layer;

s5, constructing a compression-excitation (SE) module of the ECAPA-TDNN classifier, and pooling and linear layer modules: carrying out average pooling, convolution and nonlinear change processing on the output of the S4;

s6, training phase: inputting the labels and the extracted environmental audio Mel spectrogram characteristic data into a network to perform structure and parameter optimization training on the ECAPA-TDNN network model, and completing training on the network after repeated iteration reaches the maximum experience setting times;

s7, testing: and classifying the environmental sound features in the test sample data set by using the trained ECAPA-TDNN classifier to obtain a test classification result.

Preferably, the log mel spectrum obtained in the step S3 is subjected to SpecAugment method, and the strategy randomly masks part of the frames in the time domain and part of the channels in the frequency domain.

Preferably, the features extracted in the step S3 are mel spectrograms, and a cepstrum average subtraction is adopted to perform a second random normalization processing on the feature vectors of the mel spectrograms.

Preferably, in the step S5, a compression and excitation module for constructing ECAPA-TDNN is constructed, and the output of S4 is subjected to the processes of average pooling, rolling and nonlinear variation.

Preferably, the number of times of training the ECAPA-TDNN network classification model in the step S6 is 450-500.

Preferably, when the training times of the ECAPA-TDNN network classification model are 450-500 times, the obtained loss rate and accuracy gradually tend to converge, and the loss function is defined as

Compared with the prior art, the invention has the beneficial effects that:

the invention extracts the features of the mel spectrogram of each type of different environmental sound signals (such as common environmental sounds including television sounds, cooking sounds, running water sounds, alarm sounds, crying sounds of children, curtain drawing sounds and the like); building a training set and a data set for the extracted features of the mel spectrogram and loading the training set and the data set; and constructing an ECAPA-TDNN network model, training an ECAPA-TDNN network classification model, and finishing classification recognition. The method has the advantages that the Mel spectrogram characteristics are extracted from the environmental audio frequency and used for training the ECAPA-TDNN classification network to classify, and compared with the environmental audio classification method using the traditional training classification framework, the method provided by the invention has higher accuracy and less consumption of computational resources.

Drawings

FIG. 1 is a system flow chart of an embodiment of an automatic classification method of indoor environmental sounds based on a lightweight ECAPA-TDNN neural network;

FIG. 2 is a time domain waveform diagram of an environmental sound signal according to an embodiment of an automatic classification method of indoor environmental sound based on a lightweight ECAPA-TDNN neural network;

FIG. 3 is a time domain waveform diagram of a preprocessed ambient sound signal according to an embodiment of an automatic classification method of indoor ambient sound based on a lightweight ECAPA-TDNN neural network;

FIG. 4 is a graph of a frequency spectrum of an ambient sound signal with negative decibels removed according to an embodiment of an automatic classification method of indoor ambient sound based on a lightweight ECAPA-TDNN neural network;

FIG. 5 is a Mel spectrum of an environmental sound signal according to an embodiment of an automatic classification method of indoor environmental sound based on a lightweight ECAPA-TDNN neural network;

FIG. 6 is a schematic diagram of a process for extracting features of a Mel spectrogram of an embodiment of an automatic classification method of indoor environmental sounds based on a lightweight ECAPA-TDNN neural network;

FIG. 7 is a frame diagram of ECAPA-TDNN of an embodiment of an automatic classification method of indoor environmental sound based on a lightweight ECAPA-TDNN neural network;

FIG. 8 is a graph showing a variation/convergence of a loss function value according to an embodiment of an automatic classification method of indoor environmental sounds based on a lightweight ECAPA-TDNN neural network;

FIG. 9 is a F1 score change chart of environmental sound classification and identification in an automatic indoor environmental sound classification method based on a lightweight ECAPA-TDNN neural network;

fig. 10 is a diagram showing various environmental sound recognition accuracy change rates of an automatic classification method of indoor environmental sounds based on a lightweight ECAPA-TDNN neural network.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-7, the invention provides an indoor environment sound automatic classification method based on a lightweight ECAPA-TDNN neural network, which comprises the following steps:

s1, audio pretreatment:

s11, converting the multichannel original environment sound signal into a two-channel signal.

S12, setting the audio sampling rate to 22050Hz.

S13, standardizing the audio length, intercepting all the parts with the audio signal time length longer than 10 seconds, and filling the redundant parts.

S2, data augmentation:

s21, calculating an average frequency aver_Fre, randomly shielding the data of the channel number in the interval by adding a horizontal bar, and directly returning other data to generate a Mel graph.

S22, shifting the audio data to provide a random number r, and shifting the limit amount S, wherein the shifting amount is defined as r multiplied by S multiplied by the audio length.

S23, generating a Mel diagram, setting a Mel filter group to 64, and setting the short-time Fourier transform window length to 1024 without frame shift.

S24, converting the power scale into a decibel scale, and setting the minimum negative cut-off value to be 80.

S25, performing label processing, namely performing feature extraction on six types of environment sounds including television sounds, stir-frying sounds, flowing water sounds, alarm sounds, crying sounds of children and curtain pulling sounds according to a low-complexity family environment sound challenge match data set of a 2022iFLYTEK A.I developer used in an experiment, obtaining environment sound Mel spectrogram features, and performing label processing on environment sound data of a training set. The environmental sound types of target classification are set into different groups, so that the experiment is divided into six categories as the setting of classification: television sound, label set value is 0; frying sound, wherein the label setting value is 1; running sound, the label setting value is 2, and the label setting value is 3; alarm sound, the label setting value is 4; the child crys and the label is set to be 5.

S3, carrying out secondary random normalization processing on the feature vectors of the Mel spectrogram by adopting cepstrum average subtraction to construct a training set and a testing set;

s4, constructing a classifier module of ECAPA-TDNN:

s41, S2, the extracted environmental sound signal characteristic quantity firstly enters an input layer of ECAPA-TDNN, then enters a convolution layer (up-dimension), the convolution kernel size is set to be (3, 3), the filling size is set to be (1, 1), the step length is set to be (2, 2), the nonlinear relation among all layers of a network is increased by using the Kaiming normalization and the deviation zeroing, iteration is carried out on the input layer after batch normalization, the number of channels is doubled at the moment, if the number of channels does not reach 256 through the change, the output is taken as input to enter the input layer again, and after the number of channels reaches 256, the output is output to a next module;

s5, constructing a compression and excitation (SE) module of the ECAPA-TDNN, and carrying out average pooling, convolution and nonlinear change processing on the output of the S4:

s51, enabling the output tensor of the Meier spectrogram features of various environmental sounds after passing through the classifier module to be the tensor after being compressed and excited by the module, and enabling the output tensor to firstly enter an average pooling layer of the module to be converted into a feature tensor of the size, wherein H and W are the length and the width of the audio features respectively. Then, the convolution layer A is entered, the convolution kernel size is set to be 1, the filling is set to be 0, the step size is set to be 0, the reduction ratio is set to be 16 according to experience, the channel number is 16, the convolution layer B (excitation module) is input after the Relu activation function, and the channel number is converted to 256 (reduction operation).

S52, according to the output tensor obtained in S51 and the tensor after the compression and excitation module, enabling the updated input to be averaged and pooled, updating the audio tensor to be the size, then entering a linear layer, performing dropout operation in the linear layer, increasing the dropout probability from 0.1 to 0.5 each time through 5 linear layers, and outputting the output as the average output after the addition of 5 linear layers.

S6, training an ECAPA-TDNN network classification model: inputting the features of the mel spectrograms of the processed various environmental sound signals into a network to perform label training on the ECAPA-TDNN network model, setting the initial learning rate to be 0.001 by using an Adam optimizer, and dynamically changing the learning rate by using a linear annealing strategy;

s7, classification test: and performing test classification on the environmental sound characteristics in the test sample data set by using the trained ECAPA-TDNN network structure and parameters to obtain a classification result, and identifying various environmental sounds.

In the step S2, frequency shielding and time shielding are carried out on the environmental sound signal of the step S1 so as to enhance the characteristics of the environmental sound signal.

The ECAPA-TDNN network model in the step S5 is used for constructing a compression and excitation module, the number of channels of the audio characteristics is updated from 256 to 16, and the channels are restored through a convolution layer, so that the sensitivity of the model to the characteristics of the channels is improved.

And (3) training the ECAPA-TDNN network classification model for 450-500 times in the step S6, continuously adjusting and optimizing experimental parameters, and finally obtaining the most suitable experimental parameters. The loss rate and the accuracy rate of the model are found to gradually trend towards convergence trend about 400 times after training in sequence in the experiment, so that the training iteration number of the model is adjusted to 450 times. And during each training, the model parameters obtained from the last epoch are trained for the next time, and the frame is retrained, so that the obtained loss rate and accuracy are still very small, and 450 times are selected.

When the training times of training the ECAPA-TDNN network classification model are 450-500 times, the obtained loss rate and accuracy gradually tend to be in a convergence trend, and a loss function is defined as follows: the result is 0 only when the environmental sound feature is 0 or 1, and the result of classification only when the environmental sound feature is 1.

Step one, an audio preprocessing stage: carrying out channel conversion, sampling and interception processing on the environmental sound data downloaded from the 2022iFLYTEK A.I developer large-race low-complexity family environmental sound challenge race data set to construct the environmental sound data heard under the real indoor environmental sound scene;

step two, data augmentation and label stage: frequency shielding and time shielding are carried out on the preprocessed environmental sound data to enhance the characteristics of the environmental sound signals, strategies such as translation audio data and the like are used for enhancing the data quantity of the environmental sound, the Mel spectrogram characteristics are extracted from the processed environmental sound signals, and the tags are processed to construct a data set;

step three, an environmental sound classification process: inputting the training set in the data set obtained in the second step into an ECAPA-TDNN classification network for classification, and finally classifying six different environmental sounds to train a classification frame.

Embodiment one:

the feasibility and superiority of the algorithm are described below by classifying different environmental sounds based on the environmental sound classification model under the strategy.

Firstly, data processing is carried out, the selected data set is from a data set of a low-complexity family environment sound challenge race of a 2022iFLYTEK A.I developer, and environment sounds can be divided into six environment sounds, namely television sounds, cooking sounds, flowing water sounds, alarm sounds, crying sounds of children and curtain drawing sounds. Firstly, converting a multichannel original environment sound signal into a two-channel signal, setting an audio sampling rate to 22050Hz, then standardizing audio length, intercepting all audio signal time length parts which are longer than 10 seconds for audio data with different time lengths, filling redundant parts, and continuing the experiment.

The experiment continues to select six processed environmental sounds for data augmentation, the audio mainly focused by the experiment is collected in an indoor home scene and then effectively classified, so that the environmental sound judgment requirement is met, the data of the environmental sounds are fewer under the condition, the subsequent network training effect is affected, the data augmentation label processing is needed for an original environmental sound data set, the average frequency aver_Fre of the environmental sounds is calculated, the data of the channel number in the interval is randomly shielded by adding a horizontal bar, and other data are directly returned to generate a Mel spectrogram. The audio data is then panned to provide a random number, and the amount of panning is defined as the audio length. The power scale is converted to a decibel scale, and a minimum negative cutoff value of 80 is set, and an amplified data set is established and loaded.

Taking television sound as an example, constructing a classifier module of ECAPA-TDNN, firstly entering an input layer of ECAPA-TDNN by the extracted television sound environment sound signal characteristic quantity, then entering a convolution layer (up-dimension), setting the convolution kernel size to be (3, 3), filling the convolution kernel size to be (1, 1), setting the step size to be (2, 2), adding a nonlinear relation among all layers of a network by using Kaiming normalization and deviation zero setting, iterating in the input layer after batch normalization, doubling the number of channels at the moment, entering the input layer again by taking the output as input if the number of channels does not reach 256 through the change, and outputting to the next module after the number of channels reaches 256.

And enabling the output tensor of the mel spectrogram characteristics of the environmental sounds of each television sound after passing through the classifier module to be the tensor after passing through the compression and excitation module, and enabling the output tensor to firstly enter an average pooling layer of the module to be converted into a characteristic tensor with the size, wherein H and W are the length and the width of the television audio characteristics respectively. Then, the convolution layer A is entered, the convolution kernel size is set to be 1, the filling is set to be 0, the step size is set to be 0, the reduction ratio is set to be 16 according to experience, the channel number is 16, the convolution layer B (excitation module) is input after the Relu activation function, and the channel number is converted to 256 (reduction operation).

According to the output tensor x of the television environmental sound after passing through the classifier module and the tensor after passing through the excitation module, the updated x is input into an average pooling layer, the audio tensor is updated to be the size, then enters a linear layer, dropout operation is carried out in the linear layer, the dropout probability of each time is increased from 0.1 to 0.5 through 5 linear layers, and the output is the average output after addition of 5 linear layers.

Training an ECAPA-TDNN network classification model: and (3) carrying out label training on the ECAPA-TDNN network model by inputting the processed television environment sound signal Mel spectrogram characteristics into a network, defining input as input minus an input average value and dividing the input average value by an input standard deviation when each iteration is carried out, calculating a Loss function value (using a cross entropy Loss function) after passing through a model classification layer, using an Adam optimizer, setting an initial learning rate to be 0.001, accelerating gradient descent speed, and dynamically changing the learning rate by using a linear annealing strategy by using a schedule step. When the iteration number reaches 500, the network is considered to complete the training process of the television environment sound.

What is not described in detail in this specification is all that is known to those skilled in the art.

Although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.

Claims

1. An indoor environment sound automatic classification method based on a lightweight ECAPA-TDNN neural network is characterized by comprising the following steps:

s3, feature extraction: based on the amplified environmental sound data, pre-emphasis is carried out on the data, short-time Fourier transform is carried out on the data, then Mel filtering is carried out, and a Mel spectrum feature vector is output; carrying out secondary random normalization processing on the Mel spectrum feature vector by adopting cepstrum average subtraction, establishing a feature data set and loading;

s4, inputting the environment sound characteristic data set extracted in the S3 into a convolution layer, and increasing nonlinear relations among layers of the network by using Kaiming normalization and deviation zeroing; after batch standardization, the method enters a convolution layer iteration, and when the number of the convolution layer channels reaches a threshold value, the convolution layer channels are output to the next layer;

s5, constructing a compression-excitation module of the ECAPA-TDNN classifier, and pooling and linearity layer modules: carrying out average pooling, convolution and nonlinear change processing on the output of the S4;

s6, training phase: training the output of the S5 by combining with label information, and completing the optimization training of the structure and parameters of the ECAPA-TDNN classifier after the maximum number of iterations;

2. The method for automatically classifying indoor environmental sounds based on the lightweight ECAPA-TDNN neural network according to claim 1, wherein the method is characterized in that: and (3) applying a SpecAugment method on the log Meier spectrogram obtained in the step S3, wherein the strategy randomly masks part of frames in a time domain and part of channels in a frequency domain.

3. The method for automatically classifying indoor environmental sounds based on the lightweight ECAPA-TDNN neural network according to claim 1, wherein the method is characterized in that: and S3, extracting features of the Mel spectrogram, and carrying out secondary random normalization processing on the feature vector of the Mel spectrogram by adopting cepstrum average subtraction.

4. The method for automatically classifying indoor environmental sounds based on the lightweight ECAPA-TDNN neural network according to claim 1, wherein the method is characterized in that: and in the step S5, a compression and excitation module for constructing ECAPA-TDNN is constructed, and the output of the step S4 is subjected to average pooling, rolling and nonlinear change processing.

5. The method for automatically classifying indoor environmental sounds based on the lightweight ECAPA-TDNN neural network according to claim 1, wherein the method is characterized in that: and (3) the training times of the structure and parameter optimization of the ECAPA-TDNN classifier in the step S6 are 450-500.

6. The method for automatically classifying indoor environmental sounds based on the lightweight ECAPA-TDNN neural network according to claim 5, wherein the method is characterized in that: the loss function of the training ECAPA-TDNN classifier is defined as

When the training times are 450-500 times, the obtained loss rate and accuracy gradually tend to converge. />