CN114386518A

CN114386518A - Lightweight abnormal sound event detection method based on adaptive width adaptive attention mechanism

Info

Publication number: CN114386518A
Application number: CN202210039999.5A
Authority: CN
Inventors: 安正义; 姚雨; 宋浠瑜; 王玫; 仇洪冰
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-04-22

Abstract

The invention discloses a light abnormal sound event detection method based on a self-adaptive width self-attention mechanism, which comprises the steps of firstly carrying out signal processing on audio with a label to obtain certain time-frequency characteristic representation of the audio; secondly, using the characteristic representation (usually vector or matrix) with labels as input, giving an adaptive width adaptive attention mechanism model, then, having a defined loss function and a random initialization attention weight in the adaptive width adaptive attention mechanism model, calculating the loss value of the sum label according to the adaptive attention mechanism model, then updating the adaptive attention weight by using a back propagation algorithm, and continuously performing updating iteration by using the three input weights of the attention until the loss function reaches a minimum or ideal state. And finally, storing the weight parameter by using a lightweight method, and predicting an unmarked section of audio by taking the weight parameter as a model to quickly and accurately predict the abnormal sound event.

Description

Lightweight abnormal sound event detection method based on adaptive width adaptive attention mechanism

Technical Field

The invention relates to a method for realizing abnormal overlapped sound event detection by using an adaptive width adaptive attention mechanism, in particular to a light-weight abnormal sound event detection method based on the adaptive width adaptive attention mechanism.

Background

The abnormal sound event detection technology belongs to the research field of acoustic event identification, and has important application value in smart home, urban road abnormal detection, fault detection and other smart city life.

The voice event detection task mainly comprises signal processing and a machine learning model, wherein common signal processing modes comprise noise addition, Fast Fourier Transform (FFT), Mel Cepstral Coefficient (MFCC) feature extraction and the like.

Some existing methods for constructing a learning model by using a Neural Network to detect a biological event include a scheme for achieving sound event detection by using a model of a Convolutional Neural Network (CNN) and training alone or in combination based on a Network structure of a Recurrent Neural Network (RNN). However, the real-time performance effect of the model is poor, the training model is difficult to optimize, and the prediction speed is low. The reason is determined by the nature of the network structure, wherein the CNN solves the problem of overlarge processing data, but event features in a voice event cannot be learned, a large number of parameters also easily cause overfitting, while the RNN can carry corresponding time information when extracting information, but brings a long-term dependence problem and a gradient disappearance phenomenon, and some corresponding variants solve some corresponding problems in subsequent development, but because the RNN is of a sequential structure, parallel operation is difficult to achieve, so that the calculation speed is relatively slow all the time. Compared with CNN and RNN, the self-attention mechanism has smaller complexity and fewer parameters, so that the requirement on computing capacity is smaller, and computer resources are saved; meanwhile, the problem that RNN can not be operated in parallel is solved, and the method has great advantages in calculation speed; at the same time, the information of longer sequences can be better focused, but more information is focused too much sometimes. A Queen Mary University of London Helen doctor team in 2020 compares the recognition effects of the same machine learning model before and after the control width self-attention mechanism is used, and research results show that the F fraction of the model using the control width self-attention mechanism is improved by 8.45%, the error rate is reduced by 0.15, and the improvement of the control width of attention on the acoustic event detection is proved to be effective. And with the development of the deep neural network, the network structure is more and more abundant, and the storage space of the model is also becoming bigger and bigger, so that most detection models can only stay in the theoretical stage and are difficult to deploy in portable devices such as a mobile terminal.

Disclosure of Invention

The invention provides a light abnormal Sound Event Detection method based on a self-adaptive width self-attention mechanism, which aims at the problems that in the current Sound Event Detection (SED), a prediction model is large, the prediction speed is low, the calculation resources are excessively depended on, and the real-time prediction is difficult. The method can classify and detect abnormal sound events contained in a section of audio, has better recognition effect than that based on CRNN under the condition of the same signal processing mode, solves the problems of low operation speed and incapability of parallel operation in RNN, and compresses the size of the model by a small amount of loss of recognition performance by utilizing the idea of light weight, so that the model can be deployed in a mobile terminal or other portable equipment.

The technical scheme for realizing the purpose of the invention is as follows:

the light-weight abnormal sound event detection method based on the adaptive width adaptive attention mechanism comprises the following steps:

(1) constructing a synthetic audio data set, and labeling and classifying each audio containing a plurality of abnormal sound events;

(2) preprocessing and feature extraction are carried out on the data set, and the data set is sent into a built self-adaptive width self-attention mechanism model for network iterative training until the model is optimal;

(3) compressing the model by using a lightweight method to obtain a lightweight detection model of a self-adaptive width self-attention mechanism;

(4) and preprocessing the audio to be detected, extracting features, and sending the audio to be detected into a compressed detection model for detection to obtain a prediction result.

The label and classification in the step (1) are as follows: firstly, a certain number of marked single sound event audios are taken, each type of sound event is numbered, and the total number of the sound event types is obtained

Then, some acoustic events are randomly synthesized to obtain synthesized audio, and the audio is marked

Wherein

Means using in the synthesis

A sound-like event; finally, a label file is exported, the file records the names of the audio files, and each type of sound event category occurs under each audio file name.

The preprocessing and the feature extraction in the step (2) and the step (4) are to perform resampling with a sampling rate of 16kHZ on the voice processing, then standardize audio waveform, and map the audio waveform data to the audio waveform data in a unified way

Above, normalized with max:

wherein:

the audio file (. wav) is data obtained by reading through a Python (wav) program package; using short-time Fourier transformsAnd (2) extracting 40-dimensional logarithmic Mel frequency cepstrum coefficients for all audio frequencies by transform (STFT), wherein the specific parameters are as follows:

at a sampling rate of

Frame overlap sampling

(ii) a The 40 dimensional log mel-frequency cepstral coefficients were extracted and normalized using z-score:

suppose that

The logarithmic Mel cepstrum coefficient obtained from STFT in seconds is

Wherein, in the step (A),

is that

The number of frames in a second,

,

obtaining the mapped logarithmic mel-frequency cepstrum coefficient:

the mean is 0 and the variance is 1.

The audio tag is: converting labels in seconds into labels in framesAnd labeling, namely converting each label file to obtain an audio label coding matrix taking a frame as a unit, wherein the label coding is composed of 0 element and 1 element, the column number of the matrix is the total frame number, and the row number of the matrix is

Is the total number of acoustic event categories. Wherein: one comprises

The process of converting the audio tag coding matrix of the abnormal sound event of the sound-like event from the second unit to the frame unit is as follows:

first of all, produce one

Line of

Zero matrix of columns, audio duration of

Second, number of rows of matrix

Is the number of acoustic event categories; when the label is first

Sound-like event occurrence

Time, get it first

Corresponding to sound-like event

The duration time of the row vector is converted into the length of a frame unit, and the corresponding zero vector is converted into a 1 vector;

finally, the vector of each individual acoustic event is combined into a matrix, which is the audio tag encoding matrix for that synthetic abnormal acoustic event.

The building method of the lightweight detection model of the adaptive width adaptive attention mechanism in the step (3) comprises the following steps:

1) pre-training the model:

a Python framework is adopted to build a self-attention mechanism model network as follows: the model is subjected to 3 times of convolution, three times of pooling, one time of gating cyclic unit (GRU), one-layer self-adaptive width self-attention mechanism and one-layer time distribution; wherein: the first layer is an input layer, and 40-dimensional logarithmic Mel cepstrum coefficients are input; the second tier is maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) tier (5 x 1) with input channels of 64; the third layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (4 x 1) with input channels of 64; the fourth layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (2 x 1) with input channels of 64; the fifth layer is formed by reshape and Permute to reduce the dimension and transpose the output of the fourth layer; the sixth layer is GRU activated using tanh with neuron number of 64; the seventh layer is a self-adaptive width self-attention mechanism, adopts add-attention type and is activated by using sigmoid; the eighth layer is time distribution (TimeDistributed) activated by using sigmoid and density as the number of sound event categories; each convolutional layer is used with a size of

The step size of the convolution kernel is 1, each convolution layer is activated by a normalization layer and by a 'Relu' function, and dorpout is added to improve the generalization capability of the model.

Then, taking the output of the sixth layer as input and inputting the input into an attention layer, multiplying the input with three attention weight matrixes of the attention layer to respectively obtain three attention matrixes of query Q, key K and value V, and obtaining attention weight through a series of operations, namely the relevance of each output current position and other positions of the sequence; the loss function is minimized by continuously training the iterative attention weight, namely the model is optimal; meanwhile, a self-attention mechanism model with self-adaptive width is adopted, and the width can be controlled through each training iteration to achieve optimization. Wherein:

2) self-attention mechanism model

Feature sequence read from an audio file and processed

Then multiplying the corresponding attention moment arrays respectively

Obtaining an attention input matrix

。

Wherein

,

Is the dimension of the attention mechanism output. Then the following operations are carried out:

wherein

Is a pre-set one of which is,

representing a time location. The final output is:

3) self-adaptive width self-attention mechanism model

And (4) taking the attention width as a training parameter, putting the attention width into a model, training and learning together, and adaptively selecting the attention width. When the method is implemented, a mask function is introduced

：

The function being a distance

Mapping to [0,1]Is a non-increasing function of

Parameterization, wherein

Is the maximum width of attention that is set,

is a slope representing a decrease in attention width; i.e. the attention score at this time is:

the adaptive width self-attention mechanism sacrifices some sequence information to a certain extent, saves the operation time, filters interference information, improves the operation efficiency and improves the effectiveness and reliability of the method for detecting the abnormal sound events of the urban roads.

And (3) lightening: the trained adaptive width adaptive attention mechanism detection model uses low-precision (16-bit) floating point numbers to replace high-precision (32-bit) floating point numbers in storage and prediction, and the general lightweight form is represented as follows:

wherein

And

respectively a number before quantization and a number after quantization,

is the quantization factor that is the factor of the quantization,

is the value of 0 in the original value domain after quantization, because there are many 0 in the weight and input (e.g. padding or via ReLU), so the real number 0 needs to be accurately represented when quantizing;

wherein the quantization factor

Determines the error between the quantized model and the original model, so that the quantization factor

The selection of the quantization factor is important so that the quantization factor is within a specified bit representation range (e.g., 16 bits) after quantization

The following formula is selected:

wherein

And

respectively the maximum and minimum of the object before quantization.

4) Lightweight model of adaptive width self-attention mechanism

Sending a training data set, namely a logarithmic Mel cepstrum coefficient of a synthetic audio frequency into a self-adaptive width self-attention mechanism lightweight model built by a self-adaptive width self-attention mechanism model, wherein initial value parameters of weights of all layers in the model are randomly given by PyTorch to obtain output

Wherein C is the total number of event types, T is the total frame number, and the loss of the true positive prediction label is calculated

Will be

And

element by element multiplication to obtain output

And finally, calculating the following two-term cross entropy loss function:

and (3) gradient back propagation, using an Adam gradient descent method, setting the learning rate to be 0.001, updating weight parameters, carrying out iterative training until the loss reaches the minimum, and storing model parameters to obtain the self-adaptive width self-attention mechanism detection model.

The method for detecting the model to be predicted by using the trained detection model in the step (4) comprises the following steps: preprocessing the audio to be detected with unknown labels and extracting features of the audio to be detected in the same way as the self-adaptive width self-attention mechanism model, sending the audio to the trained lightweight model of the self-adaptive width self-attention mechanism to obtain neural network probability output, and storing the neural network probability output;

searching optimal decision threshold for scale according to f1-score

According to the decision threshold

Obtaining a prediction result in the label through binarization, determining a sound event starting time frame node and an ending time frame node according to a label prediction output matrix, and calculating the cosine similarity of adjacent frames of the frame nodes corresponding to the neural network probability output matrix; if the similarity is greater than 0.5, extending the frame, namely extending the time stamp in the label matrix;

and finally, obtaining a prediction matrix after the label is extended, obtaining an identification result and completing prediction.

The invention has the advantages that: the invention provides a method for rapidly and accurately detecting the start-stop time and the event type of a sound event in an audio frequency containing various abnormal sound events. The method is based on a patent of 'a sound event labeling and identifying method adopting double Token labels' (patent application number: 202110465526.7), a detection model after model compression is obtained through lightweight training of a self-adaptive width self-attention mechanism model, audio to be detected is input to the detection model, and the type of an abnormal sound event is obtained in real time by utilizing the trained detection model. The method makes up the problem that the CNN cannot carry time information, avoids gradient disappearance and long-term dependence possibly caused by the RNN to a certain extent, exerts the advantage of parallel operation of the self-attention mechanism, adaptively selects the width of the concerned self-attention mechanism, adaptively pays attention to input information with a certain width under the condition of saving memory space with the help of light weight, improves the identification effect and the prediction speed, is used for improving the accuracy and the effectiveness of the urban road abnormal sound event detection method, and is expected to be deployed at a mobile terminal or other portable equipment.

Drawings

FIG. 1 is a flowchart of a method of detecting a defect in an embodiment of the present invention;

FIG. 2 is a diagram illustrating a self-attention mechanism model according to an embodiment of the present invention;

FIG. 3 is a model diagram of an adaptive width adaptive attention mechanism according to an embodiment of the present invention;

FIG. 4 is a diagram of mask functions according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

Fig. 1 shows a main detection flowchart of a lightweight abnormal sound event detection method based on an adaptive width adaptive attention mechanism. The method mainly comprises the following steps of classifying and identifying a segment of audio containing a plurality of sounds, wherein the detection process comprises the following steps:

the whole flow chart is mainly divided into 7 modules: firstly, constructing a synthetic sound data set; secondly, preprocessing and characteristic extraction are carried out on the data set; then, sending the model into a built self-adaptive width self-attention mechanism model for network iterative training until the model is optimal, and compressing the model by using a lightweight method when model parameters are saved; finally, the model is saved. And (3) after the audio to be predicted is subjected to preprocessing and feature extraction as the data set, sending the audio to be predicted to a stored detection model to obtain a prediction result. The specific method comprises the following steps:

(1) tagging and classifying audio each containing a plurality of anomalous acoustic events:

in abnormal event detection based on machine learning, an acoustic event detection model is generally trained from a large amount of labeled audio data, and the model obtained thereby is used for predicting an unknown section of audio, and predicting the event type and the corresponding time of occurrence. Since machine learning has only input and output, the size of the data set and the degree of authenticity tend to have a large influence on the quality of the detection result.

Overlapping sound events themselves comprise multiple simultaneous events, with great difficulty in tagging, so we generally preferThe method is synthesized by a plurality of single sound events, so that the audio containing a plurality of abnormal sound events is obtained. Firstly, a certain number of marked single sound event audios are taken, each type of sound event is numbered, and the total number of the sound event types is obtained

Wherein

Means using in the synthesis

A sound-like event. Finally, a label file is exported, the file records the names of the audio files, and each type of sound event category occurs under each audio file name.

The artificially synthesized abnormal sound event data set has a clear label for each sound event category, but obviously has low fit for abnormal sound events occurring in a real environment and is difficult to avoid deviation in prediction. The labeling of abnormal sound events in the real environment is usually listened by human ears, so that the method has high subjectivity and is time-consuming and labor-consuming, and multiple types of events occur simultaneously in the abnormal sound events, so that the labeling cost is increased by times. In order to increase the number of data sets and prevent overfitting, the following data expansion mode is adopted to expand the audio data to three times of the original data: audio random scaling, time masking, frequency masking, adding random noise, audio sample mixing (mixup). And finally, the data is expanded to achieve the effects of enriching the data set and preventing overfitting.

(2) Audio data preprocessing and feature extraction:

since the audio may originate from a variety of different devices, the processing of speech is done with resampling at a sample rate of 16 kHZ. Then audio waveThe shapes are standardized, and the audio waveform data are mapped to the shapes uniformly

Above, normalized with max:

，

is the data obtained by reading an audio file (. wav) through a python (wav) program package. Then, short-time Fourier transform (STFT) is adopted to extract 40-dimensional logarithmic Mel frequency cepstrum coefficients for all audios, and the specific parameters are as follows:

at a sampling rate of

Frame overlap sampling

(ii) a The 40 dimensional log mel-frequency cepstral coefficients were extracted and normalized with z-score as follows: suppose that

The logarithmic Mel cepstrum coefficient obtained from STFT in seconds is

Wherein, in the step (A),

is that

The number of frames in a second,

,

obtaining the mapped logarithmic mel-frequency cepstrum coefficient:

the mean is 0 and the variance is 1.

For audio tags: converting the labels with the unit of second into the labels with the unit of frame, transforming each label file to obtain an audio label coding matrix with the unit of frame, wherein the label coding is composed of 0 element and 1 element, the column number of the matrix is the total frame number, and the row number of the matrix is

For the total number of acoustic event categories, one includes

first of all, produce one

Line of

Zero matrix of columns, audio duration of

Second, number of rows of matrix

The number of acoustic event categories.

If the label is first

Sound-like event occurrence

Time, get it first

Corresponding to sound-like event

The duration of the row vector is converted to the length of a frame unit and the corresponding zero vector is converted to a 1 vector.

(3) Building a pre-training model:

a Python framework is adopted to build a self-attention mechanism model network as follows: the model is subjected to 3 times of convolution, three times of pooling, one time of gating and circulating unit (GRU), one-layer self-adaptive width self-attention mechanism and one-layer time distribution. Inputting 40-dimensional logarithmic Mel cepstrum coefficients for the first time; the second tier is maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) tier (5 x 1) with input channels of 64; the third layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (4 x 1) with input channels of 64; the fourth layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (2 x 1) with input channels of 64; the fifth layer is formed by reshape and Permute to reduce the dimension and transpose the output of the fourth layer; the sixth layer is GRU activated using tanh with neuron number of 64; the seventh layer is a self-adaptive width self-attention mechanism, adopts add-attention type and is activated by using sigmoid; the eighth layer is time distribution (TimeDistributed) activated using sigmoid and sense as the number of classes of acoustic events. Wherein each convolution layer has a size of

The step length of the convolution kernel is 1, each convolution layer is activated by a normalization layer and by a 'Relu' function, and the convolution kernel is added to improve the generalization capability of the modeldorpout。

The attention layer is specifically provided in the invention: the output of the sixth layer is used as input and input to the attention layer, the input is multiplied by three attention weight matrixes of the attention layer to respectively obtain three attention matrixes of query Q, key K and value V, and the attention weight is obtained through a series of operations, namely the relevance of the current position of each output and other positions of the sequence. And continuously training the iterative attention weight to minimize the loss function, namely optimizing the model. And the sequence information far away from the current position is relatively low in importance, so that a width-controlled self-attention mechanism mode is adopted, and meanwhile, in order to avoid losing important information, the width is controlled through each training iteration by adopting the self-attention mechanism with the self-adaptive width, so that the optimization is achieved. Therefore, under the condition of not losing important information, the calculation time can be reduced to a certain extent, and the working efficiency of the model is improved.

) Self-attention mechanism model: as shown in FIG. 2, the input is a sequence of features read from an audio file and processed

Then multiplying the corresponding attention moment arrays respectively

Obtaining an attention input matrix

。

Wherein

,

Is the dimension of the attention mechanism output. Then, the following operations are carried out：

Wherein

Is a pre-set one of which is,

representing a time location. The final output is:

2) adaptive width adaptive attention mechanism model: although the traditional self-attention mechanism model overcomes some defects in some SED tasks based on CNN and RNN, when the audio time is too long, the traditional self-attention mechanism needs to pay attention to all sequence information, so the operation time is relatively long, and the distance from the current time is relatively long

The information at distant moments is relatively of little importance and even contains interference information. Thus, a self-attention mechanism for controlling the attention width as shown in FIG. 3 appears, and under this attention model, it is possible to control the attention width according to the attention width

Proceed to the current position

The attention range of the surrounding sequence information does not need to pay attention to all the sequence information, and the efficiency of model operation is improved. Can make the system sacrifice less sequence informationUnder the condition of (2), the operation time is greatly saved, and the operation performance of the system is greatly improved.

Adaptive width adaptive attention mechanism model: in order to save computer resources and filter interference information, the attention-focusing width is selected better. The invention provides an adaptive width self-attention mechanism. Self-adaptive width self-attention mechanism: the attention width is also used as a training parameter and is put into a model to be trained and learned together, and the attention width is selected in a self-adaptive mode. When the method is implemented, a mask function as shown in FIG. 4 is introduced

：

The function being a distance

Mapping to [0,1]Is a non-increasing function of

Parameterization, wherein

Is the maximum width of attention that is set,

is a slope representing the decline of the attention width. I.e. the attention score at this time is:

And (3) lightening: the lightweight process is that the trained adaptive width adaptive attention mechanism detection model replaces a high-precision (32-bit) floating point number with a low-precision (16-bit) floating point number during storage and prediction, and the method can save about half of storage space during storage and calculate time delay during prediction, so that the effects of saving storage resources and accelerating computing power are achieved. The general form of lightweighting is expressed as:

wherein

And

respectively a number before quantization and a number after quantization,

is the quantization factor that is the factor of the quantization,

since there are many 0's (e.g. padding or via ReLU) in the weight and input, the real number 0 needs to be accurately represented in quantization.

Wherein the quantization factor

The following formula is selected:

wherein

And

respectively the maximum and minimum of the object before quantization.

4) The lightweight model of the adaptive width adaptive attention mechanism is as follows: sending a training data set, namely a logarithmic Mel cepstrum coefficient of the synthesized audio, into a self-adaptive width self-attention mechanism lightweight model built by a module 3), wherein initial value parameters of weights of all layers in the model are randomly given by PyTorch to obtain output

Will be

And

element by element multiplication to obtain output

And finally, calculating the following two-term cross entropy loss function:

(4) And (3) detecting the model to be predicted by using the trained detection model: and preprocessing and characteristic extraction are carried out on the audio to be detected with unknown labels, which is the same as the self-attention mechanism model, and then the audio is sent into the trained lightweight model of the self-adaptive width self-attention mechanism, so that neural network probability output is obtained and stored. Searching optimal decision threshold for scale according to f1-score

According to the decision threshold

Obtaining a prediction result in the label by binarization, wherein the specific implementation is as follows: and determining time frame nodes for starting and ending the acoustic event according to the label prediction output matrix, and calculating the cosine similarity of adjacent frames of the frame nodes corresponding to the neural network probability output matrix. If the similarity is greater than 0.5, the frame is extended, i.e. the time stamps in the tag matrix are extended. And finally, obtaining a prediction matrix after the label is extended, obtaining an identification result and completing prediction.

Note: the left and right sides of the steel plate can not extend beyond the preset value of the super parameter collar, and the value is generally (250 ms-50 ms).

Claims

1. The light abnormal sound event detection method based on the adaptive width adaptive attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:

2. The adaptive width adaptive attention mechanism-based light-weighted abnormal sound event detection method according to claim 1, wherein: the label and classification in the step (1) are as follows: firstly, a certain number of marked single sound event audios are taken, each type of sound event is numbered, and the total number of the sound event types is obtained

Wherein

Means using in the synthesis

3. The adaptive width adaptive attention mechanism-based light-weighted abnormal sound event detection method according to claim 1, wherein: the preprocessing and the feature extraction in the step (2) and the step (4) are to perform resampling with a sampling rate of 16kHZ on the voice processing, then standardize audio waveform, and map the audio waveform data to the audio waveform data in a unified way

Above, normalized with max:

which isThe method comprises the following steps:

the audio file (. wav) is data obtained by reading through a Python (wav) program package; extracting 40-dimensional logarithmic Mel frequency cepstrum coefficients from all audio by short-time Fourier transform (STFT), wherein the specific parameters are as follows:

at a sampling rate of

Frame overlap sampling

suppose that

The logarithmic Mel cepstrum coefficient obtained from STFT in seconds is

Wherein, in the step (A),

is that

The number of frames in a second,

,

obtaining the mapped logarithmic mel-frequency cepstrum coefficient:

the mean is 0 and the variance is 1.

4. The adaptive width adaptive attention mechanism-based light-weighted abnormal sound event detection method according to claim 1, wherein: the audio tag of step (1): converting the labels with the unit of second into the labels with the unit of frame, transforming each label file to obtain an audio label coding matrix with the unit of frame, wherein the label coding is composed of 0 element and 1 element, the column number of the matrix is the total frame number, and the row number of the matrix is

Is the total number of acoustic event categories; wherein: one comprises

first of all, produce one

Line of

Zero matrix of columns, audio duration of

Second, number of rows of matrix

Is the number of acoustic event categories; when the label is first

Sound-like event occurrence

Time, get it first

Corresponding to sound-like event

5. The adaptive width adaptive attention mechanism-based light-weighted abnormal sound event detection method according to claim 1, wherein: the building method of the lightweight detection model of the adaptive width adaptive attention mechanism in the step (3) comprises the following steps:

1) pre-training the model:

a Python framework is adopted to build a self-attention mechanism model network as follows: the model is subjected to 3 times of convolution, three times of pooling, one time of gating cyclic unit (GRU), one-layer self-adaptive width self-attention mechanism and one-layer time distribution; wherein: the first layer is an input layer, and 40-dimensional logarithmic Mel cepstrum coefficients are input; the second tier is maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) tier (5 x 1) with input channels of 64; the third layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (4 x 1) with input channels of 64; the fourth layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (2 x 1) with input channels of 64; the fifth layer is formed by reshape and Permute to reduce the dimension and transpose the output of the fourth layer; the sixth layer is GRU activated using tanh with neuron number of 64; the seventh layer is a self-adaptive width self-attention mechanism, adopts add-attention type and is activated by using sigmoid; the eighth layer is the use of sigmoid laserLive and Dense are the time distribution of the acoustic event category number; each convolutional layer is used with a size of

The step size of the convolution kernel is 1, each convolution layer is activated by a normalization layer and a Relu function, and dorpout is added;

then, taking the output of the sixth layer as input to an attention layer, multiplying the input by three attention weight matrixes of the attention layer to respectively obtain three attention matrixes of query Q, key K and value V, and obtaining attention weight through operation, namely the correlation between the current position of each output and other positions of the sequence; the loss function is minimized by continuously training the iterative attention weight, namely the model is optimal; meanwhile, a self-attention mechanism model with self-adaptive width is adopted, and the width can be controlled through each training iteration to achieve optimization; wherein:

2) self-attention mechanism model

Feature sequence read from an audio file and processed

Then multiplying the corresponding attention moment arrays respectively

Obtaining an attention input matrix

：

Wherein

,

Is the dimension of the attention mechanism output; then the following operations are carried out:

wherein

Is a pre-set one of which is,

representing a time location; the final output is:

3) self-adaptive width self-attention mechanism model

The attention width is also used as a training parameter and is put into a model to be trained and learned together, and the attention width is selected in a self-adaptive manner; when the method is implemented, a mask function is introduced

：

The function being a distance

Mapping to [0,1]Is a non-increasing function of

Parameterization, wherein

Is the maximum width of attention that is set,

the adaptive width self-attention mechanism sacrifices some sequence information to a certain extent, saves the operation time, filters interference information, improves the operation efficiency and improves the effectiveness and reliability of the urban road abnormal sound event detection method;

wherein

And

respectively a number before quantization and a number after quantization,

is the quantization factor that is the factor of the quantization,

wherein the quantization factor

The following formula is selected:

wherein

And

maximum and minimum values of the object before quantization, respectively;

4) lightweight model of adaptive width self-attention mechanism

Wherein C is the total number of event categories and T is the total frame number, and calculating the true positivePredicting tag loss

Will be

And

element by element multiplication to obtain output

And finally, calculating the following two-term cross entropy loss function:

6. The adaptive width adaptive attention mechanism-based light-weighted abnormal sound event detection method according to claim 1, wherein: the method for detecting the model to be predicted by using the trained detection model in the step (4) comprises the following steps: preprocessing the audio to be detected with unknown labels and extracting features of the audio to be detected in the same way as the self-adaptive width self-attention mechanism model, sending the audio to the trained lightweight model of the self-adaptive width self-attention mechanism to obtain neural network probability output, and storing the neural network probability output;

searching optimal decision threshold for scale according to f1-score

According to the decision threshold