CN114386518A - Lightweight abnormal sound event detection method based on adaptive width adaptive attention mechanism - Google Patents

Lightweight abnormal sound event detection method based on adaptive width adaptive attention mechanism Download PDF

Info

Publication number
CN114386518A
CN114386518A CN202210039999.5A CN202210039999A CN114386518A CN 114386518 A CN114386518 A CN 114386518A CN 202210039999 A CN202210039999 A CN 202210039999A CN 114386518 A CN114386518 A CN 114386518A
Authority
CN
China
Prior art keywords
adaptive
self
model
audio
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210039999.5A
Other languages
Chinese (zh)
Inventor
安正义
姚雨
宋浠瑜
王玫
仇洪冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202210039999.5A priority Critical patent/CN114386518A/en
Publication of CN114386518A publication Critical patent/CN114386518A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a light abnormal sound event detection method based on a self-adaptive width self-attention mechanism, which comprises the steps of firstly carrying out signal processing on audio with a label to obtain certain time-frequency characteristic representation of the audio; secondly, using the characteristic representation (usually vector or matrix) with labels as input, giving an adaptive width adaptive attention mechanism model, then, having a defined loss function and a random initialization attention weight in the adaptive width adaptive attention mechanism model, calculating the loss value of the sum label according to the adaptive attention mechanism model, then updating the adaptive attention weight by using a back propagation algorithm, and continuously performing updating iteration by using the three input weights of the attention until the loss function reaches a minimum or ideal state. And finally, storing the weight parameter by using a lightweight method, and predicting an unmarked section of audio by taking the weight parameter as a model to quickly and accurately predict the abnormal sound event.

Description

Lightweight abnormal sound event detection method based on adaptive width adaptive attention mechanism
Technical Field
The invention relates to a method for realizing abnormal overlapped sound event detection by using an adaptive width adaptive attention mechanism, in particular to a light-weight abnormal sound event detection method based on the adaptive width adaptive attention mechanism.
Background
The abnormal sound event detection technology belongs to the research field of acoustic event identification, and has important application value in smart home, urban road abnormal detection, fault detection and other smart city life.
The voice event detection task mainly comprises signal processing and a machine learning model, wherein common signal processing modes comprise noise addition, Fast Fourier Transform (FFT), Mel Cepstral Coefficient (MFCC) feature extraction and the like.
Some existing methods for constructing a learning model by using a Neural Network to detect a biological event include a scheme for achieving sound event detection by using a model of a Convolutional Neural Network (CNN) and training alone or in combination based on a Network structure of a Recurrent Neural Network (RNN). However, the real-time performance effect of the model is poor, the training model is difficult to optimize, and the prediction speed is low. The reason is determined by the nature of the network structure, wherein the CNN solves the problem of overlarge processing data, but event features in a voice event cannot be learned, a large number of parameters also easily cause overfitting, while the RNN can carry corresponding time information when extracting information, but brings a long-term dependence problem and a gradient disappearance phenomenon, and some corresponding variants solve some corresponding problems in subsequent development, but because the RNN is of a sequential structure, parallel operation is difficult to achieve, so that the calculation speed is relatively slow all the time. Compared with CNN and RNN, the self-attention mechanism has smaller complexity and fewer parameters, so that the requirement on computing capacity is smaller, and computer resources are saved; meanwhile, the problem that RNN can not be operated in parallel is solved, and the method has great advantages in calculation speed; at the same time, the information of longer sequences can be better focused, but more information is focused too much sometimes. A Queen Mary University of London Helen doctor team in 2020 compares the recognition effects of the same machine learning model before and after the control width self-attention mechanism is used, and research results show that the F fraction of the model using the control width self-attention mechanism is improved by 8.45%, the error rate is reduced by 0.15, and the improvement of the control width of attention on the acoustic event detection is proved to be effective. And with the development of the deep neural network, the network structure is more and more abundant, and the storage space of the model is also becoming bigger and bigger, so that most detection models can only stay in the theoretical stage and are difficult to deploy in portable devices such as a mobile terminal.
Disclosure of Invention
The invention provides a light abnormal Sound Event Detection method based on a self-adaptive width self-attention mechanism, which aims at the problems that in the current Sound Event Detection (SED), a prediction model is large, the prediction speed is low, the calculation resources are excessively depended on, and the real-time prediction is difficult. The method can classify and detect abnormal sound events contained in a section of audio, has better recognition effect than that based on CRNN under the condition of the same signal processing mode, solves the problems of low operation speed and incapability of parallel operation in RNN, and compresses the size of the model by a small amount of loss of recognition performance by utilizing the idea of light weight, so that the model can be deployed in a mobile terminal or other portable equipment.
The technical scheme for realizing the purpose of the invention is as follows:
the light-weight abnormal sound event detection method based on the adaptive width adaptive attention mechanism comprises the following steps:
(1) constructing a synthetic audio data set, and labeling and classifying each audio containing a plurality of abnormal sound events;
(2) preprocessing and feature extraction are carried out on the data set, and the data set is sent into a built self-adaptive width self-attention mechanism model for network iterative training until the model is optimal;
(3) compressing the model by using a lightweight method to obtain a lightweight detection model of a self-adaptive width self-attention mechanism;
(4) and preprocessing the audio to be detected, extracting features, and sending the audio to be detected into a compressed detection model for detection to obtain a prediction result.
The label and classification in the step (1) are as follows: firstly, a certain number of marked single sound event audios are taken, each type of sound event is numbered, and the total number of the sound event types is obtained
Figure 100002_DEST_PATH_IMAGE001
Then, some acoustic events are randomly synthesized to obtain synthesized audio, and the audio is marked
Figure 303587DEST_PATH_IMAGE002
Wherein
Figure 100002_DEST_PATH_IMAGE003
Means using in the synthesis
Figure 100002_DEST_PATH_IMAGE005
A sound-like event; finally, a label file is exported, the file records the names of the audio files, and each type of sound event category occurs under each audio file name.
The preprocessing and the feature extraction in the step (2) and the step (4) are to perform resampling with a sampling rate of 16kHZ on the voice processing, then standardize audio waveform, and map the audio waveform data to the audio waveform data in a unified way
Figure 581116DEST_PATH_IMAGE006
Above, normalized with max:
Figure 100002_DEST_PATH_IMAGE007
wherein:
Figure 922099DEST_PATH_IMAGE008
the audio file (. wav) is data obtained by reading through a Python (wav) program package; using short-time Fourier transformsAnd (2) extracting 40-dimensional logarithmic Mel frequency cepstrum coefficients for all audio frequencies by transform (STFT), wherein the specific parameters are as follows:
Figure 100002_DEST_PATH_IMAGE009
at a sampling rate of
Figure 977780DEST_PATH_IMAGE010
Frame overlap sampling
Figure 100002_DEST_PATH_IMAGE011
(ii) a The 40 dimensional log mel-frequency cepstral coefficients were extracted and normalized using z-score:
suppose that
Figure 902748DEST_PATH_IMAGE012
The logarithmic Mel cepstrum coefficient obtained from STFT in seconds is
Figure 100002_DEST_PATH_IMAGE013
Wherein, in the step (A),
Figure 311864DEST_PATH_IMAGE014
is that
Figure 366407DEST_PATH_IMAGE012
The number of frames in a second,
Figure 100002_DEST_PATH_IMAGE015
,
Figure 468356DEST_PATH_IMAGE016
Figure 850926DEST_PATH_IMAGE018
obtaining the mapped logarithmic mel-frequency cepstrum coefficient:
Figure 100002_DEST_PATH_IMAGE019
the mean is 0 and the variance is 1.
The audio tag is: converting labels in seconds into labels in framesAnd labeling, namely converting each label file to obtain an audio label coding matrix taking a frame as a unit, wherein the label coding is composed of 0 element and 1 element, the column number of the matrix is the total frame number, and the row number of the matrix is
Figure 188367DEST_PATH_IMAGE001
Is the total number of acoustic event categories. Wherein: one comprises
Figure 471318DEST_PATH_IMAGE001
The process of converting the audio tag coding matrix of the abnormal sound event of the sound-like event from the second unit to the frame unit is as follows:
first of all, produce one
Figure 337643DEST_PATH_IMAGE001
Line of
Figure 332144DEST_PATH_IMAGE014
Zero matrix of columns, audio duration of
Figure 817483DEST_PATH_IMAGE012
Second, number of rows of matrix
Figure 846619DEST_PATH_IMAGE020
Is the number of acoustic event categories; when the label is first
Figure 493632DEST_PATH_IMAGE005
Sound-like event occurrence
Figure 100002_DEST_PATH_IMAGE021
Time, get it first
Figure 506588DEST_PATH_IMAGE005
Corresponding to sound-like event
Figure 795618DEST_PATH_IMAGE005
The duration time of the row vector is converted into the length of a frame unit, and the corresponding zero vector is converted into a 1 vector;
finally, the vector of each individual acoustic event is combined into a matrix, which is the audio tag encoding matrix for that synthetic abnormal acoustic event.
The building method of the lightweight detection model of the adaptive width adaptive attention mechanism in the step (3) comprises the following steps:
1) pre-training the model:
a Python framework is adopted to build a self-attention mechanism model network as follows: the model is subjected to 3 times of convolution, three times of pooling, one time of gating cyclic unit (GRU), one-layer self-adaptive width self-attention mechanism and one-layer time distribution; wherein: the first layer is an input layer, and 40-dimensional logarithmic Mel cepstrum coefficients are input; the second tier is maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) tier (5 x 1) with input channels of 64; the third layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (4 x 1) with input channels of 64; the fourth layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (2 x 1) with input channels of 64; the fifth layer is formed by reshape and Permute to reduce the dimension and transpose the output of the fourth layer; the sixth layer is GRU activated using tanh with neuron number of 64; the seventh layer is a self-adaptive width self-attention mechanism, adopts add-attention type and is activated by using sigmoid; the eighth layer is time distribution (TimeDistributed) activated by using sigmoid and density as the number of sound event categories; each convolutional layer is used with a size of
Figure 413681DEST_PATH_IMAGE022
The step size of the convolution kernel is 1, each convolution layer is activated by a normalization layer and by a 'Relu' function, and dorpout is added to improve the generalization capability of the model.
Then, taking the output of the sixth layer as input and inputting the input into an attention layer, multiplying the input with three attention weight matrixes of the attention layer to respectively obtain three attention matrixes of query Q, key K and value V, and obtaining attention weight through a series of operations, namely the relevance of each output current position and other positions of the sequence; the loss function is minimized by continuously training the iterative attention weight, namely the model is optimal; meanwhile, a self-attention mechanism model with self-adaptive width is adopted, and the width can be controlled through each training iteration to achieve optimization. Wherein:
2) self-attention mechanism model
Feature sequence read from an audio file and processed
Figure 100002_DEST_PATH_IMAGE023
Then multiplying the corresponding attention moment arrays respectively
Figure 526868DEST_PATH_IMAGE024
Obtaining an attention input matrix
Figure 100002_DEST_PATH_IMAGE025
Figure 100002_DEST_PATH_IMAGE027
Wherein
Figure 168065DEST_PATH_IMAGE028
,
Figure 100002_DEST_PATH_IMAGE029
Is the dimension of the attention mechanism output. Then the following operations are carried out:
Figure 100002_DEST_PATH_IMAGE031
Figure 100002_DEST_PATH_IMAGE033
wherein
Figure 198469DEST_PATH_IMAGE034
Is a pre-set one of which is,
Figure 100002_DEST_PATH_IMAGE035
representing a time location. The final output is:
Figure 100002_DEST_PATH_IMAGE037
3) self-adaptive width self-attention mechanism model
And (4) taking the attention width as a training parameter, putting the attention width into a model, training and learning together, and adaptively selecting the attention width. When the method is implemented, a mask function is introduced
Figure 372836DEST_PATH_IMAGE038
Figure 486286DEST_PATH_IMAGE040
The function being a distance
Figure 100002_DEST_PATH_IMAGE041
Mapping to [0,1]Is a non-increasing function of
Figure 349199DEST_PATH_IMAGE042
Parameterization, wherein
Figure 100002_DEST_PATH_IMAGE043
Is the maximum width of attention that is set,
Figure 245611DEST_PATH_IMAGE044
is a slope representing a decrease in attention width; i.e. the attention score at this time is:
Figure 838267DEST_PATH_IMAGE046
the adaptive width self-attention mechanism sacrifices some sequence information to a certain extent, saves the operation time, filters interference information, improves the operation efficiency and improves the effectiveness and reliability of the method for detecting the abnormal sound events of the urban roads.
And (3) lightening: the trained adaptive width adaptive attention mechanism detection model uses low-precision (16-bit) floating point numbers to replace high-precision (32-bit) floating point numbers in storage and prediction, and the general lightweight form is represented as follows:
Figure 529142DEST_PATH_IMAGE048
wherein
Figure 100002_DEST_PATH_IMAGE049
And
Figure 112308DEST_PATH_IMAGE050
respectively a number before quantization and a number after quantization,
Figure 100002_DEST_PATH_IMAGE051
is the quantization factor that is the factor of the quantization,
Figure 671465DEST_PATH_IMAGE052
is the value of 0 in the original value domain after quantization, because there are many 0 in the weight and input (e.g. padding or via ReLU), so the real number 0 needs to be accurately represented when quantizing;
wherein the quantization factor
Figure 259572DEST_PATH_IMAGE051
Determines the error between the quantized model and the original model, so that the quantization factor
Figure 980404DEST_PATH_IMAGE051
The selection of the quantization factor is important so that the quantization factor is within a specified bit representation range (e.g., 16 bits) after quantization
Figure 286751DEST_PATH_IMAGE051
The following formula is selected:
Figure 118441DEST_PATH_IMAGE054
wherein
Figure 100002_DEST_PATH_IMAGE055
And
Figure 826634DEST_PATH_IMAGE056
respectively the maximum and minimum of the object before quantization.
4) Lightweight model of adaptive width self-attention mechanism
Sending a training data set, namely a logarithmic Mel cepstrum coefficient of a synthetic audio frequency into a self-adaptive width self-attention mechanism lightweight model built by a self-adaptive width self-attention mechanism model, wherein initial value parameters of weights of all layers in the model are randomly given by PyTorch to obtain output
Figure 100002_DEST_PATH_IMAGE057
Wherein C is the total number of event types, T is the total frame number, and the loss of the true positive prediction label is calculated
Figure 249525DEST_PATH_IMAGE058
Will be
Figure 541704DEST_PATH_IMAGE058
And
Figure 100002_DEST_PATH_IMAGE059
element by element multiplication to obtain output
Figure 318030DEST_PATH_IMAGE060
And finally, calculating the following two-term cross entropy loss function:
Figure 739784DEST_PATH_IMAGE062
and (3) gradient back propagation, using an Adam gradient descent method, setting the learning rate to be 0.001, updating weight parameters, carrying out iterative training until the loss reaches the minimum, and storing model parameters to obtain the self-adaptive width self-attention mechanism detection model.
The method for detecting the model to be predicted by using the trained detection model in the step (4) comprises the following steps: preprocessing the audio to be detected with unknown labels and extracting features of the audio to be detected in the same way as the self-adaptive width self-attention mechanism model, sending the audio to the trained lightweight model of the self-adaptive width self-attention mechanism to obtain neural network probability output, and storing the neural network probability output;
searching optimal decision threshold for scale according to f1-score
Figure 100002_DEST_PATH_IMAGE063
According to the decision threshold
Figure 208943DEST_PATH_IMAGE063
Obtaining a prediction result in the label through binarization, determining a sound event starting time frame node and an ending time frame node according to a label prediction output matrix, and calculating the cosine similarity of adjacent frames of the frame nodes corresponding to the neural network probability output matrix; if the similarity is greater than 0.5, extending the frame, namely extending the time stamp in the label matrix;
and finally, obtaining a prediction matrix after the label is extended, obtaining an identification result and completing prediction.
The invention has the advantages that: the invention provides a method for rapidly and accurately detecting the start-stop time and the event type of a sound event in an audio frequency containing various abnormal sound events. The method is based on a patent of 'a sound event labeling and identifying method adopting double Token labels' (patent application number: 202110465526.7), a detection model after model compression is obtained through lightweight training of a self-adaptive width self-attention mechanism model, audio to be detected is input to the detection model, and the type of an abnormal sound event is obtained in real time by utilizing the trained detection model. The method makes up the problem that the CNN cannot carry time information, avoids gradient disappearance and long-term dependence possibly caused by the RNN to a certain extent, exerts the advantage of parallel operation of the self-attention mechanism, adaptively selects the width of the concerned self-attention mechanism, adaptively pays attention to input information with a certain width under the condition of saving memory space with the help of light weight, improves the identification effect and the prediction speed, is used for improving the accuracy and the effectiveness of the urban road abnormal sound event detection method, and is expected to be deployed at a mobile terminal or other portable equipment.
Drawings
FIG. 1 is a flowchart of a method of detecting a defect in an embodiment of the present invention;
FIG. 2 is a diagram illustrating a self-attention mechanism model according to an embodiment of the present invention;
FIG. 3 is a model diagram of an adaptive width adaptive attention mechanism according to an embodiment of the present invention;
FIG. 4 is a diagram of mask functions according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
Fig. 1 shows a main detection flowchart of a lightweight abnormal sound event detection method based on an adaptive width adaptive attention mechanism. The method mainly comprises the following steps of classifying and identifying a segment of audio containing a plurality of sounds, wherein the detection process comprises the following steps:
the whole flow chart is mainly divided into 7 modules: firstly, constructing a synthetic sound data set; secondly, preprocessing and characteristic extraction are carried out on the data set; then, sending the model into a built self-adaptive width self-attention mechanism model for network iterative training until the model is optimal, and compressing the model by using a lightweight method when model parameters are saved; finally, the model is saved. And (3) after the audio to be predicted is subjected to preprocessing and feature extraction as the data set, sending the audio to be predicted to a stored detection model to obtain a prediction result. The specific method comprises the following steps:
(1) tagging and classifying audio each containing a plurality of anomalous acoustic events:
in abnormal event detection based on machine learning, an acoustic event detection model is generally trained from a large amount of labeled audio data, and the model obtained thereby is used for predicting an unknown section of audio, and predicting the event type and the corresponding time of occurrence. Since machine learning has only input and output, the size of the data set and the degree of authenticity tend to have a large influence on the quality of the detection result.
Overlapping sound events themselves comprise multiple simultaneous events, with great difficulty in tagging, so we generally preferThe method is synthesized by a plurality of single sound events, so that the audio containing a plurality of abnormal sound events is obtained. Firstly, a certain number of marked single sound event audios are taken, each type of sound event is numbered, and the total number of the sound event types is obtained
Figure 348937DEST_PATH_IMAGE001
Then, some acoustic events are randomly synthesized to obtain synthesized audio, and the audio is marked
Figure 788009DEST_PATH_IMAGE002
Wherein
Figure 205215DEST_PATH_IMAGE003
Means using in the synthesis
Figure 438750DEST_PATH_IMAGE005
A sound-like event. Finally, a label file is exported, the file records the names of the audio files, and each type of sound event category occurs under each audio file name.
The artificially synthesized abnormal sound event data set has a clear label for each sound event category, but obviously has low fit for abnormal sound events occurring in a real environment and is difficult to avoid deviation in prediction. The labeling of abnormal sound events in the real environment is usually listened by human ears, so that the method has high subjectivity and is time-consuming and labor-consuming, and multiple types of events occur simultaneously in the abnormal sound events, so that the labeling cost is increased by times. In order to increase the number of data sets and prevent overfitting, the following data expansion mode is adopted to expand the audio data to three times of the original data: audio random scaling, time masking, frequency masking, adding random noise, audio sample mixing (mixup). And finally, the data is expanded to achieve the effects of enriching the data set and preventing overfitting.
(2) Audio data preprocessing and feature extraction:
since the audio may originate from a variety of different devices, the processing of speech is done with resampling at a sample rate of 16 kHZ. Then audio waveThe shapes are standardized, and the audio waveform data are mapped to the shapes uniformly
Figure 275879DEST_PATH_IMAGE006
Above, normalized with max:
Figure 253062DEST_PATH_IMAGE007
Figure 649409DEST_PATH_IMAGE008
is the data obtained by reading an audio file (. wav) through a python (wav) program package. Then, short-time Fourier transform (STFT) is adopted to extract 40-dimensional logarithmic Mel frequency cepstrum coefficients for all audios, and the specific parameters are as follows:
Figure 194791DEST_PATH_IMAGE009
at a sampling rate of
Figure 43798DEST_PATH_IMAGE010
Frame overlap sampling
Figure 700038DEST_PATH_IMAGE011
(ii) a The 40 dimensional log mel-frequency cepstral coefficients were extracted and normalized with z-score as follows: suppose that
Figure 950891DEST_PATH_IMAGE012
The logarithmic Mel cepstrum coefficient obtained from STFT in seconds is
Figure 791808DEST_PATH_IMAGE013
Wherein, in the step (A),
Figure 269057DEST_PATH_IMAGE014
is that
Figure 588043DEST_PATH_IMAGE012
The number of frames in a second,
Figure 332883DEST_PATH_IMAGE015
,
Figure 79122DEST_PATH_IMAGE016
Figure 902721DEST_PATH_IMAGE064
obtaining the mapped logarithmic mel-frequency cepstrum coefficient:
Figure 166344DEST_PATH_IMAGE019
the mean is 0 and the variance is 1.
For audio tags: converting the labels with the unit of second into the labels with the unit of frame, transforming each label file to obtain an audio label coding matrix with the unit of frame, wherein the label coding is composed of 0 element and 1 element, the column number of the matrix is the total frame number, and the row number of the matrix is
Figure 126209DEST_PATH_IMAGE001
For the total number of acoustic event categories, one includes
Figure 184295DEST_PATH_IMAGE001
The process of converting the audio tag coding matrix of the abnormal sound event of the sound-like event from the second unit to the frame unit is as follows:
first of all, produce one
Figure 495191DEST_PATH_IMAGE001
Line of
Figure 421559DEST_PATH_IMAGE014
Zero matrix of columns, audio duration of
Figure 111297DEST_PATH_IMAGE012
Second, number of rows of matrix
Figure 730497DEST_PATH_IMAGE020
The number of acoustic event categories.
If the label is first
Figure DEST_PATH_IMAGE065
Sound-like event occurrence
Figure 168170DEST_PATH_IMAGE021
Time, get it first
Figure 632649DEST_PATH_IMAGE065
Corresponding to sound-like event
Figure 442473DEST_PATH_IMAGE065
The duration of the row vector is converted to the length of a frame unit and the corresponding zero vector is converted to a 1 vector.
Finally, the vector of each individual acoustic event is combined into a matrix, which is the audio tag encoding matrix for that synthetic abnormal acoustic event.
(3) Building a pre-training model:
a Python framework is adopted to build a self-attention mechanism model network as follows: the model is subjected to 3 times of convolution, three times of pooling, one time of gating and circulating unit (GRU), one-layer self-adaptive width self-attention mechanism and one-layer time distribution. Inputting 40-dimensional logarithmic Mel cepstrum coefficients for the first time; the second tier is maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) tier (5 x 1) with input channels of 64; the third layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (4 x 1) with input channels of 64; the fourth layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (2 x 1) with input channels of 64; the fifth layer is formed by reshape and Permute to reduce the dimension and transpose the output of the fourth layer; the sixth layer is GRU activated using tanh with neuron number of 64; the seventh layer is a self-adaptive width self-attention mechanism, adopts add-attention type and is activated by using sigmoid; the eighth layer is time distribution (TimeDistributed) activated using sigmoid and sense as the number of classes of acoustic events. Wherein each convolution layer has a size of
Figure 498154DEST_PATH_IMAGE022
The step length of the convolution kernel is 1, each convolution layer is activated by a normalization layer and by a 'Relu' function, and the convolution kernel is added to improve the generalization capability of the modeldorpout。
The attention layer is specifically provided in the invention: the output of the sixth layer is used as input and input to the attention layer, the input is multiplied by three attention weight matrixes of the attention layer to respectively obtain three attention matrixes of query Q, key K and value V, and the attention weight is obtained through a series of operations, namely the relevance of the current position of each output and other positions of the sequence. And continuously training the iterative attention weight to minimize the loss function, namely optimizing the model. And the sequence information far away from the current position is relatively low in importance, so that a width-controlled self-attention mechanism mode is adopted, and meanwhile, in order to avoid losing important information, the width is controlled through each training iteration by adopting the self-attention mechanism with the self-adaptive width, so that the optimization is achieved. Therefore, under the condition of not losing important information, the calculation time can be reduced to a certain extent, and the working efficiency of the model is improved.
) Self-attention mechanism model: as shown in FIG. 2, the input is a sequence of features read from an audio file and processed
Figure 393429DEST_PATH_IMAGE023
Then multiplying the corresponding attention moment arrays respectively
Figure 927178DEST_PATH_IMAGE024
Obtaining an attention input matrix
Figure 591509DEST_PATH_IMAGE025
Figure 286932DEST_PATH_IMAGE027
Wherein
Figure 794137DEST_PATH_IMAGE028
,
Figure 239900DEST_PATH_IMAGE029
Is the dimension of the attention mechanism output. Then, the following operations are carried out:
Figure 883371DEST_PATH_IMAGE066
Figure DEST_PATH_IMAGE067
Wherein
Figure 156220DEST_PATH_IMAGE034
Is a pre-set one of which is,
Figure 150721DEST_PATH_IMAGE035
representing a time location. The final output is:
Figure 636060DEST_PATH_IMAGE068
2) adaptive width adaptive attention mechanism model: although the traditional self-attention mechanism model overcomes some defects in some SED tasks based on CNN and RNN, when the audio time is too long, the traditional self-attention mechanism needs to pay attention to all sequence information, so the operation time is relatively long, and the distance from the current time is relatively long
Figure DEST_PATH_IMAGE069
The information at distant moments is relatively of little importance and even contains interference information. Thus, a self-attention mechanism for controlling the attention width as shown in FIG. 3 appears, and under this attention model, it is possible to control the attention width according to the attention width
Figure 930775DEST_PATH_IMAGE070
Proceed to the current position
Figure DEST_PATH_IMAGE071
The attention range of the surrounding sequence information does not need to pay attention to all the sequence information, and the efficiency of model operation is improved. Can make the system sacrifice less sequence informationUnder the condition of (2), the operation time is greatly saved, and the operation performance of the system is greatly improved.
Adaptive width adaptive attention mechanism model: in order to save computer resources and filter interference information, the attention-focusing width is selected better. The invention provides an adaptive width self-attention mechanism. Self-adaptive width self-attention mechanism: the attention width is also used as a training parameter and is put into a model to be trained and learned together, and the attention width is selected in a self-adaptive mode. When the method is implemented, a mask function as shown in FIG. 4 is introduced
Figure 108947DEST_PATH_IMAGE038
Figure 230224DEST_PATH_IMAGE040
The function being a distance
Figure 378309DEST_PATH_IMAGE041
Mapping to [0,1]Is a non-increasing function of
Figure 261951DEST_PATH_IMAGE042
Parameterization, wherein
Figure 79866DEST_PATH_IMAGE043
Is the maximum width of attention that is set,
Figure 48959DEST_PATH_IMAGE044
is a slope representing the decline of the attention width. I.e. the attention score at this time is:
Figure 734DEST_PATH_IMAGE046
the adaptive width self-attention mechanism sacrifices some sequence information to a certain extent, saves the operation time, filters interference information, improves the operation efficiency and improves the effectiveness and reliability of the method for detecting the abnormal sound events of the urban roads.
And (3) lightening: the lightweight process is that the trained adaptive width adaptive attention mechanism detection model replaces a high-precision (32-bit) floating point number with a low-precision (16-bit) floating point number during storage and prediction, and the method can save about half of storage space during storage and calculate time delay during prediction, so that the effects of saving storage resources and accelerating computing power are achieved. The general form of lightweighting is expressed as:
Figure 614249DEST_PATH_IMAGE072
wherein
Figure 258857DEST_PATH_IMAGE049
And
Figure 325033DEST_PATH_IMAGE050
respectively a number before quantization and a number after quantization,
Figure 80500DEST_PATH_IMAGE051
is the quantization factor that is the factor of the quantization,
Figure 673155DEST_PATH_IMAGE052
since there are many 0's (e.g. padding or via ReLU) in the weight and input, the real number 0 needs to be accurately represented in quantization.
Wherein the quantization factor
Figure 596986DEST_PATH_IMAGE051
Determines the error between the quantized model and the original model, so that the quantization factor
Figure 275092DEST_PATH_IMAGE051
The selection of the quantization factor is important so that the quantization factor is within a specified bit representation range (e.g., 16 bits) after quantization
Figure 568671DEST_PATH_IMAGE051
The following formula is selected:
Figure DEST_PATH_IMAGE073
wherein
Figure 422357DEST_PATH_IMAGE055
And
Figure 284134DEST_PATH_IMAGE056
respectively the maximum and minimum of the object before quantization.
4) The lightweight model of the adaptive width adaptive attention mechanism is as follows: sending a training data set, namely a logarithmic Mel cepstrum coefficient of the synthesized audio, into a self-adaptive width self-attention mechanism lightweight model built by a module 3), wherein initial value parameters of weights of all layers in the model are randomly given by PyTorch to obtain output
Figure 449536DEST_PATH_IMAGE057
Wherein C is the total number of event types, T is the total frame number, and the loss of the true positive prediction label is calculated
Figure 546805DEST_PATH_IMAGE058
Will be
Figure 723839DEST_PATH_IMAGE058
And
Figure 881151DEST_PATH_IMAGE059
element by element multiplication to obtain output
Figure 268270DEST_PATH_IMAGE060
And finally, calculating the following two-term cross entropy loss function:
Figure 543132DEST_PATH_IMAGE062
and (3) gradient back propagation, using an Adam gradient descent method, setting the learning rate to be 0.001, updating weight parameters, carrying out iterative training until the loss reaches the minimum, and storing model parameters to obtain the self-adaptive width self-attention mechanism detection model.
(4) And (3) detecting the model to be predicted by using the trained detection model: and preprocessing and characteristic extraction are carried out on the audio to be detected with unknown labels, which is the same as the self-attention mechanism model, and then the audio is sent into the trained lightweight model of the self-adaptive width self-attention mechanism, so that neural network probability output is obtained and stored. Searching optimal decision threshold for scale according to f1-score
Figure 964886DEST_PATH_IMAGE063
According to the decision threshold
Figure 168465DEST_PATH_IMAGE063
Obtaining a prediction result in the label by binarization, wherein the specific implementation is as follows: and determining time frame nodes for starting and ending the acoustic event according to the label prediction output matrix, and calculating the cosine similarity of adjacent frames of the frame nodes corresponding to the neural network probability output matrix. If the similarity is greater than 0.5, the frame is extended, i.e. the time stamps in the tag matrix are extended. And finally, obtaining a prediction matrix after the label is extended, obtaining an identification result and completing prediction.
Note: the left and right sides of the steel plate can not extend beyond the preset value of the super parameter collar, and the value is generally (250 ms-50 ms).

Claims (6)

1. The light abnormal sound event detection method based on the adaptive width adaptive attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:
(1) constructing a synthetic audio data set, and labeling and classifying each audio containing a plurality of abnormal sound events;
(2) preprocessing and feature extraction are carried out on the data set, and the data set is sent into a built self-adaptive width self-attention mechanism model for network iterative training until the model is optimal;
(3) compressing the model by using a lightweight method to obtain a lightweight detection model of a self-adaptive width self-attention mechanism;
(4) and preprocessing the audio to be detected, extracting features, and sending the audio to be detected into a compressed detection model for detection to obtain a prediction result.
2. The adaptive width adaptive attention mechanism-based light-weighted abnormal sound event detection method according to claim 1, wherein: the label and classification in the step (1) are as follows: firstly, a certain number of marked single sound event audios are taken, each type of sound event is numbered, and the total number of the sound event types is obtained
Figure DEST_PATH_IMAGE001
Then, some acoustic events are randomly synthesized to obtain synthesized audio, and the audio is marked
Figure 253953DEST_PATH_IMAGE002
Wherein
Figure DEST_PATH_IMAGE003
Means using in the synthesis
Figure DEST_PATH_IMAGE005
A sound-like event; finally, a label file is exported, the file records the names of the audio files, and each type of sound event category occurs under each audio file name.
3. The adaptive width adaptive attention mechanism-based light-weighted abnormal sound event detection method according to claim 1, wherein: the preprocessing and the feature extraction in the step (2) and the step (4) are to perform resampling with a sampling rate of 16kHZ on the voice processing, then standardize audio waveform, and map the audio waveform data to the audio waveform data in a unified way
Figure 937875DEST_PATH_IMAGE006
Above, normalized with max:
Figure DEST_PATH_IMAGE007
which isThe method comprises the following steps:
Figure 167999DEST_PATH_IMAGE008
the audio file (. wav) is data obtained by reading through a Python (wav) program package; extracting 40-dimensional logarithmic Mel frequency cepstrum coefficients from all audio by short-time Fourier transform (STFT), wherein the specific parameters are as follows:
Figure DEST_PATH_IMAGE009
at a sampling rate of
Figure 930157DEST_PATH_IMAGE010
Frame overlap sampling
Figure DEST_PATH_IMAGE011
(ii) a The 40 dimensional log mel-frequency cepstral coefficients were extracted and normalized using z-score:
suppose that
Figure 686760DEST_PATH_IMAGE012
The logarithmic Mel cepstrum coefficient obtained from STFT in seconds is
Figure DEST_PATH_IMAGE013
Wherein, in the step (A),
Figure 10425DEST_PATH_IMAGE014
is that
Figure 196687DEST_PATH_IMAGE012
The number of frames in a second,
Figure DEST_PATH_IMAGE015
,
Figure 388634DEST_PATH_IMAGE016
Figure 609531DEST_PATH_IMAGE018
obtaining the mapped logarithmic mel-frequency cepstrum coefficient:
Figure DEST_PATH_IMAGE019
the mean is 0 and the variance is 1.
4. The adaptive width adaptive attention mechanism-based light-weighted abnormal sound event detection method according to claim 1, wherein: the audio tag of step (1): converting the labels with the unit of second into the labels with the unit of frame, transforming each label file to obtain an audio label coding matrix with the unit of frame, wherein the label coding is composed of 0 element and 1 element, the column number of the matrix is the total frame number, and the row number of the matrix is
Figure 602632DEST_PATH_IMAGE001
Is the total number of acoustic event categories; wherein: one comprises
Figure 135245DEST_PATH_IMAGE001
The process of converting the audio tag coding matrix of the abnormal sound event of the sound-like event from the second unit to the frame unit is as follows:
first of all, produce one
Figure 865304DEST_PATH_IMAGE001
Line of
Figure 675128DEST_PATH_IMAGE014
Zero matrix of columns, audio duration of
Figure 199650DEST_PATH_IMAGE012
Second, number of rows of matrix
Figure 219559DEST_PATH_IMAGE020
Is the number of acoustic event categories; when the label is first
Figure 628674DEST_PATH_IMAGE005
Sound-like event occurrence
Figure DEST_PATH_IMAGE021
Time, get it first
Figure 824163DEST_PATH_IMAGE005
Corresponding to sound-like event
Figure 254008DEST_PATH_IMAGE005
The duration time of the row vector is converted into the length of a frame unit, and the corresponding zero vector is converted into a 1 vector;
finally, the vector of each individual acoustic event is combined into a matrix, which is the audio tag encoding matrix for that synthetic abnormal acoustic event.
5. The adaptive width adaptive attention mechanism-based light-weighted abnormal sound event detection method according to claim 1, wherein: the building method of the lightweight detection model of the adaptive width adaptive attention mechanism in the step (3) comprises the following steps:
1) pre-training the model:
a Python framework is adopted to build a self-attention mechanism model network as follows: the model is subjected to 3 times of convolution, three times of pooling, one time of gating cyclic unit (GRU), one-layer self-adaptive width self-attention mechanism and one-layer time distribution; wherein: the first layer is an input layer, and 40-dimensional logarithmic Mel cepstrum coefficients are input; the second tier is maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) tier (5 x 1) with input channels of 64; the third layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (4 x 1) with input channels of 64; the fourth layer is the maximum 2d pooling of 2-dimensional convolutions (convolution kernel 5 x 5) layer by layer (2 x 1) with input channels of 64; the fifth layer is formed by reshape and Permute to reduce the dimension and transpose the output of the fourth layer; the sixth layer is GRU activated using tanh with neuron number of 64; the seventh layer is a self-adaptive width self-attention mechanism, adopts add-attention type and is activated by using sigmoid; the eighth layer is the use of sigmoid laserLive and Dense are the time distribution of the acoustic event category number; each convolutional layer is used with a size of
Figure 26792DEST_PATH_IMAGE022
The step size of the convolution kernel is 1, each convolution layer is activated by a normalization layer and a Relu function, and dorpout is added;
then, taking the output of the sixth layer as input to an attention layer, multiplying the input by three attention weight matrixes of the attention layer to respectively obtain three attention matrixes of query Q, key K and value V, and obtaining attention weight through operation, namely the correlation between the current position of each output and other positions of the sequence; the loss function is minimized by continuously training the iterative attention weight, namely the model is optimal; meanwhile, a self-attention mechanism model with self-adaptive width is adopted, and the width can be controlled through each training iteration to achieve optimization; wherein:
2) self-attention mechanism model
Feature sequence read from an audio file and processed
Figure DEST_PATH_IMAGE023
Then multiplying the corresponding attention moment arrays respectively
Figure 574071DEST_PATH_IMAGE024
Obtaining an attention input matrix
Figure DEST_PATH_IMAGE025
Figure DEST_PATH_IMAGE027
Wherein
Figure 889646DEST_PATH_IMAGE028
,
Figure DEST_PATH_IMAGE029
Is the dimension of the attention mechanism output; then the following operations are carried out:
Figure DEST_PATH_IMAGE031
Figure DEST_PATH_IMAGE033
wherein
Figure 834599DEST_PATH_IMAGE034
Is a pre-set one of which is,
Figure DEST_PATH_IMAGE035
representing a time location; the final output is:
Figure DEST_PATH_IMAGE037
3) self-adaptive width self-attention mechanism model
The attention width is also used as a training parameter and is put into a model to be trained and learned together, and the attention width is selected in a self-adaptive manner; when the method is implemented, a mask function is introduced
Figure 265318DEST_PATH_IMAGE038
Figure 875291DEST_PATH_IMAGE040
The function being a distance
Figure DEST_PATH_IMAGE041
Mapping to [0,1]Is a non-increasing function of
Figure 45372DEST_PATH_IMAGE042
Parameterization, wherein
Figure DEST_PATH_IMAGE043
Is the maximum width of attention that is set,
Figure 489123DEST_PATH_IMAGE044
is a slope representing a decrease in attention width; i.e. the attention score at this time is:
Figure 705341DEST_PATH_IMAGE046
the adaptive width self-attention mechanism sacrifices some sequence information to a certain extent, saves the operation time, filters interference information, improves the operation efficiency and improves the effectiveness and reliability of the urban road abnormal sound event detection method;
and (3) lightening: the trained adaptive width adaptive attention mechanism detection model uses low-precision (16-bit) floating point numbers to replace high-precision (32-bit) floating point numbers in storage and prediction, and the general lightweight form is represented as follows:
Figure 994371DEST_PATH_IMAGE048
wherein
Figure DEST_PATH_IMAGE049
And
Figure 517494DEST_PATH_IMAGE050
respectively a number before quantization and a number after quantization,
Figure DEST_PATH_IMAGE051
is the quantization factor that is the factor of the quantization,
Figure 256780DEST_PATH_IMAGE052
is the value of 0 in the original value domain after quantization, because there are many 0 in the weight and input (e.g. padding or via ReLU), so the real number 0 needs to be accurately represented when quantizing;
wherein the quantization factor
Figure 835660DEST_PATH_IMAGE051
Determines the error between the quantized model and the original model, so that the quantization factor
Figure 53015DEST_PATH_IMAGE051
The selection of the quantization factor is important so that the quantization factor is within a specified bit representation range (e.g., 16 bits) after quantization
Figure 525584DEST_PATH_IMAGE051
The following formula is selected:
Figure 45558DEST_PATH_IMAGE054
wherein
Figure DEST_PATH_IMAGE055
And
Figure 642893DEST_PATH_IMAGE056
maximum and minimum values of the object before quantization, respectively;
4) lightweight model of adaptive width self-attention mechanism
Sending a training data set, namely a logarithmic Mel cepstrum coefficient of a synthetic audio frequency into a self-adaptive width self-attention mechanism lightweight model built by a self-adaptive width self-attention mechanism model, wherein initial value parameters of weights of all layers in the model are randomly given by PyTorch to obtain output
Figure DEST_PATH_IMAGE057
Wherein C is the total number of event categories and T is the total frame number, and calculating the true positivePredicting tag loss
Figure 663939DEST_PATH_IMAGE058
Will be
Figure 630495DEST_PATH_IMAGE058
And
Figure DEST_PATH_IMAGE059
element by element multiplication to obtain output
Figure 446005DEST_PATH_IMAGE060
And finally, calculating the following two-term cross entropy loss function:
Figure 265056DEST_PATH_IMAGE062
and (3) gradient back propagation, using an Adam gradient descent method, setting the learning rate to be 0.001, updating weight parameters, carrying out iterative training until the loss reaches the minimum, and storing model parameters to obtain the self-adaptive width self-attention mechanism detection model.
6. The adaptive width adaptive attention mechanism-based light-weighted abnormal sound event detection method according to claim 1, wherein: the method for detecting the model to be predicted by using the trained detection model in the step (4) comprises the following steps: preprocessing the audio to be detected with unknown labels and extracting features of the audio to be detected in the same way as the self-adaptive width self-attention mechanism model, sending the audio to the trained lightweight model of the self-adaptive width self-attention mechanism to obtain neural network probability output, and storing the neural network probability output;
searching optimal decision threshold for scale according to f1-score
Figure DEST_PATH_IMAGE063
According to the decision threshold
Figure 965159DEST_PATH_IMAGE063
Obtaining a prediction result in the label through binarization, determining a sound event starting time frame node and an ending time frame node according to a label prediction output matrix, and calculating the cosine similarity of adjacent frames of the frame nodes corresponding to the neural network probability output matrix; if the similarity is greater than 0.5, extending the frame, namely extending the time stamp in the label matrix;
and finally, obtaining a prediction matrix after the label is extended, obtaining an identification result and completing prediction.
CN202210039999.5A 2022-01-14 2022-01-14 Lightweight abnormal sound event detection method based on adaptive width adaptive attention mechanism Pending CN114386518A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210039999.5A CN114386518A (en) 2022-01-14 2022-01-14 Lightweight abnormal sound event detection method based on adaptive width adaptive attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210039999.5A CN114386518A (en) 2022-01-14 2022-01-14 Lightweight abnormal sound event detection method based on adaptive width adaptive attention mechanism

Publications (1)

Publication Number Publication Date
CN114386518A true CN114386518A (en) 2022-04-22

Family

ID=81202792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210039999.5A Pending CN114386518A (en) 2022-01-14 2022-01-14 Lightweight abnormal sound event detection method based on adaptive width adaptive attention mechanism

Country Status (1)

Country Link
CN (1) CN114386518A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083423A (en) * 2022-07-21 2022-09-20 中国科学院自动化研究所 Data processing method and device for voice identification
CN115096375A (en) * 2022-08-22 2022-09-23 启东亦大通自动化设备有限公司 Carrier roller running state monitoring method and device based on carrier roller carrying trolley detection
CN116152722A (en) * 2023-04-19 2023-05-23 南京邮电大学 Video anomaly detection method based on combination of residual attention block and self-selection learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083423A (en) * 2022-07-21 2022-09-20 中国科学院自动化研究所 Data processing method and device for voice identification
CN115083423B (en) * 2022-07-21 2022-11-15 中国科学院自动化研究所 Data processing method and device for voice authentication
CN115096375A (en) * 2022-08-22 2022-09-23 启东亦大通自动化设备有限公司 Carrier roller running state monitoring method and device based on carrier roller carrying trolley detection
CN115096375B (en) * 2022-08-22 2022-11-04 启东亦大通自动化设备有限公司 Carrier roller running state monitoring method and device based on carrier roller carrying trolley detection
CN116152722A (en) * 2023-04-19 2023-05-23 南京邮电大学 Video anomaly detection method based on combination of residual attention block and self-selection learning

Similar Documents

Publication Publication Date Title
Demir et al. A new pyramidal concatenated CNN approach for environmental sound classification
CN114386518A (en) Lightweight abnormal sound event detection method based on adaptive width adaptive attention mechanism
Su et al. Performance analysis of multiple aggregated acoustic features for environment sound classification
CN112885372B (en) Intelligent diagnosis method, system, terminal and medium for power equipment fault sound
Das et al. Urban sound classification using convolutional neural network and long short term memory based on multiple features
Davis et al. Environmental sound classification using deep convolutional neural networks and data augmentation
CN110310666B (en) Musical instrument identification method and system based on SE convolutional network
CN117095694B (en) Bird song recognition method based on tag hierarchical structure attribute relationship
CN110853630B (en) Lightweight speech recognition method facing edge calculation
CN109308912A (en) Music style recognition methods, device, computer equipment and storage medium
Wang et al. What affects the performance of convolutional neural networks for audio event classification
Colonna et al. Feature subset selection for automatically classifying anuran calls using sensor networks
CN114898773A (en) Synthetic speech detection method based on deep self-attention neural network classifier
CN113591733B (en) Underwater acoustic communication modulation mode classification identification method based on integrated neural network model
CN114913872A (en) Time-frequency double-domain audio classification method and system based on convolutional neural network
CN114999525A (en) Light-weight environment voice recognition method based on neural network
Saddam Wind sounds classification using different audio feature extraction techniques
CN114420151B (en) Speech emotion recognition method based on parallel tensor decomposition convolutional neural network
CN115563500A (en) Power distribution equipment partial discharge mode identification method, device and system based on data enhancement technology
Sattigeri et al. A scalable feature learning and tag prediction framework for natural environment sounds
CN114187923A (en) Convolutional neural network audio identification method based on one-dimensional attention mechanism
CN113936667A (en) Bird song recognition model training method, recognition method and storage medium
CN114038479A (en) Bird song recognition and classification method and device for coping with low signal-to-noise ratio and storage medium
Nasiri et al. Audiomask: Robust sound event detection using mask r-cnn and frame-level classifier
CN112767968A (en) Voice objective evaluation optimal feature group screening method based on discriminative complementary information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination