CN113140226B

CN113140226B - Sound event marking and identifying method adopting double Token labels

Info

Publication number: CN113140226B
Application number: CN202110465526.7A
Authority: CN
Inventors: 姚雨; 宋浠瑜; 王玫; 仇洪冰
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2022-06-21
Anticipated expiration: 2041-04-28
Also published as: CN113140226A

Abstract

The invention discloses a sound event labeling and identifying method adopting double Token labels, which is characterized by comprising a sound event labeling process and an identifying process, wherein the sound event labeling process comprises the following steps: 1-1) audio tag form; 1-2) completing all audio frequency labeling in the data set; the identification process comprises the following steps: 2-1) constructing an audio data set; 2-2) preprocessing audio data and extracting characteristics; 2-3) audio data amplification; 2-4) building a convolution cyclic neural network; 2-5) training a convolution cyclic neural network learning detection model; 2-6) identifying the audio to be detected by using the trained detection model. The method can widen the sound event identification range with lower cost while ensuring the accuracy, and can realize accurate sound event detection and monitoring in the living environment of people, thereby better serving the construction of smart cities.

Description

Sound event marking and identifying method adopting double Token labels

Technical Field

The invention relates to the field of sound event detection, in particular to a sound event labeling and identifying method adopting double Token labels.

Background

In the living environment of people, various types of sounds carry a great deal of information about the daily environment and the physical events that occur therein. The research on Sound event Detection (SED for short) can help people to better sense the Sound scene where the Sound event Detection is located, identify various Sound source types and obtain the time stamp of an interesting event, and has important practical significance. The method can be applied to smart cities and smart home scenes such as urban environmental noise monitoring, safety monitoring in public places, old people and children behavior monitoring in indoor environments and the like, for example, gunshot, scream and object combustion sound can be automatically detected and identified in an acoustic monitoring application scene, and the method has important values in man-machine interaction, auditory perception and satisfaction of various detection requirements in the society.

The acoustic event detection task depends on a signal processing method and a machine learning model, the acoustic event detection model is usually obtained by training a large amount of audio data with labels, and the obtained model can predict a section of audio with unknown labels, and usually predicts the category of the acoustic event and the corresponding timestamp. Specifically, a signal processing method is used for audio with a label to obtain certain characteristic representation of a time frequency domain of the audio, the characteristic representation with the label is used as input and is sent to a machine learning model, and the machine learning model defines a loss function and a random initialization weight parameter; calculating loss values of the output and the label according to forward propagation, and then updating the weight by using backward propagation; through repeated iteration until the loss function is small, the weight parameter is the acoustic event recognition model, and the obtained model can predict a section of audio without a label, so that the purpose of acoustic event detection is achieved. The process of iteration and weight updating is the machine learning training process. As machine learning follows the principle of garpage in and garpage out, the accuracy, the quality degree and the data volume of the labeled data greatly influence the effect of the detection model. The data is marked with strong label marks (accurately marking the acoustic event types and timestamps thereof, and embodying the number and positions of acoustic events in a piece of audio) and weak label marks (only marking whether a certain type of acoustic event occurs or not, and failing to embody the times of sounding and the time points of occurrence in the audio).

Data volume and annotation generally cannot be considered: if the audio data is strongly labeled, the training model can obtain more accurate and detailed label description and predict the start and end time (time stamp) of the possibly overlapped sound event. However, the strongly labeled audio data is usually finished by listening to human ears and manual labeling, which requires a high attention of human at all times during the listening process and is recorded by professional software, which is a very time-consuming and labor-consuming task, and especially when a period of audio is mixed with multiple types of sound events with overlapping time, the time and labor cost of the strongly labeled task is multiplied. And the method of weak labeling only labels whether an interesting event occurs in a section of audio, so that the labor cost for labeling the audio data set is reduced at the cost of discarding partial time information, and correspondingly, the model obtained by training the weak labeling data set cannot predict the time information of the sound event and the recognition rate is not high. Commonly used weakly labeled datasets are: detection and Classification of Acoustic Events and Events (DCASE2017) Acoustic event Detection dataset-the advantages are that the labeling is accurate, but the type and the number of the sample of the dataset are small, the identification range of the model obtained by training the dataset is narrow, and the universality of the model is poor; the Google Audio-set weakly labeled data set has the advantages that the type and the number of samples are large, but the labeling precision is low due to the limitation of cost, so that the model trained on the data set has wider identification range but has lower identification accuracy.

Disclosure of Invention

The invention aims to provide a sound event labeling and identifying method adopting double Token labels aiming at the defects of the prior art. The method can widen the sound event identification range with lower cost while ensuring the accuracy, and can realize accurate sound event detection and monitoring in the living environment of people, thereby better serving the construction of smart cities.

The technical scheme for realizing the purpose of the invention is as follows:

a sound event labeling and identifying method adopting double Token labels comprises a sound event labeling process and an identifying process, wherein the sound event labeling process comprises the following steps:

1-1) audio tag format: original audio data containing various acoustic events are played in audio annotation software Audacity, and the annotation steps are as follows: randomly selecting two tokens, respectively C, in each acoustic event occurrence time range of the audio_i' start ' and ' C_iEnd, C represents the acoustic event category;

1-2) repeating the labeling step to finish all audio labeling in the data set;

the identification process is as follows:

2-1) constructing an audio data set: adding sound event audios according to the requirements of a detection task to form an audio data set, constructing the audio data set and needing a large amount of audio with labels, firstly determining the type of a sound event to be detected according to the detection requirements, playing the sound event audio to be detected by adopting audio labeling software Audacity, clicking the type and the time stamp of a mouse labeled sound event in a column of a software Label Track while playing the audio to finish the labeling of the audio data, randomly selecting two points in the sounding time range of the heard sound event to obtain two tones, wherein the tones are respectively C_i' start ' and ' C_iEnd, C represents the acoustic event category, because the boundary of the overlapped acoustic event in the strong label can be accurately marked by repeated playback of a person, the invention randomly gives two tokens to save the time-consuming and tedious process of repeatedly playing back and determining the boundary, thereby saving manpower, and the simplified marking method saves manpower to reduce marking information and bring negative influence on identification, and the influence can be solved by designing a matched convolution cyclic neural network; finally, exporting a tag file by adopting Audacity, wherein the tag file records the name of an audio file, the type of an acoustic event occurring under each audio file name and the timestamp of each acoustic event;

2-2) audio data preprocessing and feature extraction:

for audio: because the sources of the audios may be different recording devices, the processing platform resamples all audios at the frequency of 16kHz, and standardizes audio waveform data after resampling is completed, so that the audio waveform data value is structured to a range class of (-1,1), and max standardization is adopted: x (t) s (t)/max (| s (t) |), and then extracting 128-dimensional logarithmic Mel energy spectrums from all audios by adopting short-time Fourier transform, wherein specific parameters of the short-time Fourier transform are as follows: nfft 2048, sampling frequency 16kHz, overlap with 1/2 frames, and finally normalize the logarithmic mel-energy spectrum sampling z-score: suppose the input logarithmic Mel energy spectrum is X₁,X₂,...X_n，

Wherein,

obtaining a normalized logarithmic Mel energy spectrum Y₁,Y₂....Y_nThe mean value is 0 and the variance is 1;

for audio tags: converting a label with a unit of seconds into a label with a unit of frames, converting each label file by the following steps to obtain an audio label coding matrix with the unit of frames, wherein the label coding matrix consists of 0 element and 1 element, the column number n of the matrix is the frame number, the row number m of the matrix is the category number of sound events, and the conversion of the audio label coding matrix containing m types of sound events from the unit of seconds to the unit of frames is as follows:

step 1: generating a zero matrix with m rows and n columns, wherein the number of the columns n of the matrix is sr × t and the number of the rows m of the matrix is the number of the acoustic event categories, assuming that the sampling frequency is sr and the audio duration is t;

step 2: determining a time stamp for each acoustic event in frames: assume a timestamp in seconds_secondAnd if hop _ length is frame overlap, the timestamp conversion formula is as follows:

timestamp_frame＝timestamp_second÷nfft÷(1-hop_length)；

step3：timestamp_framethe matrix values comprising the range, i.e. between the start frame and the end frame of each acoustic event, are replaced by 1's for 0's;

2-3) audio data amplification: in order to improve the generalization performance of the neural network, the audio data is prevented from being amplified to three times of the original data by adopting the following data amplification mode in overfitting: audio random scaling, time masking, frequency masking, adding random noise, audio sample mixing (mixup);

2-4) building a convolution cyclic neural network: the following convolutional recurrent neural network was constructed using a PyTorch framework: the first layer is an input layer and inputs a 128-dimensional logarithmic Mel energy spectrum, the second layer is a 2-dimensional convolution layer with the input channel number of 16 and is 2 x 2d pooled, the third layer is a 2-dimensional convolution layer with the input channel number of 32 and is 2 x 2d pooled, the fourth layer is a 2-dimensional convolution layer with the input channel number of 64 and is 2 x 2d pooled, the fifth layer is a 2-dimensional convolution layer with the input channel number of 128 and is 2 x 1 d pooled, the sixth layer is a 2d pooled with the input channel number of 256 and is 2 x 1 d pooled, then an output characteristic diagram tensor is flattened, the seventh layer is a one-dimensional convolution layer with the input channel number of 256, the eighth layer is a bidirectional recurrent neural network with two layers of GRU, the number of neurons is 256, the ninth layer is an output layer, full-connection layers of 256 and 80 neurons are sequentially used, and re-activation is used, and finally a full-connection layer activated by using sigid with the number of neurons as the acoustic event type number, each convolution layer uses convolution kernels with the size of 3 x 3 and the step size is 1, and each convolution layer is layered by one batch of normalization layers and is activated by using a ReLU function;

2-5) training a convolution cyclic neural network learning detection model: sending training data, namely the logarithmic Mel energy spectrum of the audio frequency into the convolution cyclic neural network constructed in the step 2-4), wherein initial weight parameters of the convolution cyclic neural network are randomly given by PyTorch to obtain output

Wherein C is the number of acoustic event categories, T is the total number of frames, and the loss Y of the true positive prediction label is calculated^pIs a reaction of Y^pAnd

element by element multiplication to obtain output

And finally, calculating the following two-term cross entropy loss function:

gradient backward propagation, namely using an Adam gradient descent method, setting the learning rate to be 0.001, updating weight parameters, carrying out iterative training until loss does not descend any more, and storing model parameters;

2-6) identifying the audio to be detected by using the trained detection model: standardizing an unknown audio to be detected with a label, extracting a logarithmic Mel energy spectrum, and then sending the logarithmic Mel energy spectrum into a convolution cyclic neural network after the normalization to obtain a probability output of the neural network, storing the probability output, searching an optimal decision threshold alpha for a scale according to f1-score, binarizing according to the decision threshold alpha to obtain a prediction result under a double-Token label, wherein the result of the double-Token label is not covered with a real time stamp of an acoustic event compared with a strong label detection model, and in order to reduce the prediction of false negative, a label extension strategy is adopted, and the specific method comprises the following steps: calculating the cosine similarity of adjacent frames of frame nodes corresponding to the neural network probability output matrix according to sound event starting and ending time frame nodes determined by the double-Token label prediction output matrix, extending the frames if the similarity is larger than 0.5, namely extending the time stamps in the double-Token label matrix, finally obtaining a prediction matrix after the labels are extended, obtaining a recognition result, and completing recognition, wherein the extension of the left side and the right side cannot exceed a preset hyper-parameter collar value, and the maximum and minimum duration of all sound events is generally (250ms-50 ms).

The technical scheme aims at the characteristics of strong data labels and weak data labels, a compromise method is found by combining a deep learning paradigm, a labor-saving thought for conversion from the weak labels to the strong labels is provided, and the prediction result under the double Token label labeling is updated by calculating an audio analysis research and a machine learning model, so that the acoustic event detection result can achieve the effect of being more strong than the weak labels.

The method can ensure the accuracy, broaden the sound event identification range at a lower cost, and realize accurate sound event detection and monitoring in the living environment of people, thereby better serving the construction of smart cities.

Drawings

FIG. 1 is a flow chart of an identification process in an embodiment;

FIG. 2 is a schematic diagram illustrating the use of audio annotation software to annotate a segment of audio in the process of constructing a data set according to an embodiment;

fig. 3 is a schematic diagram of a dual Token tag according to an embodiment including a class 3 acoustic event.

Detailed Description

The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.

Example (b):

referring to fig. 1, an acoustic event labeling and identifying method using a dual Token tag includes an acoustic event labeling process and an identifying process, where the event labeling process is as follows:

1-1) audio tag format: the method comprises the following steps of playing original audio data containing various acoustic events by adopting audio annotation software Audacity, wherein the annotation step is as follows: randomly selecting two tokens, C respectively in the occurrence time range of each acoustic event of the audio_i' start ' and ' C_iEnd, C represents the acoustic event category;

1-2) repeating the labeling step to finish all audio labeling in the data set;

the identification process is as follows:

2-1) constructing an audio data set: according to the detection task requirement, acoustic event audios are added to form an audio data set, a large amount of audio with labels is needed for the construction of the audio data set, the type of the acoustic event to be detected is determined according to the detection requirement, and the detection requirement is 3 types of acoustic events: the method comprises the steps of manually recording the gunshot sound, screaming sound and siren sound, downloading audio files containing the gunshot sound, screaming sound and siren sound in a video social network site such as YOUTUBE and BILIBILI, and acquiring the gunshot sound, screaming sound and siren soundThe audio frequency is divided into audio frequency segments with the duration time of 10s, various audio frequency events of 10s can be overlapped for many times, the audio frequency quantity containing three types of sound events reaches 3000 pieces by repeated recording and downloading, 3000 pieces of audio frequency of the sound event to be detected are played by adopting audio frequency labeling software Audacity, the type and the time stamp of the mouse marking sound event are clicked in a software Label Track column while the audio frequency is played, the audio frequency data labeling is completed, as shown in figure 2, each time the mouse is clicked in the Audacity software Label Track column, the labeling method comprises the following steps: without carefully listening and determining the starting and stopping event points of the acoustic event, randomly selecting two points in the sounding time range of the heard acoustic event to obtain two tones, wherein the tones are respectively C_iStart and C_iEnd, C represents the acoustic event category, because the boundary of the overlapped acoustic event in the strong label needs to be accurately labeled by repeated playback, the present embodiment randomly gives two tokens to omit the time-consuming and tedious process of repeated playback and boundary determination, and saves manpower, the simplified labeling method brings negative influence on identification by the reduction of labeling information, the influence can be solved by designing a matched convolution cyclic neural network, as shown in FIG. 3, the upper part of FIG. 3 is the labeling process of simulating double tokens, only two tokens are randomly taken within the occurrence range of the corresponding acoustic event, the middle part of FIG. 3 is the true value of the label, the lower part of FIG. 3 is the label coding matrix of the double tokens, compared with the true value, because randomly selecting the tokens does not need to carefully and consume a great deal of energy to determine the boundary of the overlapped acoustic event, the label is covered incompletely, comparing strong label labels with double Token labels in a label matrix form, wherein a vertical axis in the label matrix in fig. 3 represents acoustic event types (assumed to be 3 types), a horizontal axis represents frame numbers, 1 represents that the acoustic events are sounded at the frame numbers shown in the figure, and a blank is 0, namely, no acoustic event occurs;

2-2) audio data preprocessing and feature extraction:

for audio: because the sources of the audios may be different recording devices, the processing platform resamples all audios at the frequency of 16kHz, and standardizes audio waveform data after resampling is completed, so that the audio waveform data value is structured to a range class of (-1,1), and max standardization is adopted: x (t) s (t) max (| s (t) |), and then a 128-dimensional logarithmic mel-energy spectrum is extracted from all audio by using short-time fourier transform, wherein specific parameters of the short-time fourier transform are as follows: the frame length nfft is 2048, the sampling frequency is 16kHz, 1/2 frames are adopted for overlapping, and finally, the logarithmic Mel energy spectrum sampling z-score is standardized: suppose the input logarithmic Mel energy spectrum is X₁,X₂,...X_n，

Wherein,

obtaining a normalized logarithmic Mel energy spectrum Y₁,Y₂....Y_nHas a mean value of 0 and a variance of 1;

for audio tags: converting a label with a unit of seconds into a label with a unit of frames, and transforming each label file to obtain an audio label coding matrix with the unit of frames, wherein the label coding matrix consists of 0 element and 1 element, the column number n of the matrix is the frame number, and in the example, n is 160000; the number m of rows of the matrix is the number of acoustic event categories, and m is 3 in the example; in this example, nfft is 2048, hop _ length is 1/2 to extract the mel-energy spectrum, and an audio tag coding matrix containing m-class sound events is converted from second unit to frame unit as follows:

step 1: assuming that the sampling rate is sr, in this example, sr is 16000; the duration of the audio is t, in this example 10d, then the number of matrix columns n-sr-t is 160000, and the number of matrix rows m is the number of acoustic event categories 3, i.e. a zero matrix with 3 rows and 160000 columns is generated;

timestamp_frame＝timestamp_second÷2048÷(1-1/2)；

2-3) audio data amplification: in order to improve the generalization performance of the neural network, the overfitting is prevented by adopting the following data amplification mode to amplify the audio data to three times (9000) of the original (3000): audio random scaling, time masking, frequency masking, adding random noise, audio sample mixing (mixup);

2-4) building a convolution cyclic neural network: the following convolutional recurrent neural network was constructed using a PyTorch framework: the first layer is a 2-dimensional convolution layer with the number of input channels of 2 x 2d pooling, the third layer is a 2-dimensional convolution layer with the number of input channels of 32 d 2d pooling, the fourth layer is a 2-dimensional convolution layer with the number of input channels of 2 x 2d pooling, the fifth layer is a 2-dimensional convolution layer with the number of input channels of 2 x 12 d pooling, the sixth layer is a 2d pooling with the number of input channels of 2 x 1 pooling, the seventh layer is a one-dimensional convolution layer with the number of input channels of 256, the eighth layer is a bidirectional circulation neural network with two layers of GRU, the number of neurons is 256, the ninth layer is an output layer, full-connection layers of 256 and 80 neurons are sequentially used, LU is activated by Remoto, the last layer is a full-connection layer activated by sigmoid, the number of neurons is the number of acoustic events, each convolution layer uses convolution kernels with the size of 3 x 3 and the step size is 1, and each convolution layer is layered by one batch of normalization layers and is activated by using a ReLU function;

2-5) training a convolution cycle neural network learning detection model: sending training data, namely the logarithmic Mel energy spectrum of the audio frequency into the convolution cyclic neural network constructed in the step 2-4), wherein the initial weight parameter of the convolution cyclic neural network is randomly given by PyTorch to obtain the outputGo out

Wherein C is the number of acoustic event types, T is the total number of frames, and the loss Y of the true positive prediction label is calculated^pLet Y be^pAnd

element by element multiplication to obtain output

And finally, calculating the following two-term cross entropy loss function:

2-6) identifying the audio to be detected by using the trained detection model: standardizing an unknown audio to be detected with a label, extracting a logarithmic Mel energy spectrum, and then sending the logarithmic Mel energy spectrum into a convolution cyclic neural network after the normalization to obtain a probability output of the neural network, storing, searching an optimal decision threshold alpha for a scale according to f1-score, obtaining a prediction result under a double-Token label according to the binarization of the decision threshold alpha, wherein the result of the double-Token label is not covered with a real time stamp of an acoustic event compared with a strong label detection model, and in order to reduce the prediction of false negative, the invention adopts a label extension strategy, which specifically comprises the following steps: and (3) according to the sound event starting and ending time frame node determined by the double-Token label prediction output matrix, calculating the cosine similarity of the adjacent frames of the frame node corresponding to the neural network probability output matrix, and extending the frames if the similarity is more than 0.5, namely extending the time stamps in the double-Token label matrix. And finally, obtaining a prediction matrix after the label is extended, obtaining an identification result, and completing identification, wherein the extension of the left side and the right side in the embodiment does not exceed a preset hyper-parameter collar value, and the maximum and minimum duration of all the acoustic events is taken (250ms-50 ms).

Claims

1. A sound event labeling and identifying method adopting double Token labels is characterized by comprising a sound event labeling process and an identifying process, wherein the sound event labeling process comprises the following steps:

1-1) audio tag form: adopting audio annotation software Audacity to play original audio data containing various acoustic events, wherein the annotation step is as follows: randomly selecting two tokens, respectively C, in each acoustic event occurrence time range of the audio_iStart and C_iEnd, C represents the acoustic event category;

1-2) repeating the labeling step to finish all audio labeling in the data set;

the identification process is as follows:

2-1) constructing an audio data set: adding sound event audio to form an audio data set according to the requirement of a detection task, firstly determining the type of a sound event to be detected, playing the sound event audio to be detected by adopting audio annotation software Audacity, clicking a mouse in a software Label Track column to mark the sound event type and a time stamp while playing the audio, completing audio data annotation, randomly selecting two points in the audible time range of the heard sound event to obtain two Token, wherein the Token is C respectively_i' start ' and ' C_iEnd, C represents the acoustic event type, and finally, Audacity is adopted to export a tag file, wherein the tag file records the name of an audio file, the acoustic event type occurring under each audio file name and the timestamp of each acoustic event;

2-2) audio data preprocessing and feature extraction:

for audio: the resampling frequency of all audios is 16kHz, audio waveform data is standardized after resampling is finished, the audio waveform data value is structured to be in a range of (-1,1), and max standardization is adopted: x (t) s (t) max (| s (t) |), and then a 128-dimensional logarithmic mel-energy spectrum is extracted from all audio by using short-time fourier transform, wherein specific parameters of the short-time fourier transform are as follows: nfft 2048, sampling frequency 16kHz, overlap with 1/2 frames, and finally normalize the logarithmic mel-energy spectrum sampling z-score: suppose the input logarithmic Mel energy spectrum is X₁，X₂，...X_n，

Wherein,

obtaining a normalized logarithmic Mel energy spectrum Y₁，Y₂....Y_nHas a mean value of 0 and a variance of 1;

step 1: generating a zero matrix with m rows and n columns, wherein the number n of the columns of the matrix is sr x t and the number m of the rows of the matrix is the number of the acoustic event categories, assuming that the sampling frequency is sr and the audio duration is t;

step 2: determining a time stamp of each acoustic event in units of frames: suppose a timestamp in seconds_secondIf frame _ length is the frame length and hop _ length is the frame overlap, the timestamp conversion formula is:

timestamp_frame＝timestamp_second÷nfft÷(frame_length-hop-length)；

2-3) audio data amplification: the original audio data is amplified to three times of the original audio data by adopting the following data amplification mode: audio random scaling, time masking, frequency masking, adding random noise, audio sample mixing (mixup);

2-4) building a convolution cyclic neural network: the following convolutional recurrent neural network was constructed using a PyTorch framework: the first layer is an input layer and inputs a 128-dimensional logarithmic Mel energy spectrum, the second layer is a 2-dimensional convolution layer with the number of input channels being 16 and is 2 x 2d pooled, the third layer is a 2-dimensional convolution layer with the number of input channels being 32 and is 2 x 2d pooled, the fourth layer is a 2-dimensional convolution layer with the number of input channels being 64 and is 2 x 2d pooled, the fifth layer is a 2-dimensional convolution layer with the number of input channels being 128 and is 2 x 1 and is 2d pooled, the sixth layer is a 2d pooled with the number of input channels being 256 and being 2 x 1, then an output characteristic diagram tensor is flattened, the seventh layer is a one-dimensional convolution layer with the number of input channels being 256, the eighth layer is a bidirectional recurrent neural network with two layers of GRUs, the number of neurons is 256, the ninth layer is an output layer, full-connection layers of 256 and 80 neurons are sequentially used, and re-activation is used, and finally the number of a neuron is an event type number, Fully-connected layers activated by sigmoid, each convolutional layer using a convolution kernel with the size of 3 x 3 and the step size of 1, and each convolutional layer being layered by one batch normalization layer and activated by a ReLU function;

2-5) training a convolution cyclic neural network learning detection model: sending training data, namely the logarithmic Mel energy spectrum of the audio frequency into the convolution cyclic neural network built in the step 2-4), wherein initial weight parameters of the convolution cyclic neural network are randomly given by PyTorch to obtain output

Wherein C is the number of acoustic event categories, T is the total number of frames, and the loss Y of the true positive prediction label is calculated^pLet Y be^pAnd

element by element multiplication to obtain output

And finally, calculating the following two-term cross entropy loss function:

2-6) identifying the audio to be detected by using the trained detection model: standardizing an unknown label audio to be detected, extracting a logarithmic Mel energy spectrum, and then sending the logarithmic Mel energy spectrum into a convolution cyclic neural network after the normalization to obtain a probability output of the neural network, storing, searching an optimal decision threshold alpha for a ruler according to f1-score, and obtaining a prediction result under a double Token label according to binarization of the decision threshold alpha, wherein the specific method comprises the following steps: and (3) according to the sound event starting and ending time frame node determined by the double-Token label prediction output matrix, calculating the cosine similarity of the adjacent frames of the frame node corresponding to the neural network probability output matrix, if the similarity is more than 0.5, extending the frames, namely extending the time stamps in the double-Token label matrix, finally obtaining the prediction matrix after the label extension, obtaining the recognition result and finishing the recognition.