CN113140226B - Sound event marking and identifying method adopting double Token labels - Google Patents

Sound event marking and identifying method adopting double Token labels Download PDF

Info

Publication number
CN113140226B
CN113140226B CN202110465526.7A CN202110465526A CN113140226B CN 113140226 B CN113140226 B CN 113140226B CN 202110465526 A CN202110465526 A CN 202110465526A CN 113140226 B CN113140226 B CN 113140226B
Authority
CN
China
Prior art keywords
audio
layer
label
matrix
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110465526.7A
Other languages
Chinese (zh)
Other versions
CN113140226A (en
Inventor
姚雨
宋浠瑜
王玫
仇洪冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202110465526.7A priority Critical patent/CN113140226B/en
Publication of CN113140226A publication Critical patent/CN113140226A/en
Application granted granted Critical
Publication of CN113140226B publication Critical patent/CN113140226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a sound event labeling and identifying method adopting double Token labels, which is characterized by comprising a sound event labeling process and an identifying process, wherein the sound event labeling process comprises the following steps: 1-1) audio tag form; 1-2) completing all audio frequency labeling in the data set; the identification process comprises the following steps: 2-1) constructing an audio data set; 2-2) preprocessing audio data and extracting characteristics; 2-3) audio data amplification; 2-4) building a convolution cyclic neural network; 2-5) training a convolution cyclic neural network learning detection model; 2-6) identifying the audio to be detected by using the trained detection model. The method can widen the sound event identification range with lower cost while ensuring the accuracy, and can realize accurate sound event detection and monitoring in the living environment of people, thereby better serving the construction of smart cities.

Description

Sound event marking and identifying method adopting double Token labels
Technical Field
The invention relates to the field of sound event detection, in particular to a sound event labeling and identifying method adopting double Token labels.
Background
In the living environment of people, various types of sounds carry a great deal of information about the daily environment and the physical events that occur therein. The research on Sound event Detection (SED for short) can help people to better sense the Sound scene where the Sound event Detection is located, identify various Sound source types and obtain the time stamp of an interesting event, and has important practical significance. The method can be applied to smart cities and smart home scenes such as urban environmental noise monitoring, safety monitoring in public places, old people and children behavior monitoring in indoor environments and the like, for example, gunshot, scream and object combustion sound can be automatically detected and identified in an acoustic monitoring application scene, and the method has important values in man-machine interaction, auditory perception and satisfaction of various detection requirements in the society.
The acoustic event detection task depends on a signal processing method and a machine learning model, the acoustic event detection model is usually obtained by training a large amount of audio data with labels, and the obtained model can predict a section of audio with unknown labels, and usually predicts the category of the acoustic event and the corresponding timestamp. Specifically, a signal processing method is used for audio with a label to obtain certain characteristic representation of a time frequency domain of the audio, the characteristic representation with the label is used as input and is sent to a machine learning model, and the machine learning model defines a loss function and a random initialization weight parameter; calculating loss values of the output and the label according to forward propagation, and then updating the weight by using backward propagation; through repeated iteration until the loss function is small, the weight parameter is the acoustic event recognition model, and the obtained model can predict a section of audio without a label, so that the purpose of acoustic event detection is achieved. The process of iteration and weight updating is the machine learning training process. As machine learning follows the principle of garpage in and garpage out, the accuracy, the quality degree and the data volume of the labeled data greatly influence the effect of the detection model. The data is marked with strong label marks (accurately marking the acoustic event types and timestamps thereof, and embodying the number and positions of acoustic events in a piece of audio) and weak label marks (only marking whether a certain type of acoustic event occurs or not, and failing to embody the times of sounding and the time points of occurrence in the audio).
Data volume and annotation generally cannot be considered: if the audio data is strongly labeled, the training model can obtain more accurate and detailed label description and predict the start and end time (time stamp) of the possibly overlapped sound event. However, the strongly labeled audio data is usually finished by listening to human ears and manual labeling, which requires a high attention of human at all times during the listening process and is recorded by professional software, which is a very time-consuming and labor-consuming task, and especially when a period of audio is mixed with multiple types of sound events with overlapping time, the time and labor cost of the strongly labeled task is multiplied. And the method of weak labeling only labels whether an interesting event occurs in a section of audio, so that the labor cost for labeling the audio data set is reduced at the cost of discarding partial time information, and correspondingly, the model obtained by training the weak labeling data set cannot predict the time information of the sound event and the recognition rate is not high. Commonly used weakly labeled datasets are: detection and Classification of Acoustic Events and Events (DCASE2017) Acoustic event Detection dataset-the advantages are that the labeling is accurate, but the type and the number of the sample of the dataset are small, the identification range of the model obtained by training the dataset is narrow, and the universality of the model is poor; the Google Audio-set weakly labeled data set has the advantages that the type and the number of samples are large, but the labeling precision is low due to the limitation of cost, so that the model trained on the data set has wider identification range but has lower identification accuracy.
Disclosure of Invention
The invention aims to provide a sound event labeling and identifying method adopting double Token labels aiming at the defects of the prior art. The method can widen the sound event identification range with lower cost while ensuring the accuracy, and can realize accurate sound event detection and monitoring in the living environment of people, thereby better serving the construction of smart cities.
The technical scheme for realizing the purpose of the invention is as follows:
a sound event labeling and identifying method adopting double Token labels comprises a sound event labeling process and an identifying process, wherein the sound event labeling process comprises the following steps:
1-1) audio tag format: original audio data containing various acoustic events are played in audio annotation software Audacity, and the annotation steps are as follows: randomly selecting two tokens, respectively C, in each acoustic event occurrence time range of the audioi' start ' and ' CiEnd, C represents the acoustic event category;
1-2) repeating the labeling step to finish all audio labeling in the data set;
the identification process is as follows:
2-1) constructing an audio data set: adding sound event audios according to the requirements of a detection task to form an audio data set, constructing the audio data set and needing a large amount of audio with labels, firstly determining the type of a sound event to be detected according to the detection requirements, playing the sound event audio to be detected by adopting audio labeling software Audacity, clicking the type and the time stamp of a mouse labeled sound event in a column of a software Label Track while playing the audio to finish the labeling of the audio data, randomly selecting two points in the sounding time range of the heard sound event to obtain two tones, wherein the tones are respectively Ci' start ' and ' CiEnd, C represents the acoustic event category, because the boundary of the overlapped acoustic event in the strong label can be accurately marked by repeated playback of a person, the invention randomly gives two tokens to save the time-consuming and tedious process of repeatedly playing back and determining the boundary, thereby saving manpower, and the simplified marking method saves manpower to reduce marking information and bring negative influence on identification, and the influence can be solved by designing a matched convolution cyclic neural network; finally, exporting a tag file by adopting Audacity, wherein the tag file records the name of an audio file, the type of an acoustic event occurring under each audio file name and the timestamp of each acoustic event;
2-2) audio data preprocessing and feature extraction:
for audio: because the sources of the audios may be different recording devices, the processing platform resamples all audios at the frequency of 16kHz, and standardizes audio waveform data after resampling is completed, so that the audio waveform data value is structured to a range class of (-1,1), and max standardization is adopted: x (t) s (t)/max (| s (t) |), and then extracting 128-dimensional logarithmic Mel energy spectrums from all audios by adopting short-time Fourier transform, wherein specific parameters of the short-time Fourier transform are as follows: nfft 2048, sampling frequency 16kHz, overlap with 1/2 frames, and finally normalize the logarithmic mel-energy spectrum sampling z-score: suppose the input logarithmic Mel energy spectrum is X1,X2,...Xn
Figure BDA0003043709350000031
Wherein,
Figure BDA0003043709350000032
obtaining a normalized logarithmic Mel energy spectrum Y1,Y2....YnThe mean value is 0 and the variance is 1;
for audio tags: converting a label with a unit of seconds into a label with a unit of frames, converting each label file by the following steps to obtain an audio label coding matrix with the unit of frames, wherein the label coding matrix consists of 0 element and 1 element, the column number n of the matrix is the frame number, the row number m of the matrix is the category number of sound events, and the conversion of the audio label coding matrix containing m types of sound events from the unit of seconds to the unit of frames is as follows:
step 1: generating a zero matrix with m rows and n columns, wherein the number of the columns n of the matrix is sr × t and the number of the rows m of the matrix is the number of the acoustic event categories, assuming that the sampling frequency is sr and the audio duration is t;
step 2: determining a time stamp for each acoustic event in frames: assume a timestamp in secondssecondAnd if hop _ length is frame overlap, the timestamp conversion formula is as follows:
timestampframe=timestampsecond÷nfft÷(1-hop_length);
step3:timestampframethe matrix values comprising the range, i.e. between the start frame and the end frame of each acoustic event, are replaced by 1's for 0's;
2-3) audio data amplification: in order to improve the generalization performance of the neural network, the audio data is prevented from being amplified to three times of the original data by adopting the following data amplification mode in overfitting: audio random scaling, time masking, frequency masking, adding random noise, audio sample mixing (mixup);
2-4) building a convolution cyclic neural network: the following convolutional recurrent neural network was constructed using a PyTorch framework: the first layer is an input layer and inputs a 128-dimensional logarithmic Mel energy spectrum, the second layer is a 2-dimensional convolution layer with the input channel number of 16 and is 2 x 2d pooled, the third layer is a 2-dimensional convolution layer with the input channel number of 32 and is 2 x 2d pooled, the fourth layer is a 2-dimensional convolution layer with the input channel number of 64 and is 2 x 2d pooled, the fifth layer is a 2-dimensional convolution layer with the input channel number of 128 and is 2 x 1 d pooled, the sixth layer is a 2d pooled with the input channel number of 256 and is 2 x 1 d pooled, then an output characteristic diagram tensor is flattened, the seventh layer is a one-dimensional convolution layer with the input channel number of 256, the eighth layer is a bidirectional recurrent neural network with two layers of GRU, the number of neurons is 256, the ninth layer is an output layer, full-connection layers of 256 and 80 neurons are sequentially used, and re-activation is used, and finally a full-connection layer activated by using sigid with the number of neurons as the acoustic event type number, each convolution layer uses convolution kernels with the size of 3 x 3 and the step size is 1, and each convolution layer is layered by one batch of normalization layers and is activated by using a ReLU function;
2-5) training a convolution cyclic neural network learning detection model: sending training data, namely the logarithmic Mel energy spectrum of the audio frequency into the convolution cyclic neural network constructed in the step 2-4), wherein initial weight parameters of the convolution cyclic neural network are randomly given by PyTorch to obtain output
Figure BDA0003043709350000041
Wherein C is the number of acoustic event categories, T is the total number of frames, and the loss Y of the true positive prediction label is calculatedpIs a reaction of YpAnd
Figure BDA0003043709350000042
element by element multiplication to obtain output
Figure BDA0003043709350000043
And finally, calculating the following two-term cross entropy loss function:
Figure BDA0003043709350000044
gradient backward propagation, namely using an Adam gradient descent method, setting the learning rate to be 0.001, updating weight parameters, carrying out iterative training until loss does not descend any more, and storing model parameters;
2-6) identifying the audio to be detected by using the trained detection model: standardizing an unknown audio to be detected with a label, extracting a logarithmic Mel energy spectrum, and then sending the logarithmic Mel energy spectrum into a convolution cyclic neural network after the normalization to obtain a probability output of the neural network, storing the probability output, searching an optimal decision threshold alpha for a scale according to f1-score, binarizing according to the decision threshold alpha to obtain a prediction result under a double-Token label, wherein the result of the double-Token label is not covered with a real time stamp of an acoustic event compared with a strong label detection model, and in order to reduce the prediction of false negative, a label extension strategy is adopted, and the specific method comprises the following steps: calculating the cosine similarity of adjacent frames of frame nodes corresponding to the neural network probability output matrix according to sound event starting and ending time frame nodes determined by the double-Token label prediction output matrix, extending the frames if the similarity is larger than 0.5, namely extending the time stamps in the double-Token label matrix, finally obtaining a prediction matrix after the labels are extended, obtaining a recognition result, and completing recognition, wherein the extension of the left side and the right side cannot exceed a preset hyper-parameter collar value, and the maximum and minimum duration of all sound events is generally (250ms-50 ms).
The technical scheme aims at the characteristics of strong data labels and weak data labels, a compromise method is found by combining a deep learning paradigm, a labor-saving thought for conversion from the weak labels to the strong labels is provided, and the prediction result under the double Token label labeling is updated by calculating an audio analysis research and a machine learning model, so that the acoustic event detection result can achieve the effect of being more strong than the weak labels.
The method can ensure the accuracy, broaden the sound event identification range at a lower cost, and realize accurate sound event detection and monitoring in the living environment of people, thereby better serving the construction of smart cities.
Drawings
FIG. 1 is a flow chart of an identification process in an embodiment;
FIG. 2 is a schematic diagram illustrating the use of audio annotation software to annotate a segment of audio in the process of constructing a data set according to an embodiment;
fig. 3 is a schematic diagram of a dual Token tag according to an embodiment including a class 3 acoustic event.
Detailed Description
The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.
Example (b):
referring to fig. 1, an acoustic event labeling and identifying method using a dual Token tag includes an acoustic event labeling process and an identifying process, where the event labeling process is as follows:
1-1) audio tag format: the method comprises the following steps of playing original audio data containing various acoustic events by adopting audio annotation software Audacity, wherein the annotation step is as follows: randomly selecting two tokens, C respectively in the occurrence time range of each acoustic event of the audioi' start ' and ' CiEnd, C represents the acoustic event category;
1-2) repeating the labeling step to finish all audio labeling in the data set;
the identification process is as follows:
2-1) constructing an audio data set: according to the detection task requirement, acoustic event audios are added to form an audio data set, a large amount of audio with labels is needed for the construction of the audio data set, the type of the acoustic event to be detected is determined according to the detection requirement, and the detection requirement is 3 types of acoustic events: the method comprises the steps of manually recording the gunshot sound, screaming sound and siren sound, downloading audio files containing the gunshot sound, screaming sound and siren sound in a video social network site such as YOUTUBE and BILIBILI, and acquiring the gunshot sound, screaming sound and siren soundThe audio frequency is divided into audio frequency segments with the duration time of 10s, various audio frequency events of 10s can be overlapped for many times, the audio frequency quantity containing three types of sound events reaches 3000 pieces by repeated recording and downloading, 3000 pieces of audio frequency of the sound event to be detected are played by adopting audio frequency labeling software Audacity, the type and the time stamp of the mouse marking sound event are clicked in a software Label Track column while the audio frequency is played, the audio frequency data labeling is completed, as shown in figure 2, each time the mouse is clicked in the Audacity software Label Track column, the labeling method comprises the following steps: without carefully listening and determining the starting and stopping event points of the acoustic event, randomly selecting two points in the sounding time range of the heard acoustic event to obtain two tones, wherein the tones are respectively CiStart and CiEnd, C represents the acoustic event category, because the boundary of the overlapped acoustic event in the strong label needs to be accurately labeled by repeated playback, the present embodiment randomly gives two tokens to omit the time-consuming and tedious process of repeated playback and boundary determination, and saves manpower, the simplified labeling method brings negative influence on identification by the reduction of labeling information, the influence can be solved by designing a matched convolution cyclic neural network, as shown in FIG. 3, the upper part of FIG. 3 is the labeling process of simulating double tokens, only two tokens are randomly taken within the occurrence range of the corresponding acoustic event, the middle part of FIG. 3 is the true value of the label, the lower part of FIG. 3 is the label coding matrix of the double tokens, compared with the true value, because randomly selecting the tokens does not need to carefully and consume a great deal of energy to determine the boundary of the overlapped acoustic event, the label is covered incompletely, comparing strong label labels with double Token labels in a label matrix form, wherein a vertical axis in the label matrix in fig. 3 represents acoustic event types (assumed to be 3 types), a horizontal axis represents frame numbers, 1 represents that the acoustic events are sounded at the frame numbers shown in the figure, and a blank is 0, namely, no acoustic event occurs;
2-2) audio data preprocessing and feature extraction:
for audio: because the sources of the audios may be different recording devices, the processing platform resamples all audios at the frequency of 16kHz, and standardizes audio waveform data after resampling is completed, so that the audio waveform data value is structured to a range class of (-1,1), and max standardization is adopted: x (t) s (t) max (| s (t) |), and then a 128-dimensional logarithmic mel-energy spectrum is extracted from all audio by using short-time fourier transform, wherein specific parameters of the short-time fourier transform are as follows: the frame length nfft is 2048, the sampling frequency is 16kHz, 1/2 frames are adopted for overlapping, and finally, the logarithmic Mel energy spectrum sampling z-score is standardized: suppose the input logarithmic Mel energy spectrum is X1,X2,...Xn
Figure BDA0003043709350000061
Figure BDA0003043709350000062
Wherein,
Figure BDA0003043709350000063
obtaining a normalized logarithmic Mel energy spectrum Y1,Y2....YnHas a mean value of 0 and a variance of 1;
for audio tags: converting a label with a unit of seconds into a label with a unit of frames, and transforming each label file to obtain an audio label coding matrix with the unit of frames, wherein the label coding matrix consists of 0 element and 1 element, the column number n of the matrix is the frame number, and in the example, n is 160000; the number m of rows of the matrix is the number of acoustic event categories, and m is 3 in the example; in this example, nfft is 2048, hop _ length is 1/2 to extract the mel-energy spectrum, and an audio tag coding matrix containing m-class sound events is converted from second unit to frame unit as follows:
step 1: assuming that the sampling rate is sr, in this example, sr is 16000; the duration of the audio is t, in this example 10d, then the number of matrix columns n-sr-t is 160000, and the number of matrix rows m is the number of acoustic event categories 3, i.e. a zero matrix with 3 rows and 160000 columns is generated;
step 2: determining a time stamp for each acoustic event in frames: assume a timestamp in secondssecondAnd if hop _ length is frame overlap, the timestamp conversion formula is as follows:
timestampframe=timestampsecond÷2048÷(1-1/2);
step3:timestampframethe matrix values comprising the range, i.e. between the start frame and the end frame of each acoustic event, are replaced by 1's for 0's;
2-3) audio data amplification: in order to improve the generalization performance of the neural network, the overfitting is prevented by adopting the following data amplification mode to amplify the audio data to three times (9000) of the original (3000): audio random scaling, time masking, frequency masking, adding random noise, audio sample mixing (mixup);
2-4) building a convolution cyclic neural network: the following convolutional recurrent neural network was constructed using a PyTorch framework: the first layer is a 2-dimensional convolution layer with the number of input channels of 2 x 2d pooling, the third layer is a 2-dimensional convolution layer with the number of input channels of 32 d 2d pooling, the fourth layer is a 2-dimensional convolution layer with the number of input channels of 2 x 2d pooling, the fifth layer is a 2-dimensional convolution layer with the number of input channels of 2 x 12 d pooling, the sixth layer is a 2d pooling with the number of input channels of 2 x 1 pooling, the seventh layer is a one-dimensional convolution layer with the number of input channels of 256, the eighth layer is a bidirectional circulation neural network with two layers of GRU, the number of neurons is 256, the ninth layer is an output layer, full-connection layers of 256 and 80 neurons are sequentially used, LU is activated by Remoto, the last layer is a full-connection layer activated by sigmoid, the number of neurons is the number of acoustic events, each convolution layer uses convolution kernels with the size of 3 x 3 and the step size is 1, and each convolution layer is layered by one batch of normalization layers and is activated by using a ReLU function;
2-5) training a convolution cycle neural network learning detection model: sending training data, namely the logarithmic Mel energy spectrum of the audio frequency into the convolution cyclic neural network constructed in the step 2-4), wherein the initial weight parameter of the convolution cyclic neural network is randomly given by PyTorch to obtain the outputGo out
Figure BDA0003043709350000081
Wherein C is the number of acoustic event types, T is the total number of frames, and the loss Y of the true positive prediction label is calculatedpLet Y bepAnd
Figure BDA0003043709350000082
element by element multiplication to obtain output
Figure BDA0003043709350000083
And finally, calculating the following two-term cross entropy loss function:
Figure BDA0003043709350000084
gradient backward propagation, namely using an Adam gradient descent method, setting the learning rate to be 0.001, updating weight parameters, carrying out iterative training until loss does not descend any more, and storing model parameters;
2-6) identifying the audio to be detected by using the trained detection model: standardizing an unknown audio to be detected with a label, extracting a logarithmic Mel energy spectrum, and then sending the logarithmic Mel energy spectrum into a convolution cyclic neural network after the normalization to obtain a probability output of the neural network, storing, searching an optimal decision threshold alpha for a scale according to f1-score, obtaining a prediction result under a double-Token label according to the binarization of the decision threshold alpha, wherein the result of the double-Token label is not covered with a real time stamp of an acoustic event compared with a strong label detection model, and in order to reduce the prediction of false negative, the invention adopts a label extension strategy, which specifically comprises the following steps: and (3) according to the sound event starting and ending time frame node determined by the double-Token label prediction output matrix, calculating the cosine similarity of the adjacent frames of the frame node corresponding to the neural network probability output matrix, and extending the frames if the similarity is more than 0.5, namely extending the time stamps in the double-Token label matrix. And finally, obtaining a prediction matrix after the label is extended, obtaining an identification result, and completing identification, wherein the extension of the left side and the right side in the embodiment does not exceed a preset hyper-parameter collar value, and the maximum and minimum duration of all the acoustic events is taken (250ms-50 ms).

Claims (1)

1. A sound event labeling and identifying method adopting double Token labels is characterized by comprising a sound event labeling process and an identifying process, wherein the sound event labeling process comprises the following steps:
1-1) audio tag form: adopting audio annotation software Audacity to play original audio data containing various acoustic events, wherein the annotation step is as follows: randomly selecting two tokens, respectively C, in each acoustic event occurrence time range of the audioiStart and CiEnd, C represents the acoustic event category;
1-2) repeating the labeling step to finish all audio labeling in the data set;
the identification process is as follows:
2-1) constructing an audio data set: adding sound event audio to form an audio data set according to the requirement of a detection task, firstly determining the type of a sound event to be detected, playing the sound event audio to be detected by adopting audio annotation software Audacity, clicking a mouse in a software Label Track column to mark the sound event type and a time stamp while playing the audio, completing audio data annotation, randomly selecting two points in the audible time range of the heard sound event to obtain two Token, wherein the Token is C respectivelyi' start ' and ' CiEnd, C represents the acoustic event type, and finally, Audacity is adopted to export a tag file, wherein the tag file records the name of an audio file, the acoustic event type occurring under each audio file name and the timestamp of each acoustic event;
2-2) audio data preprocessing and feature extraction:
for audio: the resampling frequency of all audios is 16kHz, audio waveform data is standardized after resampling is finished, the audio waveform data value is structured to be in a range of (-1,1), and max standardization is adopted: x (t) s (t) max (| s (t) |), and then a 128-dimensional logarithmic mel-energy spectrum is extracted from all audio by using short-time fourier transform, wherein specific parameters of the short-time fourier transform are as follows: nfft 2048, sampling frequency 16kHz, overlap with 1/2 frames, and finally normalize the logarithmic mel-energy spectrum sampling z-score: suppose the input logarithmic Mel energy spectrum is X1,X2,...Xn
Figure FDA0003043709340000011
Wherein,
Figure FDA0003043709340000012
obtaining a normalized logarithmic Mel energy spectrum Y1,Y2....YnHas a mean value of 0 and a variance of 1;
for audio tags: converting a label with a unit of seconds into a label with a unit of frames, converting each label file by the following steps to obtain an audio label coding matrix with the unit of frames, wherein the label coding matrix consists of 0 element and 1 element, the column number n of the matrix is the frame number, the row number m of the matrix is the category number of sound events, and the conversion of the audio label coding matrix containing m types of sound events from the unit of seconds to the unit of frames is as follows:
step 1: generating a zero matrix with m rows and n columns, wherein the number n of the columns of the matrix is sr x t and the number m of the rows of the matrix is the number of the acoustic event categories, assuming that the sampling frequency is sr and the audio duration is t;
step 2: determining a time stamp of each acoustic event in units of frames: suppose a timestamp in secondssecondIf frame _ length is the frame length and hop _ length is the frame overlap, the timestamp conversion formula is:
timestampframe=timestampsecond÷nfft÷(frame_length-hop-length);
step3:timestampframethe matrix values comprising the range, i.e. between the start frame and the end frame of each acoustic event, are replaced by 1's for 0's;
2-3) audio data amplification: the original audio data is amplified to three times of the original audio data by adopting the following data amplification mode: audio random scaling, time masking, frequency masking, adding random noise, audio sample mixing (mixup);
2-4) building a convolution cyclic neural network: the following convolutional recurrent neural network was constructed using a PyTorch framework: the first layer is an input layer and inputs a 128-dimensional logarithmic Mel energy spectrum, the second layer is a 2-dimensional convolution layer with the number of input channels being 16 and is 2 x 2d pooled, the third layer is a 2-dimensional convolution layer with the number of input channels being 32 and is 2 x 2d pooled, the fourth layer is a 2-dimensional convolution layer with the number of input channels being 64 and is 2 x 2d pooled, the fifth layer is a 2-dimensional convolution layer with the number of input channels being 128 and is 2 x 1 and is 2d pooled, the sixth layer is a 2d pooled with the number of input channels being 256 and being 2 x 1, then an output characteristic diagram tensor is flattened, the seventh layer is a one-dimensional convolution layer with the number of input channels being 256, the eighth layer is a bidirectional recurrent neural network with two layers of GRUs, the number of neurons is 256, the ninth layer is an output layer, full-connection layers of 256 and 80 neurons are sequentially used, and re-activation is used, and finally the number of a neuron is an event type number, Fully-connected layers activated by sigmoid, each convolutional layer using a convolution kernel with the size of 3 x 3 and the step size of 1, and each convolutional layer being layered by one batch normalization layer and activated by a ReLU function;
2-5) training a convolution cyclic neural network learning detection model: sending training data, namely the logarithmic Mel energy spectrum of the audio frequency into the convolution cyclic neural network built in the step 2-4), wherein initial weight parameters of the convolution cyclic neural network are randomly given by PyTorch to obtain output
Figure FDA0003043709340000021
Wherein C is the number of acoustic event categories, T is the total number of frames, and the loss Y of the true positive prediction label is calculatedpLet Y bepAnd
Figure FDA0003043709340000022
element by element multiplication to obtain output
Figure FDA0003043709340000023
And finally, calculating the following two-term cross entropy loss function:
Figure FDA0003043709340000024
gradient backward propagation, namely using an Adam gradient descent method, setting the learning rate to be 0.001, updating weight parameters, carrying out iterative training until loss does not descend any more, and storing model parameters;
2-6) identifying the audio to be detected by using the trained detection model: standardizing an unknown label audio to be detected, extracting a logarithmic Mel energy spectrum, and then sending the logarithmic Mel energy spectrum into a convolution cyclic neural network after the normalization to obtain a probability output of the neural network, storing, searching an optimal decision threshold alpha for a ruler according to f1-score, and obtaining a prediction result under a double Token label according to binarization of the decision threshold alpha, wherein the specific method comprises the following steps: and (3) according to the sound event starting and ending time frame node determined by the double-Token label prediction output matrix, calculating the cosine similarity of the adjacent frames of the frame node corresponding to the neural network probability output matrix, if the similarity is more than 0.5, extending the frames, namely extending the time stamps in the double-Token label matrix, finally obtaining the prediction matrix after the label extension, obtaining the recognition result and finishing the recognition.
CN202110465526.7A 2021-04-28 2021-04-28 Sound event marking and identifying method adopting double Token labels Active CN113140226B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110465526.7A CN113140226B (en) 2021-04-28 2021-04-28 Sound event marking and identifying method adopting double Token labels

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110465526.7A CN113140226B (en) 2021-04-28 2021-04-28 Sound event marking and identifying method adopting double Token labels

Publications (2)

Publication Number Publication Date
CN113140226A CN113140226A (en) 2021-07-20
CN113140226B true CN113140226B (en) 2022-06-21

Family

ID=76816250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110465526.7A Active CN113140226B (en) 2021-04-28 2021-04-28 Sound event marking and identifying method adopting double Token labels

Country Status (1)

Country Link
CN (1) CN113140226B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299314B (en) * 2021-07-27 2021-11-02 北京世纪好未来教育科技有限公司 Training method, device and equipment of audio event recognition model
CN113963228B (en) * 2021-09-14 2024-07-02 电信科学技术第五研究所有限公司 Voice event extraction method based on deep learning feature connection analysis
CN113593606B (en) * 2021-09-30 2022-02-15 清华大学 Audio recognition method and device, computer equipment and computer-readable storage medium
CN114373484A (en) * 2022-03-22 2022-04-19 南京邮电大学 Voice-driven small sample learning method for Parkinson disease multi-symptom characteristic parameters
CN115206294B (en) * 2022-09-16 2022-12-06 深圳比特微电子科技有限公司 Training method, sound event detection method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065030A (en) * 2018-08-01 2018-12-21 上海大学 Ambient sound recognition methods and system based on convolutional neural networks
CN110827804A (en) * 2019-11-14 2020-02-21 福州大学 Sound event labeling method from audio frame sequence to event label sequence
CN110990534A (en) * 2019-11-29 2020-04-10 北京搜狗科技发展有限公司 Data processing method and device and data processing device
US10783434B1 (en) * 2019-10-07 2020-09-22 Audio Analytic Ltd Method of training a sound event recognition system
CN112447189A (en) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 Voice event detection method and device, electronic equipment and computer storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11216724B2 (en) * 2017-12-07 2022-01-04 Intel Corporation Acoustic event detection based on modelling of sequence of event subparts
US11024291B2 (en) * 2018-11-21 2021-06-01 Sri International Real-time class recognition for an audio stream
KR102444411B1 (en) * 2019-03-29 2022-09-20 한국전자통신연구원 Method and apparatus for detecting sound event considering the characteristics of each sound event

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065030A (en) * 2018-08-01 2018-12-21 上海大学 Ambient sound recognition methods and system based on convolutional neural networks
US10783434B1 (en) * 2019-10-07 2020-09-22 Audio Analytic Ltd Method of training a sound event recognition system
CN110827804A (en) * 2019-11-14 2020-02-21 福州大学 Sound event labeling method from audio frame sequence to event label sequence
CN110990534A (en) * 2019-11-29 2020-04-10 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN112447189A (en) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 Voice event detection method and device, electronic equipment and computer storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A first attempt at polyphonic sound event detection using connectionist temporal classification;Y. Wang;《2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)》;20170619;2986-2990 *
Environmental Sound Recognition Based on Double-input Convolutional Neural Network Model;M. Wang;《2020 IEEE 2nd International Conference on Civil Aviation Safety and Information Technology》;20210309;620-624 *
Sound event detection using point-labeled data;Kim B;《2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)》;20191231;1-5 *
基于改进对数梅尔谱特征的街道环境声事件检测方法;张留军;《桂林电子科技大学学报》;20200531;411-417 *
基于深层神经网络的多声音事件检测方法研究;刘亚明;《中国优秀硕士学位论文全文数据库信息科技辑》;20190831;I136-86 *

Also Published As

Publication number Publication date
CN113140226A (en) 2021-07-20

Similar Documents

Publication Publication Date Title
CN113140226B (en) Sound event marking and identifying method adopting double Token labels
Mac Aodha et al. Bat detective—Deep learning tools for bat acoustic signal detection
Priyadarshani et al. Automated birdsong recognition in complex acoustic environments: a review
Kasten et al. The remote environmental assessment laboratory's acoustic library: An archive for studying soundscape ecology
Mesaros et al. TUT database for acoustic scene classification and sound event detection
Heittola et al. Context-dependent sound event detection
Ntalampiras Bird species identification via transfer learning from music genres
Keen et al. A comparison of similarity-based approaches in the classification of flight calls of four species of North American wood-warblers (Parulidae)
Zhong et al. Acoustic detection of regionally rare bird species through deep convolutional neural networks
CN111180025B (en) Method, device and inquiry system for representing text vectors of medical records
CN111429943B (en) Joint detection method for music and relative loudness of music in audio
Zhang et al. Learning audio sequence representations for acoustic event classification
Hou et al. Transfer learning for improving singing-voice detection in polyphonic instrumental music
Morales et al. Method for passive acoustic monitoring of bird communities using UMAP and a deep neural network
Wang et al. Automated call detection for acoustic surveys with structured calls of varying length
CN107578785B (en) Music continuous emotion characteristic analysis and evaluation method based on Gamma distribution analysis
Soni et al. Automatic audio event recognition schemes for context-aware audio computing devices
Rulff et al. Urban Rhapsody: Large‐scale exploration of urban soundscapes
Xia et al. Sound event detection using multiple optimized kernels
CN117877516A (en) Sound event detection method based on cross-model two-stage training
Wang et al. A hierarchical birdsong feature extraction architecture combining static and dynamic modeling
Zhang et al. Learning audio sequence representations for acoustic event classification
Martin-Morato et al. On the robustness of deep features for audio event classification in adverse environments
Schuller et al. New avenues in audio intelligence: Towards holistic real-life audio understanding
Pan et al. Tree size estimation from a feller-buncher’s cutting sound

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20210720

Assignee: Wuhan xingeno Technology Co.,Ltd.

Assignor: GUILIN University OF ELECTRONIC TECHNOLOGY

Contract record no.: X2022450000387

Denomination of invention: An Acoustic Event Labeling and Recognition Method Using Double Token Tags

Granted publication date: 20220621

License type: Common License

Record date: 20221226

EE01 Entry into force of recordation of patent licensing contract