CN113724733A - Training method of biological sound event detection model and detection method of sound event - Google Patents
Training method of biological sound event detection model and detection method of sound event Download PDFInfo
- Publication number
- CN113724733A CN113724733A CN202111012585.5A CN202111012585A CN113724733A CN 113724733 A CN113724733 A CN 113724733A CN 202111012585 A CN202111012585 A CN 202111012585A CN 113724733 A CN113724733 A CN 113724733A
- Authority
- CN
- China
- Prior art keywords
- audio
- sample audio
- sample
- data set
- feature vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 126
- 238000012549 training Methods 0.000 title claims abstract description 77
- 238000000034 method Methods 0.000 title claims abstract description 64
- 239000013598 vector Substances 0.000 claims abstract description 208
- 239000011159 matrix material Substances 0.000 claims abstract description 88
- 238000012545 processing Methods 0.000 claims description 60
- 238000005070 sampling Methods 0.000 claims description 55
- 238000012952 Resampling Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000009432 framing Methods 0.000 claims description 9
- 238000005259 measurement Methods 0.000 abstract description 13
- 238000013527 convolutional neural network Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 238000000605 extraction Methods 0.000 description 9
- 238000011161 development Methods 0.000 description 7
- 230000018109 developmental process Effects 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 6
- 241001465754 Metazoa Species 0.000 description 5
- 241000894007 species Species 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000013145 classification model Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 241000271566 Aves Species 0.000 description 2
- 241000282339 Mustela Species 0.000 description 2
- 241001494106 Stenotomus chrysops Species 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000006584 Barton reaction Methods 0.000 description 1
- 241000270295 Serpentes Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000036992 cognitive tasks Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 230000004438 eyesight Effects 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a training method of a biological sound event detection model and a sound event detection method. Wherein, this signal surge protection circuit includes: in the pre-training stage, firstly, original audio (sample audio) containing a recorded biological sound event is resampled, then manual features (first feature matrix) are extracted and sent to a high-dimensional feature extractor, a class prototype is calculated by utilizing the output high-dimensional feature vector, and then measurement classification is realized. The embedded propagation module is designed for improving the generalization capability of the system and is used in the second stage of system training, in the fine tuning training process of the second stage, the high-dimensional feature vector output by the high-dimensional feature extractor is firstly subjected to embedded propagation to obtain an embedded interpolation vector (second feature vector), and then a class prototype is calculated to realize measurement classification, so that the technical problem that in the prior art, the robustness of a model is poor in the training process of a biological sound event detection training model is solved.
Description
Technical Field
The invention relates to the technical field of Artificial Intelligence (AI) technology and sound event detection, in particular to a training method of a biological sound event detection model and a detection method of a sound event.
Background
With the excellent performances of the deep learning method in computer vision tasks, such as classification, semantic segmentation, target detection and the like, the development of artificial intelligence is accelerated. Meanwhile, the development of the visual field has not been able to meet the increasing level of intelligent liveness, and the application requirements of the intelligent voice technology in daily life scenes are more and more diversified, such as sound scene classification, abnormal sound event detection of machines, sound event localization, sound activity detection of home events, text content recognition of audio signals, rare biological sound event detection, and the like.
The sound scene classification aims at distinguishing the places where the equipment is located through the surrounding acoustic environment, and different places are different in the sense of sight and the sense of hearing, for example, the probability of the large sound of a train whistle cannot appear in an office; the abnormal sound event detection of the machine refers to the real-time monitoring of the running sound of the machine so as to give an alarm in time when the machine breaks down, and the cost of manual inspection can be greatly reduced; the sound event positioning aims at realizing the visualization of the intelligent application program scene by acquiring the space-time characteristics of a sound scene through sound, and can be used for a wide range of machine cognitive tasks, such as reasoning type navigation tasks; the household event sound activity monitoring is specially used for realizing household intelligence, and various sound events in a household are monitored to prepare for measures of follow-up equipment; the audio signal text identifies the subtitle function commonly seen in audio and video; rare bioacoustic event detection is to help biological researchers discern the presence of species in nature in preparation for subsequent studies, in which case labeling data for the species is particularly difficult to obtain.
However, supervised deep learning approaches often require training on large amounts of labeled data, which is scarce for most applications and costly to collect, such as bio-sound event detection of rare species and custom-made sound event detection. The sound event detection has wide practical application scenes, and correspondingly, the sound event detection technology applying a small amount of priori knowledge can bring great convenience to the sound event detection task, but has great challenges. In the process of exploring a sound event detection algorithm, under the condition that a sound event detection task can only provide a small amount of target sample data, a deep network is often required to replace a simple shallow convolutional neural network in order to obtain more effective high-dimensional feature representation, and an overfitting problem is caused by a fixed high-dimensional feature extraction model which is learned by the deep network on a small amount of sample training data which is not matched with test data.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a training method of a biological sound event detection model and a detection method of a sound event, which at least solve the technical problem that in the prior art, the robustness of the model is poor in the training process of the biological sound event detection training model.
According to an aspect of an embodiment of the present invention, there is provided a training of a bio-acoustic event detection model, including: obtaining a sample audio data set containing a biological sound event and a sample audio tag data set corresponding to the sample audio data set, wherein each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set; inputting each sample audio frequency in the sample audio frequency data set into a to-be-trained sound event detection model to obtain a first characteristic matrix corresponding to each sample audio frequency in the sample audio frequency data set, wherein the to-be-trained sound event detection model comprises N types of standard audio frequencies used for comparing with the sample audio frequency, and N is a positive integer greater than or equal to 1; processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio; performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio; obtaining a predicted audio tag data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N types of standard feature vectors, wherein the predicted audio tag data set comprises a predicted audio tag of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to the N types of standard audio one to one; and under the condition that the loss function corresponding to the sample audio tag data set and the predicted audio tag data set meets a preset condition, determining the sound event detection model to be trained as a target sound detection model.
Optionally, the performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio includes: acquiring a group of feature vectors corresponding to the high-dimensional feature vectors; calculating a Euclidean distance for each pair of feature vectors in the set of feature vectors; calculating an adjacency matrix according to the Euclidean distance; carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix; and determining the second feature vector according to the propagation matrix.
Optionally, the obtaining a predicted audio tag data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N-type standard feature vectors includes: performing the following for each sample video in the sample video data set: calculating the similarity between the second feature vector corresponding to each sample audio and the N types of standard audio feature vectors respectively to obtain N similarity values; determining a target standard audio feature vector corresponding to the minimum value in the N similarity values; and determining a target sample label of a type of standard audio corresponding to the target standard audio feature vector as a predicted audio label of the sample audio, wherein the predicted audio label data set of the sample audio data set comprises a predicted audio label of each sample audio in the sample audio data set.
Optionally, the inputting each sample audio in the sample audio data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio in the sample audio data set includes: resampling each sample audio to obtain a sampled sample audio; and inputting the sampling sample audio into the sound event detection model to be trained to obtain the first feature matrix corresponding to each sample audio.
Optionally, the inputting the sample audio into the to-be-trained sound event detection model to obtain the first feature matrix corresponding to each sample audio includes: performing framing and windowing processing operation on the sampling sample audio to obtain an intermediate sample audio; and performing discrete Fourier transform on the intermediate sample audio to obtain the first characteristic matrix.
Optionally, the resampling each sample audio to obtain a sample audio includes: and performing up-sampling or down-sampling on each sample audio to obtain the sample audio, wherein the resampling comprises up-sampling or down-sampling.
According to another aspect of the embodiments of the present invention, a method for detecting a bio-acoustic event by using a target acoustic detection model determined by the above method includes: inputting target biological sound to be detected into a target sound detection model to obtain a first characteristic matrix; processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector; performing regularization interpolation operation on the high-dimensional feature vector to obtain a corresponding second feature vector; calculating similarity values between the second feature vectors and N types of standard feature vectors respectively, and determining a type of standard feature vector corresponding to the minimum similarity value; and acquiring a target audio label corresponding to the standard feature vector, and determining the target label as an audio label of the target biological audio.
According to another aspect of the embodiments of the present invention, there is also provided a training apparatus for a bio-acoustic event detection model, including: an obtaining unit, configured to obtain a sample audio data set including a biological sound event and a sample audio tag data set corresponding to the sample audio data set, where each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set; the first input unit is used for inputting each sample audio frequency in the sample audio frequency data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio frequency in the sample audio frequency data set, wherein the to-be-trained sound event detection model comprises N types of standard audio frequencies used for comparing with the sample audio frequencies, and N is a positive integer greater than or equal to 1; the first processing unit is used for processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio; the first interpolation processing unit is used for performing regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio; a first determining unit, configured to determine, according to the second feature vector corresponding to each sample audio and N types of standard feature vectors, a predicted audio tag data set corresponding to the sample audio data set, where the predicted audio tag data set includes a predicted audio tag of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to N types of standard audio one-to-one; and the second determining unit is used for determining the sound event detection model to be trained as a target sound detection model under the condition that the loss function corresponding to the sample audio tag data set and the predicted audio tag data set meets a preset condition.
Optionally, the first interpolation processing unit includes: the acquisition module is used for acquiring a group of feature vectors corresponding to the high-dimensional feature vectors; a first calculation module for calculating a euclidean distance of each pair of feature vectors in the set of feature vectors; the second calculation module is used for calculating an adjacency matrix according to the Euclidean distance; the third calculation module is used for carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix; a first determining module, configured to determine the second eigenvector according to the propagation matrix.
Optionally, the first determining unit includes: performing the following for each sample video in the sample video data set: a fourth calculating module, configured to calculate similarities between the second feature vector corresponding to each sample audio and the N types of standard audio feature vectors, respectively, so as to obtain N similarity values; a second determining module, configured to determine a target standard audio feature vector corresponding to a minimum value of the N similarity values; a third determining module, configured to determine a target sample label of a type of standard audio corresponding to the target standard audio feature vector as a predicted audio label of the sample audio, where a predicted audio label dataset of the sample audio dataset includes a predicted audio label of each sample audio in the sample audio dataset.
Optionally, the first input unit includes: the sampling processing module is used for resampling each sample audio to obtain a sampling sample audio; and the input module is used for inputting the sampling sample audio frequency into the sound event detection model to be trained to obtain the first characteristic matrix corresponding to each sample audio frequency.
Optionally, the first input module includes: the processing submodule is used for performing framing and windowing processing operations on the sampling sample audio to obtain an intermediate sample audio; and the fourth determining module is used for performing discrete Fourier transform on the intermediate sample audio to obtain the first characteristic matrix.
Optionally, the sampling processing module includes: and the sampling processing sub-module is used for performing up-sampling or down-sampling on each sample audio to obtain the sample audio, and the resampling processing comprises up-sampling or down-sampling.
According to another aspect of the embodiments of the present invention, there is also provided a device for detecting a biological sound event, the device being detected by the target sound detection model determined in any one of the methods 1 to 6, including: the second input unit is used for inputting target biological sound to be detected into the target sound detection model to obtain a first characteristic matrix; the second processing unit is used for processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector; the second interpolation processing unit is used for carrying out regularized interpolation operation on the high-dimensional characteristic vector to obtain a corresponding second characteristic vector; the calculating unit is used for calculating similarity values between the second feature vectors and the N types of standard feature vectors respectively and determining the type of standard feature vector corresponding to the minimum similarity value; and the third determining unit is used for acquiring the target audio label corresponding to the standard feature vectors of the type and determining the target label as the audio label of the target biological audio.
In the embodiment of the invention, in the pre-training stage, firstly, the recorded original audio (sample audio) containing the biological sound event is resampled, then manual features (first feature matrix) are extracted and sent to a high-dimensional feature extractor, a class prototype is calculated by using the output high-dimensional feature vector, and then measurement classification is realized. In the second stage of the system training, in the fine tuning training process of the second stage, the high-dimensional feature vector output by the high-dimensional feature extractor is subjected to embedding propagation to obtain an embedded interpolation vector (second feature vector), and then a class prototype is calculated to realize measurement classification.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware structure of a mobile terminal according to an alternative training method for a bio-sound event detection model in an embodiment of the present invention;
FIG. 2 is a flow chart of an alternative method of training a bioacoustic event detection model in accordance with embodiments of the present invention;
FIG. 3 is a flow chart of an alternative method of bio-sound event detection according to an embodiment of the present invention;
FIG. 4 is a flow chart of an alternative training method for a network bioacoustic event classification model based on pre-trained embedded propagation prototypes according to an embodiment of the present invention;
FIG. 5 is a flow chart of an alternative PCEN spectrogram extraction, according to an embodiment of the present invention;
FIG. 6 is a block diagram of an alternative residual convolutional neural network in accordance with an embodiment of the present invention;
FIG. 7 is a schematic diagram of an alternative acoustic detection model training strategy according to an embodiment of the present invention;
FIG. 8 is an apparatus diagram of an alternative method of training a bioacoustic event detection model in accordance with embodiments of the present invention;
fig. 9 is an apparatus diagram of an alternative method of bio-sound event detection according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The training of the bio-acoustic event detection model provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the method running on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of the training method for detecting a biological sound event according to the embodiment of the present invention. As shown in fig. 1, the mobile terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to the sound event detection method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In this embodiment, a training method of a biological sound event detection model is provided, fig. 2 is a flowchart of training a biological sound event detection model according to an embodiment of the present invention, and as shown in fig. 2, the training process of the biological sound event detection model includes the following steps:
step S202, a sample audio data set including a biological sound event and a sample audio tag data set corresponding to the sample audio data set are obtained, wherein each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set.
Step S204, inputting each sample audio in the sample audio data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio in the sample audio data set, wherein the to-be-trained sound event detection model comprises N types of standard audio used for comparing with the sample audio, and N is a positive integer greater than or equal to 1.
Step S206, the first feature matrix corresponding to each sample audio is processed by the high-dimensional feature extractor, and a high-dimensional feature vector corresponding to each sample audio is obtained.
And step S208, performing regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio.
Step S210, obtaining a predicted audio tag data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N-type standard feature vectors, where the predicted audio tag data set includes a predicted audio tag of each sample audio in the sample audio data set, and the N-type standard feature vectors correspond to the N-type standard audios one to one.
Step S212, under the condition that the loss function corresponding to the sample audio label data set and the prediction audio label data set meets the preset condition, determining the sound event detection model to be trained as the target sound detection model.
Optionally, in this embodiment, the target sound detection model may include, but is not limited to, applying various audio processing-based scenarios, such as speech recognition, semantic recognition, and the like.
The sample audio data set may include, but is not limited to, an audio Development data set (Development dataset) from a Detection and Classification of acoustic Scenes and acoustic Event Detection Challenges (DCASE), and a biological acoustic Event Detection task (Sound Event Detection) in DCASE2021, and 19 subclasses including 4 subclasses of common birds, ferrets, porgy, and cave birds.
In this embodiment, the biological time event detection model may map the input PCEN features to a high dimensional space f using a residual convolutional neural networkφ(xi) Obtaining high-dimensional feature vector representation, each class prototype ckIt can be obtained by averaging these high-dimensional feature vectors:
by the embodiment provided by the application, a sample audio data set containing a biological sound event and a sample audio tag data set corresponding to the sample audio data set are obtained, wherein each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set; inputting each sample audio frequency in the sample audio frequency data set into a to-be-trained sound event detection model to obtain a first characteristic matrix corresponding to each sample audio frequency in the sample audio frequency data set, wherein the to-be-trained sound event detection model comprises N types of standard audio frequencies used for comparing with the sample audio frequency, and N is a positive integer greater than or equal to 1; processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio; performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio; obtaining a predicted audio label data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N types of standard feature vectors, wherein the predicted audio label data set comprises a predicted audio label of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to the N types of standard audio one to one; and under the condition that the loss functions corresponding to the sample audio label data set and the predicted audio label data set meet preset conditions, determining the sound event detection model to be trained as a target sound detection model. After the high-dimensional feature representation is obtained, a group of interpolation representations of the high-dimensional feature vectors are obtained through the embedding propagation module, and the robustness and the generalization capability of the detection model are improved through the embedding propagation regularization method.
Optionally, performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio, where the regularization interpolation operation may include: acquiring a group of feature vectors corresponding to the high-dimensional feature vectors; calculating the Euclidean distance of each pair of feature vectors in a group of feature vectors; calculating an adjacency matrix according to the Euclidean distance; carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix; a second eigenvector is determined from the propagation matrix.
In this embodiment, given a set of network-output feature vectors for an event, the Euclidean distance is determined for each pair of featuresThese distances are then used to calculate an adjacency matrixWhereinAii0. Next, laplacian arithmetic is performed on the adjacency matrix:
from this, the propagation matrix P ═ (I- α L) can be obtained-1The propagation matrix maps each eigenvector to another eigenspaceThese mapped features can be viewed as a weighted sum of other feature vectors, such that the aggregated feature vectors have the effect of removing unwanted noise in the features. Moreover, the embedding method is simple to implement and is compatible with wide feature extraction and classifiers.
Optionally, obtaining a predicted audio tag data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N-type standard feature vector may include: performing the following for each sample video in the sample video data set: calculating the similarity between the second characteristic vector corresponding to each sample audio and the N types of standard audio characteristic vectors respectively to obtain N similarity values; determining a target standard audio feature vector corresponding to the minimum value in the N similarity values; and determining target sample labels of a type of standard audio corresponding to the target standard audio feature vector as predicted audio labels of the sample audio, wherein the predicted audio label data set of the sample audio data set comprises the predicted audio label of each sample audio in the sample audio data set.
Optionally, inputting each sample audio in the sample audio data set into the sound event detection model to be trained, to obtain a first feature matrix corresponding to each sample audio in the sample audio data set, where the method may include: resampling each sample audio to obtain a sampled sample audio; and inputting the sampled sample audio into a sound event detection model to be trained to obtain a first feature matrix corresponding to each sample audio.
Optionally, inputting the sample audio into the to-be-trained sound event detection model to obtain the first feature matrix corresponding to each sample audio, which may include: performing framing and windowing processing operation on the sampling sample audio to obtain an intermediate sample audio; and carrying out discrete Fourier transform on the intermediate sample audio to obtain a first characteristic matrix.
Optionally, resampling each sample audio to obtain a sample audio, where the resampling may include: and performing up-sampling or down-sampling on each sample audio to obtain a sample audio, wherein the resampling comprises up-sampling or down-sampling.
According to another aspect of the embodiments of the present invention, a method for detecting a bio-acoustic event by using a target acoustic detection model determined by the above method may include:
step S302, inputting the target biological sound to be detected into a target sound detection model to obtain a first feature matrix.
And step S304, processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector.
And step S306, performing regularized interpolation operation on the high-dimensional feature vectors to obtain corresponding second feature vectors.
Step S308, calculating similarity values between the second feature vectors and the N types of standard feature vectors respectively, and determining the type of standard feature vector corresponding to the minimum similarity value.
Step S310, obtaining target audio tags corresponding to the standard feature vectors, and determining the target tags as the audio tags of the target biological audio.
According to the embodiment provided by the application, target biological sound to be detected is input into a target sound detection model to obtain a first characteristic matrix; processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector; carrying out regularized interpolation operation on the high-dimensional feature vectors to obtain corresponding second feature vectors; calculating similarity values between the second feature vectors and the N types of standard feature vectors respectively, and determining the type of standard feature vector corresponding to the minimum similarity value; and acquiring target audio tags corresponding to the standard feature vectors, and determining the target tags as the audio tags of the target biological audio. An accurate audio tag can be obtained.
As an alternative embodiment, a training method of the embedded propagation prototype network biological sound event classification model based on pre-training is provided, and comprises two stages of pre-training and fine-tuning. The specific implementation is shown in fig. 4, the left half is a pre-training stage: firstly, resampling original audio (sample audio) containing a recorded biological sound event, then extracting manual features, sending the manual features into a high-dimensional feature extractor, calculating a class prototype by using an output high-dimensional feature vector, and then realizing measurement classification. The right half is an embedded propagation module designed for improving the generalization capability of the system and is used in the second stage of the system training. In the second stage of fine tuning training process, the high-dimensional feature vector output by the high-dimensional feature extractor is subjected to embedding propagation to obtain an embedded interpolation vector, and then a class prototype is calculated to realize measurement classification.
In this embodiment, a two-stage acoustic model training strategy and an embedding propagation regularization method enhance the representation of the model on high-dimensional features and improve the robustness and generalization capability of the model. The universal generalized characterization model is learned in the pre-training stage, and the characterization mode of the customized data is learned in a targeted manner in the fine-tuning stage, so that the measurement classification performance of data which is not seen in the testing stage is improved.
In this embodiment, a two-stage training method is proposed to improve the robustness of the model, and as shown in fig. 4, there are four modules in the first stage: firstly, extracting recorded original audio containing biological sound events into a PCEN characteristic matrix representation which is simple and convenient to process; secondly, a high-dimensional feature extractor based on a residual convolutional neural network (Resnet 12); thirdly, calculating a class prototype by using the high-dimensional feature vector; and fourthly, a measurement classification module.
And the fine tuning training of the second stage is realized by adding a regularization method of embedded propagation in the first stage, so that the generalization capability of the prototype classification network is improved.
In this embodiment, a method for training a network biological sound event classification model based on pre-training embedded propagation prototype is provided, which includes the following 4 steps:
step one, extracting characteristics of diverse biological audio data
In this embodiment, the acquired audio data set may be from the Detection and Classification of Sound scene and Sound Event Detection Challenges (DCASE), and the audio Development data set (Development dataset) of the biological Sound Event Detection task (Sound Event Detection) in DCASE2021 is selected, and includes 19 subclasses of 4 classes of common birds, ferrets, porgy, and cave birds.
The data set includes a Training data set (Training dataset) and a Validation data set (Validation dataset). The training data includes 4 subsets, i.e. 4 major species, and each subset includes a different number of minor species: BV contains 11 subclasses, with 5 tones for a total of 10 hours; HT comprises 3 subclasses, with 3 tones totaling 3 hours; JD contains 1 subclass, with bar tones totaling 10 minutes; MT contains 4 subclasses, 2 tones for 1 hour and 10 minutes; the verification data set only has two categories of target event voicing (positive examples) and non-target event sounds (negative examples) in each audio, including two subsets of HV (2 audio for 2 hours) and PB (6 audio for 3 hours). The details of the data set are shown in table 1.
TABLE 1DCASE2021 task5 development set data sheet
Data set | Number of audio pieces | Sampling rate (Hz) | Total length of time | Number of categories | Number of events |
BV | 5 | 24000 | 10h | 11 | 2662 |
HT | 3 | 6000 | 3h | 3 | 435 |
JD | 1 | 22050 | 10min | 1 | 355 |
MT | 2 | 8000 | 1h10min | 4 | 1234 |
HV | 2 | 6000 | 2h | 2 | 50 |
PB | 6 | 44100 | 3h | 2 | 260 |
Original audio data are resampled according to bioacoustic characteristics, and then PCEN manual characteristics are extracted.
It should be noted that the sources of the biological audio data are different, and the sampling rates are different, and in this embodiment, as shown in fig. 5, a flowchart of PCEN spectrogram extraction is shown. The specific procedure is as follows.
Firstly, resampling (scaling value range), framing and windowing processing are carried out on audio data, then discrete Fourier transform is carried out, then the frequency domain characteristics at the moment are overlapped through energy values of a group of Mel frequency filter frequency bands to obtain characteristic values of the frequency bands represented by numerical values, and finally PCEN operation is carried out.
In the embodiment, uniform resampling processing is adopted for the class imbalance problem, and global regularization processing is performed on the features before the features are sent to the network.
In practical applications, all sample audio data are 0.2s long, and energy per channel normalized manual features (PCENs) are extracted as input to the model. PCEN is a feature extraction approach that improves the robustness of the spectrogram to channel distortion by combining Dynamic Range Compression (DRC) and Adaptive Gain Control (AGC) with time integration. The original audio is scaled to [ -2 ] before doing PCEN feature extraction31,231]The range of (1). The specific process is as follows: firstly, carrying out value range scaling (resampling), framing and windowing processing operations on audio data, then carrying out discrete Fourier transform, then superposing the frequency domain characteristics at the moment through energy values of a group of Mel frequency filter frequency bands to obtain a characteristic value of which the numerical value represents the frequency band, and finally carrying out PCEN operation:
M(t,f)=sE(t,f)+(1-s)M(t-1,f)
wherein, 0<α,ε,r<1,δ>1,E(t,f)As Mel time-frequency points. In practical application of training a biological sound event detection model, due to the fact that data sampling rate difference is large, up-sampling or down-sampling operation can be conducted respectively, and the data sampling rate difference is enabled to be 22050 Hz. In the framing operation, the frame length and the frame shift are respectively 1024 sampling points and 256 sampling points; the number of triangular windows in the Mel filter is 128, so that a 128-dimensional spectrogram with the size of 17 frames is finally obtained.
Step two, building a prototype network model
In this embodiment, the prototype classification network is based on the use of a Convolutional Neural Network (CNN) for metric-based classification of few-sample events. The convolutional neural network part can be used as a high-dimensional feature extractor with strong generalization capability through training. The input of the convolutional neural network is manually extracted features or a feature representation of the original audio extracted by other models. These features are learned to a vector representation of the high-dimensional features through a convolutional neural network that is a prototype network. And calculating the mean vector representation of the obtained high-dimensional features according to the classes, namely class prototypes.
The metric classification is that sample data to be classified is represented by corresponding high-dimensional feature vectors through the same convolutional neural network, and then classification is realized through the distance between the metric and the class prototype, namely, the sample to be classified is classified into the class category to which the class prototype with the minimum distance to the class prototype belongs. Common distance measurement methods include squared euclidean distance, cosine similarity, and the like.
Step three, pre-training in the first stage
It should be noted that deepening the number of network layers is beneficial to effectively extracting high-dimensional feature information, but for a training mode with few samples, the overfitting phenomenon is aggravated, and the generalization of the model is not facilitated.
In this embodiment, a pre-training-based two-stage acoustic model training strategy learns a general biological sound high-dimensional feature representation model by training a large amount of animal sound event data in a pre-training stage, so as to provide a better initialization model for the second-stage training, accelerate convergence of a target task, and avoid an overfitting situation on a small sample data set. Therefore, the deep network can be used for extracting effective high-dimensional feature representation, and the overfitting problem caused by too few samples and too deep network can be prevented.
The training data selects a large amount of animal sound audios, firstly a PCEN characteristic matrix (equivalent to a first characteristic matrix) is extracted, the PCEN characteristic matrix is sent into a prototype network with Resnet12 as a high-dimensional characteristic extractor to extract high-dimensional characteristics and calculate class prototypes, high-dimensional characteristic vectors of samples to be tested represent squares of Euclidean distances calculated from various prototypes respectively, and the class is predicted when the distance between the samples to be tested and the class prototypes is minimum. And after the training is finished, the model parameters are fixed and are used as initialization parameters of the model of the next stage.
In the pre-training stage, AudioSet data is sent to Resnet12 in batches to obtain a high-dimensional feature vector representation with the length of 1024. And calculating a mean value class prototype for each class, calculating the similarity (Euclidean distance square) between the sample to be detected and the class prototype, and predicting the class of the sample to be detected according to the similarity.
Step four, based on the second stage fine tuning training of embedding propagation
Embedding Propagation (EP) is one of the class of manifold smoothing as an unsupervised, nonparametric regularization. One of the disadvantages of the prototype network used for the classification of few sample events is that it is easy to over-fit to a small amount of training data, and in practical applications, there is often a great difference between the training data and the test data, which requires that the prototype network obtained by training has a strong generalization capability to adapt and extract more types of audio high-dimensional features. The embedding propagation utilizes the similarity of network output high-dimensional features on a graph (the graph is constructed by utilizing the similarity between high-dimensional feature pairs and Radial Basis Functions (RBFs)) to output a group of interpolation to capture the high-order interaction between embedded vectors, and the use of interpolation embedding can bring smoother decision boundaries and increase the robustness of a model to noise.
In the second stage, PCEN manual characteristics are firstly extracted from original audio (specific data) and sent to a prototype network, initialization parameters of the model are model parameters obtained by training in the first stage, a group of interpolation representations of high-dimensional characteristic vectors are obtained through an embedded propagation module after high-dimensional characteristic representation is obtained, various prototypes are calculated on the basis, and classification prediction is carried out according to the same classification criterion in a measurement mode in the subsequent calculation and training stages. The embedded propagation is an unparameterized module and does not need to be initialized.
In the fine tuning stage, the training data of DCASE2021 task5 is used for data, the same high-dimensional feature vector representation is extracted, then the feature is embedded into a propagation module to obtain the regularization interpolation representation of the features, and then the class prototype and the similarity are calculated according to the same method for classification.
In this embodiment, the prototype network uses a residual convolutional neural network, and the specific structure is as shown in fig. 6, and the input PCEN features are mapped to a high-dimensional space fφ(xi) Obtaining high-dimensional feature vector representation, each class prototype ckIt can be obtained by averaging these high-dimensional feature vectors:
in the embodiment, the prototype network based on the measurement can perform training of less sample data by constructing the epicode, and compared with the traditional supervised learning, the overfitting problem caused by less data is alleviated to a certain extent. However, the difference between the training data and the test data is still a concern. Through a two-stage acoustic model training strategy, universal initialization parameters of a high-dimensional biological sound feature extraction model are provided for training on a specific data set, model convergence is accelerated, and model robustness is improved. In addition, classification boundaries between classes can be blurred through embedding a propagated regularization mode, so that the classification capability of the model is improved.
It should be noted that, in order to improve the robustness of the model and improve the detection performance of the system, as shown in fig. 7, a schematic diagram based on a training strategy of the acoustic detection model is provided, which is specifically described as follows.
1) AudioSet pre-training phase
AudioSet has weakly marked 10s data and strongly marked data, and is about 600 kinds. The weak mark data has two forms, one is that the downloaded audio is automatically intercepted from the YouTube video according to the start point label information and the stop point label information provided by the Audio set and the corresponding audio website; the other is the 128-dimensional feature extracted by the VGGish model. Strongly labeled data has only one form: and intercepting the downloaded audio from the YouTube video, and providing a label, a website address and corresponding strong marking timestamp information by the Audio set. The data used for pre-training was 39 subclasses of the "Animal" class in the strongly labeled data (except for 4 duplicate or fuzzy subclasses, e.g., the "hissing" of snakes and steamers belonging to the same label, and the "roaring" of various different animals belonging to the same label) for a total of 17.8 hours.
The PCEN features are firstly extracted from the audio data, then the features are sent into a residual error network with 12 layers, and loss training based on measurement is carried out on the data according to a training mode of partitioning epicode of a prototype network. The data is fed into the system in batches, 16 classes are fed in each time, 5 samples are fed in each class, and the number of the samples fed in each class is consistent with that of the samples fed in the fine adjustment of the second stage.
2) DCASE2021 task5 training dataset trimming phase:
the DCASE2021 task5 trained the dataset for 14 hours and 20 minutes, and due to the imbalance problem of the dataset classes, the audio data was first evenly resampled after extracting PCEN features to ensure that 19 classes in the dataset all had the same number of samples. Then, the data are sent to a prototype network with better initialization parameters in batches to be sequentially subjected to high-dimensional feature extraction, embedded propagation and interpolation feature vector representation, and finally measurement classification is carried out.
It is further noted that the detection system may be evaluated after the training of the bio-acoustic event detection model.
In this example, F-measure was used to evaluate the performance of the system in the following way:
the recall rate and the precision are two mutually contradictory measurement criteria, and the F-measure gives consideration to the two indexes and calculates the harmonic average of the recall rate and the precision. In the embodiment, the F-measure final score is obtained by each subset, namely, the F-measure of each subset is calculated firstly, and then the F-measures of all the subsets are taken and averaged. TP is counted when the timestamp of the predicted event is more than 30% of the intersection ratio (IoU) with the true tag. The UNK class is included in the data set, i.e. the unknown animal is called, and is processed separately because the prediction of the system has the correct possibility although the human ear cannot be identified. The method specifically comprises the following steps:
it should be noted that if the audio does not predict positive/unknown, the FN counts are the total positive events of the whole audio, i.e. the top 5 known tags of each long audio are counted as FN.
And carrying out the same characteristic extraction on the audio data of the test set, sending PCEN characteristics of the data to be tested into a classification system to obtain a final prediction result, and calculating F-measure according to an output result.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In this embodiment, a device for detecting a sound event is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 8 is a block diagram of a training apparatus for a bio-acoustic event detection model according to an embodiment of the present invention, as shown in fig. 8, the apparatus including:
an obtaining unit 81, configured to obtain a sample audio data set including a biological sound event and a sample audio tag data set corresponding to the sample audio data set, where each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set.
The first input unit 83 is configured to input each sample audio in the sample audio data set into a to-be-trained sound event detection model, so as to obtain a first feature matrix corresponding to each sample audio in the sample audio data set, where the to-be-trained sound event detection model includes N types of standard audio used for comparison with the sample audio, and N is a positive integer greater than or equal to 1.
And the first processing unit 85 is configured to process the first feature matrix corresponding to each sample audio through the high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio.
The first interpolation processing unit 87 is configured to perform regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio.
The first determining unit 89 is configured to determine, according to the second feature vector corresponding to each sample audio and the N-type standard feature vectors, a predicted audio tag data set corresponding to the sample audio data set, where the predicted audio tag data set includes a predicted audio tag of each sample audio in the sample audio data set, and the N-type standard feature vectors correspond to the N-type standard audios one to one.
The second determining unit 811 is configured to determine the sound event detection model to be trained as the target sound detection model when the loss function corresponding to the sample audio tag data set and the predicted audio tag data set satisfies a preset condition.
By the embodiment provided by the present application, the obtaining unit 81 obtains a sample audio data set containing a biological sound event and a sample audio tag data set corresponding to the sample audio data set, where each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set; the first input unit 83 inputs each sample audio in the sample audio data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio in the sample audio data set, wherein the to-be-trained sound event detection model includes N types of standard audio used for comparing with the sample audio, and N is a positive integer greater than or equal to 1; the first feature matrix corresponding to each sample audio of the first processing unit 85 is processed by the high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio; the first interpolation processing unit 87 performs regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio; the first determining unit 89 determines a predicted audio tag data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and N types of standard feature vectors, where the predicted audio tag data set includes a predicted audio tag of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to the N types of standard audio one to one; the second determining unit 811 determines the sound event detection model to be trained as the target sound detection model when the loss function corresponding to the sample audio tag data set and the predicted audio tag data set satisfies the preset condition. The robustness of the model is improved, a regularization method of embedding propagation is added, and the generalization capability of the model is improved.
Optionally, the first interpolation processing unit 87 may include: the acquisition module is used for acquiring a group of feature vectors corresponding to the high-dimensional feature vectors; a first calculation module, configured to calculate a euclidean distance between each pair of feature vectors in a set of feature vectors; the second calculation module is used for calculating the adjacency matrix according to the Euclidean distance; the third calculation module is used for carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix; and the first determining module is used for determining the second eigenvector according to the propagation matrix.
Optionally, the first determining unit 89 may include: performing the following for each sample video in the sample video data set: the fourth calculation module is used for calculating the similarity between the second feature vector corresponding to each sample audio and the N types of standard audio feature vectors respectively to obtain N similarity values; the second determining module is used for determining a target standard audio feature vector corresponding to the minimum value in the N similarity values; and a third determining module, configured to determine a target sample label of a type of standard audio corresponding to the target standard audio feature vector as a predicted audio label of the sample audio, where a predicted audio label dataset of the sample audio dataset includes a predicted audio label of each sample audio in the sample audio dataset.
Optionally, the first input unit 83 may include: the sampling processing module is used for resampling each sample audio to obtain a sampling sample audio; and the input module is used for inputting the sampling sample audio frequency into the sound event detection model to be trained to obtain a first characteristic matrix corresponding to each sample audio frequency.
Wherein, the first input module may include: the processing submodule is used for performing framing and windowing processing operations on the sampled sample audio to obtain an intermediate sample audio; and the fourth determining module is used for performing discrete Fourier transform on the intermediate sample audio to obtain a first characteristic matrix.
Optionally, the sampling processing module includes: and the sampling processing sub-module is used for performing up-sampling or down-sampling on each sample audio to obtain a sampling sample audio, and the re-sampling processing comprises up-sampling or down-sampling.
FIG. 9 is a block diagram of a bioacoustic event detection apparatus according to an embodiment of the present invention, as shown in FIG. 9, the apparatus including
The second input unit 901 is configured to input a target biological sound to be detected into the target sound detection model to obtain a first feature matrix.
And a second processing unit 903, configured to process the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector.
And the second interpolation processing unit 905 is configured to perform regularized interpolation operation on the high-dimensional feature vectors to obtain corresponding second feature vectors.
The calculating unit 907 is configured to calculate similarity values between the second feature vectors and the N types of standard feature vectors, and determine a type of standard feature vector corresponding to the smallest similarity value.
And a third determining unit 909, configured to obtain a target audio tag corresponding to one type of standard feature vector, and determine the target audio tag as an audio tag of the target bio-audio.
By the embodiment provided by the application, the second input unit 901 inputs the target biological sound to be detected into the target sound detection model to obtain a first feature matrix; the second processing unit 903 processes the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector; the second interpolation processing unit 905 performs regularized interpolation operation on the high-dimensional feature vectors to obtain corresponding second feature vectors; the calculating unit 907 calculates similarity values between the second feature vectors and the N types of standard feature vectors, and determines a type of standard feature vector corresponding to the smallest similarity value; the third determining unit 909 acquires the target audio label corresponding to the class of standard feature vectors, and determines the target label as an audio label of the target bio-audio. And determining the audio label through the second characteristic vector obtained by regularization interpolation operation, and improving the accuracy of determining the audio label by the model.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring a sample audio data set containing a biological sound event and a sample audio label data set corresponding to the sample audio data set, wherein each sample audio in the sample audio data set corresponds to one sample audio label in the sample audio label data set;
s2, inputting each sample audio in the sample audio data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio in the sample audio data set, wherein the to-be-trained sound event detection model comprises N types of standard audio used for comparing with the sample audio, and N is a positive integer greater than or equal to 1;
s3, processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio;
s4, performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio;
s5, obtaining a predicted audio label data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N types of standard feature vectors, wherein the predicted audio label data set comprises a predicted audio label of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to the N types of standard audio one to one;
and S6, under the condition that the loss function corresponding to the sample audio label data set and the predicted audio label data set meets the preset condition, determining the sound event detection model to be trained as the target sound detection model.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring a sample audio data set containing a biological sound event and a sample audio label data set corresponding to the sample audio data set, wherein each sample audio in the sample audio data set corresponds to one sample audio label in the sample audio label data set;
s2, inputting each sample audio in the sample audio data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio in the sample audio data set, wherein the to-be-trained sound event detection model comprises N types of standard audio used for comparing with the sample audio, and N is a positive integer greater than or equal to 1;
s3, processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio;
s4, performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio;
s5, obtaining a predicted audio label data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N types of standard feature vectors, wherein the predicted audio label data set comprises a predicted audio label of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to the N types of standard audio one to one;
and S6, under the condition that the loss function corresponding to the sample audio label data set and the predicted audio label data set meets the preset condition, determining the sound event detection model to be trained as the target sound detection model.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
Claims (16)
1. A training method of a biological sound event detection model is characterized by comprising the following steps:
obtaining a sample audio data set containing a biological sound event and a sample audio tag data set corresponding to the sample audio data set, wherein each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set;
inputting each sample audio frequency in the sample audio frequency data set into a to-be-trained sound event detection model to obtain a first characteristic matrix corresponding to each sample audio frequency in the sample audio frequency data set, wherein the to-be-trained sound event detection model comprises N types of standard audio frequencies used for comparing with the sample audio frequency, and N is a positive integer greater than or equal to 1;
processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio;
performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio;
determining a predicted audio label data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and N types of standard feature vectors, wherein the predicted audio label data set comprises a predicted audio label of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to the N types of standard audio one to one;
and under the condition that the loss function corresponding to the sample audio tag data set and the predicted audio tag data set meets a preset condition, determining the sound event detection model to be trained as a target sound detection model.
2. The method according to claim 1, wherein the performing a regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio comprises:
acquiring a group of feature vectors corresponding to the high-dimensional feature vectors;
calculating a Euclidean distance for each pair of feature vectors in the set of feature vectors;
calculating an adjacency matrix according to the Euclidean distance;
carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix;
and determining the second feature vector according to the propagation matrix.
3. The method according to claim 1, wherein determining the predicted audio tag data set corresponding to the sample audio data set according to the second feature vector and the N-class standard feature vector corresponding to each sample audio comprises:
performing the following for each sample video in the sample video data set:
calculating the similarity between the second feature vector corresponding to each sample audio and the N types of standard audio feature vectors respectively to obtain N similarity values;
determining a target standard audio feature vector corresponding to the minimum value in the N similarity values;
and determining a target sample label of a type of standard audio corresponding to the target standard audio feature vector as a predicted audio label of the sample audio, wherein the predicted audio label data set of the sample audio data set comprises a predicted audio label of each sample audio in the sample audio data set.
4. The method of claim 1, wherein the inputting each sample audio in the sample audio data set into a sound event detection model to be trained to obtain a first feature matrix corresponding to each sample audio in the sample audio data set comprises:
resampling each sample audio to obtain a sampled sample audio;
and inputting the sampling sample audio into the sound event detection model to be trained to obtain the first feature matrix corresponding to each sample audio.
5. The method according to claim 4, wherein the inputting the sample audio into the sound event detection model to be trained to obtain the first feature matrix corresponding to each sample audio comprises:
performing framing and windowing processing operation on the sampling sample audio to obtain an intermediate sample audio;
and performing discrete Fourier transform on the intermediate sample audio to obtain the first characteristic matrix.
6. The method of claim 4, wherein the resampling each sample audio to obtain sample audio comprises:
and performing up-sampling or down-sampling on each sample audio to obtain the sample audio, wherein the resampling comprises up-sampling or down-sampling.
7. A method for detecting a biological sound event by using a target sound detection model determined by any one of the methods 1 to 6, comprising:
inputting target biological sound to be detected into a target sound detection model to obtain a first characteristic matrix;
processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector;
performing regularization interpolation operation on the high-dimensional feature vector to obtain a corresponding second feature vector;
calculating similarity values between the second feature vectors and N types of standard feature vectors respectively, and determining a type of standard feature vector corresponding to the minimum similarity value;
and acquiring a target audio label corresponding to the standard feature vector, and determining the target label as an audio label of the target biological audio.
8. A training device for a biological sound event detection model, comprising:
an obtaining unit, configured to obtain a sample audio data set including a biological sound event and a sample audio tag data set corresponding to the sample audio data set, where each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set;
the first input unit is used for inputting each sample audio frequency in the sample audio frequency data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio frequency in the sample audio frequency data set, wherein the to-be-trained sound event detection model comprises N types of standard audio frequencies used for comparing with the sample audio frequencies, and N is a positive integer greater than or equal to 1;
the first processing unit is used for processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio;
the first interpolation processing unit is used for performing regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio;
a first determining unit, configured to determine, according to the second feature vector corresponding to each sample audio and N types of standard feature vectors, a predicted audio tag data set corresponding to the sample audio data set, where the predicted audio tag data set includes a predicted audio tag of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to N types of standard audio one-to-one;
and the second determining unit is used for determining the sound event detection model to be trained as a target sound detection model under the condition that the loss function corresponding to the sample audio tag data set and the predicted audio tag data set meets a preset condition.
9. The apparatus according to claim 8, wherein the first interpolation processing unit includes:
the acquisition module is used for acquiring a group of feature vectors corresponding to the high-dimensional feature vectors;
a first calculation module for calculating a euclidean distance of each pair of feature vectors in the set of feature vectors;
the second calculation module is used for calculating an adjacency matrix according to the Euclidean distance;
the third calculation module is used for carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix;
a first determining module, configured to determine the second eigenvector according to the propagation matrix.
10. The apparatus of claim 8, wherein the first determining unit comprises:
performing the following for each sample video in the sample video data set:
a fourth calculating module, configured to calculate similarities between the second feature vector corresponding to each sample audio and the N types of standard audio feature vectors, respectively, so as to obtain N similarity values;
a second determining module, configured to determine a target standard audio feature vector corresponding to a minimum value of the N similarity values;
a third determining module, configured to determine a target sample label of a type of standard audio corresponding to the target standard audio feature vector as a predicted audio label of the sample audio, where a predicted audio label dataset of the sample audio dataset includes a predicted audio label of each sample audio in the sample audio dataset.
11. The apparatus of claim 8, wherein the first input unit comprises:
the sampling processing module is used for resampling each sample audio to obtain a sampling sample audio;
and the input module is used for inputting the sampling sample audio frequency into the sound event detection model to be trained to obtain the first characteristic matrix corresponding to each sample audio frequency.
12. The apparatus of claim 11, wherein the first input module comprises:
the processing submodule is used for performing framing and windowing processing operations on the sampling sample audio to obtain an intermediate sample audio;
and the fourth determining module is used for performing discrete Fourier transform on the intermediate sample audio to obtain the first characteristic matrix.
13. The apparatus of claim 11, wherein the sample processing module comprises:
and the sampling processing sub-module is used for performing up-sampling or down-sampling on each sample audio to obtain the sample audio, and the resampling processing comprises up-sampling or down-sampling.
14. A bioacoustic event detection device, wherein detection is performed by the target sound detection model determined by any one of the methods 1 to 6, comprising:
the second input unit is used for inputting target biological sound to be detected into the target sound detection model to obtain a first characteristic matrix;
the second processing unit is used for processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector;
the second interpolation processing unit is used for carrying out regularized interpolation operation on the high-dimensional characteristic vector to obtain a corresponding second characteristic vector;
the calculating unit is used for calculating similarity values between the second feature vectors and the N types of standard feature vectors respectively and determining the type of standard feature vector corresponding to the minimum similarity value;
and the third determining unit is used for acquiring the target audio label corresponding to the standard feature vectors of the type and determining the target label as the audio label of the target biological audio.
15. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 6 or 7 when executed.
16. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6 or 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111012585.5A CN113724733B (en) | 2021-08-31 | 2021-08-31 | Biological sound event detection model training method and sound event detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111012585.5A CN113724733B (en) | 2021-08-31 | 2021-08-31 | Biological sound event detection model training method and sound event detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113724733A true CN113724733A (en) | 2021-11-30 |
CN113724733B CN113724733B (en) | 2023-08-01 |
Family
ID=78679714
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111012585.5A Active CN113724733B (en) | 2021-08-31 | 2021-08-31 | Biological sound event detection model training method and sound event detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113724733B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015057630A (en) * | 2013-08-13 | 2015-03-26 | 日本電信電話株式会社 | Acoustic event identification model learning device, acoustic event detection device, acoustic event identification model learning method, acoustic event detection method, and program |
CN106653032A (en) * | 2016-11-23 | 2017-05-10 | 福州大学 | Animal sound detecting method based on multiband energy distribution in low signal-to-noise-ratio environment |
US20170278513A1 (en) * | 2016-03-23 | 2017-09-28 | Google Inc. | Adaptive audio enhancement for multichannel speech recognition |
CN109192222A (en) * | 2018-07-23 | 2019-01-11 | 浙江大学 | A kind of sound abnormality detecting system based on deep learning |
CN111337277A (en) * | 2020-02-21 | 2020-06-26 | 云知声智能科技股份有限公司 | Household appliance fault determination method and device based on voice recognition |
CN111443328A (en) * | 2020-03-16 | 2020-07-24 | 上海大学 | Sound event detection and positioning method based on deep learning |
WO2020153572A1 (en) * | 2019-01-21 | 2020-07-30 | 휴멜로 주식회사 | Method and apparatus for training sound event detection model |
US20210076966A1 (en) * | 2014-09-23 | 2021-03-18 | Surgical Safety Technologies Inc. | System and method for biometric data capture for event prediction |
CN112863492A (en) * | 2020-12-31 | 2021-05-28 | 思必驰科技股份有限公司 | Sound event positioning model training method and device |
CN113205820A (en) * | 2021-04-22 | 2021-08-03 | 武汉大学 | Method for generating voice coder for voice event detection |
-
2021
- 2021-08-31 CN CN202111012585.5A patent/CN113724733B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015057630A (en) * | 2013-08-13 | 2015-03-26 | 日本電信電話株式会社 | Acoustic event identification model learning device, acoustic event detection device, acoustic event identification model learning method, acoustic event detection method, and program |
US20210076966A1 (en) * | 2014-09-23 | 2021-03-18 | Surgical Safety Technologies Inc. | System and method for biometric data capture for event prediction |
US20170278513A1 (en) * | 2016-03-23 | 2017-09-28 | Google Inc. | Adaptive audio enhancement for multichannel speech recognition |
CN106653032A (en) * | 2016-11-23 | 2017-05-10 | 福州大学 | Animal sound detecting method based on multiband energy distribution in low signal-to-noise-ratio environment |
CN109192222A (en) * | 2018-07-23 | 2019-01-11 | 浙江大学 | A kind of sound abnormality detecting system based on deep learning |
WO2020153572A1 (en) * | 2019-01-21 | 2020-07-30 | 휴멜로 주식회사 | Method and apparatus for training sound event detection model |
CN111337277A (en) * | 2020-02-21 | 2020-06-26 | 云知声智能科技股份有限公司 | Household appliance fault determination method and device based on voice recognition |
CN111443328A (en) * | 2020-03-16 | 2020-07-24 | 上海大学 | Sound event detection and positioning method based on deep learning |
CN112863492A (en) * | 2020-12-31 | 2021-05-28 | 思必驰科技股份有限公司 | Sound event positioning model training method and device |
CN113205820A (en) * | 2021-04-22 | 2021-08-03 | 武汉大学 | Method for generating voice coder for voice event detection |
Non-Patent Citations (2)
Title |
---|
TIANTIAN TANG: "CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier", 《ARXIV》 * |
刘亚明: "基于深层神经网络的多声音事件检测方法研究", 《中国优秀硕士学位论文全文数据库》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113724733B (en) | 2023-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110797021B (en) | Hybrid speech recognition network training method, hybrid speech recognition device and storage medium | |
US10332507B2 (en) | Method and device for waking up via speech based on artificial intelligence | |
CN111477250B (en) | Audio scene recognition method, training method and device for audio scene recognition model | |
CN111814810A (en) | Image recognition method and device, electronic equipment and storage medium | |
CN113724734B (en) | Sound event detection method and device, storage medium and electronic device | |
CN111862951A (en) | Voice endpoint detection method and device, storage medium and electronic equipment | |
CN111312286A (en) | Age identification method, age identification device, age identification equipment and computer readable storage medium | |
CN114267345B (en) | Model training method, voice processing method and device | |
CN115035913B (en) | Sound abnormity detection method | |
CN115565548A (en) | Abnormal sound detection method, abnormal sound detection device, storage medium and electronic equipment | |
CN114582325A (en) | Audio detection method and device, computer equipment and storage medium | |
CN111341333A (en) | Noise detection method, noise detection device, medium, and electronic apparatus | |
CN116705059B (en) | Audio semi-supervised automatic clustering method, device, equipment and medium | |
CN111951808B (en) | Voice interaction method, device, terminal equipment and medium | |
CN113724733B (en) | Biological sound event detection model training method and sound event detection method | |
CN113889086A (en) | Training method of voice recognition model, voice recognition method and related device | |
CN113948089B (en) | Voiceprint model training and voiceprint recognition methods, devices, equipment and media | |
CN113762382A (en) | Model training and scene recognition method, device, equipment and medium | |
CN112489678A (en) | Scene recognition method and device based on channel characteristics | |
CN113782033B (en) | Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium | |
CN115099372B (en) | Classification identification method and device | |
CN110827811A (en) | Voice control method and device for household electrical appliance | |
CN114400009B (en) | Voiceprint recognition method and device and electronic equipment | |
CN115796185A (en) | Semantic intention determination method and device, storage medium and electronic device | |
CN116304129A (en) | Method for determining associated object and content recommendation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP02 | Change in the address of a patent holder |
Address after: 200234 No. 100, Xuhui District, Shanghai, Guilin Road Patentee after: SHANGHAI NORMAL University Patentee after: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD. Address before: 200233 No. 100 Guilin road, Minhang District, Shanghai Patentee before: SHANGHAI NORMAL University Patentee before: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD. |
|
CP02 | Change in the address of a patent holder |