CN113724733A

CN113724733A - Training method of biological sound event detection model and detection method of sound event

Info

Publication number: CN113724733A
Application number: CN202111012585.5A
Authority: CN
Inventors: 龙艳花; 唐甜甜; 李轶杰
Original assignee: Shanghai Normal University; Unisound Shanghai Intelligent Technology Co Ltd
Current assignee: Shanghai Normal University; Unisound Shanghai Intelligent Technology Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-11-30
Anticipated expiration: 2041-08-31
Also published as: CN113724733B

Abstract

The invention discloses a training method of a biological sound event detection model and a sound event detection method. Wherein, this signal surge protection circuit includes: in the pre-training stage, firstly, original audio (sample audio) containing a recorded biological sound event is resampled, then manual features (first feature matrix) are extracted and sent to a high-dimensional feature extractor, a class prototype is calculated by utilizing the output high-dimensional feature vector, and then measurement classification is realized. The embedded propagation module is designed for improving the generalization capability of the system and is used in the second stage of system training, in the fine tuning training process of the second stage, the high-dimensional feature vector output by the high-dimensional feature extractor is firstly subjected to embedded propagation to obtain an embedded interpolation vector (second feature vector), and then a class prototype is calculated to realize measurement classification, so that the technical problem that in the prior art, the robustness of a model is poor in the training process of a biological sound event detection training model is solved.

Description

Training method of biological sound event detection model and detection method of sound event

Technical Field

The invention relates to the technical field of Artificial Intelligence (AI) technology and sound event detection, in particular to a training method of a biological sound event detection model and a detection method of a sound event.

Background

With the excellent performances of the deep learning method in computer vision tasks, such as classification, semantic segmentation, target detection and the like, the development of artificial intelligence is accelerated. Meanwhile, the development of the visual field has not been able to meet the increasing level of intelligent liveness, and the application requirements of the intelligent voice technology in daily life scenes are more and more diversified, such as sound scene classification, abnormal sound event detection of machines, sound event localization, sound activity detection of home events, text content recognition of audio signals, rare biological sound event detection, and the like.

The sound scene classification aims at distinguishing the places where the equipment is located through the surrounding acoustic environment, and different places are different in the sense of sight and the sense of hearing, for example, the probability of the large sound of a train whistle cannot appear in an office; the abnormal sound event detection of the machine refers to the real-time monitoring of the running sound of the machine so as to give an alarm in time when the machine breaks down, and the cost of manual inspection can be greatly reduced; the sound event positioning aims at realizing the visualization of the intelligent application program scene by acquiring the space-time characteristics of a sound scene through sound, and can be used for a wide range of machine cognitive tasks, such as reasoning type navigation tasks; the household event sound activity monitoring is specially used for realizing household intelligence, and various sound events in a household are monitored to prepare for measures of follow-up equipment; the audio signal text identifies the subtitle function commonly seen in audio and video; rare bioacoustic event detection is to help biological researchers discern the presence of species in nature in preparation for subsequent studies, in which case labeling data for the species is particularly difficult to obtain.

However, supervised deep learning approaches often require training on large amounts of labeled data, which is scarce for most applications and costly to collect, such as bio-sound event detection of rare species and custom-made sound event detection. The sound event detection has wide practical application scenes, and correspondingly, the sound event detection technology applying a small amount of priori knowledge can bring great convenience to the sound event detection task, but has great challenges. In the process of exploring a sound event detection algorithm, under the condition that a sound event detection task can only provide a small amount of target sample data, a deep network is often required to replace a simple shallow convolutional neural network in order to obtain more effective high-dimensional feature representation, and an overfitting problem is caused by a fixed high-dimensional feature extraction model which is learned by the deep network on a small amount of sample training data which is not matched with test data.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a training method of a biological sound event detection model and a detection method of a sound event, which at least solve the technical problem that in the prior art, the robustness of the model is poor in the training process of the biological sound event detection training model.

According to an aspect of an embodiment of the present invention, there is provided a training of a bio-acoustic event detection model, including: obtaining a sample audio data set containing a biological sound event and a sample audio tag data set corresponding to the sample audio data set, wherein each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set; inputting each sample audio frequency in the sample audio frequency data set into a to-be-trained sound event detection model to obtain a first characteristic matrix corresponding to each sample audio frequency in the sample audio frequency data set, wherein the to-be-trained sound event detection model comprises N types of standard audio frequencies used for comparing with the sample audio frequency, and N is a positive integer greater than or equal to 1; processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio; performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio; obtaining a predicted audio tag data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N types of standard feature vectors, wherein the predicted audio tag data set comprises a predicted audio tag of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to the N types of standard audio one to one; and under the condition that the loss function corresponding to the sample audio tag data set and the predicted audio tag data set meets a preset condition, determining the sound event detection model to be trained as a target sound detection model.

Optionally, the performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio includes: acquiring a group of feature vectors corresponding to the high-dimensional feature vectors; calculating a Euclidean distance for each pair of feature vectors in the set of feature vectors; calculating an adjacency matrix according to the Euclidean distance; carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix; and determining the second feature vector according to the propagation matrix.

Optionally, the obtaining a predicted audio tag data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N-type standard feature vectors includes: performing the following for each sample video in the sample video data set: calculating the similarity between the second feature vector corresponding to each sample audio and the N types of standard audio feature vectors respectively to obtain N similarity values; determining a target standard audio feature vector corresponding to the minimum value in the N similarity values; and determining a target sample label of a type of standard audio corresponding to the target standard audio feature vector as a predicted audio label of the sample audio, wherein the predicted audio label data set of the sample audio data set comprises a predicted audio label of each sample audio in the sample audio data set.

Optionally, the inputting each sample audio in the sample audio data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio in the sample audio data set includes: resampling each sample audio to obtain a sampled sample audio; and inputting the sampling sample audio into the sound event detection model to be trained to obtain the first feature matrix corresponding to each sample audio.

Optionally, the inputting the sample audio into the to-be-trained sound event detection model to obtain the first feature matrix corresponding to each sample audio includes: performing framing and windowing processing operation on the sampling sample audio to obtain an intermediate sample audio; and performing discrete Fourier transform on the intermediate sample audio to obtain the first characteristic matrix.

Optionally, the resampling each sample audio to obtain a sample audio includes: and performing up-sampling or down-sampling on each sample audio to obtain the sample audio, wherein the resampling comprises up-sampling or down-sampling.

According to another aspect of the embodiments of the present invention, a method for detecting a bio-acoustic event by using a target acoustic detection model determined by the above method includes: inputting target biological sound to be detected into a target sound detection model to obtain a first characteristic matrix; processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector; performing regularization interpolation operation on the high-dimensional feature vector to obtain a corresponding second feature vector; calculating similarity values between the second feature vectors and N types of standard feature vectors respectively, and determining a type of standard feature vector corresponding to the minimum similarity value; and acquiring a target audio label corresponding to the standard feature vector, and determining the target label as an audio label of the target biological audio.

According to another aspect of the embodiments of the present invention, there is also provided a training apparatus for a bio-acoustic event detection model, including: an obtaining unit, configured to obtain a sample audio data set including a biological sound event and a sample audio tag data set corresponding to the sample audio data set, where each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set; the first input unit is used for inputting each sample audio frequency in the sample audio frequency data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio frequency in the sample audio frequency data set, wherein the to-be-trained sound event detection model comprises N types of standard audio frequencies used for comparing with the sample audio frequencies, and N is a positive integer greater than or equal to 1; the first processing unit is used for processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio; the first interpolation processing unit is used for performing regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio; a first determining unit, configured to determine, according to the second feature vector corresponding to each sample audio and N types of standard feature vectors, a predicted audio tag data set corresponding to the sample audio data set, where the predicted audio tag data set includes a predicted audio tag of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to N types of standard audio one-to-one; and the second determining unit is used for determining the sound event detection model to be trained as a target sound detection model under the condition that the loss function corresponding to the sample audio tag data set and the predicted audio tag data set meets a preset condition.

Optionally, the first interpolation processing unit includes: the acquisition module is used for acquiring a group of feature vectors corresponding to the high-dimensional feature vectors; a first calculation module for calculating a euclidean distance of each pair of feature vectors in the set of feature vectors; the second calculation module is used for calculating an adjacency matrix according to the Euclidean distance; the third calculation module is used for carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix; a first determining module, configured to determine the second eigenvector according to the propagation matrix.

Optionally, the first determining unit includes: performing the following for each sample video in the sample video data set: a fourth calculating module, configured to calculate similarities between the second feature vector corresponding to each sample audio and the N types of standard audio feature vectors, respectively, so as to obtain N similarity values; a second determining module, configured to determine a target standard audio feature vector corresponding to a minimum value of the N similarity values; a third determining module, configured to determine a target sample label of a type of standard audio corresponding to the target standard audio feature vector as a predicted audio label of the sample audio, where a predicted audio label dataset of the sample audio dataset includes a predicted audio label of each sample audio in the sample audio dataset.

Optionally, the first input unit includes: the sampling processing module is used for resampling each sample audio to obtain a sampling sample audio; and the input module is used for inputting the sampling sample audio frequency into the sound event detection model to be trained to obtain the first characteristic matrix corresponding to each sample audio frequency.

Optionally, the first input module includes: the processing submodule is used for performing framing and windowing processing operations on the sampling sample audio to obtain an intermediate sample audio; and the fourth determining module is used for performing discrete Fourier transform on the intermediate sample audio to obtain the first characteristic matrix.

Optionally, the sampling processing module includes: and the sampling processing sub-module is used for performing up-sampling or down-sampling on each sample audio to obtain the sample audio, and the resampling processing comprises up-sampling or down-sampling.

According to another aspect of the embodiments of the present invention, there is also provided a device for detecting a biological sound event, the device being detected by the target sound detection model determined in any one of the methods 1 to 6, including: the second input unit is used for inputting target biological sound to be detected into the target sound detection model to obtain a first characteristic matrix; the second processing unit is used for processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector; the second interpolation processing unit is used for carrying out regularized interpolation operation on the high-dimensional characteristic vector to obtain a corresponding second characteristic vector; the calculating unit is used for calculating similarity values between the second feature vectors and the N types of standard feature vectors respectively and determining the type of standard feature vector corresponding to the minimum similarity value; and the third determining unit is used for acquiring the target audio label corresponding to the standard feature vectors of the type and determining the target label as the audio label of the target biological audio.

In the embodiment of the invention, in the pre-training stage, firstly, the recorded original audio (sample audio) containing the biological sound event is resampled, then manual features (first feature matrix) are extracted and sent to a high-dimensional feature extractor, a class prototype is calculated by using the output high-dimensional feature vector, and then measurement classification is realized. In the second stage of the system training, in the fine tuning training process of the second stage, the high-dimensional feature vector output by the high-dimensional feature extractor is subjected to embedding propagation to obtain an embedded interpolation vector (second feature vector), and then a class prototype is calculated to realize measurement classification.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware structure of a mobile terminal according to an alternative training method for a bio-sound event detection model in an embodiment of the present invention;

FIG. 2 is a flow chart of an alternative method of training a bioacoustic event detection model in accordance with embodiments of the present invention;

FIG. 3 is a flow chart of an alternative method of bio-sound event detection according to an embodiment of the present invention;

FIG. 4 is a flow chart of an alternative training method for a network bioacoustic event classification model based on pre-trained embedded propagation prototypes according to an embodiment of the present invention;

FIG. 5 is a flow chart of an alternative PCEN spectrogram extraction, according to an embodiment of the present invention;

FIG. 6 is a block diagram of an alternative residual convolutional neural network in accordance with an embodiment of the present invention;

FIG. 7 is a schematic diagram of an alternative acoustic detection model training strategy according to an embodiment of the present invention;

FIG. 8 is an apparatus diagram of an alternative method of training a bioacoustic event detection model in accordance with embodiments of the present invention;

fig. 9 is an apparatus diagram of an alternative method of bio-sound event detection according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The training of the bio-acoustic event detection model provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the method running on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of the training method for detecting a biological sound event according to the embodiment of the present invention. As shown in fig. 1, the mobile terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to the sound event detection method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In this embodiment, a training method of a biological sound event detection model is provided, fig. 2 is a flowchart of training a biological sound event detection model according to an embodiment of the present invention, and as shown in fig. 2, the training process of the biological sound event detection model includes the following steps:

step S202, a sample audio data set including a biological sound event and a sample audio tag data set corresponding to the sample audio data set are obtained, wherein each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set.

Step S204, inputting each sample audio in the sample audio data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio in the sample audio data set, wherein the to-be-trained sound event detection model comprises N types of standard audio used for comparing with the sample audio, and N is a positive integer greater than or equal to 1.

Step S206, the first feature matrix corresponding to each sample audio is processed by the high-dimensional feature extractor, and a high-dimensional feature vector corresponding to each sample audio is obtained.

And step S208, performing regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio.

Step S210, obtaining a predicted audio tag data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N-type standard feature vectors, where the predicted audio tag data set includes a predicted audio tag of each sample audio in the sample audio data set, and the N-type standard feature vectors correspond to the N-type standard audios one to one.

Step S212, under the condition that the loss function corresponding to the sample audio label data set and the prediction audio label data set meets the preset condition, determining the sound event detection model to be trained as the target sound detection model.

Optionally, in this embodiment, the target sound detection model may include, but is not limited to, applying various audio processing-based scenarios, such as speech recognition, semantic recognition, and the like.

The sample audio data set may include, but is not limited to, an audio Development data set (Development dataset) from a Detection and Classification of acoustic Scenes and acoustic Event Detection Challenges (DCASE), and a biological acoustic Event Detection task (Sound Event Detection) in DCASE2021, and 19 subclasses including 4 subclasses of common birds, ferrets, porgy, and cave birds.

In this embodiment, the biological time event detection model may map the input PCEN features to a high dimensional space f using a residual convolutional neural network_φ(x_i) Obtaining high-dimensional feature vector representation, each class prototype c_kIt can be obtained by averaging these high-dimensional feature vectors:

by the embodiment provided by the application, a sample audio data set containing a biological sound event and a sample audio tag data set corresponding to the sample audio data set are obtained, wherein each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set; inputting each sample audio frequency in the sample audio frequency data set into a to-be-trained sound event detection model to obtain a first characteristic matrix corresponding to each sample audio frequency in the sample audio frequency data set, wherein the to-be-trained sound event detection model comprises N types of standard audio frequencies used for comparing with the sample audio frequency, and N is a positive integer greater than or equal to 1; processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio; performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio; obtaining a predicted audio label data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N types of standard feature vectors, wherein the predicted audio label data set comprises a predicted audio label of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to the N types of standard audio one to one; and under the condition that the loss functions corresponding to the sample audio label data set and the predicted audio label data set meet preset conditions, determining the sound event detection model to be trained as a target sound detection model. After the high-dimensional feature representation is obtained, a group of interpolation representations of the high-dimensional feature vectors are obtained through the embedding propagation module, and the robustness and the generalization capability of the detection model are improved through the embedding propagation regularization method.

Optionally, performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio, where the regularization interpolation operation may include: acquiring a group of feature vectors corresponding to the high-dimensional feature vectors; calculating the Euclidean distance of each pair of feature vectors in a group of feature vectors; calculating an adjacency matrix according to the Euclidean distance; carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix; a second eigenvector is determined from the propagation matrix.

In this embodiment, given a set of network-output feature vectors for an event, the Euclidean distance is determined for each pair of features

These distances are then used to calculate an adjacency matrix

Wherein

A_ii0. Next, laplacian arithmetic is performed on the adjacency matrix:

from this, the propagation matrix P ═ (I- α L) can be obtained^-1The propagation matrix maps each eigenvector to another eigenspace

These mapped features can be viewed as a weighted sum of other feature vectors, such that the aggregated feature vectors have the effect of removing unwanted noise in the features. Moreover, the embedding method is simple to implement and is compatible with wide feature extraction and classifiers.

Optionally, obtaining a predicted audio tag data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N-type standard feature vector may include: performing the following for each sample video in the sample video data set: calculating the similarity between the second characteristic vector corresponding to each sample audio and the N types of standard audio characteristic vectors respectively to obtain N similarity values; determining a target standard audio feature vector corresponding to the minimum value in the N similarity values; and determining target sample labels of a type of standard audio corresponding to the target standard audio feature vector as predicted audio labels of the sample audio, wherein the predicted audio label data set of the sample audio data set comprises the predicted audio label of each sample audio in the sample audio data set.

Optionally, inputting each sample audio in the sample audio data set into the sound event detection model to be trained, to obtain a first feature matrix corresponding to each sample audio in the sample audio data set, where the method may include: resampling each sample audio to obtain a sampled sample audio; and inputting the sampled sample audio into a sound event detection model to be trained to obtain a first feature matrix corresponding to each sample audio.

Optionally, inputting the sample audio into the to-be-trained sound event detection model to obtain the first feature matrix corresponding to each sample audio, which may include: performing framing and windowing processing operation on the sampling sample audio to obtain an intermediate sample audio; and carrying out discrete Fourier transform on the intermediate sample audio to obtain a first characteristic matrix.

Optionally, resampling each sample audio to obtain a sample audio, where the resampling may include: and performing up-sampling or down-sampling on each sample audio to obtain a sample audio, wherein the resampling comprises up-sampling or down-sampling.

According to another aspect of the embodiments of the present invention, a method for detecting a bio-acoustic event by using a target acoustic detection model determined by the above method may include:

step S302, inputting the target biological sound to be detected into a target sound detection model to obtain a first feature matrix.

And step S304, processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector.

And step S306, performing regularized interpolation operation on the high-dimensional feature vectors to obtain corresponding second feature vectors.

Step S308, calculating similarity values between the second feature vectors and the N types of standard feature vectors respectively, and determining the type of standard feature vector corresponding to the minimum similarity value.

Step S310, obtaining target audio tags corresponding to the standard feature vectors, and determining the target tags as the audio tags of the target biological audio.

According to the embodiment provided by the application, target biological sound to be detected is input into a target sound detection model to obtain a first characteristic matrix; processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector; carrying out regularized interpolation operation on the high-dimensional feature vectors to obtain corresponding second feature vectors; calculating similarity values between the second feature vectors and the N types of standard feature vectors respectively, and determining the type of standard feature vector corresponding to the minimum similarity value; and acquiring target audio tags corresponding to the standard feature vectors, and determining the target tags as the audio tags of the target biological audio. An accurate audio tag can be obtained.

As an alternative embodiment, a training method of the embedded propagation prototype network biological sound event classification model based on pre-training is provided, and comprises two stages of pre-training and fine-tuning. The specific implementation is shown in fig. 4, the left half is a pre-training stage: firstly, resampling original audio (sample audio) containing a recorded biological sound event, then extracting manual features, sending the manual features into a high-dimensional feature extractor, calculating a class prototype by using an output high-dimensional feature vector, and then realizing measurement classification. The right half is an embedded propagation module designed for improving the generalization capability of the system and is used in the second stage of the system training. In the second stage of fine tuning training process, the high-dimensional feature vector output by the high-dimensional feature extractor is subjected to embedding propagation to obtain an embedded interpolation vector, and then a class prototype is calculated to realize measurement classification.

In this embodiment, a two-stage acoustic model training strategy and an embedding propagation regularization method enhance the representation of the model on high-dimensional features and improve the robustness and generalization capability of the model. The universal generalized characterization model is learned in the pre-training stage, and the characterization mode of the customized data is learned in a targeted manner in the fine-tuning stage, so that the measurement classification performance of data which is not seen in the testing stage is improved.

In this embodiment, a two-stage training method is proposed to improve the robustness of the model, and as shown in fig. 4, there are four modules in the first stage: firstly, extracting recorded original audio containing biological sound events into a PCEN characteristic matrix representation which is simple and convenient to process; secondly, a high-dimensional feature extractor based on a residual convolutional neural network (Resnet 12); thirdly, calculating a class prototype by using the high-dimensional feature vector; and fourthly, a measurement classification module.

And the fine tuning training of the second stage is realized by adding a regularization method of embedded propagation in the first stage, so that the generalization capability of the prototype classification network is improved.

In this embodiment, a method for training a network biological sound event classification model based on pre-training embedded propagation prototype is provided, which includes the following 4 steps:

step one, extracting characteristics of diverse biological audio data

In this embodiment, the acquired audio data set may be from the Detection and Classification of Sound scene and Sound Event Detection Challenges (DCASE), and the audio Development data set (Development dataset) of the biological Sound Event Detection task (Sound Event Detection) in DCASE2021 is selected, and includes 19 subclasses of 4 classes of common birds, ferrets, porgy, and cave birds.

The data set includes a Training data set (Training dataset) and a Validation data set (Validation dataset). The training data includes 4 subsets, i.e. 4 major species, and each subset includes a different number of minor species: BV contains 11 subclasses, with 5 tones for a total of 10 hours; HT comprises 3 subclasses, with 3 tones totaling 3 hours; JD contains 1 subclass, with bar tones totaling 10 minutes; MT contains 4 subclasses, 2 tones for 1 hour and 10 minutes; the verification data set only has two categories of target event voicing (positive examples) and non-target event sounds (negative examples) in each audio, including two subsets of HV (2 audio for 2 hours) and PB (6 audio for 3 hours). The details of the data set are shown in table 1.

TABLE 1DCASE2021 task5 development set data sheet

Data set	Number of audio pieces	Sampling rate (Hz)	Total length of time	Number of categories	Number of events
						BV	5	24000	10h	11	2662
HT	3	6000	3h	3	435
						JD	1	22050	10min	1	355
MT	2	8000	1h10min	4	1234
						HV	2	6000	2h	2	50
PB	6	44100	3h	2	260

Original audio data are resampled according to bioacoustic characteristics, and then PCEN manual characteristics are extracted.

It should be noted that the sources of the biological audio data are different, and the sampling rates are different, and in this embodiment, as shown in fig. 5, a flowchart of PCEN spectrogram extraction is shown. The specific procedure is as follows.

Firstly, resampling (scaling value range), framing and windowing processing are carried out on audio data, then discrete Fourier transform is carried out, then the frequency domain characteristics at the moment are overlapped through energy values of a group of Mel frequency filter frequency bands to obtain characteristic values of the frequency bands represented by numerical values, and finally PCEN operation is carried out.

In the embodiment, uniform resampling processing is adopted for the class imbalance problem, and global regularization processing is performed on the features before the features are sent to the network.

In practical applications, all sample audio data are 0.2s long, and energy per channel normalized manual features (PCENs) are extracted as input to the model. PCEN is a feature extraction approach that improves the robustness of the spectrogram to channel distortion by combining Dynamic Range Compression (DRC) and Adaptive Gain Control (AGC) with time integration. The original audio is scaled to [ -2 ] before doing PCEN feature extraction³¹,2³¹]The range of (1). The specific process is as follows: firstly, carrying out value range scaling (resampling), framing and windowing processing operations on audio data, then carrying out discrete Fourier transform, then superposing the frequency domain characteristics at the moment through energy values of a group of Mel frequency filter frequency bands to obtain a characteristic value of which the numerical value represents the frequency band, and finally carrying out PCEN operation:

M_(t,f)＝sE_(t,f)+(1-s)M_(t-1,f)

wherein, 0<α,ε,r<1,δ>1，E_(t,f)As Mel time-frequency points. In practical application of training a biological sound event detection model, due to the fact that data sampling rate difference is large, up-sampling or down-sampling operation can be conducted respectively, and the data sampling rate difference is enabled to be 22050 Hz. In the framing operation, the frame length and the frame shift are respectively 1024 sampling points and 256 sampling points; the number of triangular windows in the Mel filter is 128, so that a 128-dimensional spectrogram with the size of 17 frames is finally obtained.

Step two, building a prototype network model

In this embodiment, the prototype classification network is based on the use of a Convolutional Neural Network (CNN) for metric-based classification of few-sample events. The convolutional neural network part can be used as a high-dimensional feature extractor with strong generalization capability through training. The input of the convolutional neural network is manually extracted features or a feature representation of the original audio extracted by other models. These features are learned to a vector representation of the high-dimensional features through a convolutional neural network that is a prototype network. And calculating the mean vector representation of the obtained high-dimensional features according to the classes, namely class prototypes.

The metric classification is that sample data to be classified is represented by corresponding high-dimensional feature vectors through the same convolutional neural network, and then classification is realized through the distance between the metric and the class prototype, namely, the sample to be classified is classified into the class category to which the class prototype with the minimum distance to the class prototype belongs. Common distance measurement methods include squared euclidean distance, cosine similarity, and the like.

Step three, pre-training in the first stage

It should be noted that deepening the number of network layers is beneficial to effectively extracting high-dimensional feature information, but for a training mode with few samples, the overfitting phenomenon is aggravated, and the generalization of the model is not facilitated.

In this embodiment, a pre-training-based two-stage acoustic model training strategy learns a general biological sound high-dimensional feature representation model by training a large amount of animal sound event data in a pre-training stage, so as to provide a better initialization model for the second-stage training, accelerate convergence of a target task, and avoid an overfitting situation on a small sample data set. Therefore, the deep network can be used for extracting effective high-dimensional feature representation, and the overfitting problem caused by too few samples and too deep network can be prevented.

The training data selects a large amount of animal sound audios, firstly a PCEN characteristic matrix (equivalent to a first characteristic matrix) is extracted, the PCEN characteristic matrix is sent into a prototype network with Resnet12 as a high-dimensional characteristic extractor to extract high-dimensional characteristics and calculate class prototypes, high-dimensional characteristic vectors of samples to be tested represent squares of Euclidean distances calculated from various prototypes respectively, and the class is predicted when the distance between the samples to be tested and the class prototypes is minimum. And after the training is finished, the model parameters are fixed and are used as initialization parameters of the model of the next stage.

In the pre-training stage, AudioSet data is sent to Resnet12 in batches to obtain a high-dimensional feature vector representation with the length of 1024. And calculating a mean value class prototype for each class, calculating the similarity (Euclidean distance square) between the sample to be detected and the class prototype, and predicting the class of the sample to be detected according to the similarity.

Step four, based on the second stage fine tuning training of embedding propagation

Embedding Propagation (EP) is one of the class of manifold smoothing as an unsupervised, nonparametric regularization. One of the disadvantages of the prototype network used for the classification of few sample events is that it is easy to over-fit to a small amount of training data, and in practical applications, there is often a great difference between the training data and the test data, which requires that the prototype network obtained by training has a strong generalization capability to adapt and extract more types of audio high-dimensional features. The embedding propagation utilizes the similarity of network output high-dimensional features on a graph (the graph is constructed by utilizing the similarity between high-dimensional feature pairs and Radial Basis Functions (RBFs)) to output a group of interpolation to capture the high-order interaction between embedded vectors, and the use of interpolation embedding can bring smoother decision boundaries and increase the robustness of a model to noise.

In the second stage, PCEN manual characteristics are firstly extracted from original audio (specific data) and sent to a prototype network, initialization parameters of the model are model parameters obtained by training in the first stage, a group of interpolation representations of high-dimensional characteristic vectors are obtained through an embedded propagation module after high-dimensional characteristic representation is obtained, various prototypes are calculated on the basis, and classification prediction is carried out according to the same classification criterion in a measurement mode in the subsequent calculation and training stages. The embedded propagation is an unparameterized module and does not need to be initialized.

In the fine tuning stage, the training data of DCASE2021 task5 is used for data, the same high-dimensional feature vector representation is extracted, then the feature is embedded into a propagation module to obtain the regularization interpolation representation of the features, and then the class prototype and the similarity are calculated according to the same method for classification.

In this embodiment, the prototype network uses a residual convolutional neural network, and the specific structure is as shown in fig. 6, and the input PCEN features are mapped to a high-dimensional space f_φ(x_i) Obtaining high-dimensional feature vector representation, each class prototype c_kIt can be obtained by averaging these high-dimensional feature vectors:

in the embodiment, the prototype network based on the measurement can perform training of less sample data by constructing the epicode, and compared with the traditional supervised learning, the overfitting problem caused by less data is alleviated to a certain extent. However, the difference between the training data and the test data is still a concern. Through a two-stage acoustic model training strategy, universal initialization parameters of a high-dimensional biological sound feature extraction model are provided for training on a specific data set, model convergence is accelerated, and model robustness is improved. In addition, classification boundaries between classes can be blurred through embedding a propagated regularization mode, so that the classification capability of the model is improved.

It should be noted that, in order to improve the robustness of the model and improve the detection performance of the system, as shown in fig. 7, a schematic diagram based on a training strategy of the acoustic detection model is provided, which is specifically described as follows.

1) AudioSet pre-training phase

AudioSet has weakly marked 10s data and strongly marked data, and is about 600 kinds. The weak mark data has two forms, one is that the downloaded audio is automatically intercepted from the YouTube video according to the start point label information and the stop point label information provided by the Audio set and the corresponding audio website; the other is the 128-dimensional feature extracted by the VGGish model. Strongly labeled data has only one form: and intercepting the downloaded audio from the YouTube video, and providing a label, a website address and corresponding strong marking timestamp information by the Audio set. The data used for pre-training was 39 subclasses of the "Animal" class in the strongly labeled data (except for 4 duplicate or fuzzy subclasses, e.g., the "hissing" of snakes and steamers belonging to the same label, and the "roaring" of various different animals belonging to the same label) for a total of 17.8 hours.

The PCEN features are firstly extracted from the audio data, then the features are sent into a residual error network with 12 layers, and loss training based on measurement is carried out on the data according to a training mode of partitioning epicode of a prototype network. The data is fed into the system in batches, 16 classes are fed in each time, 5 samples are fed in each class, and the number of the samples fed in each class is consistent with that of the samples fed in the fine adjustment of the second stage.

2) DCASE2021 task5 training dataset trimming phase:

the DCASE2021 task5 trained the dataset for 14 hours and 20 minutes, and due to the imbalance problem of the dataset classes, the audio data was first evenly resampled after extracting PCEN features to ensure that 19 classes in the dataset all had the same number of samples. Then, the data are sent to a prototype network with better initialization parameters in batches to be sequentially subjected to high-dimensional feature extraction, embedded propagation and interpolation feature vector representation, and finally measurement classification is carried out.

It is further noted that the detection system may be evaluated after the training of the bio-acoustic event detection model.

In this example, F-measure was used to evaluate the performance of the system in the following way:

the recall rate and the precision are two mutually contradictory measurement criteria, and the F-measure gives consideration to the two indexes and calculates the harmonic average of the recall rate and the precision. In the embodiment, the F-measure final score is obtained by each subset, namely, the F-measure of each subset is calculated firstly, and then the F-measures of all the subsets are taken and averaged. TP is counted when the timestamp of the predicted event is more than 30% of the intersection ratio (IoU) with the true tag. The UNK class is included in the data set, i.e. the unknown animal is called, and is processed separately because the prediction of the system has the correct possibility although the human ear cannot be identified. The method specifically comprises the following steps:

it should be noted that if the audio does not predict positive/unknown, the FN counts are the total positive events of the whole audio, i.e. the top 5 known tags of each long audio are counted as FN.

And carrying out the same characteristic extraction on the audio data of the test set, sending PCEN characteristics of the data to be tested into a classification system to obtain a final prediction result, and calculating F-measure according to an output result.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a device for detecting a sound event is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 8 is a block diagram of a training apparatus for a bio-acoustic event detection model according to an embodiment of the present invention, as shown in fig. 8, the apparatus including:

an obtaining unit 81, configured to obtain a sample audio data set including a biological sound event and a sample audio tag data set corresponding to the sample audio data set, where each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set.

The first input unit 83 is configured to input each sample audio in the sample audio data set into a to-be-trained sound event detection model, so as to obtain a first feature matrix corresponding to each sample audio in the sample audio data set, where the to-be-trained sound event detection model includes N types of standard audio used for comparison with the sample audio, and N is a positive integer greater than or equal to 1.

And the first processing unit 85 is configured to process the first feature matrix corresponding to each sample audio through the high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio.

The first interpolation processing unit 87 is configured to perform regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio.

The first determining unit 89 is configured to determine, according to the second feature vector corresponding to each sample audio and the N-type standard feature vectors, a predicted audio tag data set corresponding to the sample audio data set, where the predicted audio tag data set includes a predicted audio tag of each sample audio in the sample audio data set, and the N-type standard feature vectors correspond to the N-type standard audios one to one.

The second determining unit 811 is configured to determine the sound event detection model to be trained as the target sound detection model when the loss function corresponding to the sample audio tag data set and the predicted audio tag data set satisfies a preset condition.

By the embodiment provided by the present application, the obtaining unit 81 obtains a sample audio data set containing a biological sound event and a sample audio tag data set corresponding to the sample audio data set, where each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set; the first input unit 83 inputs each sample audio in the sample audio data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio in the sample audio data set, wherein the to-be-trained sound event detection model includes N types of standard audio used for comparing with the sample audio, and N is a positive integer greater than or equal to 1; the first feature matrix corresponding to each sample audio of the first processing unit 85 is processed by the high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio; the first interpolation processing unit 87 performs regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio; the first determining unit 89 determines a predicted audio tag data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and N types of standard feature vectors, where the predicted audio tag data set includes a predicted audio tag of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to the N types of standard audio one to one; the second determining unit 811 determines the sound event detection model to be trained as the target sound detection model when the loss function corresponding to the sample audio tag data set and the predicted audio tag data set satisfies the preset condition. The robustness of the model is improved, a regularization method of embedding propagation is added, and the generalization capability of the model is improved.

Optionally, the first interpolation processing unit 87 may include: the acquisition module is used for acquiring a group of feature vectors corresponding to the high-dimensional feature vectors; a first calculation module, configured to calculate a euclidean distance between each pair of feature vectors in a set of feature vectors; the second calculation module is used for calculating the adjacency matrix according to the Euclidean distance; the third calculation module is used for carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix; and the first determining module is used for determining the second eigenvector according to the propagation matrix.

Optionally, the first determining unit 89 may include: performing the following for each sample video in the sample video data set: the fourth calculation module is used for calculating the similarity between the second feature vector corresponding to each sample audio and the N types of standard audio feature vectors respectively to obtain N similarity values; the second determining module is used for determining a target standard audio feature vector corresponding to the minimum value in the N similarity values; and a third determining module, configured to determine a target sample label of a type of standard audio corresponding to the target standard audio feature vector as a predicted audio label of the sample audio, where a predicted audio label dataset of the sample audio dataset includes a predicted audio label of each sample audio in the sample audio dataset.

Optionally, the first input unit 83 may include: the sampling processing module is used for resampling each sample audio to obtain a sampling sample audio; and the input module is used for inputting the sampling sample audio frequency into the sound event detection model to be trained to obtain a first characteristic matrix corresponding to each sample audio frequency.

Wherein, the first input module may include: the processing submodule is used for performing framing and windowing processing operations on the sampled sample audio to obtain an intermediate sample audio; and the fourth determining module is used for performing discrete Fourier transform on the intermediate sample audio to obtain a first characteristic matrix.

Optionally, the sampling processing module includes: and the sampling processing sub-module is used for performing up-sampling or down-sampling on each sample audio to obtain a sampling sample audio, and the re-sampling processing comprises up-sampling or down-sampling.

FIG. 9 is a block diagram of a bioacoustic event detection apparatus according to an embodiment of the present invention, as shown in FIG. 9, the apparatus including

The second input unit 901 is configured to input a target biological sound to be detected into the target sound detection model to obtain a first feature matrix.

And a second processing unit 903, configured to process the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector.

And the second interpolation processing unit 905 is configured to perform regularized interpolation operation on the high-dimensional feature vectors to obtain corresponding second feature vectors.

The calculating unit 907 is configured to calculate similarity values between the second feature vectors and the N types of standard feature vectors, and determine a type of standard feature vector corresponding to the smallest similarity value.

And a third determining unit 909, configured to obtain a target audio tag corresponding to one type of standard feature vector, and determine the target audio tag as an audio tag of the target bio-audio.

By the embodiment provided by the application, the second input unit 901 inputs the target biological sound to be detected into the target sound detection model to obtain a first feature matrix; the second processing unit 903 processes the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector; the second interpolation processing unit 905 performs regularized interpolation operation on the high-dimensional feature vectors to obtain corresponding second feature vectors; the calculating unit 907 calculates similarity values between the second feature vectors and the N types of standard feature vectors, and determines a type of standard feature vector corresponding to the smallest similarity value; the third determining unit 909 acquires the target audio label corresponding to the class of standard feature vectors, and determines the target label as an audio label of the target bio-audio. And determining the audio label through the second characteristic vector obtained by regularization interpolation operation, and improving the accuracy of determining the audio label by the model.

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring a sample audio data set containing a biological sound event and a sample audio label data set corresponding to the sample audio data set, wherein each sample audio in the sample audio data set corresponds to one sample audio label in the sample audio label data set;

s2, inputting each sample audio in the sample audio data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio in the sample audio data set, wherein the to-be-trained sound event detection model comprises N types of standard audio used for comparing with the sample audio, and N is a positive integer greater than or equal to 1;

s3, processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio;

s4, performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio;

s5, obtaining a predicted audio label data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and the N types of standard feature vectors, wherein the predicted audio label data set comprises a predicted audio label of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to the N types of standard audio one to one;

and S6, under the condition that the loss function corresponding to the sample audio label data set and the predicted audio label data set meets the preset condition, determining the sound event detection model to be trained as the target sound detection model.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

Claims

1. A training method of a biological sound event detection model is characterized by comprising the following steps:

obtaining a sample audio data set containing a biological sound event and a sample audio tag data set corresponding to the sample audio data set, wherein each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set;

inputting each sample audio frequency in the sample audio frequency data set into a to-be-trained sound event detection model to obtain a first characteristic matrix corresponding to each sample audio frequency in the sample audio frequency data set, wherein the to-be-trained sound event detection model comprises N types of standard audio frequencies used for comparing with the sample audio frequency, and N is a positive integer greater than or equal to 1;

processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio;

performing regularization interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio;

determining a predicted audio label data set corresponding to the sample audio data set according to the second feature vector corresponding to each sample audio and N types of standard feature vectors, wherein the predicted audio label data set comprises a predicted audio label of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to the N types of standard audio one to one;

and under the condition that the loss function corresponding to the sample audio tag data set and the predicted audio tag data set meets a preset condition, determining the sound event detection model to be trained as a target sound detection model.

2. The method according to claim 1, wherein the performing a regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio comprises:

acquiring a group of feature vectors corresponding to the high-dimensional feature vectors;

calculating a Euclidean distance for each pair of feature vectors in the set of feature vectors;

calculating an adjacency matrix according to the Euclidean distance;

carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix;

and determining the second feature vector according to the propagation matrix.

3. The method according to claim 1, wherein determining the predicted audio tag data set corresponding to the sample audio data set according to the second feature vector and the N-class standard feature vector corresponding to each sample audio comprises:

performing the following for each sample video in the sample video data set:

calculating the similarity between the second feature vector corresponding to each sample audio and the N types of standard audio feature vectors respectively to obtain N similarity values;

determining a target standard audio feature vector corresponding to the minimum value in the N similarity values;

and determining a target sample label of a type of standard audio corresponding to the target standard audio feature vector as a predicted audio label of the sample audio, wherein the predicted audio label data set of the sample audio data set comprises a predicted audio label of each sample audio in the sample audio data set.

4. The method of claim 1, wherein the inputting each sample audio in the sample audio data set into a sound event detection model to be trained to obtain a first feature matrix corresponding to each sample audio in the sample audio data set comprises:

resampling each sample audio to obtain a sampled sample audio;

and inputting the sampling sample audio into the sound event detection model to be trained to obtain the first feature matrix corresponding to each sample audio.

5. The method according to claim 4, wherein the inputting the sample audio into the sound event detection model to be trained to obtain the first feature matrix corresponding to each sample audio comprises:

performing framing and windowing processing operation on the sampling sample audio to obtain an intermediate sample audio;

and performing discrete Fourier transform on the intermediate sample audio to obtain the first characteristic matrix.

6. The method of claim 4, wherein the resampling each sample audio to obtain sample audio comprises:

and performing up-sampling or down-sampling on each sample audio to obtain the sample audio, wherein the resampling comprises up-sampling or down-sampling.

7. A method for detecting a biological sound event by using a target sound detection model determined by any one of the methods 1 to 6, comprising:

inputting target biological sound to be detected into a target sound detection model to obtain a first characteristic matrix;

processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector;

performing regularization interpolation operation on the high-dimensional feature vector to obtain a corresponding second feature vector;

calculating similarity values between the second feature vectors and N types of standard feature vectors respectively, and determining a type of standard feature vector corresponding to the minimum similarity value;

and acquiring a target audio label corresponding to the standard feature vector, and determining the target label as an audio label of the target biological audio.

8. A training device for a biological sound event detection model, comprising:

an obtaining unit, configured to obtain a sample audio data set including a biological sound event and a sample audio tag data set corresponding to the sample audio data set, where each sample audio in the sample audio data set corresponds to one sample audio tag in the sample audio tag data set;

the first input unit is used for inputting each sample audio frequency in the sample audio frequency data set into a to-be-trained sound event detection model to obtain a first feature matrix corresponding to each sample audio frequency in the sample audio frequency data set, wherein the to-be-trained sound event detection model comprises N types of standard audio frequencies used for comparing with the sample audio frequencies, and N is a positive integer greater than or equal to 1;

the first processing unit is used for processing the first feature matrix corresponding to each sample audio through a high-dimensional feature extractor to obtain a high-dimensional feature vector corresponding to each sample audio;

the first interpolation processing unit is used for performing regularized interpolation operation on the high-dimensional feature vector corresponding to each sample audio to obtain a second feature vector corresponding to each sample audio;

a first determining unit, configured to determine, according to the second feature vector corresponding to each sample audio and N types of standard feature vectors, a predicted audio tag data set corresponding to the sample audio data set, where the predicted audio tag data set includes a predicted audio tag of each sample audio in the sample audio data set, and the N types of standard feature vectors correspond to N types of standard audio one-to-one;

and the second determining unit is used for determining the sound event detection model to be trained as a target sound detection model under the condition that the loss function corresponding to the sample audio tag data set and the predicted audio tag data set meets a preset condition.

9. The apparatus according to claim 8, wherein the first interpolation processing unit includes:

the acquisition module is used for acquiring a group of feature vectors corresponding to the high-dimensional feature vectors;

a first calculation module for calculating a euclidean distance of each pair of feature vectors in the set of feature vectors;

the second calculation module is used for calculating an adjacency matrix according to the Euclidean distance;

the third calculation module is used for carrying out Laplace operator operation on the adjacent matrix to obtain a propagation matrix;

a first determining module, configured to determine the second eigenvector according to the propagation matrix.

10. The apparatus of claim 8, wherein the first determining unit comprises:

performing the following for each sample video in the sample video data set:

a fourth calculating module, configured to calculate similarities between the second feature vector corresponding to each sample audio and the N types of standard audio feature vectors, respectively, so as to obtain N similarity values;

a second determining module, configured to determine a target standard audio feature vector corresponding to a minimum value of the N similarity values;

a third determining module, configured to determine a target sample label of a type of standard audio corresponding to the target standard audio feature vector as a predicted audio label of the sample audio, where a predicted audio label dataset of the sample audio dataset includes a predicted audio label of each sample audio in the sample audio dataset.

11. The apparatus of claim 8, wherein the first input unit comprises:

the sampling processing module is used for resampling each sample audio to obtain a sampling sample audio;

and the input module is used for inputting the sampling sample audio frequency into the sound event detection model to be trained to obtain the first characteristic matrix corresponding to each sample audio frequency.

12. The apparatus of claim 11, wherein the first input module comprises:

the processing submodule is used for performing framing and windowing processing operations on the sampling sample audio to obtain an intermediate sample audio;

and the fourth determining module is used for performing discrete Fourier transform on the intermediate sample audio to obtain the first characteristic matrix.

13. The apparatus of claim 11, wherein the sample processing module comprises:

and the sampling processing sub-module is used for performing up-sampling or down-sampling on each sample audio to obtain the sample audio, and the resampling processing comprises up-sampling or down-sampling.

14. A bioacoustic event detection device, wherein detection is performed by the target sound detection model determined by any one of the methods 1 to 6, comprising:

the second input unit is used for inputting target biological sound to be detected into the target sound detection model to obtain a first characteristic matrix;

the second processing unit is used for processing the first feature matrix through a high-dimensional feature extractor to obtain a high-dimensional feature vector;

the second interpolation processing unit is used for carrying out regularized interpolation operation on the high-dimensional characteristic vector to obtain a corresponding second characteristic vector;

the calculating unit is used for calculating similarity values between the second feature vectors and the N types of standard feature vectors respectively and determining the type of standard feature vector corresponding to the minimum similarity value;

and the third determining unit is used for acquiring the target audio label corresponding to the standard feature vectors of the type and determining the target label as the audio label of the target biological audio.

15. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 6 or 7 when executed.

16. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6 or 7.