CN116524939A - ECAPA-TDNN-based automatic identification method for bird song species - Google Patents
ECAPA-TDNN-based automatic identification method for bird song species Download PDFInfo
- Publication number
- CN116524939A CN116524939A CN202310439188.9A CN202310439188A CN116524939A CN 116524939 A CN116524939 A CN 116524939A CN 202310439188 A CN202310439188 A CN 202310439188A CN 116524939 A CN116524939 A CN 116524939A
- Authority
- CN
- China
- Prior art keywords
- bird song
- ecapa
- tdnn
- bird
- song
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 239000012634 fragment Substances 0.000 claims abstract description 33
- 238000007781 pre-processing Methods 0.000 claims abstract description 15
- 239000000203 mixture Substances 0.000 claims abstract description 14
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 11
- 238000001514 detection method Methods 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims abstract description 10
- 238000004458 analytical method Methods 0.000 claims abstract description 7
- 238000013145 classification model Methods 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 7
- 238000009432 framing Methods 0.000 claims description 6
- 230000001537 neural effect Effects 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 230000005236 sound signal Effects 0.000 claims description 5
- 230000004927 fusion Effects 0.000 claims description 4
- 238000009826 distribution Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 2
- 238000005520 cutting process Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
Abstract
The invention discloses an ECAPA-TDNN-based automatic identification method for a bird song species, which comprises the following steps: collecting a bird song segment, preprocessing and extracting features to obtain a Mel frequency cepstrum coefficient; feeding the mel frequency cepstrum coefficient into an ECAPA-TDNN network for training; removing mute segments through a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting segments containing bird singing; obtaining a Mel frequency cepstrum coefficient corresponding to the bird song segments, inputting a trained model for recognition, and obtaining a result; and displaying the identification results on a graphical user interface one by one, counting the number of the categories, drawing a spectrogram, and displaying the identification information in a derived table. According to the invention, the accuracy of the bird song classifying scene is improved through the ECAPA-TDNN model, the automatic classification and the preliminary analysis of the bird song long fragment data are realized, the workload of manual cutting is reduced, and the subsequent deep analysis in the ecological field is facilitated.
Description
Technical Field
The invention belongs to the technical field of acoustic monitoring and audio signal identification, and particularly relates to an ECAPA-TDNN-based automatic identification method for a bird song species.
Background
Birds are used as important components of an ecological system, and provide important basis for a ecologist to know the biodiversity and climate change of a region. The passive acoustic monitoring has the characteristics of low cost, wide range, non-invasiveness and the like, so that the bird song signal becomes an important data source for monitoring the bird activities. With the increasing worsening of ecological environment and the serious difficulty in management in recent years, species identification, behavior analysis and acoustic index research based on bird song have great application value.
At present, species identification algorithms based on bird song mainly include: 1) The classification method based on template matching, such as a dynamic time planning template algorithm, has the problems of large operand and low operation efficiency; 2) The traditional machine learning algorithm, such as a random forest, a support vector machine, a hidden Markov model and the like, is greatly influenced by noise, and has higher signal-to-noise ratio requirement on a data set; 3) Species identification using deep learning, such as Alex Net, VGG16 model, etc., is currently popular, but is rarely applied to species identification based on bird song in China.
The ECAPA-TDNN model was proposed in 2020, and introduced an SE (squeeze excitation) module and a channel attention mechanism to make the model learn more global information in audio data, which has become the mainstream voiceprint model. The ECAPA-TDNN is utilized to extract voiceprint characteristics in an open source voiceprint recognition system issued by hundred-degree flag Paddle special, and the error rate of recognition and the like is as low as 0.95%.
Thus, the great problems of the prior art are: the field of bird song species identification lacks a mainstream neural network model with high accuracy to be verified. In addition, the current bird song recognition algorithm needs to be predicted by manually cutting out a segment only containing bird song, and when long-time bird song audio is input, the manual workload is large, observer deviation exists, and inconvenience is brought to deep analysis and subsequent research.
Disclosure of Invention
The invention aims to solve the problems in the prior art, and provides an ECAPA-TDNN-based automatic identification method for the singing species, which can improve the identification accuracy and realize automatic segmentation of singing audio and species identification.
The technical solution for realizing the purpose of the invention is as follows: an ECAPA-TDNN-based method for automatically identifying a bird song species, the method comprising the steps of:
step 1, collecting a bird song signal, preprocessing the bird song signal to construct a bird song data set, and then obtaining a Mel frequency cepstrum coefficient through feature extraction;
step 2, inputting the mel frequency cepstrum coefficient into an ECAPA-TDNN network for training to obtain an ECAPA-TDNN birdcage classification model;
step 3, inputting bird song audio, removing mute fragments through a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting fragments containing bird song;
and 4, extracting a Mel frequency cepstrum coefficient based on the fragment containing the bird song according to the mode of the step 1, and inputting the ECAPA-TDNN bird song classification model for recognition to obtain a recognition result.
Further, the pretreatment in step 1 specifically includes:
step 1-1-1, aiming at a bird song signal, eliminating a direct current component and carrying out pre-emphasis;
step 1-1-2, performing high-pass filtering;
step 1-1-3, carrying out framing treatment;
step 1-1-4, windowing is carried out by using a Hanning window.
Further, the pre-emphasis in step 1-1-1 is specifically performed by a transfer function of H (z) =1-az -1 A is the pre-emphasis coefficient.
Further, in the step 1, the obtaining the mel frequency cepstrum coefficient through feature extraction specifically includes:
step 1-2-1, performing short-time Fourier transform on a bird song signal in a bird song data set, taking an absolute value of the obtained value, and then squaring to obtain an energy spectrogram;
step 1-2-2, constructing a Mel filter group, and performing dot product operation with the energy spectrum to obtain a Mel spectrogram;
step 1-2-3, taking the logarithm of the Mel spectrogram;
step 1-2-4, performing discrete cosine transform, and taking the first P data to obtain a Mel frequency cepstrum coefficient; and P is an integer.
Further, the ECAPA-TDNN network comprises a convolutional neural layer, three SE-Res2Block layers, a Attentive Statistics Pooling layer and a fully connected layer;
inputting the mel-frequency cepstrum coefficient into an ECAPA-TDNN network model for training, wherein the specific process comprises the following steps:
inputting mel frequency cepstrum coefficients into a convolutional neural layer to obtain potential audio features;
carrying out multi-layer feature fusion on the potential audio features through an SE-Res2Block layer, and extracting global information;
the output of the three SE-Res2Block layers is connected in series according to the characteristic dimension;
the average value and the standard deviation based on the attention mechanism are obtained through a Attentive Statistics Pooling (statistical pooling with the attention mechanism) layer, and vectors are obtained through serial connection according to characteristic dimensions;
carrying out softmax classification on the vectors through a full connection layer to obtain classification results;
based on the classification result, updating network parameters by using a cross entropy loss function through back propagation to obtain the ECAPA-TDNN birdcasting classification model.
Further, the step 3 specifically includes the following steps:
step 3-1, preprocessing and framing the singing audio, wherein the preprocessing flow is the same as step 1, a plurality of classes are created, and the singing audio is divided into segments and stored in the classes;
step 3-2, mute judgment, specifically comprising:
step 3-2-1, dividing sub-bands for the segments, and calculating the logarithmic energy of the sub-bands;
3-2-2, calculating a probability P (X|H2) of a voice based on a voice Gaussian mixture model and a probability P (X|H2 0) of a noise based on a noise Gaussian mixture model for each sub-band when the total energy of the frame is larger than the minimum energy required for triggering the audio signal;
step 3-2-3, calculating likelihood ratio of the sub-band based on the two probabilities: likelihood ratio = log (P (x|h1)/P (x|h0));
step 3-2-4, judging the sub-band to be a sound segment if the likelihood ratio of one sub-band meets a preset threshold;
and or accumulating the likelihood ratio of each sub-band as an overall likelihood ratio, and judging the sub-band as a voiced segment if the overall likelihood ratio meets a preset threshold;
otherwise, judging the voice signal to be a mute segment;
step 3-3, collecting sound fragments, which specifically comprises the following steps:
step 3-3-1, creating a data container with two ends, adding an object, and obtaining the number of the voiced classes;
step 3-3-2, when the number of the sounding fragments is greater than 90% of the capacity of the data container, judging that the bird song starts, and writing the data in the current data container into the constructed empty list;
step 3-3-3, repeating step 3-3-1 and step 3-3-2, and ending list writing when the number of mute segments is more than 90%;
and 3-3-4, returning list data, namely the fragments only containing bird song.
Further, the step 3-2-1 specifically comprises: according to the frequency spectrum characteristics and energy distribution of the bird song, dividing each frame of the bird song signal into six sub-bands of 200-2000Hz, 2000-3000Hz, 3000-3500Hz, 3500-4500Hz, 4500-8000Hz and 8000-24000Hz, calculating the sub-band energy and the total energy, and taking the logarithm.
Further, the method further comprises:
and 5, displaying the identification results on a graphical user interface one by one, counting the number of the identification results in different categories, drawing a spectrogram, and displaying identification information in a derived table, wherein the identification information comprises the start and stop time of bird song, the identification results and the similarity.
Further, the specific process of step 5 includes:
step 5-1, building a graphical user interface, displaying the index of the segmentation fragments, the identification result and the similarity one by one, and counting the number according to the types of the bird sings;
step 5-2, drawing a spectrogram of the input singing voice frequency;
and 5-3, creating a new table, and storing the start and stop time, the identification result and the similarity information of the bird song.
Compared with the prior art, the invention has the remarkable advantages that:
1) Compared with traditional machine learning, the deep learning can quickly and accurately learn audio potential characteristics, the ECAPA-TDNN model emphasizes a channel attention mechanism and multi-layer characteristic fusion, and experimental results show that the accuracy of the classification of the bird song species is obviously improved.
2) The silence detection algorithm designed based on the Gaussian mixture model realizes automatic cutting of singing fragments, and has high accuracy and good segmentation effect.
3) The user graphical interface designed by the invention is convenient and visual, and the user can independently select the audio to be identified, and the functions of statistics, drawing and form realize the preliminary analysis and display of the audio to be identified.
The invention is described in further detail below with reference to the accompanying drawings.
Drawings
FIG. 1 is a flow chart of an ECAPA-TDNN-based method for automatically identifying a bird song species.
FIG. 2 is a block diagram of ECAPA-TDNN.
FIG. 3 is a SE-Res2Block Block diagram of ECAPA-TDNN.
Fig. 4 is a flow chart of silence detection based on a gaussian mixture model.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that, if directional indications (such as up, down, left, right, front, and rear … …) are included in the embodiments of the present invention, the directional indications are merely used to explain the relative positional relationship, movement conditions, etc. between the components in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indications are correspondingly changed.
In addition, if there is a description of "first", "second", etc. in the embodiments of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
In one embodiment, in conjunction with FIG. 1, there is provided an ECAPA-TDNN-based method for automatically identifying a bird song species, the method comprising the steps of:
step 1, collecting a bird song signal, preprocessing the bird song signal to form a purer bird song data set, and then obtaining a Mel frequency cepstrum coefficient through feature extraction;
the pretreatment comprises the following steps:
step 1-1-1, for the bird song signal, eliminating the DC component, pre-emphasis is performed, specifically by a transfer function of H (z) =1-az -1 A is a pre-emphasis coefficient, a=0.97, to boost high frequency components in the signal;
step 1-1-2, performing high-pass filtering; because the voice noise and the environmental noise are mainly concentrated below 350Hz, the signals pass through an 8-order Butterworth high-pass filter with the cut-off frequency of 350Hz so as to obtain purer bird sound signals;
step 1-1-3, carrying out frame division processing, wherein the frame length is 2048, and the frame shift is 512;
step 1-1-4, windowing is carried out by using a hanning window so as to eliminate the discontinuity between frames after framing.
The obtaining of the mel frequency cepstrum coefficient through feature extraction specifically comprises:
step 1-2-1, performing short-time Fourier transform on a bird song signal in a bird song data set, taking an absolute value of the obtained value, and then squaring to obtain an energy spectrogram;
step 1-2-2, constructing a Mel filter group, wherein the number of filters is 128, and performing dot product operation with the energy spectrum to obtain a Mel spectrogram;
step 1-2-3, taking the logarithm of the Mel spectrogram;
step 1-2-4, performing discrete cosine transform, and taking the first 80 data to obtain a Mel frequency cepstrum coefficient; and P is an integer.
Step 2, inputting the mel frequency cepstrum coefficient into an ECAPA-TDNN network (a time delay neural network for emphasizing channel attention, propagation and aggregation) for training to obtain an ECAPA-TDNN bird song classification model; in connection with fig. 2, the ECAPA-TDNN network includes a convolutional neural layer, three SE-Res2Block layers, a Attentive Statistics Pooling (statistical pooling with attention mechanisms) layer, and a fully connected layer. The specific process of the steps comprises:
step 2-1, obtaining potential audio characteristics by passing the mel frequency cepstrum coefficient through a convolutional neural layer;
referring to fig. 3, step 2-2, performing multi-layer feature fusion on the potential audio features through an SE-Res2Block layer, and extracting global information;
step 2-3, outputting three SE-Res2Block layers in series according to characteristic dimensions;
step 2-4, obtaining an average value and a standard deviation based on an attention mechanism through a Attentive Statistics Pooling layer, and carrying out series connection according to characteristic dimensions to obtain a 3072-dimensional vector;
step 2-5, carrying out softmax classification on the vectors through a full connection layer to obtain classification results;
and 2-6, updating network parameters by using a cross entropy loss function through back propagation based on the classification result to obtain the ECAPA-TDNN birdcasting classification model.
And 3, inputting 1-minute bird song audio, removing mute fragments by a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting fragments containing bird song. Referring to fig. 4, the method specifically includes:
step 3-1, preprocessing and framing the singing audio, wherein the preprocessing flow is the same as that of step 1, a plurality of classes are created, and the singing audio is divided into fragments every 100ms and stored in the classes;
step 3-2, mute judgment, specifically comprising:
step 3-2-1, dividing sub-bands for the segments, and calculating the logarithmic energy of the sub-bands; dividing each frame of bird song signals into six sub-bands of 200-2000Hz, 2000-3000Hz, 3000-3500Hz, 3500-4500Hz, 4500-8000Hz and 8000-24000Hz according to the frequency spectrum characteristics and energy distribution of bird song signals, calculating the energy of the sub-bands and the total energy, and taking logarithms;
3-2-2, calculating a probability P (X|H2) of a voice based on a voice Gaussian mixture model and a probability P (X|H2 0) of a noise based on a noise Gaussian mixture model for each sub-band when the total energy of the frame is larger than the minimum energy required for triggering the audio signal;
step 3-2-3, calculating likelihood ratio of the sub-band based on the two probabilities: likelihood ratio = log (P (x|h1)/P (x|h0));
step 3-2-4, judging that the sound fragment (local) exists if the likelihood ratio of a certain sub-band meets a preset threshold value;
or accumulating the likelihood ratio of each sub-band as an overall likelihood ratio, and judging the sub-band as a voiced segment (global) if the overall likelihood ratio meets a preset threshold; here, one of global and local is a voiced segment as long as it is satisfied;
otherwise, judging the voice signal to be a mute segment;
step 3-3, collecting sound fragments, which specifically comprises the following steps:
step 3-3-1, creating a data container with two ends, adding an object (an instance of a Frame class and a mute detection result), and obtaining the number of the voiced classes;
step 3-3-2, when the number of the sounding fragments is greater than 90% of the capacity of the data container, judging that the bird song starts, and writing the data in the current data container into the constructed empty list;
step 3-3-3, repeating step 3-3-1 and step 3-3-2, and ending list writing when the number of mute segments is more than 90%;
and 3-3-4, returning list data, namely the fragments only containing bird song.
And 4, extracting a Mel frequency cepstrum coefficient based on the fragment containing the bird song according to the mode of the step 1, and inputting the ECAPA-TDNN bird song classification model for recognition to obtain a recognition result.
And 5, displaying the identification results on a graphical user interface one by one, counting the number of the identification results in each category, drawing a spectrogram, and displaying identification information in a derived table, wherein the identification information comprises the start and stop time of bird song, the identification results, the similarity and the like. The method specifically comprises the following steps:
step 5-1, building a graphical user interface, displaying the index of the segmentation fragments, the identification result and the similarity one by one, and counting the number according to the types of the bird sings;
step 5-2, drawing a spectrogram of the input singing voice frequency;
and 5-3, newly creating an Excel table, and storing the start and stop time, the identification result, the similarity information and the like of the bird song.
In one embodiment, an ECAPA-TDNN based automated bird song species identification system is provided, the system comprising:
the first module is used for preprocessing the bird song signal and extracting the characteristics to obtain a Mel frequency cepstrum coefficient;
the second module is used for training the ECAPA-TDNN network based on the Mel frequency cepstrum coefficient to obtain an ECAPA-TDNN birdcasting classification model;
the third module is used for performing silence detection, removing silence fragments and generating a data set only containing bird song fragments;
a fourth module, configured to perform pretreatment and feature extraction on the segmented bird song segments, and identify the bird song segments through an ECAPA-TDNN bird song classification model;
and the fifth module is used for realizing user interaction and displaying classification and analysis results.
Specific limitations regarding the ECAPA-TDNN-based bird song species automatic identification system may be found in the above description of the method for automatically identifying bird song species based on ECAPA-TDNN, and will not be described in detail herein. The modules in the ECAPA-TDNN-based bird song species automatic identification system can be fully or partially implemented by software, hardware and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:
step 1, collecting a bird song signal, preprocessing the bird song signal to construct a bird song data set, and then obtaining a Mel frequency cepstrum coefficient through feature extraction;
step 2, inputting the mel frequency cepstrum coefficient into an ECAPA-TDNN network for training to obtain an ECAPA-TDNN birdcage classification model;
step 3, inputting bird song audio, removing mute fragments through a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting fragments containing bird song;
step 4, extracting a Mel frequency cepstrum coefficient based on the segment containing the bird song according to the mode of the step 1, inputting the ECAPA-TDNN bird song classification model for recognition, and obtaining a recognition result;
and 5, displaying the identification results on a graphical user interface one by one, counting the number of the identification results in different categories, drawing a spectrogram, and displaying identification information in a derived table, wherein the identification information comprises the start and stop time of bird song, the identification results and the similarity.
For specific limitations on each step, reference may be made to the above limitations on the ECAPA-TDNN-based method for automatic identification of bird song species, which are not described in detail herein.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
step 1, collecting a bird song signal, preprocessing the bird song signal to construct a bird song data set, and then obtaining a Mel frequency cepstrum coefficient through feature extraction;
step 2, inputting the mel frequency cepstrum coefficient into an ECAPA-TDNN network for training to obtain an ECAPA-TDNN birdcage classification model;
step 3, inputting bird song audio, removing mute fragments through a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting fragments containing bird song;
step 4, extracting a Mel frequency cepstrum coefficient based on the segment containing the bird song according to the mode of the step 1, inputting the ECAPA-TDNN bird song classification model for recognition, and obtaining a recognition result;
and 5, displaying the identification results on a graphical user interface one by one, counting the number of the identification results in different categories, drawing a spectrogram, and displaying identification information in a derived table, wherein the identification information comprises the start and stop time of bird song, the identification results and the similarity.
For specific limitations on each step, reference may be made to the above limitations on the ECAPA-TDNN-based method for automatic identification of bird song species, which are not described in detail herein.
The method is convenient and quick, has strong practicability and high accuracy, fully combines the advantages of deep learning, realizes automatic segmentation and species identification of the bird song record segments, and has important significance for researching the biodiversity of an ecosystem and protecting endangered birds.
The foregoing has outlined and described the basic principles, features, and advantages of the present invention. It will be understood by those skilled in the art that the foregoing embodiments are not intended to limit the invention, and the above embodiments and descriptions are meant to be illustrative only of the principles of the invention, and that various modifications, equivalent substitutions, improvements, etc. may be made within the spirit and scope of the invention without departing from the spirit and scope of the invention.
Claims (10)
1. An ECAPA-TDNN-based automatic identification method for a bird song species, which is characterized by comprising the following steps:
step 1, collecting a bird song signal, preprocessing the bird song signal to construct a bird song data set, and then obtaining a Mel frequency cepstrum coefficient through feature extraction;
step 2, inputting the mel frequency cepstrum coefficient into an ECAPA-TDNN network for training to obtain an ECAPA-TDNN birdcage classification model;
step 3, inputting bird song audio, removing mute fragments through a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting fragments containing bird song;
and 4, extracting a Mel frequency cepstrum coefficient based on the fragment containing the bird song according to the mode of the step 1, and inputting the ECAPA-TDNN bird song classification model for recognition to obtain a recognition result.
2. The ECAPA-TDNN based method for automatic identification of a bird song species according to claim 1, wherein the preprocessing in step 1 specifically includes:
step 1-1-1, aiming at a bird song signal, eliminating a direct current component and carrying out pre-emphasis;
step 1-1-2, performing high-pass filtering;
step 1-1-3, carrying out framing treatment;
step 1-1-4, windowing is carried out by using a Hanning window.
3. The ECAPA-TDNN based automatic identification method of a bird song species according to claim 2, wherein the pre-emphasis in step 1-1-1 is specifically performed by a transfer function of H (z) =1-az -1 A is the pre-emphasis coefficient.
4. The automatic identification method of a bird song species based on ECAPA-TDNN according to claim 2, wherein the obtaining a mel frequency cepstrum coefficient by feature extraction in step 1 specifically includes:
step 1-2-1, performing short-time Fourier transform on a bird song signal in a bird song data set, taking an absolute value of the obtained value, and then squaring to obtain an energy spectrogram;
step 1-2-2, constructing a Mel filter group, and performing dot product operation with the energy spectrum to obtain a Mel spectrogram;
step 1-2-3, taking the logarithm of the Mel spectrogram;
step 1-2-4, performing discrete cosine transform, and taking the first P data to obtain a Mel frequency cepstrum coefficient; and P is an integer.
5. The ECAPA-TDNN based automatic identification method of a bird song species as claimed in claim 1, wherein the ECAPA-TDNN network includes a convolutional neural layer, three SE-Res2Block layers, a Attentive Statistics Pooling layer, and a full connection layer;
inputting the mel-frequency cepstrum coefficient into an ECAPA-TDNN network model for training, wherein the specific process comprises the following steps:
inputting mel frequency cepstrum coefficients into a convolutional neural layer to obtain potential audio features;
carrying out multi-layer feature fusion on the potential audio features through an SE-Res2Block layer, and extracting global information;
the output of the three SE-Res2Block layers is connected in series according to the characteristic dimension;
the average value and the standard deviation based on the attention mechanism are obtained through Attentive Statistics Pooling layers, and vectors are obtained through serial connection according to characteristic dimensions;
carrying out softmax classification on the vectors through a full connection layer to obtain classification results;
based on the classification result, updating network parameters by using a cross entropy loss function through back propagation to obtain the ECAPA-TDNN birdcasting classification model.
6. The ECAPA-TDNN based method for automatic identification of a bird song species of claim 1, wherein step 3 specifically comprises the following steps:
step 3-1, preprocessing and framing the singing audio, wherein the preprocessing flow is the same as step 1, a plurality of classes are created, and the singing audio is divided into segments and stored in the classes;
step 3-2, mute judgment, specifically comprising:
step 3-2-1, dividing sub-bands for the segments, and calculating the logarithmic energy of the sub-bands;
3-2-2, calculating a probability P (X|H2) of a voice based on a voice Gaussian mixture model and a probability P (X|H2 0) of a noise based on a noise Gaussian mixture model for each sub-band when the total energy of the frame is larger than the minimum energy required for triggering the audio signal;
step 3-2-3, calculating likelihood ratio of the sub-band based on the two probabilities: likelihood ratio = log (P (x|h1)/P (x|h0));
step 3-2-4, judging the sub-band to be a sound segment if the likelihood ratio of one sub-band meets a preset threshold;
or accumulating the likelihood ratio of each sub-band as an overall likelihood ratio, and judging the sub-band as a voiced segment if the overall likelihood ratio meets a preset threshold;
otherwise, judging the voice signal to be a mute segment;
step 3-3, collecting sound fragments, which specifically comprises the following steps:
step 3-3-1, creating a data container with two ends, adding an object, and obtaining the number of the voiced classes;
step 3-3-2, when the number of the sounding fragments is greater than 90% of the capacity of the data container, judging that the bird song starts, and writing the data in the current data container into the constructed empty list;
step 3-3-3, repeating step 3-3-1 and step 3-3-2, and ending list writing when the number of mute segments is more than 90%;
and 3-3-4, returning list data, namely the fragments only containing bird song.
7. The ECAPA-TDNN based method for automatic identification of a bird song species of claim 6, wherein step 3-2-1 specifically comprises: according to the frequency spectrum characteristics and energy distribution of the bird song, dividing each frame of the bird song signal into six sub-bands of 200-2000Hz, 2000-3000Hz, 3000-3500Hz, 3500-4500Hz, 4500-8000Hz and 8000-24000Hz, calculating the sub-band energy and the total energy, and taking the logarithm.
8. The ECAPA-TDNN based method for automatic identification of a bird song species of claim 6, further comprising:
and 5, displaying the identification results on a graphical user interface one by one, counting the number of the identification results in different categories, drawing a spectrogram, and displaying identification information in a derived table, wherein the identification information comprises the start and stop time of bird song, the identification results and the similarity.
9. The ECAPA-TDNN based bird song species automatic identification method of claim 8, wherein the step 5 specific process includes:
step 5-1, building a graphical user interface, displaying the index of the segmentation fragments, the identification result and the similarity one by one, and counting the number according to the types of the bird sings;
step 5-2, drawing a spectrogram of the input singing voice frequency;
and 5-3, creating a new table, and storing the start and stop time, the identification result and the similarity information of the bird song.
10. An ECAPA-TDNN based automatic bird song species identification system based on the method according to any one of claims 1 to 9, characterized in that the system comprises:
the first module is used for preprocessing the bird song signal and extracting the characteristics to obtain a Mel frequency cepstrum coefficient;
the second module is used for training the ECAPA-TDNN network based on the Mel frequency cepstrum coefficient to obtain an ECAPA-TDNN birdcasting classification model;
the third module is used for performing silence detection, removing silence fragments and generating a data set only containing bird song fragments;
a fourth module, configured to perform pretreatment and feature extraction on the segmented bird song segments, and identify the bird song segments through an ECAPA-TDNN bird song classification model;
and the fifth module is used for realizing user interaction and displaying classification and analysis results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310439188.9A CN116524939A (en) | 2023-04-23 | 2023-04-23 | ECAPA-TDNN-based automatic identification method for bird song species |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310439188.9A CN116524939A (en) | 2023-04-23 | 2023-04-23 | ECAPA-TDNN-based automatic identification method for bird song species |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116524939A true CN116524939A (en) | 2023-08-01 |
Family
ID=87396896
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310439188.9A Pending CN116524939A (en) | 2023-04-23 | 2023-04-23 | ECAPA-TDNN-based automatic identification method for bird song species |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116524939A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117095694A (en) * | 2023-10-18 | 2023-11-21 | 中国科学技术大学 | Bird song recognition method based on tag hierarchical structure attribute relationship |
CN117727309A (en) * | 2024-02-18 | 2024-03-19 | 百鸟数据科技(北京)有限责任公司 | Automatic identification method for bird song species based on TDNN structure |
CN117727332A (en) * | 2024-02-18 | 2024-03-19 | 百鸟数据科技(北京)有限责任公司 | Ecological population assessment method based on language spectrum feature analysis |
CN117746871A (en) * | 2024-02-21 | 2024-03-22 | 南方科技大学 | Cloud-based bird song detection method and system |
-
2023
- 2023-04-23 CN CN202310439188.9A patent/CN116524939A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117095694A (en) * | 2023-10-18 | 2023-11-21 | 中国科学技术大学 | Bird song recognition method based on tag hierarchical structure attribute relationship |
CN117095694B (en) * | 2023-10-18 | 2024-02-23 | 中国科学技术大学 | Bird song recognition method based on tag hierarchical structure attribute relationship |
CN117727309A (en) * | 2024-02-18 | 2024-03-19 | 百鸟数据科技(北京)有限责任公司 | Automatic identification method for bird song species based on TDNN structure |
CN117727332A (en) * | 2024-02-18 | 2024-03-19 | 百鸟数据科技(北京)有限责任公司 | Ecological population assessment method based on language spectrum feature analysis |
CN117727309B (en) * | 2024-02-18 | 2024-04-26 | 百鸟数据科技(北京)有限责任公司 | Automatic identification method for bird song species based on TDNN structure |
CN117727332B (en) * | 2024-02-18 | 2024-04-26 | 百鸟数据科技(北京)有限责任公司 | Ecological population assessment method based on language spectrum feature analysis |
CN117746871A (en) * | 2024-02-21 | 2024-03-22 | 南方科技大学 | Cloud-based bird song detection method and system |
CN117746871B (en) * | 2024-02-21 | 2024-07-16 | 深圳市规划和自然资源数据管理中心(深圳市空间地理信息中心) | Cloud-based bird song detection method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110085251B (en) | Human voice extraction method, human voice extraction device and related products | |
CN109065031B (en) | Voice labeling method, device and equipment | |
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
CN116524939A (en) | ECAPA-TDNN-based automatic identification method for bird song species | |
CN111063341B (en) | Method and system for segmenting and clustering multi-person voice in complex environment | |
CN107274916B (en) | Method and device for operating audio/video file based on voiceprint information | |
CN111279414B (en) | Segmentation-based feature extraction for sound scene classification | |
CN100485780C (en) | Quick audio-frequency separating method based on tonic frequency | |
CN107305541A (en) | Speech recognition text segmentation method and device | |
CN111724770B (en) | Audio keyword identification method for generating confrontation network based on deep convolution | |
CN102486920A (en) | Audio event detection method and device | |
CN103700370A (en) | Broadcast television voice recognition method and system | |
CN110880329A (en) | Audio identification method and equipment and storage medium | |
CN110675862A (en) | Corpus acquisition method, electronic device and storage medium | |
CN102915729B (en) | Speech keyword spotting system and system and method of creating dictionary for the speech keyword spotting system | |
CN110070859B (en) | Voice recognition method and device | |
CN103559879A (en) | Method and device for extracting acoustic features in language identification system | |
CN106409298A (en) | Identification method of sound rerecording attack | |
CN113823323B (en) | Audio processing method and device based on convolutional neural network and related equipment | |
CN102073631A (en) | Video news unit dividing method by using association rule technology | |
CN114141252A (en) | Voiceprint recognition method and device, electronic equipment and storage medium | |
CN106531195B (en) | A kind of dialogue collision detection method and device | |
CN111276124B (en) | Keyword recognition method, device, equipment and readable storage medium | |
CN111091809A (en) | Regional accent recognition method and device based on depth feature fusion | |
CN112885330A (en) | Language identification method and system based on low-resource audio |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |