CN116524939A - ECAPA-TDNN-based automatic identification method for bird song species - Google Patents

ECAPA-TDNN-based automatic identification method for bird song species Download PDF

Info

Publication number
CN116524939A
CN116524939A CN202310439188.9A CN202310439188A CN116524939A CN 116524939 A CN116524939 A CN 116524939A CN 202310439188 A CN202310439188 A CN 202310439188A CN 116524939 A CN116524939 A CN 116524939A
Authority
CN
China
Prior art keywords
bird song
ecapa
tdnn
bird
song
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310439188.9A
Other languages
Chinese (zh)
Inventor
赵兆
鞠然然
许志勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202310439188.9A priority Critical patent/CN116524939A/en
Publication of CN116524939A publication Critical patent/CN116524939A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

The invention discloses an ECAPA-TDNN-based automatic identification method for a bird song species, which comprises the following steps: collecting a bird song segment, preprocessing and extracting features to obtain a Mel frequency cepstrum coefficient; feeding the mel frequency cepstrum coefficient into an ECAPA-TDNN network for training; removing mute segments through a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting segments containing bird singing; obtaining a Mel frequency cepstrum coefficient corresponding to the bird song segments, inputting a trained model for recognition, and obtaining a result; and displaying the identification results on a graphical user interface one by one, counting the number of the categories, drawing a spectrogram, and displaying the identification information in a derived table. According to the invention, the accuracy of the bird song classifying scene is improved through the ECAPA-TDNN model, the automatic classification and the preliminary analysis of the bird song long fragment data are realized, the workload of manual cutting is reduced, and the subsequent deep analysis in the ecological field is facilitated.

Description

ECAPA-TDNN-based automatic identification method for bird song species
Technical Field
The invention belongs to the technical field of acoustic monitoring and audio signal identification, and particularly relates to an ECAPA-TDNN-based automatic identification method for a bird song species.
Background
Birds are used as important components of an ecological system, and provide important basis for a ecologist to know the biodiversity and climate change of a region. The passive acoustic monitoring has the characteristics of low cost, wide range, non-invasiveness and the like, so that the bird song signal becomes an important data source for monitoring the bird activities. With the increasing worsening of ecological environment and the serious difficulty in management in recent years, species identification, behavior analysis and acoustic index research based on bird song have great application value.
At present, species identification algorithms based on bird song mainly include: 1) The classification method based on template matching, such as a dynamic time planning template algorithm, has the problems of large operand and low operation efficiency; 2) The traditional machine learning algorithm, such as a random forest, a support vector machine, a hidden Markov model and the like, is greatly influenced by noise, and has higher signal-to-noise ratio requirement on a data set; 3) Species identification using deep learning, such as Alex Net, VGG16 model, etc., is currently popular, but is rarely applied to species identification based on bird song in China.
The ECAPA-TDNN model was proposed in 2020, and introduced an SE (squeeze excitation) module and a channel attention mechanism to make the model learn more global information in audio data, which has become the mainstream voiceprint model. The ECAPA-TDNN is utilized to extract voiceprint characteristics in an open source voiceprint recognition system issued by hundred-degree flag Paddle special, and the error rate of recognition and the like is as low as 0.95%.
Thus, the great problems of the prior art are: the field of bird song species identification lacks a mainstream neural network model with high accuracy to be verified. In addition, the current bird song recognition algorithm needs to be predicted by manually cutting out a segment only containing bird song, and when long-time bird song audio is input, the manual workload is large, observer deviation exists, and inconvenience is brought to deep analysis and subsequent research.
Disclosure of Invention
The invention aims to solve the problems in the prior art, and provides an ECAPA-TDNN-based automatic identification method for the singing species, which can improve the identification accuracy and realize automatic segmentation of singing audio and species identification.
The technical solution for realizing the purpose of the invention is as follows: an ECAPA-TDNN-based method for automatically identifying a bird song species, the method comprising the steps of:
step 1, collecting a bird song signal, preprocessing the bird song signal to construct a bird song data set, and then obtaining a Mel frequency cepstrum coefficient through feature extraction;
step 2, inputting the mel frequency cepstrum coefficient into an ECAPA-TDNN network for training to obtain an ECAPA-TDNN birdcage classification model;
step 3, inputting bird song audio, removing mute fragments through a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting fragments containing bird song;
and 4, extracting a Mel frequency cepstrum coefficient based on the fragment containing the bird song according to the mode of the step 1, and inputting the ECAPA-TDNN bird song classification model for recognition to obtain a recognition result.
Further, the pretreatment in step 1 specifically includes:
step 1-1-1, aiming at a bird song signal, eliminating a direct current component and carrying out pre-emphasis;
step 1-1-2, performing high-pass filtering;
step 1-1-3, carrying out framing treatment;
step 1-1-4, windowing is carried out by using a Hanning window.
Further, the pre-emphasis in step 1-1-1 is specifically performed by a transfer function of H (z) =1-az -1 A is the pre-emphasis coefficient.
Further, in the step 1, the obtaining the mel frequency cepstrum coefficient through feature extraction specifically includes:
step 1-2-1, performing short-time Fourier transform on a bird song signal in a bird song data set, taking an absolute value of the obtained value, and then squaring to obtain an energy spectrogram;
step 1-2-2, constructing a Mel filter group, and performing dot product operation with the energy spectrum to obtain a Mel spectrogram;
step 1-2-3, taking the logarithm of the Mel spectrogram;
step 1-2-4, performing discrete cosine transform, and taking the first P data to obtain a Mel frequency cepstrum coefficient; and P is an integer.
Further, the ECAPA-TDNN network comprises a convolutional neural layer, three SE-Res2Block layers, a Attentive Statistics Pooling layer and a fully connected layer;
inputting the mel-frequency cepstrum coefficient into an ECAPA-TDNN network model for training, wherein the specific process comprises the following steps:
inputting mel frequency cepstrum coefficients into a convolutional neural layer to obtain potential audio features;
carrying out multi-layer feature fusion on the potential audio features through an SE-Res2Block layer, and extracting global information;
the output of the three SE-Res2Block layers is connected in series according to the characteristic dimension;
the average value and the standard deviation based on the attention mechanism are obtained through a Attentive Statistics Pooling (statistical pooling with the attention mechanism) layer, and vectors are obtained through serial connection according to characteristic dimensions;
carrying out softmax classification on the vectors through a full connection layer to obtain classification results;
based on the classification result, updating network parameters by using a cross entropy loss function through back propagation to obtain the ECAPA-TDNN birdcasting classification model.
Further, the step 3 specifically includes the following steps:
step 3-1, preprocessing and framing the singing audio, wherein the preprocessing flow is the same as step 1, a plurality of classes are created, and the singing audio is divided into segments and stored in the classes;
step 3-2, mute judgment, specifically comprising:
step 3-2-1, dividing sub-bands for the segments, and calculating the logarithmic energy of the sub-bands;
3-2-2, calculating a probability P (X|H2) of a voice based on a voice Gaussian mixture model and a probability P (X|H2 0) of a noise based on a noise Gaussian mixture model for each sub-band when the total energy of the frame is larger than the minimum energy required for triggering the audio signal;
step 3-2-3, calculating likelihood ratio of the sub-band based on the two probabilities: likelihood ratio = log (P (x|h1)/P (x|h0));
step 3-2-4, judging the sub-band to be a sound segment if the likelihood ratio of one sub-band meets a preset threshold;
and or accumulating the likelihood ratio of each sub-band as an overall likelihood ratio, and judging the sub-band as a voiced segment if the overall likelihood ratio meets a preset threshold;
otherwise, judging the voice signal to be a mute segment;
step 3-3, collecting sound fragments, which specifically comprises the following steps:
step 3-3-1, creating a data container with two ends, adding an object, and obtaining the number of the voiced classes;
step 3-3-2, when the number of the sounding fragments is greater than 90% of the capacity of the data container, judging that the bird song starts, and writing the data in the current data container into the constructed empty list;
step 3-3-3, repeating step 3-3-1 and step 3-3-2, and ending list writing when the number of mute segments is more than 90%;
and 3-3-4, returning list data, namely the fragments only containing bird song.
Further, the step 3-2-1 specifically comprises: according to the frequency spectrum characteristics and energy distribution of the bird song, dividing each frame of the bird song signal into six sub-bands of 200-2000Hz, 2000-3000Hz, 3000-3500Hz, 3500-4500Hz, 4500-8000Hz and 8000-24000Hz, calculating the sub-band energy and the total energy, and taking the logarithm.
Further, the method further comprises:
and 5, displaying the identification results on a graphical user interface one by one, counting the number of the identification results in different categories, drawing a spectrogram, and displaying identification information in a derived table, wherein the identification information comprises the start and stop time of bird song, the identification results and the similarity.
Further, the specific process of step 5 includes:
step 5-1, building a graphical user interface, displaying the index of the segmentation fragments, the identification result and the similarity one by one, and counting the number according to the types of the bird sings;
step 5-2, drawing a spectrogram of the input singing voice frequency;
and 5-3, creating a new table, and storing the start and stop time, the identification result and the similarity information of the bird song.
Compared with the prior art, the invention has the remarkable advantages that:
1) Compared with traditional machine learning, the deep learning can quickly and accurately learn audio potential characteristics, the ECAPA-TDNN model emphasizes a channel attention mechanism and multi-layer characteristic fusion, and experimental results show that the accuracy of the classification of the bird song species is obviously improved.
2) The silence detection algorithm designed based on the Gaussian mixture model realizes automatic cutting of singing fragments, and has high accuracy and good segmentation effect.
3) The user graphical interface designed by the invention is convenient and visual, and the user can independently select the audio to be identified, and the functions of statistics, drawing and form realize the preliminary analysis and display of the audio to be identified.
The invention is described in further detail below with reference to the accompanying drawings.
Drawings
FIG. 1 is a flow chart of an ECAPA-TDNN-based method for automatically identifying a bird song species.
FIG. 2 is a block diagram of ECAPA-TDNN.
FIG. 3 is a SE-Res2Block Block diagram of ECAPA-TDNN.
Fig. 4 is a flow chart of silence detection based on a gaussian mixture model.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that, if directional indications (such as up, down, left, right, front, and rear … …) are included in the embodiments of the present invention, the directional indications are merely used to explain the relative positional relationship, movement conditions, etc. between the components in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indications are correspondingly changed.
In addition, if there is a description of "first", "second", etc. in the embodiments of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
In one embodiment, in conjunction with FIG. 1, there is provided an ECAPA-TDNN-based method for automatically identifying a bird song species, the method comprising the steps of:
step 1, collecting a bird song signal, preprocessing the bird song signal to form a purer bird song data set, and then obtaining a Mel frequency cepstrum coefficient through feature extraction;
the pretreatment comprises the following steps:
step 1-1-1, for the bird song signal, eliminating the DC component, pre-emphasis is performed, specifically by a transfer function of H (z) =1-az -1 A is a pre-emphasis coefficient, a=0.97, to boost high frequency components in the signal;
step 1-1-2, performing high-pass filtering; because the voice noise and the environmental noise are mainly concentrated below 350Hz, the signals pass through an 8-order Butterworth high-pass filter with the cut-off frequency of 350Hz so as to obtain purer bird sound signals;
step 1-1-3, carrying out frame division processing, wherein the frame length is 2048, and the frame shift is 512;
step 1-1-4, windowing is carried out by using a hanning window so as to eliminate the discontinuity between frames after framing.
The obtaining of the mel frequency cepstrum coefficient through feature extraction specifically comprises:
step 1-2-1, performing short-time Fourier transform on a bird song signal in a bird song data set, taking an absolute value of the obtained value, and then squaring to obtain an energy spectrogram;
step 1-2-2, constructing a Mel filter group, wherein the number of filters is 128, and performing dot product operation with the energy spectrum to obtain a Mel spectrogram;
step 1-2-3, taking the logarithm of the Mel spectrogram;
step 1-2-4, performing discrete cosine transform, and taking the first 80 data to obtain a Mel frequency cepstrum coefficient; and P is an integer.
Step 2, inputting the mel frequency cepstrum coefficient into an ECAPA-TDNN network (a time delay neural network for emphasizing channel attention, propagation and aggregation) for training to obtain an ECAPA-TDNN bird song classification model; in connection with fig. 2, the ECAPA-TDNN network includes a convolutional neural layer, three SE-Res2Block layers, a Attentive Statistics Pooling (statistical pooling with attention mechanisms) layer, and a fully connected layer. The specific process of the steps comprises:
step 2-1, obtaining potential audio characteristics by passing the mel frequency cepstrum coefficient through a convolutional neural layer;
referring to fig. 3, step 2-2, performing multi-layer feature fusion on the potential audio features through an SE-Res2Block layer, and extracting global information;
step 2-3, outputting three SE-Res2Block layers in series according to characteristic dimensions;
step 2-4, obtaining an average value and a standard deviation based on an attention mechanism through a Attentive Statistics Pooling layer, and carrying out series connection according to characteristic dimensions to obtain a 3072-dimensional vector;
step 2-5, carrying out softmax classification on the vectors through a full connection layer to obtain classification results;
and 2-6, updating network parameters by using a cross entropy loss function through back propagation based on the classification result to obtain the ECAPA-TDNN birdcasting classification model.
And 3, inputting 1-minute bird song audio, removing mute fragments by a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting fragments containing bird song. Referring to fig. 4, the method specifically includes:
step 3-1, preprocessing and framing the singing audio, wherein the preprocessing flow is the same as that of step 1, a plurality of classes are created, and the singing audio is divided into fragments every 100ms and stored in the classes;
step 3-2, mute judgment, specifically comprising:
step 3-2-1, dividing sub-bands for the segments, and calculating the logarithmic energy of the sub-bands; dividing each frame of bird song signals into six sub-bands of 200-2000Hz, 2000-3000Hz, 3000-3500Hz, 3500-4500Hz, 4500-8000Hz and 8000-24000Hz according to the frequency spectrum characteristics and energy distribution of bird song signals, calculating the energy of the sub-bands and the total energy, and taking logarithms;
3-2-2, calculating a probability P (X|H2) of a voice based on a voice Gaussian mixture model and a probability P (X|H2 0) of a noise based on a noise Gaussian mixture model for each sub-band when the total energy of the frame is larger than the minimum energy required for triggering the audio signal;
step 3-2-3, calculating likelihood ratio of the sub-band based on the two probabilities: likelihood ratio = log (P (x|h1)/P (x|h0));
step 3-2-4, judging that the sound fragment (local) exists if the likelihood ratio of a certain sub-band meets a preset threshold value;
or accumulating the likelihood ratio of each sub-band as an overall likelihood ratio, and judging the sub-band as a voiced segment (global) if the overall likelihood ratio meets a preset threshold; here, one of global and local is a voiced segment as long as it is satisfied;
otherwise, judging the voice signal to be a mute segment;
step 3-3, collecting sound fragments, which specifically comprises the following steps:
step 3-3-1, creating a data container with two ends, adding an object (an instance of a Frame class and a mute detection result), and obtaining the number of the voiced classes;
step 3-3-2, when the number of the sounding fragments is greater than 90% of the capacity of the data container, judging that the bird song starts, and writing the data in the current data container into the constructed empty list;
step 3-3-3, repeating step 3-3-1 and step 3-3-2, and ending list writing when the number of mute segments is more than 90%;
and 3-3-4, returning list data, namely the fragments only containing bird song.
And 4, extracting a Mel frequency cepstrum coefficient based on the fragment containing the bird song according to the mode of the step 1, and inputting the ECAPA-TDNN bird song classification model for recognition to obtain a recognition result.
And 5, displaying the identification results on a graphical user interface one by one, counting the number of the identification results in each category, drawing a spectrogram, and displaying identification information in a derived table, wherein the identification information comprises the start and stop time of bird song, the identification results, the similarity and the like. The method specifically comprises the following steps:
step 5-1, building a graphical user interface, displaying the index of the segmentation fragments, the identification result and the similarity one by one, and counting the number according to the types of the bird sings;
step 5-2, drawing a spectrogram of the input singing voice frequency;
and 5-3, newly creating an Excel table, and storing the start and stop time, the identification result, the similarity information and the like of the bird song.
In one embodiment, an ECAPA-TDNN based automated bird song species identification system is provided, the system comprising:
the first module is used for preprocessing the bird song signal and extracting the characteristics to obtain a Mel frequency cepstrum coefficient;
the second module is used for training the ECAPA-TDNN network based on the Mel frequency cepstrum coefficient to obtain an ECAPA-TDNN birdcasting classification model;
the third module is used for performing silence detection, removing silence fragments and generating a data set only containing bird song fragments;
a fourth module, configured to perform pretreatment and feature extraction on the segmented bird song segments, and identify the bird song segments through an ECAPA-TDNN bird song classification model;
and the fifth module is used for realizing user interaction and displaying classification and analysis results.
Specific limitations regarding the ECAPA-TDNN-based bird song species automatic identification system may be found in the above description of the method for automatically identifying bird song species based on ECAPA-TDNN, and will not be described in detail herein. The modules in the ECAPA-TDNN-based bird song species automatic identification system can be fully or partially implemented by software, hardware and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:
step 1, collecting a bird song signal, preprocessing the bird song signal to construct a bird song data set, and then obtaining a Mel frequency cepstrum coefficient through feature extraction;
step 2, inputting the mel frequency cepstrum coefficient into an ECAPA-TDNN network for training to obtain an ECAPA-TDNN birdcage classification model;
step 3, inputting bird song audio, removing mute fragments through a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting fragments containing bird song;
step 4, extracting a Mel frequency cepstrum coefficient based on the segment containing the bird song according to the mode of the step 1, inputting the ECAPA-TDNN bird song classification model for recognition, and obtaining a recognition result;
and 5, displaying the identification results on a graphical user interface one by one, counting the number of the identification results in different categories, drawing a spectrogram, and displaying identification information in a derived table, wherein the identification information comprises the start and stop time of bird song, the identification results and the similarity.
For specific limitations on each step, reference may be made to the above limitations on the ECAPA-TDNN-based method for automatic identification of bird song species, which are not described in detail herein.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
step 1, collecting a bird song signal, preprocessing the bird song signal to construct a bird song data set, and then obtaining a Mel frequency cepstrum coefficient through feature extraction;
step 2, inputting the mel frequency cepstrum coefficient into an ECAPA-TDNN network for training to obtain an ECAPA-TDNN birdcage classification model;
step 3, inputting bird song audio, removing mute fragments through a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting fragments containing bird song;
step 4, extracting a Mel frequency cepstrum coefficient based on the segment containing the bird song according to the mode of the step 1, inputting the ECAPA-TDNN bird song classification model for recognition, and obtaining a recognition result;
and 5, displaying the identification results on a graphical user interface one by one, counting the number of the identification results in different categories, drawing a spectrogram, and displaying identification information in a derived table, wherein the identification information comprises the start and stop time of bird song, the identification results and the similarity.
For specific limitations on each step, reference may be made to the above limitations on the ECAPA-TDNN-based method for automatic identification of bird song species, which are not described in detail herein.
The method is convenient and quick, has strong practicability and high accuracy, fully combines the advantages of deep learning, realizes automatic segmentation and species identification of the bird song record segments, and has important significance for researching the biodiversity of an ecosystem and protecting endangered birds.
The foregoing has outlined and described the basic principles, features, and advantages of the present invention. It will be understood by those skilled in the art that the foregoing embodiments are not intended to limit the invention, and the above embodiments and descriptions are meant to be illustrative only of the principles of the invention, and that various modifications, equivalent substitutions, improvements, etc. may be made within the spirit and scope of the invention without departing from the spirit and scope of the invention.

Claims (10)

1. An ECAPA-TDNN-based automatic identification method for a bird song species, which is characterized by comprising the following steps:
step 1, collecting a bird song signal, preprocessing the bird song signal to construct a bird song data set, and then obtaining a Mel frequency cepstrum coefficient through feature extraction;
step 2, inputting the mel frequency cepstrum coefficient into an ECAPA-TDNN network for training to obtain an ECAPA-TDNN birdcage classification model;
step 3, inputting bird song audio, removing mute fragments through a voice endpoint detection algorithm based on a Gaussian mixture model, and extracting fragments containing bird song;
and 4, extracting a Mel frequency cepstrum coefficient based on the fragment containing the bird song according to the mode of the step 1, and inputting the ECAPA-TDNN bird song classification model for recognition to obtain a recognition result.
2. The ECAPA-TDNN based method for automatic identification of a bird song species according to claim 1, wherein the preprocessing in step 1 specifically includes:
step 1-1-1, aiming at a bird song signal, eliminating a direct current component and carrying out pre-emphasis;
step 1-1-2, performing high-pass filtering;
step 1-1-3, carrying out framing treatment;
step 1-1-4, windowing is carried out by using a Hanning window.
3. The ECAPA-TDNN based automatic identification method of a bird song species according to claim 2, wherein the pre-emphasis in step 1-1-1 is specifically performed by a transfer function of H (z) =1-az -1 A is the pre-emphasis coefficient.
4. The automatic identification method of a bird song species based on ECAPA-TDNN according to claim 2, wherein the obtaining a mel frequency cepstrum coefficient by feature extraction in step 1 specifically includes:
step 1-2-1, performing short-time Fourier transform on a bird song signal in a bird song data set, taking an absolute value of the obtained value, and then squaring to obtain an energy spectrogram;
step 1-2-2, constructing a Mel filter group, and performing dot product operation with the energy spectrum to obtain a Mel spectrogram;
step 1-2-3, taking the logarithm of the Mel spectrogram;
step 1-2-4, performing discrete cosine transform, and taking the first P data to obtain a Mel frequency cepstrum coefficient; and P is an integer.
5. The ECAPA-TDNN based automatic identification method of a bird song species as claimed in claim 1, wherein the ECAPA-TDNN network includes a convolutional neural layer, three SE-Res2Block layers, a Attentive Statistics Pooling layer, and a full connection layer;
inputting the mel-frequency cepstrum coefficient into an ECAPA-TDNN network model for training, wherein the specific process comprises the following steps:
inputting mel frequency cepstrum coefficients into a convolutional neural layer to obtain potential audio features;
carrying out multi-layer feature fusion on the potential audio features through an SE-Res2Block layer, and extracting global information;
the output of the three SE-Res2Block layers is connected in series according to the characteristic dimension;
the average value and the standard deviation based on the attention mechanism are obtained through Attentive Statistics Pooling layers, and vectors are obtained through serial connection according to characteristic dimensions;
carrying out softmax classification on the vectors through a full connection layer to obtain classification results;
based on the classification result, updating network parameters by using a cross entropy loss function through back propagation to obtain the ECAPA-TDNN birdcasting classification model.
6. The ECAPA-TDNN based method for automatic identification of a bird song species of claim 1, wherein step 3 specifically comprises the following steps:
step 3-1, preprocessing and framing the singing audio, wherein the preprocessing flow is the same as step 1, a plurality of classes are created, and the singing audio is divided into segments and stored in the classes;
step 3-2, mute judgment, specifically comprising:
step 3-2-1, dividing sub-bands for the segments, and calculating the logarithmic energy of the sub-bands;
3-2-2, calculating a probability P (X|H2) of a voice based on a voice Gaussian mixture model and a probability P (X|H2 0) of a noise based on a noise Gaussian mixture model for each sub-band when the total energy of the frame is larger than the minimum energy required for triggering the audio signal;
step 3-2-3, calculating likelihood ratio of the sub-band based on the two probabilities: likelihood ratio = log (P (x|h1)/P (x|h0));
step 3-2-4, judging the sub-band to be a sound segment if the likelihood ratio of one sub-band meets a preset threshold;
or accumulating the likelihood ratio of each sub-band as an overall likelihood ratio, and judging the sub-band as a voiced segment if the overall likelihood ratio meets a preset threshold;
otherwise, judging the voice signal to be a mute segment;
step 3-3, collecting sound fragments, which specifically comprises the following steps:
step 3-3-1, creating a data container with two ends, adding an object, and obtaining the number of the voiced classes;
step 3-3-2, when the number of the sounding fragments is greater than 90% of the capacity of the data container, judging that the bird song starts, and writing the data in the current data container into the constructed empty list;
step 3-3-3, repeating step 3-3-1 and step 3-3-2, and ending list writing when the number of mute segments is more than 90%;
and 3-3-4, returning list data, namely the fragments only containing bird song.
7. The ECAPA-TDNN based method for automatic identification of a bird song species of claim 6, wherein step 3-2-1 specifically comprises: according to the frequency spectrum characteristics and energy distribution of the bird song, dividing each frame of the bird song signal into six sub-bands of 200-2000Hz, 2000-3000Hz, 3000-3500Hz, 3500-4500Hz, 4500-8000Hz and 8000-24000Hz, calculating the sub-band energy and the total energy, and taking the logarithm.
8. The ECAPA-TDNN based method for automatic identification of a bird song species of claim 6, further comprising:
and 5, displaying the identification results on a graphical user interface one by one, counting the number of the identification results in different categories, drawing a spectrogram, and displaying identification information in a derived table, wherein the identification information comprises the start and stop time of bird song, the identification results and the similarity.
9. The ECAPA-TDNN based bird song species automatic identification method of claim 8, wherein the step 5 specific process includes:
step 5-1, building a graphical user interface, displaying the index of the segmentation fragments, the identification result and the similarity one by one, and counting the number according to the types of the bird sings;
step 5-2, drawing a spectrogram of the input singing voice frequency;
and 5-3, creating a new table, and storing the start and stop time, the identification result and the similarity information of the bird song.
10. An ECAPA-TDNN based automatic bird song species identification system based on the method according to any one of claims 1 to 9, characterized in that the system comprises:
the first module is used for preprocessing the bird song signal and extracting the characteristics to obtain a Mel frequency cepstrum coefficient;
the second module is used for training the ECAPA-TDNN network based on the Mel frequency cepstrum coefficient to obtain an ECAPA-TDNN birdcasting classification model;
the third module is used for performing silence detection, removing silence fragments and generating a data set only containing bird song fragments;
a fourth module, configured to perform pretreatment and feature extraction on the segmented bird song segments, and identify the bird song segments through an ECAPA-TDNN bird song classification model;
and the fifth module is used for realizing user interaction and displaying classification and analysis results.
CN202310439188.9A 2023-04-23 2023-04-23 ECAPA-TDNN-based automatic identification method for bird song species Pending CN116524939A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310439188.9A CN116524939A (en) 2023-04-23 2023-04-23 ECAPA-TDNN-based automatic identification method for bird song species

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310439188.9A CN116524939A (en) 2023-04-23 2023-04-23 ECAPA-TDNN-based automatic identification method for bird song species

Publications (1)

Publication Number Publication Date
CN116524939A true CN116524939A (en) 2023-08-01

Family

ID=87396896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310439188.9A Pending CN116524939A (en) 2023-04-23 2023-04-23 ECAPA-TDNN-based automatic identification method for bird song species

Country Status (1)

Country Link
CN (1) CN116524939A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095694A (en) * 2023-10-18 2023-11-21 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN117727309A (en) * 2024-02-18 2024-03-19 百鸟数据科技(北京)有限责任公司 Automatic identification method for bird song species based on TDNN structure
CN117727332A (en) * 2024-02-18 2024-03-19 百鸟数据科技(北京)有限责任公司 Ecological population assessment method based on language spectrum feature analysis
CN117746871A (en) * 2024-02-21 2024-03-22 南方科技大学 Cloud-based bird song detection method and system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095694A (en) * 2023-10-18 2023-11-21 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN117095694B (en) * 2023-10-18 2024-02-23 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN117727309A (en) * 2024-02-18 2024-03-19 百鸟数据科技(北京)有限责任公司 Automatic identification method for bird song species based on TDNN structure
CN117727332A (en) * 2024-02-18 2024-03-19 百鸟数据科技(北京)有限责任公司 Ecological population assessment method based on language spectrum feature analysis
CN117727309B (en) * 2024-02-18 2024-04-26 百鸟数据科技(北京)有限责任公司 Automatic identification method for bird song species based on TDNN structure
CN117727332B (en) * 2024-02-18 2024-04-26 百鸟数据科技(北京)有限责任公司 Ecological population assessment method based on language spectrum feature analysis
CN117746871A (en) * 2024-02-21 2024-03-22 南方科技大学 Cloud-based bird song detection method and system
CN117746871B (en) * 2024-02-21 2024-07-16 深圳市规划和自然资源数据管理中心(深圳市空间地理信息中心) Cloud-based bird song detection method and system

Similar Documents

Publication Publication Date Title
CN110085251B (en) Human voice extraction method, human voice extraction device and related products
CN109065031B (en) Voice labeling method, device and equipment
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN116524939A (en) ECAPA-TDNN-based automatic identification method for bird song species
CN111063341B (en) Method and system for segmenting and clustering multi-person voice in complex environment
CN107274916B (en) Method and device for operating audio/video file based on voiceprint information
CN111279414B (en) Segmentation-based feature extraction for sound scene classification
CN100485780C (en) Quick audio-frequency separating method based on tonic frequency
CN107305541A (en) Speech recognition text segmentation method and device
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN102486920A (en) Audio event detection method and device
CN103700370A (en) Broadcast television voice recognition method and system
CN110880329A (en) Audio identification method and equipment and storage medium
CN110675862A (en) Corpus acquisition method, electronic device and storage medium
CN102915729B (en) Speech keyword spotting system and system and method of creating dictionary for the speech keyword spotting system
CN110070859B (en) Voice recognition method and device
CN103559879A (en) Method and device for extracting acoustic features in language identification system
CN106409298A (en) Identification method of sound rerecording attack
CN113823323B (en) Audio processing method and device based on convolutional neural network and related equipment
CN102073631A (en) Video news unit dividing method by using association rule technology
CN114141252A (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN106531195B (en) A kind of dialogue collision detection method and device
CN111276124B (en) Keyword recognition method, device, equipment and readable storage medium
CN111091809A (en) Regional accent recognition method and device based on depth feature fusion
CN112885330A (en) Language identification method and system based on low-resource audio

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination