CN116524939A

CN116524939A - An automatic identification method of bird song species based on ECAPA-TDNN

Info

Publication number: CN116524939A
Application number: CN202310439188.9A
Authority: CN
Inventors: 赵兆; 鞠然然; 许志勇
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2023-04-23
Filing date: 2023-04-23
Publication date: 2023-08-01

Abstract

The invention discloses an automatic identification method of birdsong species based on ECAPA‑TDNN, which includes: collecting birdsong segments, performing preprocessing and feature extraction, and obtaining Mel frequency cepstral coefficients; sending Mel frequency cepstral coefficients into ECAPA ‑TDNN network for training; through the speech endpoint detection algorithm based on Gaussian mixture model, the silent segment is removed, and the segment containing the bird's song is extracted; the Mel frequency cepstral coefficient corresponding to the bird's song segment is obtained, and the trained model is input for recognition. Get the results; display the recognition results one by one on the GUI, perform quantitative statistics by category, draw a spectrum diagram, and display the recognition information in the export table. The present invention improves the accuracy rate in the bird song classification scene through the ECAPA-TDNN model, realizes the automatic classification and preliminary analysis of long segment data of bird song, reduces the workload of manual tailoring, and facilitates subsequent in-depth analysis in the ecological field .

Description

An automatic identification method of bird song species based on ECAPA-TDNN

技术领域technical field

本发明属于声学监测和音频信号识别技术领域，特别是一种基于ECAPA-TDNN的鸟鸣物种自动识别方法。The invention belongs to the technical field of acoustic monitoring and audio signal recognition, in particular to an ECAPA-TDNN-based automatic bird song species recognition method.

背景技术Background technique

鸟类作为生态系统的重要组成部分，为生态学者了解一个地区的生物多样性和气候变化提供了重要依据。被动声学监测具有成本低、范围广、非入侵性等特点，使得鸟鸣声信号成为监测鸟类活动的重要数据来源。随着近年来生态环境日益恶化、治理困境重重，基于鸟鸣声的物种识别、行为分析、声学指数研究具有巨大的应用价值。As an important part of the ecosystem, birds provide an important basis for ecologists to understand biodiversity and climate change in a region. Passive acoustic monitoring has the characteristics of low cost, wide range, and non-invasiveness, making bird song signals an important data source for monitoring bird activities. With the deterioration of the ecological environment and the difficulties in governance in recent years, species identification, behavior analysis, and acoustic index research based on bird songs have great application value.

目前，基于鸟鸣声的物种识别算法主要有：1)基于模板匹配的分类方法，如动态时间规划模板算法，该方法存在运算量大、运算效率低的问题；2)传统的机器学习算法，如随机森林、支持向量机、隐马尔科夫模型等，该方法受噪声影响较大，对数据集的信噪比要求较高；3)利用深度学习进行物种识别，如Alex Net、VGG16模型等目前比较热门，但国内很少应用到基于鸟鸣声的物种识别上来。At present, the species identification algorithms based on birdsong mainly include: 1) classification methods based on template matching, such as dynamic time planning template algorithm, which has the problems of large amount of calculation and low calculation efficiency; 2) traditional machine learning algorithm, Such as random forest, support vector machine, hidden Markov model, etc., this method is greatly affected by noise, and has high requirements for the signal-to-noise ratio of the data set; 3) Use deep learning for species identification, such as Alex Net, VGG16 model, etc. It is currently popular, but it is rarely applied to species identification based on bird songs in China.

ECAPA-TDNN模型于2020年被提出，引入SE(挤压激励)模块以及通道注意机制使模型学习到音频数据中更多的全局信息，已成为主流的声纹模型。百度旗下Paddle Speech(飞浆)发布的开源声纹识别系统中就利用了ECAPA-TDNN提取声纹特征，识别等错误率低至0.95％。The ECAPA-TDNN model was proposed in 2020. The introduction of the SE (squeeze excitation) module and the channel attention mechanism enables the model to learn more global information in the audio data, and has become the mainstream voiceprint model. The open source voiceprint recognition system released by Baidu's Paddle Speech uses ECAPA-TDNN to extract voiceprint features, and the recognition error rate is as low as 0.95%.

因此，现有技术的很大问题是：鸟鸣物种识别领域缺乏被验证的准确率高的主流神经网络模型。此外，目前的鸟鸣声识别算法需由人工截取出仅包含鸟声的片段进行预测，当输入长时间的鸟鸣唱音频时，人工工作量大且存在观察者偏差，给深入分析和后续研究带来不便。Therefore, a big problem in the prior art is that the field of bird song species recognition lacks a verified mainstream neural network model with high accuracy. In addition, the current bird song recognition algorithm needs to manually intercept clips containing only bird songs for prediction. When inputting long-term bird song audio, the manual workload is heavy and there is observer bias. In-depth analysis and follow-up research bring inconvenience.

发明内容Contents of the invention

本发明的目的在于针对上述现有技术存在的问题，提供一种基于ECAPA-TDNN的鸟鸣物种自动识别方法，在提高识别的准确率的同时，实现鸣唱音频的自动分段和物种识别。The purpose of the present invention is to address the problems in the above-mentioned prior art, to provide a method for automatic species identification of birdsong based on ECAPA-TDNN, to realize automatic segmentation and species identification of singing audio while improving the accuracy of identification.

实现本发明目的的技术解决方案为：一种基于ECAPA-TDNN的鸟鸣物种自动识别方法，所述方法包括以下步骤：The technical solution that realizes the object of the present invention is: a kind of bird song species automatic recognition method based on ECAPA-TDNN, described method comprises the following steps:

步骤1，采集鸟鸣信号并进行预处理，构建鸟鸣数据集，之后通过特征提取，获得梅尔频率倒谱系数；Step 1. Collect bird song signals and preprocess them to construct a bird song data set, and then obtain Mel frequency cepstral coefficients through feature extraction;

步骤2，将所述梅尔频率倒谱系数输入ECAPA-TDNN网络进行训练，得到ECAPA-TDNN鸟鸣分类模型；Step 2, input ECAPA-TDNN network with described mel frequency cepstral coefficient and train, obtain ECAPA-TDNN birdsong classification model;

步骤3，输入鸟鸣唱音频，通过基于高斯混合模型的语音端点检测算法，去除静音片段，提取出包含鸟鸣声的片段；Step 3, input bird singing audio, remove the silent segment and extract the segment containing bird singing through the speech endpoint detection algorithm based on Gaussian mixture model;

步骤4，基于所述包含鸟鸣声的片段，按照步骤1的方式提取梅尔频率倒谱系数，输入所述ECAPA-TDNN鸟鸣分类模型进行识别，得到识别结果。Step 4, based on the segment containing birdsong, extract the Mel-frequency cepstral coefficients according to the method in step 1, input the ECAPA-TDNN birdsong classification model for recognition, and obtain the recognition result.

进一步地，步骤1中所述预处理，具体包括：Further, the preprocessing described in step 1 specifically includes:

步骤1-1-1，针对鸟鸣信号，消除直流分量，进行预加重；Step 1-1-1, for the birdsong signal, eliminate the DC component and perform pre-emphasis;

步骤1-1-2，进行高通滤波；Step 1-1-2, perform high-pass filtering;

步骤1-1-3，进行分帧处理；Step 1-1-3, carry out sub-frame processing;

步骤1-1-4，使用汉宁窗进行加窗。Step 1-1-4, use the Hanning window for windowing.

进一步地，步骤1-1-1中预加重具体通过传递函数为H(z)＝1-az^-1的一阶FIR高通数字滤波器实现，其中a为预加重系数。Further, the pre-emphasis in step 1-1-1 is specifically realized by a first-order FIR high-pass digital filter whose transfer function is H(z)=1-az ^-1 , where a is the pre-emphasis coefficient.

进一步地，步骤1中所述通过特征提取，获得梅尔频率倒谱系数，具体包括：Further, the Mel-frequency cepstral coefficients are obtained through feature extraction as described in step 1, which specifically includes:

步骤1-2-1，对鸟鸣数据集中的鸟鸣信号进行短时傅里叶变换，所得值取绝对值，再求平方，得到能量谱图；Step 1-2-1, perform short-time Fourier transform on the bird song signal in the bird song data set, take the absolute value of the obtained value, and then square it to obtain the energy spectrum;

步骤1-2-2，构造梅尔滤波器组，并与能量谱进行点积运算，得到梅尔频谱图；Step 1-2-2, constructing a Mel filter bank, and performing a dot product operation with the energy spectrum to obtain a Mel spectrogram;

步骤1-2-3，对所述梅尔频谱图取对数；Step 1-2-3, taking the logarithm of the mel spectrogram;

步骤1-2-4，进行离散余弦变换，并取前P个数据，得到梅尔频率倒谱系数；所述P为整数。Step 1-2-4, performing discrete cosine transform, and taking the first P data to obtain Mel-frequency cepstral coefficients; said P is an integer.

进一步地，所述ECAPA-TDNN网络包括卷积神经层、三个SE-Res2Block层、Attentive Statistics Pooling层和全连接层；Further, the ECAPA-TDNN network includes a convolutional neural layer, three SE-Res2Block layers, an Attentive Statistics Pooling layer and a fully connected layer;

将所述梅尔频率倒谱系数输入ECAPA-TDNN网络模型进行训练，具体过程包括：The Mel-frequency cepstral coefficients are input into the ECAPA-TDNN network model for training, and the specific process includes:

将梅尔频率倒谱系数输入卷积神经层，得到潜在音频特征；Input the Mel-frequency cepstral coefficients into the convolutional neural layer to obtain potential audio features;

通过SE-Res2Block层将所述潜在音频特征进行多层特征融合，提取全局信息；Perform multi-layer feature fusion of the potential audio features through the SE-Res2Block layer to extract global information;

将三个SE-Res2Block层的输出，按照特征维度进行串联；Concatenate the outputs of the three SE-Res2Block layers according to the feature dimension;

通过Attentive Statistics Pooling(带注意力机制的统计池化)层，得到基于注意力机制的均值和标准差，按照特征维度进行串联得到向量；Through the Attentive Statistics Pooling (statistical pooling with attention mechanism) layer, the mean and standard deviation based on the attention mechanism are obtained, and the vector is obtained by concatenation according to the feature dimension;

通过全连接层对所述向量进行softmax分类，得到分类结果；Carrying out softmax classification to the vector through a fully connected layer to obtain a classification result;

基于所述分类结果，利用交叉熵损失函数通过反向传播更新网络参数，得到ECAPA-TDNN鸟鸣分类模型。Based on the classification results, the cross-entropy loss function is used to update the network parameters through back propagation, and the ECAPA-TDNN bird song classification model is obtained.

进一步地，步骤3具体包括以下过程：Further, step 3 specifically includes the following process:

步骤3-1，对所述鸟鸣唱音频进行预处理和分帧，其中预处理流程同步骤1，创建多个类，将鸟鸣唱音频分割成片段存入类中；Step 3-1, preprocessing and framing the bird singing audio, wherein the preprocessing process is the same as step 1, creating multiple classes, and dividing the bird singing audio into segments and storing them in the class;

步骤3-2，静音判断，具体包括：Step 3-2, mute judgment, specifically includes:

步骤3-2-1，对片段划分子带，并计算子带对数能量；Step 3-2-1, divide the segment into subbands, and calculate the logarithmic energy of the subbands;

步骤3-2-2，当帧总能量大于触发音频信号所需的最低能量时，对每一子带，计算基于语音混合高斯模型的语音的概率P(X|H1)、计算基于噪声混合高斯模型的噪声的概率P(X|H0)；Step 3-2-2, when the total energy of the frame is greater than the minimum energy required to trigger the audio signal, for each subband, calculate the probability P(X|H1) of the speech based on the speech mixture Gaussian model, and calculate the probability P(X|H1) based on the noise mixture Gaussian The probability of the noise of the model P(X|H0);

步骤3-2-3，基于上述两个概率求取子带的似然比：似然比＝log(P(X|H1)/P(X|H0))；Step 3-2-3, calculating the likelihood ratio of the subband based on the above two probabilities: likelihood ratio=log(P(X|H1)/P(X|H0));

步骤3-2-4，若其中某一子带的似然比满足预设阈值则判决为有声片段；Step 3-2-4, if the likelihood ratio of one of the subbands meets the preset threshold, it is judged as a voiced segment;

和或，累加每一个子带的似然比作为整体似然比，若整体似然比满足预设阈值则判决为有声片段；and or, accumulating the likelihood ratio of each subband as the overall likelihood ratio, and if the overall likelihood ratio meets the preset threshold, it is judged as a voiced segment;

否则判决为静音片段；Otherwise, it is judged as a silent segment;

步骤3-3，有声片段收集，具体包括：Step 3-3, the collection of audio clips, specifically includes:

步骤3-3-1，创建一个有两端的数据容器，添加一个对象，获取有声的类的个数；Step 3-3-1, create a data container with two ends, add an object, and get the number of voiced classes;

步骤3-3-2，当有声片段的数量大于数据容器容量的90％时，判断为鸟鸣开始，将当前数据容器中的数据写入构建的空列表中；Step 3-3-2, when the number of audio clips is greater than 90% of the capacity of the data container, it is judged that the birdsong starts, and the data in the current data container is written into the constructed empty list;

步骤3-3-3，重复步骤3-3-1和步骤3-3-2，当静音片段的数量大于90％时，结束列表写入；Step 3-3-3, repeat step 3-3-1 and step 3-3-2, when the number of mute segments is greater than 90%, end the list writing;

步骤3-3-4，返回列表数据，即为仅包含鸟鸣声的片段。Step 3-3-4, return the list data, that is, the segment containing only the sound of birdsong.

进一步地，步骤3-2-1具体包括：根据鸟鸣声频谱特性和能量分布，将每帧鸟鸣信号划分为200-2000Hz、2000-3000Hz、3000-3500Hz、3500-4500Hz、4500-8000Hz、8000-24000Hz六个子带，计算子带能量和总能量，并取对数。Further, step 3-2-1 specifically includes: according to the spectral characteristics and energy distribution of birdsong, each frame of birdsong signal is divided into 200-2000Hz, 2000-3000Hz, 3000-3500Hz, 3500-4500Hz, 4500-8000Hz, 8000-24000Hz six sub-bands, calculate sub-band energy and total energy, and take logarithm.

进一步地，所述方法还包括：Further, the method also includes:

步骤5，将识别结果逐条显示在图形用户界面上，分类别进行数量统计，绘制频谱图，并在导出表格中展示识别信息，包括鸟鸣起止时间、识别结果和相似度。Step 5: Display the recognition results one by one on the GUI, conduct quantitative statistics by category, draw a spectrum diagram, and display the recognition information in the export table, including the start and end time of bird song, recognition results and similarity.

进一步地，步骤5具体过程包括：Further, the specific process of step 5 includes:

步骤5-1，搭建图形用户界面，将分割片段索引、识别结果、相似度逐条显示，按鸟鸣种类进行数量统计；Step 5-1, build a graphical user interface, display the segment index, recognition results, and similarity one by one, and perform quantitative statistics by bird song type;

步骤5-2，绘制输入鸟鸣唱音频的频谱图；Step 5-2, draw the spectrogram of the input bird singing audio;

步骤5-3，新建表格，存放鸟鸣起止时间、识别结果、相似度信息。Step 5-3, create a new table to store the start and end time of birdsong, recognition results, and similarity information.

本发明与现有技术相比，其显著优点为：Compared with the prior art, the present invention has the remarkable advantages of:

1)与传统机器学习相比，深度学习能够快速、准确地学习音频潜在特征，ECAPA-TDNN模型强调通道注意力机制和多层特征融合，实验结果表明鸟鸣物种分类的准确率明显提升。1) Compared with traditional machine learning, deep learning can quickly and accurately learn audio potential features. The ECAPA-TDNN model emphasizes channel attention mechanism and multi-layer feature fusion. Experimental results show that the accuracy of bird song species classification is significantly improved.

2)本发明基于高斯混合模型设计的静音检测算法，实现了鸣唱片段自动切割，准确率高，分段效果好。2) The present invention is based on the mute detection algorithm designed by the Gaussian mixture model, which realizes the automatic cutting of the singing disc segment, with high accuracy and good segmentation effect.

3)本发明设计的用户图形界面便捷、直观，用户可以自主选择待识别音频，统计、绘图、表格功能实现了对待识别音频的初步分析与展示。3) The graphical user interface designed by the present invention is convenient and intuitive, and the user can independently select the audio to be identified, and the functions of statistics, drawing and tables realize the preliminary analysis and display of the audio to be identified.

下面结合附图对本发明作进一步详细描述。The present invention will be described in further detail below in conjunction with the accompanying drawings.

附图说明Description of drawings

图1为本发明基于ECAPA-TDNN的鸟鸣物种自动识别方法的流程图。Fig. 1 is the flow chart of the automatic identification method of bird song species based on ECAPA-TDNN in the present invention.

图2为ECAPA-TDNN的结构图。Figure 2 is the structure diagram of ECAPA-TDNN.

图3为ECAPA-TDNN的SE-Res2Block结构图。Figure 3 is the structure diagram of SE-Res2Block of ECAPA-TDNN.

图4为基于高斯混合模型的静音检测流程图。Fig. 4 is a flowchart of silence detection based on Gaussian mixture model.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

需要说明，若本发明实施例中有涉及方向性指示(诸如上、下、左、右、前、后……)，则该方向性指示仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等，如果该特定姿态发生改变时，则该方向性指示也相应地随之改变。It should be noted that if there is a directional indication (such as up, down, left, right, front, back...) in the embodiment of the present invention, the directional indication is only used to explain the position in a certain posture (as shown in the accompanying drawing). If the specific posture changes, the directional indication will also change accordingly.

另外，若本发明实施例中有涉及“第一”、“第二”等的描述，则该“第一”、“第二”等的描述仅用于描述目的，而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外，各个实施例之间的技术方案可以相互结合，但是必须是以本领域普通技术人员能够实现为基础，当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在，也不在本发明要求的保护范围之内。In addition, if there are descriptions involving "first", "second" and so on in the embodiments of the present invention, the descriptions of "first", "second" and so on are only for descriptive purposes, and should not be interpreted as indicating or implying Its relative importance or implicitly indicates the number of technical features indicated. Thus, the features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In addition, the technical solutions of the various embodiments can be combined with each other, but it must be based on the realization of those skilled in the art. When the combination of technical solutions is contradictory or cannot be realized, it should be considered that the combination of technical solutions does not exist , nor within the scope of protection required by the present invention.

在一个实施例中，结合图1，提供了一种基于ECAPA-TDNN的鸟鸣物种自动识别方法，所述方法包括以下步骤：In one embodiment, in conjunction with Fig. 1, a kind of bird song species automatic identification method based on ECAPA-TDNN is provided, and described method comprises the following steps:

步骤1，采集鸟鸣信号并进行预处理，形成较为纯净的鸟鸣数据集，之后通过特征提取，获得梅尔频率倒谱系数；Step 1, collect bird song signals and perform preprocessing to form a relatively pure bird song data set, and then obtain Mel frequency cepstral coefficients through feature extraction;

所述预处理包括：Described pretreatment comprises:

步骤1-1-1，针对鸟鸣信号，消除直流分量，进行预加重，具体通过传递函数为H(z)＝1-az^-1的一阶FIR高通数字滤波器实现，其中a为预加重系数，a＝0.97，以提升信号中的高频分量；Step 1-1-1, for the birdsong signal, eliminate the DC component and perform pre-emphasis, specifically through a first-order FIR high-pass digital filter with a transfer function of H(z)=1-az ^-1 , where a is pre-emphasis Coefficient, a=0.97, to enhance the high frequency component in the signal;

步骤1-1-2，进行高通滤波；由于语音噪声和环境噪声主要集中在350Hz以下，故使信号通过截止频率为350Hz的8阶巴特沃兹高通滤波器，以获得较为纯净的鸟声信号；Step 1-1-2, perform high-pass filtering; since speech noise and environmental noise are mainly concentrated below 350Hz, the signal is passed through an 8-order Butterworth high-pass filter with a cutoff frequency of 350Hz to obtain a relatively pure bird sound signal;

步骤1-1-3，进行分帧处理，帧长为2048，帧移为512；Step 1-1-3, perform frame division processing, the frame length is 2048, and the frame shift is 512;

步骤1-1-4，使用汉宁窗进行加窗，以消除分帧后出现的帧与帧之间的不连续性。In step 1-1-4, use a Hanning window to add a window to eliminate the discontinuity between frames after framing.

所述通过特征提取，获得梅尔频率倒谱系数，具体包括：The feature extraction to obtain the Mel frequency cepstral coefficients specifically includes:

步骤1-2-2，构造梅尔滤波器组，滤波器个数为128，并与能量谱进行点积运算，得到梅尔频谱图；Step 1-2-2, constructing a Mel filter bank, the number of filters is 128, and performing a dot product operation with the energy spectrum to obtain a Mel spectrogram;

步骤1-2-4，进行离散余弦变换，并取前80个数据，得到梅尔频率倒谱系数；所述P为整数。In step 1-2-4, discrete cosine transform is performed, and the first 80 data are taken to obtain Mel-frequency cepstral coefficients; the P is an integer.

步骤2，将所述梅尔频率倒谱系数输入ECAPA-TDNN网络(强调通道注意、传播和聚合的时延神经网络)进行训练，得到ECAPA-TDNN鸟鸣分类模型；结合图2，所述ECAPA-TDNN网络包括卷积神经层、三个SE-Res2Block层、Attentive Statistics Pooling(带注意力机制的统计池化)层和全连接层。该步骤具体过程包括：Step 2, the Mel-frequency cepstral coefficient input ECAPA-TDNN network (emphasis channel attention, delay neural network of propagation and aggregation) is trained, obtains ECAPA-TDNN birdsong classification model; In conjunction with Fig. 2, described ECAPA -TDNN network includes convolutional neural layer, three SE-Res2Block layers, Attentive Statistics Pooling (statistical pooling with attention mechanism) layer and fully connected layer. The specific process of this step includes:

步骤2-1，梅尔频率倒谱系数通过卷积神经层，得到潜在音频特征；Step 2-1, the Mel-frequency cepstral coefficient passes through the convolutional neural layer to obtain potential audio features;

结合图3，步骤2-2，通过SE-Res2Block层将所述潜在音频特征进行多层特征融合，提取全局信息；In conjunction with Figure 3, step 2-2, the potential audio features are fused with multi-layer features through the SE-Res2Block layer to extract global information;

步骤2-3，将三个SE-Res2Block层的输出，按照特征维度进行串联；Step 2-3, connect the outputs of the three SE-Res2Block layers in series according to the feature dimension;

步骤2-4，通过Attentive Statistics Pooling层，得到基于注意力机制的均值和标准差，按照特征维度进行串联，得到3072维的向量；Steps 2-4, through the Attentive Statistics Pooling layer, get the mean and standard deviation based on the attention mechanism, concatenate according to the feature dimension, and get a 3072-dimensional vector;

步骤2-5，通过全连接层对所述向量进行softmax分类，得到分类结果；Step 2-5, performing softmax classification on the vector through a fully connected layer to obtain a classification result;

步骤2-6，基于所述分类结果利用交叉熵损失函数通过反向传播更新网络参数，得到ECAPA-TDNN鸟鸣分类模型。Steps 2-6, based on the classification results, use the cross-entropy loss function to update the network parameters through backpropagation to obtain the ECAPA-TDNN bird song classification model.

步骤3，输入1分钟鸟鸣唱音频，通过基于高斯混合模型的语音端点检测算法，去除静音片段，提取出包含鸟鸣声的片段。结合图4，具体包括：Step 3: Input 1 minute of bird singing audio, and use the Gaussian mixture model-based speech endpoint detection algorithm to remove the silent segment and extract the segment containing bird singing. Combined with Figure 4, it specifically includes:

步骤3-1，对所述鸟鸣唱音频进行预处理和分帧，其中预处理流程同步骤1，创建多个类，将鸟鸣唱音频每100ms分割成片段存入类中；Step 3-1, preprocessing and framing the bird singing audio, wherein the preprocessing process is the same as step 1, creating multiple classes, dividing the bird singing audio into segments every 100ms and storing them in the class;

步骤3-2-1，对片段划分子带，并计算子带对数能量；根据鸟鸣声频谱特性和能量分布，将每帧鸟鸣信号划分为200-2000Hz、2000-3000Hz、3000-3500Hz、3500-4500Hz、4500-8000Hz、8000-24000Hz六个子带，计算子带能量和总能量，并取对数；Step 3-2-1, divide the segment into subbands, and calculate the logarithmic energy of the subbands; divide each frame of bird song signal into 200-2000Hz, 2000-3000Hz, 3000-3500Hz according to the frequency spectrum characteristics and energy distribution of birdsong , 3500-4500Hz, 4500-8000Hz, 8000-24000Hz six sub-bands, calculate the sub-band energy and total energy, and take the logarithm;

步骤3-2-4，若其中某一子带的似然比满足预设阈值则判决为有声片段(局部)；Step 3-2-4, if the likelihood ratio of one of the subbands meets the preset threshold, it is judged as a voiced segment (local);

或，累加每一个子带的似然比作为整体似然比，若整体似然比满足预设阈值则判决为有声片段(全局)；这里，全局和局部只要一者满足即为有声片段；Or, accumulate the likelihood ratio of each sub-band as the overall likelihood ratio, and if the overall likelihood ratio meets the preset threshold, it is judged as a voiced segment (global); here, as long as one of the global and local is satisfied, it is a voiced segment;

否则判决为静音片段；Otherwise, it is judged as a silent segment;

步骤3-3-1，创建一个有两端的数据容器，添加一个对象(Frame类的实例、静音检测结果)，获取有声的类的个数；Step 3-3-1, create a data container with two ends, add an object (instance of Frame class, mute detection result), and obtain the number of classes with sound;

步骤5，将识别结果逐条显示在图形用户界面上，分类别进行数量统计，绘制频谱图，并在导出表格中展示识别信息，包括鸟鸣起止时间、识别结果和相似度等。具体包括：Step 5: Display the recognition results one by one on the GUI, perform quantitative statistics by category, draw a spectrum diagram, and display the recognition information in the export table, including the start and end time of bird song, recognition results and similarity, etc. Specifically include:

步骤5-3，新建Excel表格，存放鸟鸣起止时间、识别结果、相似度信息等。Step 5-3, create a new Excel table to store the start and end time of birdsong, recognition results, similarity information, etc.

在一个实施例中，提供了一种基于ECAPA-TDNN的鸟鸣物种自动识别系统，所述系统包括：In one embodiment, a kind of bird song species automatic recognition system based on ECAPA-TDNN is provided, and said system comprises:

第一模块，用于对鸟鸣信号进行预处理和特征提取，获得梅尔频率倒谱系数；The first module is used for preprocessing and feature extraction of birdsong signals to obtain Mel frequency cepstral coefficients;

第二模块，用于基于梅尔频率倒谱系数对ECAPA-TDNN网络进行训练，得到ECAPA-TDNN鸟鸣分类模型；The second module is used to train the ECAPA-TDNN network based on Mel frequency cepstral coefficients to obtain the ECAPA-TDNN bird song classification model;

第三模块，用于进行静音检测，去除静音片段，产生仅包含鸟鸣片段的数据集；The third module is used for silent detection, removing silent segments, and generating a data set containing only bird song segments;

第四模块，用于对分割好的鸟鸣片段进行预处理和特征提取，并通过ECAPA-TDNN鸟鸣分类模型进行识别；The fourth module is used for preprocessing and feature extraction of segmented bird song segments, and recognizes them through the ECAPA-TDNN bird song classification model;

第五模块，用于实现用户交互，展示分类和分析结果。The fifth module is used to realize user interaction and display classification and analysis results.

关于基于ECAPA-TDNN的鸟鸣物种自动识别系统的具体限定可以参见上文中对于基于ECAPA-TDNN的鸟鸣物种自动识别方法的限定，在此不再赘述。上述基于ECAPA-TDNN的鸟鸣物种自动识别系统中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。For the specific limitations of the ECAPA-TDNN-based automatic bird song species identification system, please refer to the above-mentioned limitations on the ECAPA-TDNN-based bird song species automatic identification method, and will not be repeated here. Each module in the above-mentioned ECAPA-TDNN-based bird song species automatic identification system can be fully or partially realized by software, hardware and combinations thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.

在一个实施例中，提供了一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，处理器执行计算机程序时实现以下步骤：In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the computer program, the following steps are implemented:

步骤4，基于所述包含鸟鸣声的片段，按照步骤1的方式提取梅尔频率倒谱系数，输入所述ECAPA-TDNN鸟鸣分类模型进行识别，得到识别结果；Step 4, based on the segment containing birdsong, extract the Mel-frequency cepstral coefficients according to the method of step 1, input the ECAPA-TDNN birdsong classification model to identify, and obtain the recognition result;

关于每一步的具体限定可以参见上文中对于基于ECAPA-TDNN的鸟鸣物种自动识别方法的限定，在此不再赘述。For the specific limitations of each step, please refer to the above-mentioned limitations on the automatic identification method of bird song species based on ECAPA-TDNN, and will not be repeated here.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现以下步骤：In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

本发明方便快捷、实用性强、准确率高，充分结合深度学习的优势，实现了鸟鸣唱片段的自动分割和物种识别，对于研究生态系统的生物多样性、保护濒危鸟类具有重要意义。The invention is convenient and fast, has strong practicability and high accuracy, fully combines the advantages of deep learning, and realizes the automatic segmentation and species identification of bird songs, which is of great significance for the study of biodiversity in ecosystems and the protection of endangered birds.

以上显示和描述了本发明的基本原理、主要特征及优点。本行业的技术人员应该了解，本发明不受上述实施例的限制，上述实施例和说明书中描述的只是说明本发明的原理，在不脱离本发明精神和范围的前提下，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The basic principles, main features and advantages of the present invention have been shown and described above. Those skilled in the art should understand that the present invention is not limited by the above-mentioned embodiments. The above-mentioned embodiments and descriptions only illustrate the principles of the present invention. Within the spirit and principles, any modifications, equivalent replacements, improvements, etc., shall be included in the protection scope of the present invention.

Claims

1. a kind of bird song species automatic identification method based on ECAPA-TDNN, it is characterized in that, described method comprises the following steps:

Step 1. Collect bird song signals and preprocess them to construct a bird song data set, and then obtain Mel frequency cepstral coefficients through feature extraction;

Step 2, input ECAPA-TDNN network with described mel frequency cepstral coefficient and train, obtain ECAPA-TDNN birdsong classification model;

Step 3, input bird singing audio, remove the silent segment and extract the segment containing bird singing through the speech endpoint detection algorithm based on Gaussian mixture model;

Step 4, based on the segment containing birdsong, extract the Mel-frequency cepstral coefficients according to the method in step 1, input the ECAPA-TDNN birdsong classification model for recognition, and obtain the recognition result.

2. the bird song species automatic recognition method based on ECAPA-TDNN according to claim 1, is characterized in that, the preprocessing described in step 1 specifically comprises:

Step 1-1-1, for the birdsong signal, eliminate the DC component and perform pre-emphasis;

Step 1-1-2, perform high-pass filtering;

Step 1-1-3, carry out sub-frame processing;

Step 1-1-4, use the Hanning window for windowing.

3. the bird's song species automatic recognition method based on ECAPA-TDNN according to claim 2, is characterized in that, pre-emphasis is specifically passed through transfer function H (z)=1-az ^-1 in the step 1-1-1 First-order FIR high-pass digital filter implementation, where a is the pre-emphasis coefficient.

4. the bird song species automatic identification method based on ECAPA-TDNN according to claim 2, is characterized in that, described in the step 1 by feature extraction, obtains Mel frequency cepstral coefficient, specifically comprises:

Step 1-2-1, perform short-time Fourier transform on the bird song signal in the bird song data set, take the absolute value of the obtained value, and then square it to obtain the energy spectrum;

Step 1-2-2, constructing a Mel filter bank, and performing a dot product operation with the energy spectrum to obtain a Mel spectrogram;

Step 1-2-3, taking the logarithm of the mel spectrogram;

Step 1-2-4, performing discrete cosine transform, and taking the first P data to obtain Mel-frequency cepstral coefficients; said P is an integer.

5. the bird song species automatic identification method based on ECAPA-TDNN according to claim 1, is characterized in that, described ECAPA-TDNN network comprises convolution neural layer, three SE-Res2Block layers, Attentive Statistics Pooling layer and full connection layer;

The Mel-frequency cepstral coefficients are input into the ECAPA-TDNN network model for training, and the specific process includes:

Input the Mel-frequency cepstral coefficients into the convolutional neural layer to obtain potential audio features;

Perform multi-layer feature fusion of the potential audio features through the SE-Res2Block layer to extract global information;

Concatenate the outputs of the three SE-Res2Block layers according to the feature dimension;

Through the Attentive Statistics Pooling layer, the mean and standard deviation based on the attention mechanism are obtained, and the vector is obtained by concatenation according to the feature dimension;

Carrying out softmax classification to the vector through a fully connected layer to obtain a classification result;

Based on the classification results, the cross-entropy loss function is used to update the network parameters through back propagation, and the ECAPA-TDNN bird song classification model is obtained.

6. the bird song species automatic identification method based on ECAPA-TDNN according to claim 1, is characterized in that, step 3 specifically comprises the following process:

Step 3-1, preprocessing and framing the bird singing audio, wherein the preprocessing process is the same as step 1, creating multiple classes, and dividing the bird singing audio into segments and storing them in the class;

Step 3-2, mute judgment, specifically includes:

Step 3-2-1, divide the segment into subbands, and calculate the logarithmic energy of the subbands;

Step 3-2-2, when the total energy of the frame is greater than the minimum energy required to trigger the audio signal, for each subband, calculate the probability P(X|H1) of the speech based on the speech mixture Gaussian model, and calculate the probability P(X|H1) based on the noise mixture Gaussian The probability of the noise of the model P(X|H0);

Step 3-2-3, calculating the likelihood ratio of the subband based on the above two probabilities: likelihood ratio=log(P(X|H1)/P(X|H0));

Step 3-2-4, if the likelihood ratio of one of the subbands meets the preset threshold, it is judged as a voiced segment;

Or, accumulating the likelihood ratio of each subband as the overall likelihood ratio, and if the overall likelihood ratio meets the preset threshold, it is judged as a voiced segment;

Otherwise, it is judged as a silent segment;

Step 3-3, the collection of audio clips, specifically includes:

Step 3-3-1, create a data container with two ends, add an object, and get the number of voiced classes;

Step 3-3-2, when the number of audio clips is greater than 90% of the capacity of the data container, it is judged that the birdsong starts, and the data in the current data container is written into the constructed empty list;

Step 3-3-3, repeat step 3-3-1 and step 3-3-2, when the number of mute segments is greater than 90%, end the list writing;

Step 3-3-4, return the list data, that is, the segment containing only the sound of birdsong.

7. the bird song species automatic identification method based on ECAPA-TDNN according to claim 6, is characterized in that, step 3-2-1 specifically comprises: according to bird song spectrum characteristic and energy distribution, every frame bird song signal Divide into six subbands of 200-2000Hz, 2000-3000Hz, 3000-3500Hz, 3500-4500Hz, 4500-8000Hz, 8000-24000Hz, calculate subband energy and total energy, and take logarithm.

8. the bird song species automatic identification method based on ECAPA-TDNN according to claim 6, is characterized in that, described method also comprises:

Step 5, display the recognition results one by one on the GUI, perform quantitative statistics by category, draw a spectrum diagram, and display the recognition information in the export table, including the start and end time of bird song, recognition results and similarity.

9. the bird song species automatic identification method based on ECAPA-TDNN according to claim 8, is characterized in that, step 5 specific processes comprise:

Step 5-1, build a graphical user interface, display the segment index, recognition results, and similarity one by one, and perform quantitative statistics by bird song type;

Step 5-2, draw the spectrogram of the input bird singing audio;

Step 5-3, create a new table to store the start and end time of birdsong, recognition results, and similarity information.

10. based on the bird song species automatic identification system based on ECAPA-TDNN of any one described method of claim 1 to 9, it is characterized in that, described system comprises:

The first module is used for preprocessing and feature extraction of birdsong signals to obtain Mel frequency cepstral coefficients;

The second module is used to train the ECAPA-TDNN network based on Mel frequency cepstral coefficients to obtain the ECAPA-TDNN bird song classification model;

The third module is used for silent detection, removing silent segments, and generating a data set containing only bird song segments;

The fourth module is used for preprocessing and feature extraction of segmented bird song segments, and recognizes them through the ECAPA-TDNN bird song classification model;

The fifth module is used to realize user interaction and display classification and analysis results.