CN115294994A

CN115294994A - Bird sound automatic identification system in real environment

Info

Publication number: CN115294994A
Application number: CN202210739725.7A
Authority: CN
Inventors: 肖汉光; 刘代代; 陈凯
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-11-04

Abstract

The invention relates to the technical field of singing signal extraction and recognition, in particular to an automatic bird singing recognition system in a real environment. The system comprises a preprocessing module, a window processing module and a filter processing module, wherein the preprocessing module is used for performing framing, windowing and filter processing on the bird song audio file; the characteristic extraction module is used for extracting the frequency spectrum characteristic and the music score characteristic of the bird song audio file; the characteristic combination module is used for combining the extracted frequency spectrum characteristic and the extracted music score characteristic to obtain a Log-CST characteristic set, a MFCC-CST characteristic set and a Log-MFCC-CST characteristic set; and the species classification prediction module is used for obtaining bird species according to the feature set of the obtained bird song audio file. The invention eliminates the difficulty of manually extracting the characteristics, reduces the cost and shortens the period; by combining the feature sets, key information in the sounding signal is extracted more comprehensively; by constructing the AMResNet deep learning network model, more accurate identification and classification are realized, and the generalization capability of the model is improved.

Description

Bird sound automatic identification system in real environment

Technical Field

The invention relates to the technical field of singing signal extraction and recognition, in particular to an automatic bird singing recognition system in a real environment.

Background

Birds are one of the representative groups of wild animals, are an important component of the ecosystem, and the balance and stability of the entire ecosystem are maintained in terms of survival and development. The investigation and monitoring of birds can provide necessary information such as species, number, life habit, living quality, habitat condition and the like of bird populations, help researchers master the current situation of bird resources and dynamic changes of bird and animal resources, and provide basis for effective protection, continuous utilization and scientific management of bird resources. However, the traditional bird investigation and monitoring mode has the defects of long monitoring period, limited monitoring range, high labor intensity and the like, and cannot meet the requirements of digitization, automation and intellectualization of bird species monitoring at present.

Bird song is one of the important biological characteristics of birds, has higher degree of identification, and is widely applied to classification research of bird species. Based on the theory, the method for realizing bird species investigation by sound recognition by using automatic recording equipment and recognition software can overcome the defects and realize high-efficiency, non-damage, low-interference and large-range monitoring. The research on the bird singing is beneficial to people to master the life activity rules of the breeding behavior, the life habit and the like of the birds, and the automatic statistics of the number of individual birds or species of the birds is realized, so that the birds are protected more effectively.

Bird song recognition can be classified into a conventional recognition method and a deep learning method.

The traditional method for bird song recognition is a pattern matching based classification method, most commonly a Dynamic Time Warping (DTW) algorithm. The algorithm has high identification precision, but has the defect that the matching calculation amount is too large, so that the identification efficiency is influenced. Subsequently, the feature-based classification model is widely used, and methods commonly used at home and abroad include Hidden Markov Models (HMM), gaussian Mixture Models (GMM), support Vector Machines (SVM), random Forests (RF), autonomic Neural Networks (ANN), k nearest neighbor (kNN), bayesian network learning, and a mixture model thereof. However, there are great difficulties in extracting suitable difference features from such methods.

With the development of deep learning, the deep neural network can automatically learn data features with high complexity, so that the problems that the features are difficult to learn manually, the generalization capability is not ideal and the deep features cannot be extracted in the traditional method are solved, and the remarkable effect is achieved in the application in recent years. In 1997, mcrlar A.L (Mcrlar A L, card H C.bird identification using scientific neurological network and statistical analysis [ C ]// electric and Computer Engineering,1997.Engineering innovation. In order to further improve the accuracy of acoustic recognition, a Convolutional Neural Network (CNN) that is highlighted in an image classification task has become a hot spot in sound classification research. Sprengelt (M.Lasseck, bird species identification in soundscapes, working Notes of CLEF 2019.) realizes the identification of 1500 Bird singings by taking the ethnic word spectrogram in 2019 as the input of the convolutional neural network, and the average identification rate is not lower than 70%. Therefore, the deep learning can bring better identification effect for bird song identification problem.

However, when birds singing in a noisy environment is processed, the recognition effect of the model is not good enough, so that the problems still need to be studied more deeply, and the following technical problems mainly exist:

(1) The existing traditional singing recognition method is a pattern matching process, features need to be manually extracted, the processing period is long, the recognition efficiency is low, and the method is difficult to be applied to a large-range monitoring and low-delay scene required by bird statistical analysis.

(2) The existing identification method mostly adopts a single characteristic diagram for input, so that the identification effect of a network model is poor. Bird song is a non-stationary signal, containing no meaningful substructures or patterns, and using only a single feature may not effectively capture important audio information, thereby making it difficult to avoid the problem of inaccurate recognition details caused by similar noise.

(3) The existing deep learning method has the defects of poor generalization capability and poor practicability. Currently, these studies only evaluate classification models in a single scene, but in a real environment full of noise, methods for combining detection and classification of multiple species are rare, and methods for achieving better results are more scarce.

Disclosure of Invention

The present invention aims to provide an automatic bird song recognition system in a real environment, which is used for solving the problems in the prior art in the background art.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a bird song automatic identification system in real environment, comprising:

the preprocessing module is used for performing framing, windowing and filter processing on the bird song audio file;

the characteristic extraction module is used for extracting the frequency spectrum characteristic and the music score characteristic of the bird song audio file;

the characteristic combination module is used for combining the extracted frequency spectrum characteristic and the extracted music score characteristic to obtain a Log-CST characteristic set, a MFCC-CST characteristic set and a Log-MFCC-CST characteristic set;

and the species classification prediction module is used for obtaining bird species according to the feature set of the bird song audio file.

Further, the species classification prediction module is an AMResNet network, which includes a convolutional layer (Conv), a batch normalization layer (BN), a modified linear unit (ReLU), a max pooling layer (MaxPool), 4 structure blocks (ARBlock), an average pooling layer (AvgPool), and a full connection layer (FC); in the structure block, an attention layer is connected with a residual error layer in series;

the attention layer comprises a channel attention module and a space attention module, the channel attention module and the space attention module are weighted respectively, the models are concentrated on the most important information in the time domain and the frequency domain, and irrelevant noise parts are filtered out;

the residual layer consists of two residual structures, each residual structure comprising a Conv3 x 3-BN-ReLU operation and a jump connection, connected in sequence.

Further, the spectral features include: log mel-frequency spectrum (Log-mel), mel-frequency cepstral coefficient (MFCC);

the score features include: chroma (Chroma), spectral contrast (Spectral contrast), and hue centroid (Tonnetz).

Further, in the channel attention module, the MaxPoint-MLP sequential connection operation and the AvgPool-MLP sequential connection operation are combined through addition;

reducing the channel dimension of output data to 1 through 1 convolution layer with the size of 3 and the step length of 1, and obtaining attention weight by using sigmoid as an activation function, wherein the calculation is as follows:

A _C (x)＝σ(W _mlp (Avg(x))+W _mlp (Max(x)))

wherein x and A _C (x) Representing the input and output of the channel attention module, respectively, avg () and Max () representing average pooling and maximum pooling, W _mlp () Representing multi-layer perceptron learning, σ () is a sigmoid function.

Further, in the spatial attention module, maxPool and AvgPool perform splicing (Concat) operation based on channels, aggregate feature maps after passing through the channel attention module into vectors of H × W × 2, and obtain attention weight of the spatial attention module through a two-dimensional convolution operation with a kernel size of 7 and padding (padding) of 3 and a sigmoid function, and calculate as follows:

A _S (x)＝σ(f ³ ([Avg(A _C (x))；Max(A _C (x))]))

wherein x and A _S (x) Respectively representing the input and output of the spatial attention module, f ³ () Representing a convolution operation with a convolution kernel of 7 and a padding of 3.

Further, the output of the attention layer

From the input feature tensor X _i And the output processed by the two attention modules is multiplied, and the calculation process is as follows:

further, the calculation process of the residual layer is as follows:

y＝x+F(x,w)

wherein x and y represent the input and output of the residual structure, respectively, and w is the corresponding weight of the input element;

the output calculation of the residual network formed by the residual layers comprises the following steps:

where L and L denote the number of residual layers, f _ReLU () Indicating the ReLU activation function.

The invention also provides a method for the bird song automatic identification system in the real environment, which comprises the following steps:

s1, performing framing, windowing and filter processing on a read bird song audio file, and extracting frequency spectrum characteristics and music score characteristics;

s2, combining the extracted features to obtain a Log-CST feature set, an MFCC-CST feature set and a Log-MFCC-CST feature set;

and S3, the obtained feature set is used as input and sent into an AMResNet network for low-dimensional to high-dimensional feature learning, and bird species predicted by the model are output.

Further, in S2, the feature combination of the extracted features mainly includes:

splicing the chromaticity, the spectral contrast and the hue centroid to obtain an extended characteristic;

aggregating the logarithmic Mel frequency spectrum and the expansion characteristics to obtain a Log-CST characteristic set;

aggregating the Mel cepstrum coefficient and the expansion feature to obtain an MFCC-CST feature set;

aggregating the logarithmic Mel frequency spectrum, mel cepstrum coefficients and the expansion characteristics to obtain a Log-MFCC-CST characteristic set;

and the Log-CST feature set, the MFCC-CST feature set and the Log-MFCC-CST feature set are all combined in a linear manner.

Further, the process of training the audio file features of the birds by the AMResNet network in advance comprises the following steps:

acquiring audio of birds through a mobile terminal, and extracting spectral features and music score features of the birds;

loading the variety of each bird, the frequency spectrum characteristic and the music score characteristic to a PC end provided with an AMResNet network;

the PC terminal firstly compares the bird species with the species divided in advance to judge whether the bird species is a new species;

if the new species is the new species, the frequency spectrum characteristic and the score characteristic are received firstly and recorded as new sound data; then downloading and loading a pre-training weight file, performing transfer learning on the spectrum characteristics, the score characteristics and the bird types, and training on a deployed AMResNet network; uploading the trained new characteristic data set and the pre-training weight file, updating the bird song data set of the database, and reminding background workers to monitor new bird species;

if the new species is not the new species, only a pre-training weight file is needed to be downloaded, then the frequency spectrum characteristic and the score characteristic are used as the input of an AMResNet network, then the predicted species of the network is output, and the result is displayed; meanwhile, data are transmitted to a background, the number of the birds in the database is modified, and workers can monitor the change trend of bird species conveniently.

The invention has at least the following beneficial effects:

the identification method provided by the invention eliminates the difficulty of manually extracting the characteristics, reduces the labor cost, shortens the identification period and makes the real-time monitoring of the bird species change possible;

the invention designs and realizes an effective characteristic combination mode, and combines logarithmic Mel frequency spectrum (Logmel), mel cepstrum coefficient (MFCC), chroma (Chroma), spectral contrast (Spectral _ contrast) and hue centroid characteristic (Tonnetz) into a new characteristic set, thereby more comprehensively extracting key information in the singing signal;

the invention builds an AMResNet deep learning network model which is based on a residual error network and is doubly combined with an attention mechanism on a channel and a space; the jump connection in the residual error network relieves the problems of gradient loss and network degradation, so that a deeper network architecture can be constructed, and more accurate identification and classification can be realized; the attention mechanism focuses the model on the most important information in the time domain and the frequency domain by weighting the channel and the space, ignores the noise part in the characteristics and improves the generalization capability of the model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of bird song recognition according to the present invention;

FIG. 2 is a spectral plot of a feature;

FIG. 3 is a block diagram of a feature set;

FIG. 4 is a schematic diagram of an AMResNet model processing a single channel feature set input;

FIG. 5 is an architectural diagram of an attention layer;

FIG. 6 is a diagram illustrating a residual structure in a residual layer;

FIG. 7 is a graph showing changes in feature maps in a model (a) of an inattentive layer and a model (b) of an attentive layer;

FIG. 8 is a graph of an AMResNet confusion matrix;

FIG. 9 is a ROC plot under ten-fold cross validation.

Fig. 10 is a structural view of an automatic bird song recognition system.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, the whole bird song recognition process of the present invention is mainly divided into three stages: feature extraction, feature combination and species classification prediction of AMResNet.

Firstly, a read bird song audio file is subjected to framing, windowing and filter processing, and spectrum characteristics and music score characteristics are respectively extracted, wherein the spectrum characteristics and the music score characteristics comprise five characteristics of logarithmic Mel frequency spectrum, mel cepstrum coefficient, chroma, spectral contrast and hue centroid. And then, combining the five features according to a certain mode to obtain a feature set of a single channel. And finally, taking the feature set as input, sending the feature set into an AMResNet network, comparing the feature set with the audio file features of all birds trained in advance, and outputting bird species obtained through comparison.

To this end, the system of the invention comprises at least: the preprocessing module is used for performing framing, windowing and filter processing on the bird song audio file;

and the species classification prediction module is used for obtaining bird species according to the feature set of the obtained bird song audio file.

Specifically, the detailed technical scheme is as follows:

(1) Feature extraction

The read-in sounding audio file is stored in an MP3 format, and in order to adapt to the input of a network model in deep learning, it is necessary to extract the characteristics thereof by using a computer. By means of the audio processing library Librosa, the window length of the Fast Fourier Transform (FFT) is set to 1024, the frame offset is set to 512, the channel numbers of the logarithmic mel-frequency spectrum and mel-frequency cepstral coefficients are all 40, and the channel numbers of the chromaticity, spectral contrast and hue centroid are 12, 7 and 6, respectively. Finally, the feature matrix sizes of the extraction of the logarithmic mel-frequency spectrum and the mel-frequency cepstrum coefficient are both 40 × 63, and the feature sizes of the chromaticity, the spectral contrast and the hue centroid are respectively 12 × 63,7 × 63,6 × 63, and the spectrograms of the five features are shown in fig. 2.

(2) Feature combination

The audio features contain rich information, but the information among different features is different, and the features are combined in a certain mode, so that the effect of acquiring the useful information to the maximum extent is achieved.

The logarithmic mel-frequency spectrum and mel-frequency cepstral coefficients are the most commonly used features in automatic audio recognition. Chroma, spectral contrast and tonal centroid characteristics are the most commonly used characteristics in Music Information Recognition (MIR) and are used as extended features (CST) after stitching. The Log mel frequency spectrum and the extended features are aggregated to form a feature set (Log-CST). The mel-frequency cepstral coefficients and the extended features are aggregated to form a feature set (MFCC-CST). The Log mel-frequency spectrum, mel-frequency cepstral coefficients and the extended features are aggregated to form a feature set (Log-MFCC-CST). All feature sets were combined in a linear fashion, with Log-CST (FIG. 3 (a)), MFCC-CST (FIG. 3 (b)), and Log-MFCC-CST (FIG. 3 (c)) sizes of 65X 63, and 105X 63, respectively.

(3) Deep learning network AMResNet

The AMResNet is mainly used for solving the problem of species classification of singing and is a visual domain residual error network combined with an attention mechanism. The structural diagram of the AMResNet is shown in fig. 4, and the main branch includes a convolutional layer, a batch normalization layer, a maximum pooling layer, 1-4 structural blocks, an average pooling layer, and a full connection layer. In each building block, the attention layer with the channel attention module and the spatial attention module is concatenated with the residual layer with two jump connections. N represents the number of channels in the four blocks, with values of 64, 128, 256, 512, respectively. The input data is a single-channel feature map of the feature set, and the size of the input feature map is reduced by half after the input feature map is processed by the first 7 x 7 convolutional layer and the 2 x 2 maximum pooling layer. The feature maps derived from the attention layer and the residual layer are the same size as the input feature map, but the number of channels is changed. Since the averaging pool is performed after the 4 th block, the sizes of the feature maps for flattening are all 1 × 1, and are delivered to a full-link layer having 1024 hidden units, and finally, the tensor values of the corresponding sizes are output according to the number of categories of the chirp data set.

Referring to fig. 10, an intelligent identification process of the bird song automatic identification system in a real environment specifically includes:

pouring the types of birds, the frequency spectrum characteristics of the birds and the music score characteristics into a PC end provided with an AMResNet network;

if the new species is the new species, the frequency spectrum characteristic and the score characteristic are received firstly and recorded as new sound data; then downloading and loading a pre-training weight file, performing transfer learning on the frequency spectrum characteristics, the music score characteristics and the bird types, and training on a deployed AMResNet network; uploading the trained new characteristic data set and the pre-training weight file, updating the bird song data set of the database, and reminding background workers to monitor new bird species;

3.1 attention layer

Each attention layer consists of a channel attention module (fig. 5 (left box)) and a spatial attention module (fig. 5 (right box)), which weight the channel and space, respectively, focusing the model on the most important information in the time and frequency domains, thereby filtering out uncorrelated noise parts.

In the channel attention module, maxPoint-MLP sequential ligation operation and AvgPool-MLP sequential ligation operation are combined by addition. Then, the channel dimension of the output data is reduced to 1 by 1 convolution layer with size of 3 and step size of 1, and sigmoid is used as an activation function to obtain the attention weight. The calculation process is as follows:

A _C (x)＝σ(W _mlp (Avg(x))+W _mlp (Max(x)))

In the space attention module, maxPoint and AvgPool perform Concat operation based on channels, feature maps after passing through the channel attention module are aggregated into H multiplied by W multiplied by 2 vectors, and attention weight of the space attention module is obtained through two-dimensional convolution operation with kernel size of 7 and padding of 3 and sigmoid function. The calculation process is as follows:

A _S (x)＝σ(f ³ ([Avg(A _C (x))；Max(A _C (x))]))

wherein x and A _S (x) Respectively representing the input and output of the spatial attention module, f ³ () Represents the convolution operation with convolution kernel of 7 and padding of 3.

Finally, the output of the entire attention layer

From the input feature tensor X _i And the output processed by the two attention modules, the calculation process is as follows:

3.2 residual layer

Each residual layer consists of two residual structures (fig. 6), and each residual structure consists of a sequentially connected Conv3 x 3-BN-ReLU operation and a jump connection. Compared with a general deep neural network, a deep structure in the residual error network is not designed to be invalid identity mapping any more, but is designed to be fitting operation, as long as a residual error function F () is equal to 0, the residual error function F () can be converted into identity transformation, the fitting residual error is simple and easy to realize, and the calculation process is as follows:

y＝x+F(x,w)

where x and y represent the input and output of the residual structure, respectively, and w is the corresponding weight of the input element.

And finally, outputting the comprehensive result of all deep structures by a residual error network, wherein the calculation process is as follows:

(4) Statistical analysis

In this study, the evaluation indices used to test the performance of the proposed method are: accuracy (Accuracy), precision (Precision), recall (Recall), and F1 score (F1-score). Accuracy is a gold index of a classification model, and is suitable for both two-classification tasks and multi-classification tasks. For the classification model f () with n test sets D, the accuracy is calculated as follows:

the precision rate represents the proportion of correctly classified bird song samples in the prediction tag set, and the calculation process is as follows:

the recall rate represents the proportion of correctly classified bird song samples in the real label set, and the calculation process is as follows:

the F1 score is a harmonic mean of the precision and recall, and is calculated as follows:

to define these indices, values for True Positive (TP), true Negative (TN), false Positive (FP) and False Negative (FN) were also used in this study.

(5) Experimental verification

The bird song data set composed of 12651 song records in real environment provided by Beijing artificial intelligence institute (BAAI) is used as an experimental object, and 19 species are gray goose (AA), great swan (CC), green duck (AP), green wing duck (ACr), western quail (CQ), pheasant (PCo), red throat submerged bird (GS), cocket (ACi), common whorlled plotter (PCa), eagle (AG), eurasian (BB), western rice-stem chicken (WC), boney chicken (FA), black wing long-foot snipe (HH), phoenix-head wheat chicken (VV), white waist snipe (TC), snipe (TT), snipe (TG) and sparrow (Pa). During the experiment, all bird song data sets were divided into a training set (8863 song recordings) and a testing set (3788 song recordings), and the model was trained by a ten-fold cross-validation method. We tested the experimental results of different combinations of feature sets (table 1), different numbers of attention layers (table 2), and presence or absence of attention layers (fig. 7), respectively, and verified the performance of AMResNet using the confusion matrix (fig. 8), the ROC curve (fig. 9), the precision, the recall, and the F1 score (table 3). In addition, we compared the recognition accuracy of AMResNet with seven other common classification models (Table 4), including Gaussian Mixture Model (GMM), hidden Markov Model (HMM), three-layer cascaded Artificial Neural Network (ANN), resNet-18, resNet-34, resNet-50, and vision transform (ViT).

The results of the experiments show that the combined features Log-CST (Table 1) used in this study are the most effective and the four attention layers (Table 2) used in AMResNet work the best. The feature map can effectively remove the noise part (Huang Kuang in fig. 7 (a)) in the model with attention layer (fig. 7 (b)) in the model without attention layer (fig. 7 (a)), and the relevant part of the feature map (red frame in fig. 7 (b)) is highlighted by the attention layer. For the ten-fold cross-validated ROC curve (fig. 9), AMResNet achieved good mean AUC values and also achieved good classification in the confusion matrix (fig. 8) and identification of each species (table 3). Finally, in a comparison experiment (table 4) of different models, the AMResNet model combining the advantages of the residual error network and the attention mechanism not only deepens the network depth, but also reduces the calculation amount, and can also give higher weight to important information of input data, thereby obtaining the best recognition effect.

TABLE 1 AMResNet identification accuracy comparison based on feature sets of different combinations

TABLE 2 comparison of model identification accuracy for different numbers of attention layers

TABLE 3 identification of BCResNet on each species

TABLE 4 comparison of accuracy rates of chirp identification for different models

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. An automatic bird song recognition system in a real environment, comprising:

the preprocessing module is used for framing, windowing and filtering the bird song audio file;

2. The automatic bird song recognition system in a real environment according to claim 1, wherein the species classification prediction module is an AMResNet network comprising a convolutional layer (Conv), a batch normalization layer (BN), a modified linear unit (ReLU), a max pooling layer (MaxPool), 4 structured blocks (ARBlock), an average pooling layer (AvgPool), and a full connection layer (FC); in the structure block, an attention layer is connected with a residual error layer in series;

the residual layer consists of two residual structures, each residual structure comprising a sequentially connected Conv3 x 3-BN-ReLU operation and a jump connection.

3. The system of claim 1, wherein the spectral features comprise: log mel-frequency spectrum (Log-mel), mel-frequency cepstral coefficient (MFCC);

the score features include: chroma (Chroma), spectral contrast (Spectral contrast), and centroid of hue (tonnitz).

4. The automatic bird song recognition system of claim 2, wherein in the channel attention module, the MaxPool-MLP sequential connection operation and the AvgPool-MLP sequential connection operation are combined by addition;

A _C (x)＝σ(W _mlp (Avg(x))+W _mlp (Max(x)))

5. The system according to claim 4, wherein in the spatial attention module, maxPoint and Avgpool perform a concatenation (Concat) operation based on channels, aggregate feature maps after passing through the channel attention module into H × W × 2 vectors, and obtain attention weights of the spatial attention module through a two-dimensional convolution operation with a kernel size of 7 and padding (padding) of 3 and a sigmoid function, and are calculated as follows:

A _S (x)＝σ(f ³ ([Avg(A _C (x))；Max(A _C (x))]))

6. The system of claim 5, wherein the attention layer output is used for automatic bird song recognition in real environment

7. the system according to claim 2, wherein the residual layer is calculated as follows:

y＝x+F(x,w)

wherein L and L denote the number of residual layers, f _ReLU () Indicating the ReLU activation function.

8. A method for an automatic bird song recognition system in a real environment, comprising the steps of:

9. The method according to claim 8, wherein the step S2 of combining the extracted features mainly includes:

splicing the chromaticity, the spectrum contrast and the hue centroid to obtain an extended characteristic;

aggregating the Mel cepstrum coefficients and the expansion features to obtain an MFCC-CST feature set;

and the Log-CST feature set, the MFCC-CST feature set and the Log-MFCC-CST feature set are all combined in a linear mode.

10. The method of claim 8, wherein the AMResNet network pre-trains the audio file characterization process of each bird, which comprises:

if the new species is not the new species, only a pre-training weight file is needed to be downloaded, then the frequency spectrum characteristic and the score characteristic are used as the input of an AMResNet network, then the predicted species of the network is output, and the result is displayed; simultaneously, with data transfer backstage, repair this birds quantity in the database, the staff of being convenient for monitors the trend of change of birds kind.