CN115294994A - Bird sound automatic identification system in real environment - Google Patents

Bird sound automatic identification system in real environment Download PDF

Info

Publication number
CN115294994A
CN115294994A CN202210739725.7A CN202210739725A CN115294994A CN 115294994 A CN115294994 A CN 115294994A CN 202210739725 A CN202210739725 A CN 202210739725A CN 115294994 A CN115294994 A CN 115294994A
Authority
CN
China
Prior art keywords
characteristic
bird
cst
species
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210739725.7A
Other languages
Chinese (zh)
Inventor
肖汉光
刘代代
陈凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Technology
Original Assignee
Chongqing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Technology filed Critical Chongqing University of Technology
Priority to CN202210739725.7A priority Critical patent/CN115294994A/en
Publication of CN115294994A publication Critical patent/CN115294994A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of singing signal extraction and recognition, in particular to an automatic bird singing recognition system in a real environment. The system comprises a preprocessing module, a window processing module and a filter processing module, wherein the preprocessing module is used for performing framing, windowing and filter processing on the bird song audio file; the characteristic extraction module is used for extracting the frequency spectrum characteristic and the music score characteristic of the bird song audio file; the characteristic combination module is used for combining the extracted frequency spectrum characteristic and the extracted music score characteristic to obtain a Log-CST characteristic set, a MFCC-CST characteristic set and a Log-MFCC-CST characteristic set; and the species classification prediction module is used for obtaining bird species according to the feature set of the obtained bird song audio file. The invention eliminates the difficulty of manually extracting the characteristics, reduces the cost and shortens the period; by combining the feature sets, key information in the sounding signal is extracted more comprehensively; by constructing the AMResNet deep learning network model, more accurate identification and classification are realized, and the generalization capability of the model is improved.

Description

Bird sound automatic identification system in real environment
Technical Field
The invention relates to the technical field of singing signal extraction and recognition, in particular to an automatic bird singing recognition system in a real environment.
Background
Birds are one of the representative groups of wild animals, are an important component of the ecosystem, and the balance and stability of the entire ecosystem are maintained in terms of survival and development. The investigation and monitoring of birds can provide necessary information such as species, number, life habit, living quality, habitat condition and the like of bird populations, help researchers master the current situation of bird resources and dynamic changes of bird and animal resources, and provide basis for effective protection, continuous utilization and scientific management of bird resources. However, the traditional bird investigation and monitoring mode has the defects of long monitoring period, limited monitoring range, high labor intensity and the like, and cannot meet the requirements of digitization, automation and intellectualization of bird species monitoring at present.
Bird song is one of the important biological characteristics of birds, has higher degree of identification, and is widely applied to classification research of bird species. Based on the theory, the method for realizing bird species investigation by sound recognition by using automatic recording equipment and recognition software can overcome the defects and realize high-efficiency, non-damage, low-interference and large-range monitoring. The research on the bird singing is beneficial to people to master the life activity rules of the breeding behavior, the life habit and the like of the birds, and the automatic statistics of the number of individual birds or species of the birds is realized, so that the birds are protected more effectively.
Bird song recognition can be classified into a conventional recognition method and a deep learning method.
The traditional method for bird song recognition is a pattern matching based classification method, most commonly a Dynamic Time Warping (DTW) algorithm. The algorithm has high identification precision, but has the defect that the matching calculation amount is too large, so that the identification efficiency is influenced. Subsequently, the feature-based classification model is widely used, and methods commonly used at home and abroad include Hidden Markov Models (HMM), gaussian Mixture Models (GMM), support Vector Machines (SVM), random Forests (RF), autonomic Neural Networks (ANN), k nearest neighbor (kNN), bayesian network learning, and a mixture model thereof. However, there are great difficulties in extracting suitable difference features from such methods.
With the development of deep learning, the deep neural network can automatically learn data features with high complexity, so that the problems that the features are difficult to learn manually, the generalization capability is not ideal and the deep features cannot be extracted in the traditional method are solved, and the remarkable effect is achieved in the application in recent years. In 1997, mcrlar A.L (Mcrlar A L, card H C.bird identification using scientific neurological network and statistical analysis [ C ]// electric and Computer Engineering,1997.Engineering innovation. In order to further improve the accuracy of acoustic recognition, a Convolutional Neural Network (CNN) that is highlighted in an image classification task has become a hot spot in sound classification research. Sprengelt (M.Lasseck, bird species identification in soundscapes, working Notes of CLEF 2019.) realizes the identification of 1500 Bird singings by taking the ethnic word spectrogram in 2019 as the input of the convolutional neural network, and the average identification rate is not lower than 70%. Therefore, the deep learning can bring better identification effect for bird song identification problem.
However, when birds singing in a noisy environment is processed, the recognition effect of the model is not good enough, so that the problems still need to be studied more deeply, and the following technical problems mainly exist:
(1) The existing traditional singing recognition method is a pattern matching process, features need to be manually extracted, the processing period is long, the recognition efficiency is low, and the method is difficult to be applied to a large-range monitoring and low-delay scene required by bird statistical analysis.
(2) The existing identification method mostly adopts a single characteristic diagram for input, so that the identification effect of a network model is poor. Bird song is a non-stationary signal, containing no meaningful substructures or patterns, and using only a single feature may not effectively capture important audio information, thereby making it difficult to avoid the problem of inaccurate recognition details caused by similar noise.
(3) The existing deep learning method has the defects of poor generalization capability and poor practicability. Currently, these studies only evaluate classification models in a single scene, but in a real environment full of noise, methods for combining detection and classification of multiple species are rare, and methods for achieving better results are more scarce.
Disclosure of Invention
The present invention aims to provide an automatic bird song recognition system in a real environment, which is used for solving the problems in the prior art in the background art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a bird song automatic identification system in real environment, comprising:
the preprocessing module is used for performing framing, windowing and filter processing on the bird song audio file;
the characteristic extraction module is used for extracting the frequency spectrum characteristic and the music score characteristic of the bird song audio file;
the characteristic combination module is used for combining the extracted frequency spectrum characteristic and the extracted music score characteristic to obtain a Log-CST characteristic set, a MFCC-CST characteristic set and a Log-MFCC-CST characteristic set;
and the species classification prediction module is used for obtaining bird species according to the feature set of the bird song audio file.
Further, the species classification prediction module is an AMResNet network, which includes a convolutional layer (Conv), a batch normalization layer (BN), a modified linear unit (ReLU), a max pooling layer (MaxPool), 4 structure blocks (ARBlock), an average pooling layer (AvgPool), and a full connection layer (FC); in the structure block, an attention layer is connected with a residual error layer in series;
the attention layer comprises a channel attention module and a space attention module, the channel attention module and the space attention module are weighted respectively, the models are concentrated on the most important information in the time domain and the frequency domain, and irrelevant noise parts are filtered out;
the residual layer consists of two residual structures, each residual structure comprising a Conv3 x 3-BN-ReLU operation and a jump connection, connected in sequence.
Further, the spectral features include: log mel-frequency spectrum (Log-mel), mel-frequency cepstral coefficient (MFCC);
the score features include: chroma (Chroma), spectral contrast (Spectral contrast), and hue centroid (Tonnetz).
Further, in the channel attention module, the MaxPoint-MLP sequential connection operation and the AvgPool-MLP sequential connection operation are combined through addition;
reducing the channel dimension of output data to 1 through 1 convolution layer with the size of 3 and the step length of 1, and obtaining attention weight by using sigmoid as an activation function, wherein the calculation is as follows:
A C (x)=σ(W mlp (Avg(x))+W mlp (Max(x)))
wherein x and A C (x) Representing the input and output of the channel attention module, respectively, avg () and Max () representing average pooling and maximum pooling, W mlp () Representing multi-layer perceptron learning, σ () is a sigmoid function.
Further, in the spatial attention module, maxPool and AvgPool perform splicing (Concat) operation based on channels, aggregate feature maps after passing through the channel attention module into vectors of H × W × 2, and obtain attention weight of the spatial attention module through a two-dimensional convolution operation with a kernel size of 7 and padding (padding) of 3 and a sigmoid function, and calculate as follows:
A S (x)=σ(f 3 ([Avg(A C (x));Max(A C (x))]))
wherein x and A S (x) Respectively representing the input and output of the spatial attention module, f 3 () Representing a convolution operation with a convolution kernel of 7 and a padding of 3.
Further, the output of the attention layer
Figure BDA0003717347000000051
From the input feature tensor X i And the output processed by the two attention modules is multiplied, and the calculation process is as follows:
Figure BDA0003717347000000052
further, the calculation process of the residual layer is as follows:
y=x+F(x,w)
wherein x and y represent the input and output of the residual structure, respectively, and w is the corresponding weight of the input element;
the output calculation of the residual network formed by the residual layers comprises the following steps:
Figure BDA0003717347000000053
where L and L denote the number of residual layers, f ReLU () Indicating the ReLU activation function.
The invention also provides a method for the bird song automatic identification system in the real environment, which comprises the following steps:
s1, performing framing, windowing and filter processing on a read bird song audio file, and extracting frequency spectrum characteristics and music score characteristics;
s2, combining the extracted features to obtain a Log-CST feature set, an MFCC-CST feature set and a Log-MFCC-CST feature set;
and S3, the obtained feature set is used as input and sent into an AMResNet network for low-dimensional to high-dimensional feature learning, and bird species predicted by the model are output.
Further, in S2, the feature combination of the extracted features mainly includes:
splicing the chromaticity, the spectral contrast and the hue centroid to obtain an extended characteristic;
aggregating the logarithmic Mel frequency spectrum and the expansion characteristics to obtain a Log-CST characteristic set;
aggregating the Mel cepstrum coefficient and the expansion feature to obtain an MFCC-CST feature set;
aggregating the logarithmic Mel frequency spectrum, mel cepstrum coefficients and the expansion characteristics to obtain a Log-MFCC-CST characteristic set;
and the Log-CST feature set, the MFCC-CST feature set and the Log-MFCC-CST feature set are all combined in a linear manner.
Further, the process of training the audio file features of the birds by the AMResNet network in advance comprises the following steps:
acquiring audio of birds through a mobile terminal, and extracting spectral features and music score features of the birds;
loading the variety of each bird, the frequency spectrum characteristic and the music score characteristic to a PC end provided with an AMResNet network;
the PC terminal firstly compares the bird species with the species divided in advance to judge whether the bird species is a new species;
if the new species is the new species, the frequency spectrum characteristic and the score characteristic are received firstly and recorded as new sound data; then downloading and loading a pre-training weight file, performing transfer learning on the spectrum characteristics, the score characteristics and the bird types, and training on a deployed AMResNet network; uploading the trained new characteristic data set and the pre-training weight file, updating the bird song data set of the database, and reminding background workers to monitor new bird species;
if the new species is not the new species, only a pre-training weight file is needed to be downloaded, then the frequency spectrum characteristic and the score characteristic are used as the input of an AMResNet network, then the predicted species of the network is output, and the result is displayed; meanwhile, data are transmitted to a background, the number of the birds in the database is modified, and workers can monitor the change trend of bird species conveniently.
The invention has at least the following beneficial effects:
the identification method provided by the invention eliminates the difficulty of manually extracting the characteristics, reduces the labor cost, shortens the identification period and makes the real-time monitoring of the bird species change possible;
the invention designs and realizes an effective characteristic combination mode, and combines logarithmic Mel frequency spectrum (Logmel), mel cepstrum coefficient (MFCC), chroma (Chroma), spectral contrast (Spectral _ contrast) and hue centroid characteristic (Tonnetz) into a new characteristic set, thereby more comprehensively extracting key information in the singing signal;
the invention builds an AMResNet deep learning network model which is based on a residual error network and is doubly combined with an attention mechanism on a channel and a space; the jump connection in the residual error network relieves the problems of gradient loss and network degradation, so that a deeper network architecture can be constructed, and more accurate identification and classification can be realized; the attention mechanism focuses the model on the most important information in the time domain and the frequency domain by weighting the channel and the space, ignores the noise part in the characteristics and improves the generalization capability of the model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of bird song recognition according to the present invention;
FIG. 2 is a spectral plot of a feature;
FIG. 3 is a block diagram of a feature set;
FIG. 4 is a schematic diagram of an AMResNet model processing a single channel feature set input;
FIG. 5 is an architectural diagram of an attention layer;
FIG. 6 is a diagram illustrating a residual structure in a residual layer;
FIG. 7 is a graph showing changes in feature maps in a model (a) of an inattentive layer and a model (b) of an attentive layer;
FIG. 8 is a graph of an AMResNet confusion matrix;
FIG. 9 is a ROC plot under ten-fold cross validation.
Fig. 10 is a structural view of an automatic bird song recognition system.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, the whole bird song recognition process of the present invention is mainly divided into three stages: feature extraction, feature combination and species classification prediction of AMResNet.
Firstly, a read bird song audio file is subjected to framing, windowing and filter processing, and spectrum characteristics and music score characteristics are respectively extracted, wherein the spectrum characteristics and the music score characteristics comprise five characteristics of logarithmic Mel frequency spectrum, mel cepstrum coefficient, chroma, spectral contrast and hue centroid. And then, combining the five features according to a certain mode to obtain a feature set of a single channel. And finally, taking the feature set as input, sending the feature set into an AMResNet network, comparing the feature set with the audio file features of all birds trained in advance, and outputting bird species obtained through comparison.
To this end, the system of the invention comprises at least: the preprocessing module is used for performing framing, windowing and filter processing on the bird song audio file;
the characteristic extraction module is used for extracting the frequency spectrum characteristic and the music score characteristic of the bird song audio file;
the characteristic combination module is used for combining the extracted frequency spectrum characteristic and the extracted music score characteristic to obtain a Log-CST characteristic set, a MFCC-CST characteristic set and a Log-MFCC-CST characteristic set;
and the species classification prediction module is used for obtaining bird species according to the feature set of the obtained bird song audio file.
Specifically, the detailed technical scheme is as follows:
(1) Feature extraction
The read-in sounding audio file is stored in an MP3 format, and in order to adapt to the input of a network model in deep learning, it is necessary to extract the characteristics thereof by using a computer. By means of the audio processing library Librosa, the window length of the Fast Fourier Transform (FFT) is set to 1024, the frame offset is set to 512, the channel numbers of the logarithmic mel-frequency spectrum and mel-frequency cepstral coefficients are all 40, and the channel numbers of the chromaticity, spectral contrast and hue centroid are 12, 7 and 6, respectively. Finally, the feature matrix sizes of the extraction of the logarithmic mel-frequency spectrum and the mel-frequency cepstrum coefficient are both 40 × 63, and the feature sizes of the chromaticity, the spectral contrast and the hue centroid are respectively 12 × 63,7 × 63,6 × 63, and the spectrograms of the five features are shown in fig. 2.
(2) Feature combination
The audio features contain rich information, but the information among different features is different, and the features are combined in a certain mode, so that the effect of acquiring the useful information to the maximum extent is achieved.
The logarithmic mel-frequency spectrum and mel-frequency cepstral coefficients are the most commonly used features in automatic audio recognition. Chroma, spectral contrast and tonal centroid characteristics are the most commonly used characteristics in Music Information Recognition (MIR) and are used as extended features (CST) after stitching. The Log mel frequency spectrum and the extended features are aggregated to form a feature set (Log-CST). The mel-frequency cepstral coefficients and the extended features are aggregated to form a feature set (MFCC-CST). The Log mel-frequency spectrum, mel-frequency cepstral coefficients and the extended features are aggregated to form a feature set (Log-MFCC-CST). All feature sets were combined in a linear fashion, with Log-CST (FIG. 3 (a)), MFCC-CST (FIG. 3 (b)), and Log-MFCC-CST (FIG. 3 (c)) sizes of 65X 63, and 105X 63, respectively.
(3) Deep learning network AMResNet
The AMResNet is mainly used for solving the problem of species classification of singing and is a visual domain residual error network combined with an attention mechanism. The structural diagram of the AMResNet is shown in fig. 4, and the main branch includes a convolutional layer, a batch normalization layer, a maximum pooling layer, 1-4 structural blocks, an average pooling layer, and a full connection layer. In each building block, the attention layer with the channel attention module and the spatial attention module is concatenated with the residual layer with two jump connections. N represents the number of channels in the four blocks, with values of 64, 128, 256, 512, respectively. The input data is a single-channel feature map of the feature set, and the size of the input feature map is reduced by half after the input feature map is processed by the first 7 x 7 convolutional layer and the 2 x 2 maximum pooling layer. The feature maps derived from the attention layer and the residual layer are the same size as the input feature map, but the number of channels is changed. Since the averaging pool is performed after the 4 th block, the sizes of the feature maps for flattening are all 1 × 1, and are delivered to a full-link layer having 1024 hidden units, and finally, the tensor values of the corresponding sizes are output according to the number of categories of the chirp data set.
Referring to fig. 10, an intelligent identification process of the bird song automatic identification system in a real environment specifically includes:
acquiring audio of birds through a mobile terminal, and extracting spectral features and music score features of the birds;
pouring the types of birds, the frequency spectrum characteristics of the birds and the music score characteristics into a PC end provided with an AMResNet network;
the PC terminal firstly compares the bird species with the species divided in advance to judge whether the bird species is a new species;
if the new species is the new species, the frequency spectrum characteristic and the score characteristic are received firstly and recorded as new sound data; then downloading and loading a pre-training weight file, performing transfer learning on the frequency spectrum characteristics, the music score characteristics and the bird types, and training on a deployed AMResNet network; uploading the trained new characteristic data set and the pre-training weight file, updating the bird song data set of the database, and reminding background workers to monitor new bird species;
if the new species is not the new species, only a pre-training weight file is needed to be downloaded, then the frequency spectrum characteristic and the score characteristic are used as the input of an AMResNet network, then the predicted species of the network is output, and the result is displayed; meanwhile, data are transmitted to a background, the number of the birds in the database is modified, and workers can monitor the change trend of bird species conveniently.
3.1 attention layer
Each attention layer consists of a channel attention module (fig. 5 (left box)) and a spatial attention module (fig. 5 (right box)), which weight the channel and space, respectively, focusing the model on the most important information in the time and frequency domains, thereby filtering out uncorrelated noise parts.
In the channel attention module, maxPoint-MLP sequential ligation operation and AvgPool-MLP sequential ligation operation are combined by addition. Then, the channel dimension of the output data is reduced to 1 by 1 convolution layer with size of 3 and step size of 1, and sigmoid is used as an activation function to obtain the attention weight. The calculation process is as follows:
A C (x)=σ(W mlp (Avg(x))+W mlp (Max(x)))
wherein x and A C (x) Representing the input and output of the channel attention module, respectively, avg () and Max () representing average pooling and maximum pooling, W mlp () Representing multi-layer perceptron learning, σ () is a sigmoid function.
In the space attention module, maxPoint and AvgPool perform Concat operation based on channels, feature maps after passing through the channel attention module are aggregated into H multiplied by W multiplied by 2 vectors, and attention weight of the space attention module is obtained through two-dimensional convolution operation with kernel size of 7 and padding of 3 and sigmoid function. The calculation process is as follows:
A S (x)=σ(f 3 ([Avg(A C (x));Max(A C (x))]))
wherein x and A S (x) Respectively representing the input and output of the spatial attention module, f 3 () Represents the convolution operation with convolution kernel of 7 and padding of 3.
Finally, the output of the entire attention layer
Figure BDA0003717347000000121
From the input feature tensor X i And the output processed by the two attention modules, the calculation process is as follows:
Figure BDA0003717347000000122
3.2 residual layer
Each residual layer consists of two residual structures (fig. 6), and each residual structure consists of a sequentially connected Conv3 x 3-BN-ReLU operation and a jump connection. Compared with a general deep neural network, a deep structure in the residual error network is not designed to be invalid identity mapping any more, but is designed to be fitting operation, as long as a residual error function F () is equal to 0, the residual error function F () can be converted into identity transformation, the fitting residual error is simple and easy to realize, and the calculation process is as follows:
y=x+F(x,w)
where x and y represent the input and output of the residual structure, respectively, and w is the corresponding weight of the input element.
And finally, outputting the comprehensive result of all deep structures by a residual error network, wherein the calculation process is as follows:
Figure BDA0003717347000000131
where L and L denote the number of residual layers, f ReLU () Indicating the ReLU activation function.
(4) Statistical analysis
In this study, the evaluation indices used to test the performance of the proposed method are: accuracy (Accuracy), precision (Precision), recall (Recall), and F1 score (F1-score). Accuracy is a gold index of a classification model, and is suitable for both two-classification tasks and multi-classification tasks. For the classification model f () with n test sets D, the accuracy is calculated as follows:
Figure BDA0003717347000000132
the precision rate represents the proportion of correctly classified bird song samples in the prediction tag set, and the calculation process is as follows:
Figure BDA0003717347000000133
the recall rate represents the proportion of correctly classified bird song samples in the real label set, and the calculation process is as follows:
Figure BDA0003717347000000141
the F1 score is a harmonic mean of the precision and recall, and is calculated as follows:
Figure BDA0003717347000000142
to define these indices, values for True Positive (TP), true Negative (TN), false Positive (FP) and False Negative (FN) were also used in this study.
(5) Experimental verification
The bird song data set composed of 12651 song records in real environment provided by Beijing artificial intelligence institute (BAAI) is used as an experimental object, and 19 species are gray goose (AA), great swan (CC), green duck (AP), green wing duck (ACr), western quail (CQ), pheasant (PCo), red throat submerged bird (GS), cocket (ACi), common whorlled plotter (PCa), eagle (AG), eurasian (BB), western rice-stem chicken (WC), boney chicken (FA), black wing long-foot snipe (HH), phoenix-head wheat chicken (VV), white waist snipe (TC), snipe (TT), snipe (TG) and sparrow (Pa). During the experiment, all bird song data sets were divided into a training set (8863 song recordings) and a testing set (3788 song recordings), and the model was trained by a ten-fold cross-validation method. We tested the experimental results of different combinations of feature sets (table 1), different numbers of attention layers (table 2), and presence or absence of attention layers (fig. 7), respectively, and verified the performance of AMResNet using the confusion matrix (fig. 8), the ROC curve (fig. 9), the precision, the recall, and the F1 score (table 3). In addition, we compared the recognition accuracy of AMResNet with seven other common classification models (Table 4), including Gaussian Mixture Model (GMM), hidden Markov Model (HMM), three-layer cascaded Artificial Neural Network (ANN), resNet-18, resNet-34, resNet-50, and vision transform (ViT).
The results of the experiments show that the combined features Log-CST (Table 1) used in this study are the most effective and the four attention layers (Table 2) used in AMResNet work the best. The feature map can effectively remove the noise part (Huang Kuang in fig. 7 (a)) in the model with attention layer (fig. 7 (b)) in the model without attention layer (fig. 7 (a)), and the relevant part of the feature map (red frame in fig. 7 (b)) is highlighted by the attention layer. For the ten-fold cross-validated ROC curve (fig. 9), AMResNet achieved good mean AUC values and also achieved good classification in the confusion matrix (fig. 8) and identification of each species (table 3). Finally, in a comparison experiment (table 4) of different models, the AMResNet model combining the advantages of the residual error network and the attention mechanism not only deepens the network depth, but also reduces the calculation amount, and can also give higher weight to important information of input data, thereby obtaining the best recognition effect.
TABLE 1 AMResNet identification accuracy comparison based on feature sets of different combinations
Figure BDA0003717347000000151
TABLE 2 comparison of model identification accuracy for different numbers of attention layers
Figure BDA0003717347000000152
Figure BDA0003717347000000161
TABLE 3 identification of BCResNet on each species
Figure BDA0003717347000000162
TABLE 4 comparison of accuracy rates of chirp identification for different models
Figure BDA0003717347000000163
Figure BDA0003717347000000171
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. An automatic bird song recognition system in a real environment, comprising:
the preprocessing module is used for framing, windowing and filtering the bird song audio file;
the characteristic extraction module is used for extracting the frequency spectrum characteristic and the music score characteristic of the bird song audio file;
the characteristic combination module is used for combining the extracted frequency spectrum characteristic and the extracted music score characteristic to obtain a Log-CST characteristic set, a MFCC-CST characteristic set and a Log-MFCC-CST characteristic set;
and the species classification prediction module is used for obtaining bird species according to the feature set of the bird song audio file.
2. The automatic bird song recognition system in a real environment according to claim 1, wherein the species classification prediction module is an AMResNet network comprising a convolutional layer (Conv), a batch normalization layer (BN), a modified linear unit (ReLU), a max pooling layer (MaxPool), 4 structured blocks (ARBlock), an average pooling layer (AvgPool), and a full connection layer (FC); in the structure block, an attention layer is connected with a residual error layer in series;
the attention layer comprises a channel attention module and a space attention module, the channel attention module and the space attention module are weighted respectively, the models are concentrated on the most important information in the time domain and the frequency domain, and irrelevant noise parts are filtered out;
the residual layer consists of two residual structures, each residual structure comprising a sequentially connected Conv3 x 3-BN-ReLU operation and a jump connection.
3. The system of claim 1, wherein the spectral features comprise: log mel-frequency spectrum (Log-mel), mel-frequency cepstral coefficient (MFCC);
the score features include: chroma (Chroma), spectral contrast (Spectral contrast), and centroid of hue (tonnitz).
4. The automatic bird song recognition system of claim 2, wherein in the channel attention module, the MaxPool-MLP sequential connection operation and the AvgPool-MLP sequential connection operation are combined by addition;
reducing the channel dimension of output data to 1 through 1 convolution layer with the size of 3 and the step length of 1, and obtaining attention weight by using sigmoid as an activation function, wherein the calculation is as follows:
A C (x)=σ(W mlp (Avg(x))+W mlp (Max(x)))
wherein x and A C (x) Representing the input and output of the channel attention module, respectively, avg () and Max () representing average pooling and maximum pooling, W mlp () Representing multi-layer perceptron learning, σ () is a sigmoid function.
5. The system according to claim 4, wherein in the spatial attention module, maxPoint and Avgpool perform a concatenation (Concat) operation based on channels, aggregate feature maps after passing through the channel attention module into H × W × 2 vectors, and obtain attention weights of the spatial attention module through a two-dimensional convolution operation with a kernel size of 7 and padding (padding) of 3 and a sigmoid function, and are calculated as follows:
A S (x)=σ(f 3 ([Avg(A C (x));Max(A C (x))]))
wherein x and A S (x) Respectively representing the input and output of the spatial attention module, f 3 () Representing a convolution operation with a convolution kernel of 7 and a padding of 3.
6. The system of claim 5, wherein the attention layer output is used for automatic bird song recognition in real environment
Figure FDA0003717346990000021
From the input feature tensor X i And the output processed by the two attention modules is multiplied, and the calculation process is as follows:
Figure FDA0003717346990000031
7. the system according to claim 2, wherein the residual layer is calculated as follows:
y=x+F(x,w)
wherein x and y represent the input and output of the residual structure, respectively, and w is the corresponding weight of the input element;
the output calculation of the residual network formed by the residual layers comprises the following steps:
Figure FDA0003717346990000032
wherein L and L denote the number of residual layers, f ReLU () Indicating the ReLU activation function.
8. A method for an automatic bird song recognition system in a real environment, comprising the steps of:
s1, performing framing, windowing and filter processing on a read bird song audio file, and extracting frequency spectrum characteristics and music score characteristics;
s2, combining the extracted features to obtain a Log-CST feature set, an MFCC-CST feature set and a Log-MFCC-CST feature set;
and S3, the obtained feature set is used as input and sent into an AMResNet network for low-dimensional to high-dimensional feature learning, and bird species predicted by the model are output.
9. The method according to claim 8, wherein the step S2 of combining the extracted features mainly includes:
splicing the chromaticity, the spectrum contrast and the hue centroid to obtain an extended characteristic;
aggregating the logarithmic Mel frequency spectrum and the expansion characteristics to obtain a Log-CST characteristic set;
aggregating the Mel cepstrum coefficients and the expansion features to obtain an MFCC-CST feature set;
aggregating the logarithmic Mel frequency spectrum, mel cepstrum coefficients and the expansion characteristics to obtain a Log-MFCC-CST characteristic set;
and the Log-CST feature set, the MFCC-CST feature set and the Log-MFCC-CST feature set are all combined in a linear mode.
10. The method of claim 8, wherein the AMResNet network pre-trains the audio file characterization process of each bird, which comprises:
acquiring audio of birds through a mobile terminal, and extracting spectral features and music score features of the birds;
loading the variety of each bird, the frequency spectrum characteristic and the music score characteristic to a PC end provided with an AMResNet network;
the PC terminal firstly compares the bird species with the species divided in advance to judge whether the bird species is a new species;
if the new species is the new species, the frequency spectrum characteristic and the score characteristic are received firstly and recorded as new sound data; then downloading and loading a pre-training weight file, performing transfer learning on the frequency spectrum characteristics, the music score characteristics and the bird types, and training on a deployed AMResNet network; uploading the trained new characteristic data set and the pre-training weight file, updating the bird song data set of the database, and reminding background workers to monitor new bird species;
if the new species is not the new species, only a pre-training weight file is needed to be downloaded, then the frequency spectrum characteristic and the score characteristic are used as the input of an AMResNet network, then the predicted species of the network is output, and the result is displayed; simultaneously, with data transfer backstage, repair this birds quantity in the database, the staff of being convenient for monitors the trend of change of birds kind.
CN202210739725.7A 2022-06-28 2022-06-28 Bird sound automatic identification system in real environment Pending CN115294994A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210739725.7A CN115294994A (en) 2022-06-28 2022-06-28 Bird sound automatic identification system in real environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210739725.7A CN115294994A (en) 2022-06-28 2022-06-28 Bird sound automatic identification system in real environment

Publications (1)

Publication Number Publication Date
CN115294994A true CN115294994A (en) 2022-11-04

Family

ID=83819691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210739725.7A Pending CN115294994A (en) 2022-06-28 2022-06-28 Bird sound automatic identification system in real environment

Country Status (1)

Country Link
CN (1) CN115294994A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114974268A (en) * 2022-06-08 2022-08-30 江苏麦克马尼生态科技有限公司 Bird song recognition monitoring system and method based on Internet of things
CN116206612A (en) * 2023-03-02 2023-06-02 中国科学院半导体研究所 Bird voice recognition method, model training method, device and electronic equipment
CN116559778A (en) * 2023-07-11 2023-08-08 海纳科德(湖北)科技有限公司 Vehicle whistle positioning method and system based on deep learning
CN117095694A (en) * 2023-10-18 2023-11-21 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN118098270A (en) * 2024-04-24 2024-05-28 安徽大学 Noise tracing method based on feature extraction and feature fusion

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114974268A (en) * 2022-06-08 2022-08-30 江苏麦克马尼生态科技有限公司 Bird song recognition monitoring system and method based on Internet of things
CN114974268B (en) * 2022-06-08 2023-09-05 江苏麦克马尼生态科技有限公司 Bird song recognition monitoring system and method based on Internet of things
CN116206612A (en) * 2023-03-02 2023-06-02 中国科学院半导体研究所 Bird voice recognition method, model training method, device and electronic equipment
CN116559778A (en) * 2023-07-11 2023-08-08 海纳科德(湖北)科技有限公司 Vehicle whistle positioning method and system based on deep learning
CN116559778B (en) * 2023-07-11 2023-09-29 海纳科德(湖北)科技有限公司 Vehicle whistle positioning method and system based on deep learning
CN117095694A (en) * 2023-10-18 2023-11-21 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN117095694B (en) * 2023-10-18 2024-02-23 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN118098270A (en) * 2024-04-24 2024-05-28 安徽大学 Noise tracing method based on feature extraction and feature fusion

Similar Documents

Publication Publication Date Title
CN115294994A (en) Bird sound automatic identification system in real environment
Priyadarshani et al. Automated birdsong recognition in complex acoustic environments: a review
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN111048114A (en) Equipment and method for detecting abnormal sound of equipment
CN110211594B (en) Speaker identification method based on twin network model and KNN algorithm
CN108520753A (en) Voice lie detection method based on the two-way length of convolution memory network in short-term
WO2018166316A1 (en) Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures
CN108806694A (en) A kind of teaching Work attendance method based on voice recognition
Himawan et al. 3d convolution recurrent neural networks for bird sound detection
Ting Yuan et al. Frog sound identification system for frog species recognition
CN115410711B (en) White feather broiler health monitoring method based on sound signal characteristics and random forest
CN107193378A (en) Emotion decision maker and method based on brain wave machine learning
CN111048097A (en) Twin network voiceprint recognition method based on 3D convolution
CN111986699A (en) Sound event detection method based on full convolution network
CN116842460A (en) Cough-related disease identification method and system based on attention mechanism and residual neural network
CN112200238A (en) Hard rock tension-shear fracture identification method and device based on sound characteristics
CN115578678A (en) Fish feeding intensity classification method and system
Xiao et al. AMResNet: An automatic recognition model of bird sounds in real environment
CN114863905A (en) Voice category acquisition method and device, electronic equipment and storage medium
CN112329819A (en) Underwater target identification method based on multi-network fusion
Chinmayi et al. Emotion Classification Using Deep Learning
CN112052880A (en) Underwater sound target identification method based on weight updating support vector machine
Wang et al. A hierarchical birdsong feature extraction architecture combining static and dynamic modeling
CN113936667A (en) Bird song recognition model training method, recognition method and storage medium
CN115170942A (en) Fish behavior identification method with multilevel fusion of sound and vision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination