CN115294994A - Bird sound automatic identification system in real environment - Google Patents
Bird sound automatic identification system in real environment Download PDFInfo
- Publication number
- CN115294994A CN115294994A CN202210739725.7A CN202210739725A CN115294994A CN 115294994 A CN115294994 A CN 115294994A CN 202210739725 A CN202210739725 A CN 202210739725A CN 115294994 A CN115294994 A CN 115294994A
- Authority
- CN
- China
- Prior art keywords
- characteristic
- bird
- cst
- species
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001228 spectrum Methods 0.000 claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000000605 extraction Methods 0.000 claims abstract description 9
- 238000009432 framing Methods 0.000 claims abstract description 7
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 36
- 230000003595 spectral effect Effects 0.000 claims description 17
- 238000004364 calculation method Methods 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 6
- 230000004931 aggregating effect Effects 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 238000013526 transfer learning Methods 0.000 claims description 3
- 239000013598 vector Substances 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 2
- 238000012512 characterization method Methods 0.000 claims 1
- 230000008439 repair process Effects 0.000 claims 1
- 238000013135 deep learning Methods 0.000 abstract description 8
- 241000894007 species Species 0.000 description 42
- 241000271566 Aves Species 0.000 description 18
- 238000012544 monitoring process Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000013145 classification model Methods 0.000 description 5
- 241001529251 Gallinago gallinago Species 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 241000287828 Gallus gallus Species 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000011835 investigation Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000007619 statistical method Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 241000272525 Anas platyrhynchos Species 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 2
- 241000286209 Phasianidae Species 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 241000272814 Anser sp. Species 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 206010063385 Intellectualisation Diseases 0.000 description 1
- 241000287127 Passeridae Species 0.000 description 1
- 206010057009 Pharyngeal erythema Diseases 0.000 description 1
- 241000448536 Pyrrholaemus brunneus Species 0.000 description 1
- 241000209140 Triticum Species 0.000 description 1
- 235000021307 Triticum Nutrition 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000002567 autonomic effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000009395 breeding Methods 0.000 description 1
- 230000001488 breeding effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003012 network analysis Methods 0.000 description 1
- 230000000926 neurological effect Effects 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of singing signal extraction and recognition, in particular to an automatic bird singing recognition system in a real environment. The system comprises a preprocessing module, a window processing module and a filter processing module, wherein the preprocessing module is used for performing framing, windowing and filter processing on the bird song audio file; the characteristic extraction module is used for extracting the frequency spectrum characteristic and the music score characteristic of the bird song audio file; the characteristic combination module is used for combining the extracted frequency spectrum characteristic and the extracted music score characteristic to obtain a Log-CST characteristic set, a MFCC-CST characteristic set and a Log-MFCC-CST characteristic set; and the species classification prediction module is used for obtaining bird species according to the feature set of the obtained bird song audio file. The invention eliminates the difficulty of manually extracting the characteristics, reduces the cost and shortens the period; by combining the feature sets, key information in the sounding signal is extracted more comprehensively; by constructing the AMResNet deep learning network model, more accurate identification and classification are realized, and the generalization capability of the model is improved.
Description
Technical Field
The invention relates to the technical field of singing signal extraction and recognition, in particular to an automatic bird singing recognition system in a real environment.
Background
Birds are one of the representative groups of wild animals, are an important component of the ecosystem, and the balance and stability of the entire ecosystem are maintained in terms of survival and development. The investigation and monitoring of birds can provide necessary information such as species, number, life habit, living quality, habitat condition and the like of bird populations, help researchers master the current situation of bird resources and dynamic changes of bird and animal resources, and provide basis for effective protection, continuous utilization and scientific management of bird resources. However, the traditional bird investigation and monitoring mode has the defects of long monitoring period, limited monitoring range, high labor intensity and the like, and cannot meet the requirements of digitization, automation and intellectualization of bird species monitoring at present.
Bird song is one of the important biological characteristics of birds, has higher degree of identification, and is widely applied to classification research of bird species. Based on the theory, the method for realizing bird species investigation by sound recognition by using automatic recording equipment and recognition software can overcome the defects and realize high-efficiency, non-damage, low-interference and large-range monitoring. The research on the bird singing is beneficial to people to master the life activity rules of the breeding behavior, the life habit and the like of the birds, and the automatic statistics of the number of individual birds or species of the birds is realized, so that the birds are protected more effectively.
Bird song recognition can be classified into a conventional recognition method and a deep learning method.
The traditional method for bird song recognition is a pattern matching based classification method, most commonly a Dynamic Time Warping (DTW) algorithm. The algorithm has high identification precision, but has the defect that the matching calculation amount is too large, so that the identification efficiency is influenced. Subsequently, the feature-based classification model is widely used, and methods commonly used at home and abroad include Hidden Markov Models (HMM), gaussian Mixture Models (GMM), support Vector Machines (SVM), random Forests (RF), autonomic Neural Networks (ANN), k nearest neighbor (kNN), bayesian network learning, and a mixture model thereof. However, there are great difficulties in extracting suitable difference features from such methods.
With the development of deep learning, the deep neural network can automatically learn data features with high complexity, so that the problems that the features are difficult to learn manually, the generalization capability is not ideal and the deep features cannot be extracted in the traditional method are solved, and the remarkable effect is achieved in the application in recent years. In 1997, mcrlar A.L (Mcrlar A L, card H C.bird identification using scientific neurological network and statistical analysis [ C ]// electric and Computer Engineering,1997.Engineering innovation. In order to further improve the accuracy of acoustic recognition, a Convolutional Neural Network (CNN) that is highlighted in an image classification task has become a hot spot in sound classification research. Sprengelt (M.Lasseck, bird species identification in soundscapes, working Notes of CLEF 2019.) realizes the identification of 1500 Bird singings by taking the ethnic word spectrogram in 2019 as the input of the convolutional neural network, and the average identification rate is not lower than 70%. Therefore, the deep learning can bring better identification effect for bird song identification problem.
However, when birds singing in a noisy environment is processed, the recognition effect of the model is not good enough, so that the problems still need to be studied more deeply, and the following technical problems mainly exist:
(1) The existing traditional singing recognition method is a pattern matching process, features need to be manually extracted, the processing period is long, the recognition efficiency is low, and the method is difficult to be applied to a large-range monitoring and low-delay scene required by bird statistical analysis.
(2) The existing identification method mostly adopts a single characteristic diagram for input, so that the identification effect of a network model is poor. Bird song is a non-stationary signal, containing no meaningful substructures or patterns, and using only a single feature may not effectively capture important audio information, thereby making it difficult to avoid the problem of inaccurate recognition details caused by similar noise.
(3) The existing deep learning method has the defects of poor generalization capability and poor practicability. Currently, these studies only evaluate classification models in a single scene, but in a real environment full of noise, methods for combining detection and classification of multiple species are rare, and methods for achieving better results are more scarce.
Disclosure of Invention
The present invention aims to provide an automatic bird song recognition system in a real environment, which is used for solving the problems in the prior art in the background art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a bird song automatic identification system in real environment, comprising:
the preprocessing module is used for performing framing, windowing and filter processing on the bird song audio file;
the characteristic extraction module is used for extracting the frequency spectrum characteristic and the music score characteristic of the bird song audio file;
the characteristic combination module is used for combining the extracted frequency spectrum characteristic and the extracted music score characteristic to obtain a Log-CST characteristic set, a MFCC-CST characteristic set and a Log-MFCC-CST characteristic set;
and the species classification prediction module is used for obtaining bird species according to the feature set of the bird song audio file.
Further, the species classification prediction module is an AMResNet network, which includes a convolutional layer (Conv), a batch normalization layer (BN), a modified linear unit (ReLU), a max pooling layer (MaxPool), 4 structure blocks (ARBlock), an average pooling layer (AvgPool), and a full connection layer (FC); in the structure block, an attention layer is connected with a residual error layer in series;
the attention layer comprises a channel attention module and a space attention module, the channel attention module and the space attention module are weighted respectively, the models are concentrated on the most important information in the time domain and the frequency domain, and irrelevant noise parts are filtered out;
the residual layer consists of two residual structures, each residual structure comprising a Conv3 x 3-BN-ReLU operation and a jump connection, connected in sequence.
Further, the spectral features include: log mel-frequency spectrum (Log-mel), mel-frequency cepstral coefficient (MFCC);
the score features include: chroma (Chroma), spectral contrast (Spectral contrast), and hue centroid (Tonnetz).
Further, in the channel attention module, the MaxPoint-MLP sequential connection operation and the AvgPool-MLP sequential connection operation are combined through addition;
reducing the channel dimension of output data to 1 through 1 convolution layer with the size of 3 and the step length of 1, and obtaining attention weight by using sigmoid as an activation function, wherein the calculation is as follows:
A C (x)=σ(W mlp (Avg(x))+W mlp (Max(x)))
wherein x and A C (x) Representing the input and output of the channel attention module, respectively, avg () and Max () representing average pooling and maximum pooling, W mlp () Representing multi-layer perceptron learning, σ () is a sigmoid function.
Further, in the spatial attention module, maxPool and AvgPool perform splicing (Concat) operation based on channels, aggregate feature maps after passing through the channel attention module into vectors of H × W × 2, and obtain attention weight of the spatial attention module through a two-dimensional convolution operation with a kernel size of 7 and padding (padding) of 3 and a sigmoid function, and calculate as follows:
A S (x)=σ(f 3 ([Avg(A C (x));Max(A C (x))]))
wherein x and A S (x) Respectively representing the input and output of the spatial attention module, f 3 () Representing a convolution operation with a convolution kernel of 7 and a padding of 3.
Further, the output of the attention layerFrom the input feature tensor X i And the output processed by the two attention modules is multiplied, and the calculation process is as follows:
further, the calculation process of the residual layer is as follows:
y=x+F(x,w)
wherein x and y represent the input and output of the residual structure, respectively, and w is the corresponding weight of the input element;
the output calculation of the residual network formed by the residual layers comprises the following steps:
where L and L denote the number of residual layers, f ReLU () Indicating the ReLU activation function.
The invention also provides a method for the bird song automatic identification system in the real environment, which comprises the following steps:
s1, performing framing, windowing and filter processing on a read bird song audio file, and extracting frequency spectrum characteristics and music score characteristics;
s2, combining the extracted features to obtain a Log-CST feature set, an MFCC-CST feature set and a Log-MFCC-CST feature set;
and S3, the obtained feature set is used as input and sent into an AMResNet network for low-dimensional to high-dimensional feature learning, and bird species predicted by the model are output.
Further, in S2, the feature combination of the extracted features mainly includes:
splicing the chromaticity, the spectral contrast and the hue centroid to obtain an extended characteristic;
aggregating the logarithmic Mel frequency spectrum and the expansion characteristics to obtain a Log-CST characteristic set;
aggregating the Mel cepstrum coefficient and the expansion feature to obtain an MFCC-CST feature set;
aggregating the logarithmic Mel frequency spectrum, mel cepstrum coefficients and the expansion characteristics to obtain a Log-MFCC-CST characteristic set;
and the Log-CST feature set, the MFCC-CST feature set and the Log-MFCC-CST feature set are all combined in a linear manner.
Further, the process of training the audio file features of the birds by the AMResNet network in advance comprises the following steps:
acquiring audio of birds through a mobile terminal, and extracting spectral features and music score features of the birds;
loading the variety of each bird, the frequency spectrum characteristic and the music score characteristic to a PC end provided with an AMResNet network;
the PC terminal firstly compares the bird species with the species divided in advance to judge whether the bird species is a new species;
if the new species is the new species, the frequency spectrum characteristic and the score characteristic are received firstly and recorded as new sound data; then downloading and loading a pre-training weight file, performing transfer learning on the spectrum characteristics, the score characteristics and the bird types, and training on a deployed AMResNet network; uploading the trained new characteristic data set and the pre-training weight file, updating the bird song data set of the database, and reminding background workers to monitor new bird species;
if the new species is not the new species, only a pre-training weight file is needed to be downloaded, then the frequency spectrum characteristic and the score characteristic are used as the input of an AMResNet network, then the predicted species of the network is output, and the result is displayed; meanwhile, data are transmitted to a background, the number of the birds in the database is modified, and workers can monitor the change trend of bird species conveniently.
The invention has at least the following beneficial effects:
the identification method provided by the invention eliminates the difficulty of manually extracting the characteristics, reduces the labor cost, shortens the identification period and makes the real-time monitoring of the bird species change possible;
the invention designs and realizes an effective characteristic combination mode, and combines logarithmic Mel frequency spectrum (Logmel), mel cepstrum coefficient (MFCC), chroma (Chroma), spectral contrast (Spectral _ contrast) and hue centroid characteristic (Tonnetz) into a new characteristic set, thereby more comprehensively extracting key information in the singing signal;
the invention builds an AMResNet deep learning network model which is based on a residual error network and is doubly combined with an attention mechanism on a channel and a space; the jump connection in the residual error network relieves the problems of gradient loss and network degradation, so that a deeper network architecture can be constructed, and more accurate identification and classification can be realized; the attention mechanism focuses the model on the most important information in the time domain and the frequency domain by weighting the channel and the space, ignores the noise part in the characteristics and improves the generalization capability of the model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of bird song recognition according to the present invention;
FIG. 2 is a spectral plot of a feature;
FIG. 3 is a block diagram of a feature set;
FIG. 4 is a schematic diagram of an AMResNet model processing a single channel feature set input;
FIG. 5 is an architectural diagram of an attention layer;
FIG. 6 is a diagram illustrating a residual structure in a residual layer;
FIG. 7 is a graph showing changes in feature maps in a model (a) of an inattentive layer and a model (b) of an attentive layer;
FIG. 8 is a graph of an AMResNet confusion matrix;
FIG. 9 is a ROC plot under ten-fold cross validation.
Fig. 10 is a structural view of an automatic bird song recognition system.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, the whole bird song recognition process of the present invention is mainly divided into three stages: feature extraction, feature combination and species classification prediction of AMResNet.
Firstly, a read bird song audio file is subjected to framing, windowing and filter processing, and spectrum characteristics and music score characteristics are respectively extracted, wherein the spectrum characteristics and the music score characteristics comprise five characteristics of logarithmic Mel frequency spectrum, mel cepstrum coefficient, chroma, spectral contrast and hue centroid. And then, combining the five features according to a certain mode to obtain a feature set of a single channel. And finally, taking the feature set as input, sending the feature set into an AMResNet network, comparing the feature set with the audio file features of all birds trained in advance, and outputting bird species obtained through comparison.
To this end, the system of the invention comprises at least: the preprocessing module is used for performing framing, windowing and filter processing on the bird song audio file;
the characteristic extraction module is used for extracting the frequency spectrum characteristic and the music score characteristic of the bird song audio file;
the characteristic combination module is used for combining the extracted frequency spectrum characteristic and the extracted music score characteristic to obtain a Log-CST characteristic set, a MFCC-CST characteristic set and a Log-MFCC-CST characteristic set;
and the species classification prediction module is used for obtaining bird species according to the feature set of the obtained bird song audio file.
Specifically, the detailed technical scheme is as follows:
(1) Feature extraction
The read-in sounding audio file is stored in an MP3 format, and in order to adapt to the input of a network model in deep learning, it is necessary to extract the characteristics thereof by using a computer. By means of the audio processing library Librosa, the window length of the Fast Fourier Transform (FFT) is set to 1024, the frame offset is set to 512, the channel numbers of the logarithmic mel-frequency spectrum and mel-frequency cepstral coefficients are all 40, and the channel numbers of the chromaticity, spectral contrast and hue centroid are 12, 7 and 6, respectively. Finally, the feature matrix sizes of the extraction of the logarithmic mel-frequency spectrum and the mel-frequency cepstrum coefficient are both 40 × 63, and the feature sizes of the chromaticity, the spectral contrast and the hue centroid are respectively 12 × 63,7 × 63,6 × 63, and the spectrograms of the five features are shown in fig. 2.
(2) Feature combination
The audio features contain rich information, but the information among different features is different, and the features are combined in a certain mode, so that the effect of acquiring the useful information to the maximum extent is achieved.
The logarithmic mel-frequency spectrum and mel-frequency cepstral coefficients are the most commonly used features in automatic audio recognition. Chroma, spectral contrast and tonal centroid characteristics are the most commonly used characteristics in Music Information Recognition (MIR) and are used as extended features (CST) after stitching. The Log mel frequency spectrum and the extended features are aggregated to form a feature set (Log-CST). The mel-frequency cepstral coefficients and the extended features are aggregated to form a feature set (MFCC-CST). The Log mel-frequency spectrum, mel-frequency cepstral coefficients and the extended features are aggregated to form a feature set (Log-MFCC-CST). All feature sets were combined in a linear fashion, with Log-CST (FIG. 3 (a)), MFCC-CST (FIG. 3 (b)), and Log-MFCC-CST (FIG. 3 (c)) sizes of 65X 63, and 105X 63, respectively.
(3) Deep learning network AMResNet
The AMResNet is mainly used for solving the problem of species classification of singing and is a visual domain residual error network combined with an attention mechanism. The structural diagram of the AMResNet is shown in fig. 4, and the main branch includes a convolutional layer, a batch normalization layer, a maximum pooling layer, 1-4 structural blocks, an average pooling layer, and a full connection layer. In each building block, the attention layer with the channel attention module and the spatial attention module is concatenated with the residual layer with two jump connections. N represents the number of channels in the four blocks, with values of 64, 128, 256, 512, respectively. The input data is a single-channel feature map of the feature set, and the size of the input feature map is reduced by half after the input feature map is processed by the first 7 x 7 convolutional layer and the 2 x 2 maximum pooling layer. The feature maps derived from the attention layer and the residual layer are the same size as the input feature map, but the number of channels is changed. Since the averaging pool is performed after the 4 th block, the sizes of the feature maps for flattening are all 1 × 1, and are delivered to a full-link layer having 1024 hidden units, and finally, the tensor values of the corresponding sizes are output according to the number of categories of the chirp data set.
Referring to fig. 10, an intelligent identification process of the bird song automatic identification system in a real environment specifically includes:
acquiring audio of birds through a mobile terminal, and extracting spectral features and music score features of the birds;
pouring the types of birds, the frequency spectrum characteristics of the birds and the music score characteristics into a PC end provided with an AMResNet network;
the PC terminal firstly compares the bird species with the species divided in advance to judge whether the bird species is a new species;
if the new species is the new species, the frequency spectrum characteristic and the score characteristic are received firstly and recorded as new sound data; then downloading and loading a pre-training weight file, performing transfer learning on the frequency spectrum characteristics, the music score characteristics and the bird types, and training on a deployed AMResNet network; uploading the trained new characteristic data set and the pre-training weight file, updating the bird song data set of the database, and reminding background workers to monitor new bird species;
if the new species is not the new species, only a pre-training weight file is needed to be downloaded, then the frequency spectrum characteristic and the score characteristic are used as the input of an AMResNet network, then the predicted species of the network is output, and the result is displayed; meanwhile, data are transmitted to a background, the number of the birds in the database is modified, and workers can monitor the change trend of bird species conveniently.
3.1 attention layer
Each attention layer consists of a channel attention module (fig. 5 (left box)) and a spatial attention module (fig. 5 (right box)), which weight the channel and space, respectively, focusing the model on the most important information in the time and frequency domains, thereby filtering out uncorrelated noise parts.
In the channel attention module, maxPoint-MLP sequential ligation operation and AvgPool-MLP sequential ligation operation are combined by addition. Then, the channel dimension of the output data is reduced to 1 by 1 convolution layer with size of 3 and step size of 1, and sigmoid is used as an activation function to obtain the attention weight. The calculation process is as follows:
A C (x)=σ(W mlp (Avg(x))+W mlp (Max(x)))
wherein x and A C (x) Representing the input and output of the channel attention module, respectively, avg () and Max () representing average pooling and maximum pooling, W mlp () Representing multi-layer perceptron learning, σ () is a sigmoid function.
In the space attention module, maxPoint and AvgPool perform Concat operation based on channels, feature maps after passing through the channel attention module are aggregated into H multiplied by W multiplied by 2 vectors, and attention weight of the space attention module is obtained through two-dimensional convolution operation with kernel size of 7 and padding of 3 and sigmoid function. The calculation process is as follows:
A S (x)=σ(f 3 ([Avg(A C (x));Max(A C (x))]))
wherein x and A S (x) Respectively representing the input and output of the spatial attention module, f 3 () Represents the convolution operation with convolution kernel of 7 and padding of 3.
Finally, the output of the entire attention layerFrom the input feature tensor X i And the output processed by the two attention modules, the calculation process is as follows:
3.2 residual layer
Each residual layer consists of two residual structures (fig. 6), and each residual structure consists of a sequentially connected Conv3 x 3-BN-ReLU operation and a jump connection. Compared with a general deep neural network, a deep structure in the residual error network is not designed to be invalid identity mapping any more, but is designed to be fitting operation, as long as a residual error function F () is equal to 0, the residual error function F () can be converted into identity transformation, the fitting residual error is simple and easy to realize, and the calculation process is as follows:
y=x+F(x,w)
where x and y represent the input and output of the residual structure, respectively, and w is the corresponding weight of the input element.
And finally, outputting the comprehensive result of all deep structures by a residual error network, wherein the calculation process is as follows:
where L and L denote the number of residual layers, f ReLU () Indicating the ReLU activation function.
(4) Statistical analysis
In this study, the evaluation indices used to test the performance of the proposed method are: accuracy (Accuracy), precision (Precision), recall (Recall), and F1 score (F1-score). Accuracy is a gold index of a classification model, and is suitable for both two-classification tasks and multi-classification tasks. For the classification model f () with n test sets D, the accuracy is calculated as follows:
the precision rate represents the proportion of correctly classified bird song samples in the prediction tag set, and the calculation process is as follows:
the recall rate represents the proportion of correctly classified bird song samples in the real label set, and the calculation process is as follows:
the F1 score is a harmonic mean of the precision and recall, and is calculated as follows:
to define these indices, values for True Positive (TP), true Negative (TN), false Positive (FP) and False Negative (FN) were also used in this study.
(5) Experimental verification
The bird song data set composed of 12651 song records in real environment provided by Beijing artificial intelligence institute (BAAI) is used as an experimental object, and 19 species are gray goose (AA), great swan (CC), green duck (AP), green wing duck (ACr), western quail (CQ), pheasant (PCo), red throat submerged bird (GS), cocket (ACi), common whorlled plotter (PCa), eagle (AG), eurasian (BB), western rice-stem chicken (WC), boney chicken (FA), black wing long-foot snipe (HH), phoenix-head wheat chicken (VV), white waist snipe (TC), snipe (TT), snipe (TG) and sparrow (Pa). During the experiment, all bird song data sets were divided into a training set (8863 song recordings) and a testing set (3788 song recordings), and the model was trained by a ten-fold cross-validation method. We tested the experimental results of different combinations of feature sets (table 1), different numbers of attention layers (table 2), and presence or absence of attention layers (fig. 7), respectively, and verified the performance of AMResNet using the confusion matrix (fig. 8), the ROC curve (fig. 9), the precision, the recall, and the F1 score (table 3). In addition, we compared the recognition accuracy of AMResNet with seven other common classification models (Table 4), including Gaussian Mixture Model (GMM), hidden Markov Model (HMM), three-layer cascaded Artificial Neural Network (ANN), resNet-18, resNet-34, resNet-50, and vision transform (ViT).
The results of the experiments show that the combined features Log-CST (Table 1) used in this study are the most effective and the four attention layers (Table 2) used in AMResNet work the best. The feature map can effectively remove the noise part (Huang Kuang in fig. 7 (a)) in the model with attention layer (fig. 7 (b)) in the model without attention layer (fig. 7 (a)), and the relevant part of the feature map (red frame in fig. 7 (b)) is highlighted by the attention layer. For the ten-fold cross-validated ROC curve (fig. 9), AMResNet achieved good mean AUC values and also achieved good classification in the confusion matrix (fig. 8) and identification of each species (table 3). Finally, in a comparison experiment (table 4) of different models, the AMResNet model combining the advantages of the residual error network and the attention mechanism not only deepens the network depth, but also reduces the calculation amount, and can also give higher weight to important information of input data, thereby obtaining the best recognition effect.
TABLE 1 AMResNet identification accuracy comparison based on feature sets of different combinations
TABLE 2 comparison of model identification accuracy for different numbers of attention layers
TABLE 3 identification of BCResNet on each species
TABLE 4 comparison of accuracy rates of chirp identification for different models
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (10)
1. An automatic bird song recognition system in a real environment, comprising:
the preprocessing module is used for framing, windowing and filtering the bird song audio file;
the characteristic extraction module is used for extracting the frequency spectrum characteristic and the music score characteristic of the bird song audio file;
the characteristic combination module is used for combining the extracted frequency spectrum characteristic and the extracted music score characteristic to obtain a Log-CST characteristic set, a MFCC-CST characteristic set and a Log-MFCC-CST characteristic set;
and the species classification prediction module is used for obtaining bird species according to the feature set of the bird song audio file.
2. The automatic bird song recognition system in a real environment according to claim 1, wherein the species classification prediction module is an AMResNet network comprising a convolutional layer (Conv), a batch normalization layer (BN), a modified linear unit (ReLU), a max pooling layer (MaxPool), 4 structured blocks (ARBlock), an average pooling layer (AvgPool), and a full connection layer (FC); in the structure block, an attention layer is connected with a residual error layer in series;
the attention layer comprises a channel attention module and a space attention module, the channel attention module and the space attention module are weighted respectively, the models are concentrated on the most important information in the time domain and the frequency domain, and irrelevant noise parts are filtered out;
the residual layer consists of two residual structures, each residual structure comprising a sequentially connected Conv3 x 3-BN-ReLU operation and a jump connection.
3. The system of claim 1, wherein the spectral features comprise: log mel-frequency spectrum (Log-mel), mel-frequency cepstral coefficient (MFCC);
the score features include: chroma (Chroma), spectral contrast (Spectral contrast), and centroid of hue (tonnitz).
4. The automatic bird song recognition system of claim 2, wherein in the channel attention module, the MaxPool-MLP sequential connection operation and the AvgPool-MLP sequential connection operation are combined by addition;
reducing the channel dimension of output data to 1 through 1 convolution layer with the size of 3 and the step length of 1, and obtaining attention weight by using sigmoid as an activation function, wherein the calculation is as follows:
A C (x)=σ(W mlp (Avg(x))+W mlp (Max(x)))
wherein x and A C (x) Representing the input and output of the channel attention module, respectively, avg () and Max () representing average pooling and maximum pooling, W mlp () Representing multi-layer perceptron learning, σ () is a sigmoid function.
5. The system according to claim 4, wherein in the spatial attention module, maxPoint and Avgpool perform a concatenation (Concat) operation based on channels, aggregate feature maps after passing through the channel attention module into H × W × 2 vectors, and obtain attention weights of the spatial attention module through a two-dimensional convolution operation with a kernel size of 7 and padding (padding) of 3 and a sigmoid function, and are calculated as follows:
A S (x)=σ(f 3 ([Avg(A C (x));Max(A C (x))]))
wherein x and A S (x) Respectively representing the input and output of the spatial attention module, f 3 () Representing a convolution operation with a convolution kernel of 7 and a padding of 3.
7. the system according to claim 2, wherein the residual layer is calculated as follows:
y=x+F(x,w)
wherein x and y represent the input and output of the residual structure, respectively, and w is the corresponding weight of the input element;
the output calculation of the residual network formed by the residual layers comprises the following steps:
wherein L and L denote the number of residual layers, f ReLU () Indicating the ReLU activation function.
8. A method for an automatic bird song recognition system in a real environment, comprising the steps of:
s1, performing framing, windowing and filter processing on a read bird song audio file, and extracting frequency spectrum characteristics and music score characteristics;
s2, combining the extracted features to obtain a Log-CST feature set, an MFCC-CST feature set and a Log-MFCC-CST feature set;
and S3, the obtained feature set is used as input and sent into an AMResNet network for low-dimensional to high-dimensional feature learning, and bird species predicted by the model are output.
9. The method according to claim 8, wherein the step S2 of combining the extracted features mainly includes:
splicing the chromaticity, the spectrum contrast and the hue centroid to obtain an extended characteristic;
aggregating the logarithmic Mel frequency spectrum and the expansion characteristics to obtain a Log-CST characteristic set;
aggregating the Mel cepstrum coefficients and the expansion features to obtain an MFCC-CST feature set;
aggregating the logarithmic Mel frequency spectrum, mel cepstrum coefficients and the expansion characteristics to obtain a Log-MFCC-CST characteristic set;
and the Log-CST feature set, the MFCC-CST feature set and the Log-MFCC-CST feature set are all combined in a linear mode.
10. The method of claim 8, wherein the AMResNet network pre-trains the audio file characterization process of each bird, which comprises:
acquiring audio of birds through a mobile terminal, and extracting spectral features and music score features of the birds;
loading the variety of each bird, the frequency spectrum characteristic and the music score characteristic to a PC end provided with an AMResNet network;
the PC terminal firstly compares the bird species with the species divided in advance to judge whether the bird species is a new species;
if the new species is the new species, the frequency spectrum characteristic and the score characteristic are received firstly and recorded as new sound data; then downloading and loading a pre-training weight file, performing transfer learning on the frequency spectrum characteristics, the music score characteristics and the bird types, and training on a deployed AMResNet network; uploading the trained new characteristic data set and the pre-training weight file, updating the bird song data set of the database, and reminding background workers to monitor new bird species;
if the new species is not the new species, only a pre-training weight file is needed to be downloaded, then the frequency spectrum characteristic and the score characteristic are used as the input of an AMResNet network, then the predicted species of the network is output, and the result is displayed; simultaneously, with data transfer backstage, repair this birds quantity in the database, the staff of being convenient for monitors the trend of change of birds kind.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210739725.7A CN115294994A (en) | 2022-06-28 | 2022-06-28 | Bird sound automatic identification system in real environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210739725.7A CN115294994A (en) | 2022-06-28 | 2022-06-28 | Bird sound automatic identification system in real environment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115294994A true CN115294994A (en) | 2022-11-04 |
Family
ID=83819691
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210739725.7A Pending CN115294994A (en) | 2022-06-28 | 2022-06-28 | Bird sound automatic identification system in real environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115294994A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114974268A (en) * | 2022-06-08 | 2022-08-30 | 江苏麦克马尼生态科技有限公司 | Bird song recognition monitoring system and method based on Internet of things |
CN116206612A (en) * | 2023-03-02 | 2023-06-02 | 中国科学院半导体研究所 | Bird voice recognition method, model training method, device and electronic equipment |
CN116559778A (en) * | 2023-07-11 | 2023-08-08 | 海纳科德(湖北)科技有限公司 | Vehicle whistle positioning method and system based on deep learning |
CN117095694A (en) * | 2023-10-18 | 2023-11-21 | 中国科学技术大学 | Bird song recognition method based on tag hierarchical structure attribute relationship |
CN118098270A (en) * | 2024-04-24 | 2024-05-28 | 安徽大学 | Noise tracing method based on feature extraction and feature fusion |
-
2022
- 2022-06-28 CN CN202210739725.7A patent/CN115294994A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114974268A (en) * | 2022-06-08 | 2022-08-30 | 江苏麦克马尼生态科技有限公司 | Bird song recognition monitoring system and method based on Internet of things |
CN114974268B (en) * | 2022-06-08 | 2023-09-05 | 江苏麦克马尼生态科技有限公司 | Bird song recognition monitoring system and method based on Internet of things |
CN116206612A (en) * | 2023-03-02 | 2023-06-02 | 中国科学院半导体研究所 | Bird voice recognition method, model training method, device and electronic equipment |
CN116559778A (en) * | 2023-07-11 | 2023-08-08 | 海纳科德(湖北)科技有限公司 | Vehicle whistle positioning method and system based on deep learning |
CN116559778B (en) * | 2023-07-11 | 2023-09-29 | 海纳科德(湖北)科技有限公司 | Vehicle whistle positioning method and system based on deep learning |
CN117095694A (en) * | 2023-10-18 | 2023-11-21 | 中国科学技术大学 | Bird song recognition method based on tag hierarchical structure attribute relationship |
CN117095694B (en) * | 2023-10-18 | 2024-02-23 | 中国科学技术大学 | Bird song recognition method based on tag hierarchical structure attribute relationship |
CN118098270A (en) * | 2024-04-24 | 2024-05-28 | 安徽大学 | Noise tracing method based on feature extraction and feature fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115294994A (en) | Bird sound automatic identification system in real environment | |
Priyadarshani et al. | Automated birdsong recognition in complex acoustic environments: a review | |
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
CN111048114A (en) | Equipment and method for detecting abnormal sound of equipment | |
CN110211594B (en) | Speaker identification method based on twin network model and KNN algorithm | |
CN108520753A (en) | Voice lie detection method based on the two-way length of convolution memory network in short-term | |
WO2018166316A1 (en) | Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures | |
CN108806694A (en) | A kind of teaching Work attendance method based on voice recognition | |
Himawan et al. | 3d convolution recurrent neural networks for bird sound detection | |
Ting Yuan et al. | Frog sound identification system for frog species recognition | |
CN115410711B (en) | White feather broiler health monitoring method based on sound signal characteristics and random forest | |
CN107193378A (en) | Emotion decision maker and method based on brain wave machine learning | |
CN111048097A (en) | Twin network voiceprint recognition method based on 3D convolution | |
CN111986699A (en) | Sound event detection method based on full convolution network | |
CN116842460A (en) | Cough-related disease identification method and system based on attention mechanism and residual neural network | |
CN112200238A (en) | Hard rock tension-shear fracture identification method and device based on sound characteristics | |
CN115578678A (en) | Fish feeding intensity classification method and system | |
Xiao et al. | AMResNet: An automatic recognition model of bird sounds in real environment | |
CN114863905A (en) | Voice category acquisition method and device, electronic equipment and storage medium | |
CN112329819A (en) | Underwater target identification method based on multi-network fusion | |
Chinmayi et al. | Emotion Classification Using Deep Learning | |
CN112052880A (en) | Underwater sound target identification method based on weight updating support vector machine | |
Wang et al. | A hierarchical birdsong feature extraction architecture combining static and dynamic modeling | |
CN113936667A (en) | Bird song recognition model training method, recognition method and storage medium | |
CN115170942A (en) | Fish behavior identification method with multilevel fusion of sound and vision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |