CN114863938A

CN114863938A - Bird language identification method and system based on attention residual error and feature fusion

Info

Publication number: CN114863938A
Application number: CN202210570511.1A
Authority: CN
Inventors: 程吉祥; 潘齐炜; 李志丹; 何虹斌; 曾蕊
Original assignee: Southwest Petroleum University
Current assignee: Southwest Petroleum University
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-08-05

Abstract

The invention discloses a bird sound identification method and system based on attention residual error and feature fusion, which comprises the following steps: firstly, preprocessing operations of framing and windowing are carried out on a bird sound training set; then, processing the preprocessed bird sound training set by two feature extraction methods, and converting the obtained feature information into an energy spectrum image; in the training stage, the probability of bird species is calculated by using a residual error network with a horizontal and vertical attention structure module and a cross entropy loss function, a final classification prediction layer is obtained, and classification prediction of bird sounds is realized; a bird voice recognition system is designed and implemented, which can perform bird voice recognition classification using the method proposed by the present invention. The method improves the accuracy of bird sound recognition with high sound confusion degree, and can prove the effectiveness of the method through the test result.

Description

Bird language identification method and system based on attention residual error and feature fusion

Technical Field

The invention relates to the technical field of bird voiceprint recognition based on deep learning, in particular to a bird language recognition method and system based on attention residual error and feature fusion.

Background

Birds are an important index for evaluating the health of an ecosystem, and as an important component of the ecosystem, the existence and migration mode of the birds are often warning signals of environmental health of any specific area. In recent decades, the protection of bird biodiversity is more and more important, and the significance of bird voiceprint recognition technology is more and more important. The sound production structure and organs of each bird have certain differences, biological characteristics of the birds cannot be copied, the biological characteristics can be used for identifying biological species, and bird voiceprint technology is used for identifying the species of the biological characteristics specific to bird species by utilizing voiceprint identification technology. At present, bird voiceprint recognition technology can be divided into a traditional method and a deep learning-based method according to the type of a model; the traditional method mainly uses a Gaussian mixture model and maximum likelihood estimation to learn the sound with the highest score; the deep learning-based method mainly comprises the steps of training, identifying and detecting through a neural network model. Compared with the traditional method and the machine learning method, the deep learning-based method has more excellent performance in processing bird voice recognition tasks. With the rapid development of artificial intelligence and deep learning, the bird voiceprint recognition technology has wide application prospect in the field of environmental protection.

Document 1(Lee C H, Han C, Chuang C. automatic Classification of Bird specifices From the sources Using Two-Dimensional Cepstral Coefficients [ J ]. IEEE Transactions on Audio Spech & Lange Processing,2008,16(8):1541-1550.) is to use single Audio features to improve the feature information expression by dynamic and static extraction and then fusion representation, so as to improve the identification accuracy. Document 2(Efremova D B, Sankupelay M, Konovalov D A. data-Efficient Classification of Birdcall Through consistent Neural Networks Transfer Learning [ C ] In: Digital Image Computing: Techniques and Applications,2019,294-301.) utilizes ResNet50 deep Convolutional Neural Networks as a model to increase the speed of bird identification. Document 3 (spring and courage of poplar, qihongda, penyanqiu, etc..) research on energy spectrogram fusing voiceprint information in bird identification [ J ] application acoustics 2020,39(3): 453) is combined with classifier algorithm through LBP and HOG characteristics, and the generated confrontation network frequency spectrum information is additionally used for data enhancement, so that the identification rate is further improved.

Most models in the recognition task based on deep learning use large convolutional neural networks, although the recognition rate is improved, the problems of difficult training, low detection speed and the like caused by the increase of parameter calculation amount are inevitable. In the aspect of feature extraction, a single feature extraction method is generally used, but a single feature parameter cannot completely express all the characteristics of bird sounds in the identification and detection process, and certain limitations exist.

Disclosure of Invention

In order to solve the problems of audio characteristic information limitation, large network parameter quantity and the like, the invention provides a bird language identification method based on attention residual error and characteristic fusion, which uses two characteristic extraction methods, performs information fusion to obtain characteristic information, and converts the characteristic information into an energy spectrogram; the energy spectrogram is input into a bird language identification classification convolution neural network, corresponding characteristic images are generated through sampling, a residual error structure network with a horizontal and vertical attention module is used for effectively paying attention to channel relations among the characteristic images after the energy spectrogram is input into the network, and meanwhile, the calculation cost is reduced and the identification precision is improved.

A bird language identification method based on attention residual error and feature fusion is characterized by comprising the following steps:

s1, collecting the sound of various birds in the natural environment to form a sound training set; marking the sound of the bird species to which the bird belongs, and controlling the time range of each sound to be between 2s and 30s, wherein the sound contains the singing of a single bird;

s2, sampling the sound training set in the step S1 by using the same sampling frequency, and then unifying the audio time of the sound training set through the preprocessing operations of framing and windowing;

s3, obtaining feature information through two feature extraction methods, and finally converting the feature information into an energy spectrogram;

s31, processing the preprocessed sound training set by sequentially using a Mel trigonometric filtering algorithm and a cepstrum mean variance normalization method to obtain a vector fraction F; processing the preprocessed sound training set by using a gamma pass filtering algorithm added with noise suppression processing and cepstrum mean variance normalization in sequence to obtain a vector score G; and fusing the two vector fractions to obtain characteristic information f:

f＝ωF+(1-ω)G

where ω represents the mixing weight coefficient.

And S32, converting the characteristic information f obtained in the S31 into an energy spectrogram.

S33, performing image enhancement on the obtained energy spectrogram; wherein the image enhancement operation comprises image color random gray scale transformation and image rotation.

The constructed bird language identification classification convolution neural network specifically comprises the following steps:

the network structure is sequentially provided with 3 convolution layers with convolution kernels of 3 x 3 and step length of 2, a maximum pooling layer, an activation function layer and 48 residual error structure layers with horizontal and vertical attention modules; the residual error structure layer with the horizontal and vertical attention modules comprises a convolution layer, an activation function layer, an average pooling layer and a batch normalization layer; the network structure uses a global average pooling operation at the last layer and an activation function layer after all convolution operations.

Inputting the energy spectrogram into the bird recognition classification convolutional neural network, and performing down-sampling operation on 3 convolutional layers with 3 × 3 step lengths of 2 to obtain a characteristic image Q, wherein the process is represented as follows:

Q＝F _3*3 (F _3*3 (F _3*3 (f)))

processing the characteristic image Q by a residual structural layer with a horizontal and vertical attention module, wherein the residual structural layer with the horizontal and vertical attention module comprises the following parts: convolution layer 1 x 1, convolution layer 3 x 3, convolution layer 1 x 1, batch normalization layer, activation function layer, horizontal and vertical attention module and residual connection; the process can be expressed as:

F _out ＝F _HW (F _1*1 (F _3*3 (F _1*1 (x))))+x

F _HW is a horizontal-vertical attention module, which is composed of two attention submodules in the vertical and horizontal directions, respectively, and is expressed as:

F _HW ＝F _H +F _W

wherein:

δ denotes the use of sigmoid function, conv (x) denotes the use of a 1 × 1 size convolution kernel, and Avgpool denotes the use of an average pooling operation.

S4, constructing a bird language identification classification convolution neural network; inputting the energy spectrogram obtained in the step S3 into the constructed bird language identification classification convolutional neural network for training; the loss function uses a classified cross entropy loss function, an optimization strategy and a hyper-parameter are set for constructing a bird language recognition classification network, the loss function is continuously reduced by carrying out cyclic iterative training on the network until the set iteration times are finished and the training weight parameters are stored;

s5, constructing a bird language recognition system based on attention residual error and feature fusion by using the bird language recognition and classification convolutional neural network constructed in the step S3 and the obtained network training weight parameters, performing bird language recognition and classification on the energy spectrogram to be detected by using the detection system, and performing quantity marking and classification on all input bird spectrograms by using the bird language recognition system.

The invention also provides a bird language recognition system based on residual attention and feature fusion, which comprises the following modules:

the bird sound acquisition module is configured to acquire a bird sound data set to be processed;

a bird voice recognition model acquisition module, which configures a bird voice recognizer by using the bird voice recognition model and the parameter file obtained in the bird language recognition method based on attention residual and feature fusion in the claim 1, and is used for bird voice type recognition and classification;

and the bird counting module is used for counting the number of the obtained bird species.

Has the advantages that:

1. the invention provides a bird language identification method based on attention residual error and feature fusion. Because the bird sound data set comprises a large number of short-time bird original singing signals and is various in types, the invention firstly uses voiceprint to preprocess audio signals, then converts the audio signals into energy spectrogram by audio characteristic extraction, and then uses an attention residual error network to extract picture characteristics to accelerate the training speed of the recognition and classification network and reduce the network parameter number.

2. When the audio signal features are extracted, different feature extraction methods are adopted, cepstrum mean and variance normalization and feature warping processing are used after different features are obtained, and channel mismatch and channel effect in audio possibly existing are reduced.

3. The invention provides and designs a bird voice recognition system which can perform bird voice recognition by using the bird voice recognition method based on attention residual error and feature fusion.

Drawings

FIG. 1 is a general diagram of a model structure of a bird language identification method used in an embodiment of the present invention;

fig. 2 is a diagram illustrating an attention structure of a bird language recognition method according to an embodiment of the present invention; wherein, the sub-graph a is the integral structure of the attention residual error module; sub-graph b is the vertical and horizontal attention sub-structure diagram in the attention module;

FIG. 3 is a schematic flow chart of bird voice recognition system according to an embodiment of the present invention;

FIG. 4 is a block diagram of a bird voice recognition system according to an embodiment of the present invention;

fig. 5 is a comparison graph of feature extraction by the method of the present invention and the method not employed in the present invention, in which, sub-graph a is an energy spectrum graph of the feature extraction method using only the mel-triangle filtering algorithm, sub-graph b is an energy spectrum graph of the feature extraction method using only the gamma-pass filtering algorithm with noise suppression processing, and sub-graph c is an energy spectrum graph obtained by the feature fusion method of the present invention.

Fig. 6 is a comparison graph of confusion matrices between the method of the present invention and the method not of the present invention, in which, sub-graph a is a confusion matrix graph using only the mel triangle filtering algorithm feature extraction method, sub-graph b is a confusion matrix graph using only the gamma pass filtering algorithm feature extraction method with noise suppression processing, sub-graph c is a confusion matrix graph using the feature fusion method, and sub-graph d is a confusion matrix graph using the method of the present invention.

Detailed Description

In order to make the technical features, objects and advantages of the present invention more comprehensible, one embodiment of the present invention is further described with reference to the accompanying drawings. The examples are given solely for the purpose of illustration and are not to be construed as limitations of the present invention, as numerous insubstantial modifications and adaptations of the invention may be made by those skilled in the art based on the teachings herein.

s1, collecting the sound of various birds in the natural environment to form a sound training set; marking the sound of the bird species to which the bird belongs, and controlling the time range of each sound to be between 2s and 30s, wherein the sound contains the singing of a single bird; the number of bird sounds screened here is greater than or equal to 200 per bird.

s4, constructing a bird language identification classification convolution neural network; inputting the energy spectrogram obtained in the step S3 into the constructed bird language identification classification convolutional neural network for training; the loss function uses a classified cross entropy loss function, an optimization strategy and a hyper-parameter are set for constructing a bird language recognition classification network, the loss function is continuously reduced by carrying out cyclic iterative training on the network until the set iteration times are finished and the training weight parameters are stored; when the training iteration times are reached, the loss function value is not obviously reduced after the training and fitting are performed;

As a specific embodiment of the present invention, the steps specifically include the following steps:

f＝ωF+(1-ω)G

where ω represents the mixing weight coefficient.

As a specific embodiment of the present invention, the bird language identification classification convolutional neural network constructed in step S3 specifically includes:

Q＝F _3*3 (F _3*3 (F _3*3 (f)))

F _out ＝F _HW (F _1*1 (F _3*3 (F _1*1 (x))))+x

F _HW ＝F _H +F _W

wherein:

Simulation experiment

Fig. 5 shows that the frequency representation in the energy spectrum graph c subgraph using the method is more obvious than the frequency representation in the subgraphs a and b, which shows that the feature fusion method in the invention has certain enhancement effect on the frequency and is beneficial to the improvement of the recognition rate.

The recognition rates of the method and the comparison recognition method are shown in table 1, wherein the characteristic 1 method is an index obtained by only using a Mel-triangular filtering algorithm and a residual error network, the characteristic 2 method is an index obtained by only using a gamma-pass filtering algorithm subjected to noise suppression and the residual error network, and the characteristic fusion method is an index obtained by only using a characteristic fusion method and the residual error network.

Table 1 simulation experiment bird voice recognition classification evaluation index statistical table

Method	Average precision (%)	Average recall (%)	Average F1 (%)
				Characteristic 1 method	88.96	86.06	88.06
Characteristic 2 method	90.17	89.14	89.14
				Feature fusion method	93.43	90.91	92.15
The method of the invention	93.62	90.59	92.17

As can be seen from table 1 and fig. 6: by using the feature fusion and attention residual error network method, the accuracy and the F1 value are improved to a certain extent, and the classification effect is superior to that of a single feature extraction method and a residual error network. The simulation experiment results show that the method of the invention improves the expression of the feature extraction on the bird sound information and well improves the recognition performance without adding extra calculation cost to the network.

The content of the method of the present invention is described above, and those skilled in the art can implement the method of the present invention based on the description of the content. Based on the above description of the invention, other embodiments obtained by a person skilled in the art without any inventive step should fall within the scope of protection of the invention.

Claims

1. A bird language identification method and system based on attention residual error and feature fusion are characterized by comprising the following steps:

s5, constructing a bird language recognition system based on attention residual error and feature fusion by using the bird language recognition and classification convolutional neural network constructed in the step S3 and the obtained network training weight parameters, recognizing and classifying bird sounds by using the detection system, and quantitatively marking and classifying all input bird sounds by using the bird language recognition system.

2. The method for identifying bird language based on attention residual error and feature fusion according to claim 1, wherein the step S3 comprises the following steps:

f＝ωF+(1-ω)G

where ω represents the mixing weight coefficient.

3. The method for identifying bird language based on attention residual error and feature fusion according to claim 1, wherein the bird language identification classification convolutional neural network constructed in step S3 specifically comprises:

Q＝F _3*3 (F _3*3 (F _3*3 (f)))

F _out ＝F _HW (F _1*1 (F _3*3 (F _1*1 (x))))+x

F _HW ＝F _H +F _W

wherein:

4. A bird language recognition system based on attention residual and feature fusion is characterized by comprising the following modules:

a bird voice recognition model acquisition module configured to form a bird voice recognizer with the bird voice recognition model and the parameter file obtained in the bird language recognition method based on attention residual and feature fusion in claim 1, and to be used for bird voice category recognition and classification;