CN114038479A

CN114038479A - Bird song recognition and classification method and device for coping with low signal-to-noise ratio and storage medium

Info

Publication number: CN114038479A
Application number: CN202111323056.7A
Authority: CN
Inventors: 陈爱斌; 伍安芸; 周国雄; 刘志华; 彭伟雄
Original assignee: Central South University of Forestry and Technology
Current assignee: Central South University of Forestry and Technology
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-02-11

Abstract

The invention discloses a low signal-to-noise ratio bird song recognition and classification method, a device and a storage medium, and relates to artificial intelligence. The method comprises the following steps: extracting a time sequence signal of the audio to be identified; stacking audio frames frame by frame according to the time sequence of the sampling point values in the time sequence signal to obtain a first characteristic matrix; carrying out endpoint detection on the first feature matrix by using a preset detection algorithm to obtain a second feature matrix of the real sounding interval of the bird song; stacking the audio frames frame by frame according to the time sequence again for the sampling point values in the real sounding interval of the bird song to obtain a calibrated third feature matrix; and performing feature extraction on the third feature matrix by using a convolutional neural network so as to input the third feature matrix into a gated cyclic network for processing to obtain an identification classification result. When the method is used for classifying and identifying the bird song, the accuracy of voice endpoint detection and the accuracy of bird song classification are both high.

Description

Bird song recognition and classification method and device for coping with low signal-to-noise ratio and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a low signal-to-noise ratio bird song recognition and classification method, a device and a storage medium.

Background

Bird song is as the indispensable sound in the forest, and the main part of bird song is closely related with local ecological environment quality, and birds is the important factor index that ecological scientist carried out environmental quality and assesses. Birds usually live in dense forests, and the traditional bird singing identification method mostly relies on experts to enter the forests for identification, so that the method has certain limitation.

At present, due to the rise of deep learning, the classification of bird song becomes a popular research branch in the field of audio classification, and the classification has an extremely important promoting effect on helping ecologists to evaluate the environmental quality. There are a lot of studies on deep learning, and a number of excellent algorithm models are applied to the field of audio classification, such as deep learning for urban environmental sound classification, human dialect classification, music genre classification, and the like. However, there are few studies on classification of bird song, and since bird song data collected in real life is often contaminated with non-target sound features, there is a problem in how to correctly recognize bird song in a low signal-to-noise ratio.

The existing identification method for setting bird song has two defects: firstly, the current bird song recognition directly converts the collected bird song into a spectrogram, and then sends the spectrogram into a network for recognition, so that environmental noise mixed with the acquired bird song can be used as an effective characteristic by the network, and thus, the classification of the bird song is certainly negatively influenced. Therefore, how to efficiently reduce the noise of the audio features is an urgent problem to be researched; secondly, bird song has the characteristic of time domain continuity, then most of the existing recognition models do not consider one point, so that the robustness of the models is not very strong, and how to accurately detect and separate bird song features is an important prerequisite for utilizing the characteristics of bird song features. However, existing detection algorithms are not applicable in low signal-to-noise ratio situations, and thus algorithms for bird song endpoint detection remain to be improved.

Disclosure of Invention

The invention provides a method, a device and a storage medium for identifying and classifying bird song, which are used for solving the problems that the accuracy of voice endpoint detection is not high and the accuracy of classifying bird song is not high when the existing bird song is identified.

The technical scheme provided by the invention for the technical problem is as follows:

in one aspect, the present invention provides a method for identifying and classifying birdsong coping with low signal-to-noise ratio, where the method includes:

extracting a time sequence signal of the audio to be identified;

stacking audio frames frame by frame according to the time sequence of the sampling point values in the time sequence signal to obtain a first characteristic matrix;

carrying out endpoint detection on the first feature matrix by using a preset detection algorithm to obtain a second feature matrix of the real sounding interval of the bird song;

stacking the audio frames frame by frame according to the time sequence again for the sampling point values in the real sounding interval to obtain a calibrated third feature matrix, wherein the first dimension of the second feature matrix is aligned with the first dimension of the third feature matrix;

and performing feature extraction on the third feature matrix by using a convolutional neural network so as to input the third feature matrix into a gated cyclic network for processing to obtain an identification classification result.

According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, the performing end point detection on the first feature matrix by using a preset detection algorithm to obtain a second feature matrix of the real sounding interval of the bird song comprises the following steps:

detecting each first utterance candidate segment in the first feature matrix by adopting an energy detection algorithm;

merging and un-merging the adjacent first utterance candidate segments in each first utterance candidate segment according to a preset rule to obtain each second utterance candidate segment;

and selecting the second sounding candidate segment with the largest audio frame number in the second sounding candidate segments as the real sounding interval of the bird song of the second characteristic matrix.

According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, the step of detecting each first sound-emitting candidate segment in the first feature matrix by adopting an energy detection algorithm comprises the following steps:

solving the energy sum of all audio frames in the first characteristic matrix;

calculating an energy average value of the audio frame according to the energy sum;

comparing the energy sum of each audio frame in the first feature matrix with the energy mean value one by one;

if the energy sum of the current audio frame is larger than the energy average value, judging that the current audio frame is a candidate frame for the singing and sounding of the bird;

recording all the bird singing sound production candidate frames to obtain a candidate frame list;

and stacking continuous audio frames in the candidate frame list according to the audio frame sequence in the first feature matrix to obtain each first pronunciation candidate segment.

According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, the merging and un-merging the adjacent first utterance candidate segments in each first utterance candidate segment according to a preset rule to obtain each second utterance candidate segment comprises:

performing primary processing on the first utterance candidate segments to obtain a primary processing list, wherein the first processing comprises merging first merged utterance candidate segments obtained by corresponding adjacent first utterance candidate segments if the number of discontinuous audio frames between adjacent first utterance candidate segments in each first utterance candidate segment does not exceed a first threshold; otherwise, not merging; the primary processing list comprises primary processing sound production candidate segments, including the first combined sound production candidate segment obtained by combination and the first sound production candidate segment which is not combined; if the number of the first-stage processing sounding candidate segments is equal to 1, ending the processing;

if the number of the primary processing sounding candidate segments is more than 2, performing secondary processing on the primary processing sounding candidate segments to obtain a secondary processing list, wherein the secondary processing comprises the steps of obtaining the minimum value of discontinuous audio frame numbers between adjacent primary processing sounding candidate segments, and if the minimum value is not more than a second threshold value, combining second combined sounding candidate segments obtained by corresponding adjacent primary processing sounding candidate segments; otherwise, not merging; the secondary processing list comprises secondary processing sounding candidate segments, including the second combined sounding candidate segment obtained by combination and the primary processing sounding candidate segment which is not combined; wherein the second threshold is greater than the first threshold; if the number of the secondary processing sounding candidate segments is more than 2, repeating the secondary processing; if the number of the secondary processing sounding candidate segments is equal to 1, ending the processing;

and acquiring the primary processing sounding candidate segment obtained after finishing the processing or the secondary processing sounding candidate segment obtained after finishing the processing as the second sounding candidate segment.

According to the bird song recognition and classification method dealing with the low signal-to-noise ratio, the merging and un-merging the adjacent first utterance candidate segments in each first utterance candidate segment according to a preset rule to obtain each second utterance candidate segment further includes:

if the number of the primary processing sounding candidate segments is equal to 2 or the number of the secondary processing sounding candidate segments is equal to 2, performing three-level processing on the corresponding primary processing sounding candidate segments or the corresponding secondary processing sounding candidate segments to obtain a three-level processing list, wherein the three-level processing comprises combining the corresponding adjacent primary processing sounding candidate segments to obtain a third combined sounding candidate segment if the number of discontinuous audio frames between the adjacent primary processing sounding candidate segments does not exceed a third threshold value; otherwise, not merging; if the number of discontinuous audio frames between the adjacent secondary processing sounding candidate segments does not exceed a third threshold value, combining the corresponding adjacent secondary processing sounding candidate segments to obtain a fourth combined sounding candidate segment; otherwise, not merging; wherein the third threshold is greater than the second threshold; finishing the processing;

and acquiring the candidate section of the sounding obtained by the three-stage processing as the second candidate section of the sounding.

According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, the step of extracting the features of the third feature matrix by using the convolutional neural network so as to input the third feature matrix into the gated cyclic network for processing to obtain the identification and classification result comprises the following steps:

inputting the third feature matrix into a convolutional neural network for pre-feature extraction to obtain a fourth feature matrix;

performing dimensionality reduction processing on the fourth feature matrix to obtain a fifth feature matrix with strong time domain continuity;

inputting the fifth feature matrix into the gated loop network to obtain a final classification prediction score;

and inputting the classification prediction score into an Argmax () function to obtain the identification classification result.

According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, the step of inputting the third feature matrix into a convolutional neural network for pre-feature extraction to obtain a fourth feature matrix comprises the following steps:

inputting the third feature matrix into a convolutional neural network for convolution operation;

arranging the values obtained after the convolution operation according to the corresponding positions in the third feature matrix to obtain a sixth feature matrix;

activating each element in the sixth feature matrix by using an activation function to obtain an activation value;

and putting the activation values to corresponding positions in the third feature matrix and performing pooling operation to obtain the fourth feature matrix.

According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, before the end point detection is performed on the first feature matrix by using a preset detection algorithm to obtain the second feature matrix of the real sounding interval of the bird song, the method further comprises the following steps:

and carrying out noise reduction treatment on the first characteristic matrix by adopting a weighted average threshold algorithm.

In a second aspect, the present invention provides a bird song recognition and classification apparatus coping with a low signal-to-noise ratio, the apparatus comprising:

the extraction module is used for extracting a time sequence signal of the audio to be identified;

the first processing module is used for stacking audio frames frame by frame according to the time sequence of the sampling point values in the time sequence signal to obtain a first characteristic matrix;

the detection module is used for carrying out endpoint detection on the first characteristic matrix by using a preset detection algorithm so as to obtain a second characteristic matrix of the real sounding interval of the bird song;

a second processing module, configured to stack audio frames frame by frame again according to a time sequence for sampling point values within the bird song true sounding interval to obtain a calibrated third feature matrix, where a first dimension of the second feature matrix is aligned with a first dimension of the third feature matrix;

and the identification and classification module is used for extracting the characteristics of the third characteristic matrix by using a convolutional neural network so as to input the third characteristic matrix into the gated cyclic network for processing to obtain an identification and classification result.

In a third aspect, the present invention also provides a storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method for identifying and classifying bird song coping with low signal-to-noise ratio as described above.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

according to the low signal-to-noise ratio bird song recognition and classification method, the unconverted signal sampling point values are directly used as the feature matrix, compared with the traditional method that the spectrum features are obtained by a series of conversion of the audio, the operation of obtaining the features is simpler, and the calculation resources are saved. Meanwhile, the target characteristics of the bird song can be well detected under the condition of more noise by adopting an endpoint detection algorithm based on frame-level energy, then a two-dimensional convolutional neural network is adopted as a pre-characteristic extractor, and then the characteristics with strong time domain attributes are processed by utilizing a gated cyclic network, so that a good classification effect is achieved under the condition of low signal-to-noise ratio. Therefore, the method for identifying and classifying the bird song which deals with the low signal-to-noise ratio has relatively high accuracy rate of voice endpoint detection and relatively high accuracy of bird song classification.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a low snr bird song recognition and classification method according to an embodiment of the present invention;

FIG. 2 is a functional block diagram of the bird song recognition and classification apparatus for dealing with low SNR according to the present invention;

FIG. 3 is a diagram illustrating the noise reduction effect of the noise reduction process according to the present invention;

fig. 4 is a diagram illustrating the effectiveness of the endpoint detection algorithm provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Referring to fig. 1, a flow chart of a bird song recognition and classification method for handling a low signal-to-noise ratio according to an embodiment of the present invention is provided. The low signal to noise ratio bird song recognition and classification method is mainly used for carrying out bird song recognition and classification on the acquired material audio, and has good recognition efficiency and accuracy.

As shown in fig. 1, the method for identifying and classifying a bird song coping with a low signal-to-noise ratio according to the present embodiment may include the following steps:

s101: extracting a time sequence signal of the audio to be identified, wherein the audio to be identified can be a real-time recorded audio, or an audio file which is pre-recorded and stored in a memory, or of course, the audio to be identified can also be an audio file which is obtained by using an automatic script program from the largest public audio website of birds.

This step can be implemented by extracting the time-series signal y of the audio to be recognized using a library of librosa audio signal extractions in python, where y ═ { x ═₁，x₂，x₃，…，x_NAnd N is the length of the signal sequence. Here, librosa is a python toolkit for audio and music analysis and processing, and can perform time-frequency processing and feature extraction on an audio file.

S102: and stacking the audio frames frame by frame according to the sampling point values in the time sequence signal according to the time sequence to obtain a first feature matrix.

In this step, the specific implementation process may include normalizing the time series signal, that is, stacking audio frames frame by frame according to a time sequence on the sampling point values in the time series signal to obtain the first feature matrix of a fixed size, where each audio frame includes 1024 sampling point values.

The first feature matrix may be denoted as T and may be represented by the following equation:

where, the number of rows of T is 600, the number of columns is 1024, and x_nmIs the sampling point value in the y sequence, and the first feature matrix T obtained after normalization is a matrix with the size of 600 multiplied by 1024. Here, the first feature matrix T is saved in the form of a picture.

S103: and carrying out endpoint detection on the first characteristic matrix by using a preset detection algorithm to obtain a second characteristic matrix of the real sounding interval of the bird song.

In this step, the preset detection algorithm may be an energy detection algorithm, and in the process of implementing this step, the following substeps may be specifically included:

s1031: detecting each first utterance candidate segment in the first feature matrix by using an energy detection algorithm, wherein the detection step is as follows:

s10311: and solving the energy sum of all audio frames in the first characteristic matrix, wherein the energy sum can be calculated as s and satisfies the relation:

wherein s is_nFor the sum of the energies of each audio frame, the relation is satisfied:

s10312: calculating an energy mean value of the audio frame according to the energy sum, wherein d can be calculated, and the energy mean value satisfies the relation: d is s/600.

S10313: comparing the energy sum of each audio frame in the first feature matrix with the energy mean value one by one, specifically, sequentially judging the energy sum s of each audio frame in the first feature matrix_nWhether d is greater than or equal to d.

S10314: and if the energy sum of the current audio frame is larger than the energy average value, judging that the current audio frame is a candidate frame for the bird singing and sounding. It will be appreciated that the current audio frame is only a candidate frame for utterance that is more likely to be a true bird song.

S10315: recording all of the bird song production candidate frames to obtain a candidate frame list, which may be w₀And satisfies the relation:

wherein n is more than or equal to 1_N≤n,n＝600。

S10316: selecting the candidate according to the audio frame order in the first feature matrixSuccessive audio frames in the frame list are stacked to obtain each of the first utterance candidate segments. Here, the first utterance candidate may be counted as_RjAnd satisfies the relation: r_j＝(n₁,n₂,…,n_i),(n₁,n₂,…,n_i)∈w₀And all the stacked first utterance candidates may be chronologically maintained in a stacking process list, which may be w and satisfies the relation: w ═ R₁,R₂,…,R_j) Wherein j represents_w0J is more than or equal to 1 and less than or equal to 300.

S1032: merging and un-merging the adjacent sound production candidate segments in each first sound production candidate segment according to a preset rule to obtain each second sound production candidate segment, wherein the sound production candidate segments in the stacking processing list w are processed by adopting an automatic breakpoint technology, and the processing steps are as follows:

s10321: performing primary processing on the first utterance candidate segments to obtain a primary processing list, wherein the first processing comprises merging first merged utterance candidate segments obtained by corresponding adjacent first utterance candidate segments if the number of discontinuous audio frames between adjacent first utterance candidate segments in each first utterance candidate segment does not exceed a first threshold; otherwise, not merging; the primary processing list comprises primary processing sound production candidate segments, including the first combined sound production candidate segment obtained by combination and the first sound production candidate segment which is not combined; and if the number of the primary processing sounding candidate segments is equal to 1, ending the processing. The first threshold is a1, which may be 15.

Here, the adjacent first utterance candidate segments R are sequentially calculated_j、R_j+1The number of discontinuous audio frames (which may be num) between them, if num ≦ a1, the two adjacent utterance candidates are merged, denoted as R_j＝R_j+R_j+1Otherwise, not merging. The primary processing list obtained by primary processing can be counted as w₁And satisfies the relation: w is a₁＝(R₁,R₂,…,R_i) Wherein i is more than or equal to 1 and less than or equal to j.

S10322: if the number of the primary processing sounding candidate segments is more than 2, performing secondary processing on the primary processing sounding candidate segments to obtain a secondary processing list, wherein the secondary processing comprises the steps of obtaining the minimum value of discontinuous audio frame numbers between adjacent primary processing sounding candidate segments, and if the minimum value is not more than a second threshold value, combining second combined sounding candidate segments obtained by corresponding adjacent primary processing sounding candidate segments; otherwise, not merging; the secondary processing list comprises secondary processing sounding candidate segments, including the second combined sounding candidate segment obtained by combination and the primary processing sounding candidate segment which is not combined; wherein the second threshold is greater than the first threshold; if the number of the secondary processing sounding candidate segments is more than 2, repeating the secondary processing; and if the number of the secondary processing sounding candidate segments is equal to 1, ending the processing. The second threshold is a2, which may take the value 30.

Here, the list w is processed by computing the primary processing₁The number of the middle-stage processing utterance candidate segments (which can be counted as len1) is used as an execution condition: when the number of the primary processing sounding candidate segments is 1 (len1 is 1), ending the processing, and taking the primary processing sounding candidate segments as the second sounding candidate segments; when the number of the first-stage processing sounding candidate segments is more than 2, a first-stage processing list w is calculated₁The number of discontinuous audio frames between all adjacent primary processing sound production candidate segments is counted as n_sAnd satisfies the relation: ns (num1, num2, …, num-1), and then taking the minimum value n thereof_minAnd satisfies the relation: n is_min＝min(n_s)。

If n is_minIf the value is more than 2 × a1, taking the secondary processing sound production candidate segment as the second sound production candidate segment; if n is_minIf num is less than or equal to a2, the two adjacent primary processing sound production candidate segments are merged and expressed as R_i＝R_i+R_i+1Otherwise, not merging. The secondary processing list obtained by secondary processing can be counted as w₂And satisfies the relation: w2 ═ R₁,R₂,…,R_ii) Wherein, 1 is less than or equal to ii and less than or equal to i.

Computing a Secondary processing List w₂The number of the secondary processing utterance candidate segments (which can be counted as len2) is middle, when the number of the secondary processing utterance candidate segments is 1 (len2 is 1), the processing is finished, and the secondary processing utterance candidate segments are used as the second utterance candidate segments; and when the number of the secondary processing sounding candidate segments is more than 2, performing secondary processing again until the number of the secondary processing sounding candidate segments is 1 or 2.

When the number of the primary processing utterance candidate segments is 2, the step in S10323 is performed; when the number of secondary-processing utterance candidate segments is 2, the step in S10323 is also performed.

S10323: if the number of the primary processing sounding candidate segments is equal to 2 or the number of the secondary processing sounding candidate segments is equal to 2, performing three-level processing on the corresponding primary processing sounding candidate segments or the corresponding secondary processing sounding candidate segments to obtain a three-level processing list, wherein the three-level processing comprises combining the corresponding adjacent primary processing sounding candidate segments to obtain a third combined sounding candidate segment if the number of discontinuous audio frames between the adjacent primary processing sounding candidate segments does not exceed a third threshold value; otherwise, not merging; if the number of discontinuous audio frames between the adjacent secondary processing sounding candidate segments does not exceed a third threshold value, combining the corresponding adjacent secondary processing sounding candidate segments to obtain a fourth combined sounding candidate segment; otherwise, not merging; wherein the third threshold is greater than the second threshold. The third threshold is a3, and preferably 40.

Here, if the number of the one-stage processing utterance candidate segments is equal to 2, the adjacent one-stage processing utterance candidate segments R are sequentially calculated_i、R_i+1The discontinuous audio frame number num between the two adjacent first-level processing sound production candidate segments is merged and is represented as R if num is less than or equal to a3_i＝R_i+R_i+1Otherwise, the combination is not carried out, and the processing is finished. The three-level processing list obtained by the three-level processing can be calculated as w₃And satisfies the relation: w3 ═ (R1, R2, …, Riii), where 1. ltoreq. iii. ltoreq.i.

If the second level processing is soundingIf the number of the candidate segments is equal to 2, the adjacent two-stage processing sounding candidate segments R are calculated in sequence_ii、R_ii+1And if num is less than or equal to a3, merging two adjacent secondary processing sound production candidate segments, namely Rii-Rii + Rii +1, otherwise, not merging, and then ending the processing. Similarly, the list of three-stage treatments obtained by the three-stage treatment can be designated as w₃And satisfies the relation: w3 ═ (R1, R2, …, Riii), where 1. ltoreq. iii. ltoreq.i.

S10324: and acquiring the primary processing sound production candidate section obtained after finishing the processing, the secondary processing sound production candidate section obtained after finishing the processing or the sound production candidate section obtained through the tertiary processing as the second sound production candidate section.

Here, when the number of the utterance candidate segments for the first-stage processing is 1, the list w for the third-stage processing may be made₃First order processing list w₁(ii) a Similarly, when the number of the second-stage processing sounding candidate segments is 1, the third-stage processing list w can be ordered₃First order processing list w₂。

S1033: and selecting the second sounding candidate segment with the largest audio frame number in the second sounding candidate segments as the actual sounding interval of the bird song.

According to a three-level processing list w₃Selecting the segment R with the largest audio frame number_iii＝max(w₃) And taking the first value and the last value in the sounding candidate segment as end values to respectively represent the starting frame and the ending frame of the bird song. Where the first value may be counted as C and the last value may be counted as D.

Before this step, the first feature matrix may be further subjected to noise reduction, where all the element values x in the first feature matrix may be calculated by using a weighted average threshold method_nmSum of (a) is calculated as

Then, an average value ave is obtained, wherein the average value ave satisfies the following relation: ave sum/D, where D represents the magnitude of the first feature matrixSmall and equal to 600 × 1024.

After the average value ave is calculated, weighting is performed to obtain an average threshold Q, which satisfies the following relation: q is (1+ α) × ave, where α is a weighting coefficient, and is 0.15.

Then, performing noise reduction processing on the first feature matrix T according to the average threshold Q, specifically, sequentially processing elements in the first feature matrix T according to the average threshold Q, and the processing manner is:

i.e. at the current element value x_nmWhen less than Q, the current element value x is set_nmSet to 0; at the current element value x_nmWhen greater than or equal to Q, the current element value x is retained_nm。

The pre-processing feature matrix, designated as T1, is obtained by the following method:

s104: and stacking the audio frames frame by frame according to the time sequence again for the sampling point values in the real sounding interval to obtain a calibrated third feature matrix, wherein the first dimension of the second feature matrix is aligned with the first dimension of the third feature matrix.

The values between the C-th audio frame and the D-th audio frame are extracted from the first feature matrix T according to the first value C and the last value D, and a second feature matrix T2 representing the real sound segment of the bird song is obtained, which can be represented by the following formula:

wherein, C is more than or equal to 1 and less than or equal to D is less than or equal to 600.

And then, normalizing the second feature matrix again to complete matrix calibration, namely stacking the audio frames frame by frame according to the time sequence again for the sampling point values in the real sounding interval, so as to obtain a third feature matrix, wherein the first dimension of the second feature matrix is aligned with the first dimension of the third feature matrix. Here, the third feature matrix is T3, which can be represented by the following formula:

it will be appreciated that the third feature matrix T3 is a 600 x 1024 matrix corresponding in size to the first feature matrix T.

It is understood that the bird song data obtained through the processing in step S105 may be trained as an experimental sample, the training process includes forward propagation and backward propagation, the model parameters are updated according to the loss, and the training is finished and the model is saved when the loss value is small and tends to be stable, so as to obtain a trained model.

S106: and performing feature extraction on the third feature matrix by using a convolutional neural network so as to input the third feature matrix into a gated cyclic network for processing to obtain an identification classification result.

In this step, the size of the third feature matrix T3 is n₀×m₀Wherein n is₀＝600,m₀1024. In the process of implementing this step, the following sub-steps may be specifically included:

s1061: inputting the third feature matrix into a convolutional neural network for pre-feature extraction to obtain a fourth feature matrix, which may include the following steps:

s10611: inputting the third feature matrix into a convolutional neural network for convolution operation, specifically, inputting the third feature matrix T3 into a three-layer two-dimensional convolutional neural network, where the convolution operation performed by each layer of convolutional neural network satisfies the following relation:

wherein i ═ 1,2,3]Representing a current network hierarchy; [ f ] of_h,f_w]Denotes the size of the convolution filter with a sliding step of s on the vertical and horizontal axes_h,s_w]P represents the fill size for the matrix dimension; n is_ii、m_iiAnd represents the dimension size of the third feature matrix T3 obtained after the convolution operation. Here, the convolution operation will be of size [ f ]_h,f_w]The weight matrix z of the filter of (2) can be represented by:

s10612: and arranging the values obtained after the convolution operation according to the corresponding positions in the third feature matrix to obtain a sixth feature matrix. Specifically, the region value X of the feature matrix introduced from the previous layer is multiplied in turn to obtain a value X_ij. Wherein, the region value X satisfies the following relation:

and the value x obtained_ijSatisfies the following relation:

the step size of the filter is s for each shift in the feature matrix_h,s_w]Then arranging the obtained values according to the relative positions of the values in the original matrix to obtain a feature matrix T after convolution_jAs the sixth feature matrix, it can be represented by the following equation:

s10613: activating each element in the sixth feature matrix by using an activation function to obtain an activation value, specifically, activating the feature matrix T by using the activation function_jEach element in the table is activated to obtain a corresponding activation value, and then the activation value is put back to its position in the original matrix, where g (x) can be used to represent the LEAKRELU () activation function, and the feature matrix T_jEach element in (b) is a value obtained by activation processing of an activation function, wherein the activation function g (x) satisfies the following relation:

s10614: and putting the activation values to corresponding positions in the third feature matrix and performing pooling operation to obtain the fourth feature matrix. Specifically, a pooling operation is performed to obtain a feature matrix, the pooling operation satisfying the following relation:

wherein n is_i,m_iFor the dimension of the feature matrix obtained after pooling operation, p_cRepresenting the fill size of the feature matrix before pooling, the size of the pooling kernel being f_hc,f_wc]The step length of sliding on the vertical axis and the horizontal axis is s_hc,s_wc]. Here, the maximum pooling type is used, i.e. the size of the pooling kernel is a region, and accordingly, in the feature matrix T_jThe areas with the same size are selected,satisfies the following relation:

then, the maximum value of the elements in the matrix is obtained as the representative element of the area, and the characteristic matrix T after the pooling operation is obtained_cAs the fourth feature matrix T4, the representative elements satisfy the relation:

feature matrix T_cSatisfies the following relation:

by integrating S10611 to S10614, the three-layer two-dimensional convolutional neural network is used to extract high-level features, each convolutional operation includes a maximum pooling operation, and a feature matrix with a size of 8 × 17 × 10 is finally extracted as a final high-level feature, and the input and output sizes and parameter settings of each layer can be shown in the following table:

here, using the RMSProp gradient descent optimization algorithm as the gradient optimization method of the convolutional network, the momentum parameter is set to 0.7, the initial learning rate lr is 0.001, and the average accuracy (MAP) is used as the classification prediction effect index, while the cross entropy (cross entropy) loss function is selected as the loss measurement function of the network.

The loss function satisfies the following relation:

loss＝-(y×log(h(x))+(1-y)×log(1-h(x)))，

wherein y represents the case that the prediction result of the current sample is true or false, the value is 0 or 1, h (x) is the fraction of the model predicted by the current sample, and loss is the loss value.

S1062: and performing dimensionality reduction on the fourth feature matrix to obtain a fifth feature matrix with strong time domain continuity, wherein the fourth feature matrix T4 is reduced to be a 17 × 80 fifth feature matrix T5, which can support a subsequent gated round-robin network (GRU).

Here, after the convolution operation and the pooling operation of step S1061, the fifth feature matrix of 17 × 80 with strong time-domain continuity is obtained by concatenating the obtained fourth feature matrix (corresponding to the final high-level features) in the frequency dimension.

S1063: inputting the fifth feature matrix into the gated loop network to obtain a final classification prediction score, wherein the output of the last layer of the gated loop network is used as the final classification prediction score by utilizing the characteristic that the gated loop network is good at processing time series data, and the classification prediction score can be counted as T6.

S1064: inputting the classification prediction score into an Argmax () function to obtain the recognition classification result, specifically, inputting the classification prediction score into the Argmax () function to obtain a one-hot vector of the final classification result, wherein the one-hot vector is expressed as a one-dimensional vector such as [1,0,0,0], and the vector representation model presets the current experimental sample as the first bird.

Furthermore, the noise reduction is carried out by adopting a weighted average threshold method, most of non-target characteristics can be removed, and a better noise reduction effect is achieved.

Referring to fig. 2, a functional block diagram of the bird song recognition and classification apparatus for dealing with low signal-to-noise ratio provided by the present invention is shown. The bird song recognition and classification device 100 for dealing with the low signal-to-noise ratio comprises an extraction module 11, a first processing module 12, a detection module 13, a second processing module 14 and a recognition and classification module 15, and all processes of the bird song recognition and classification method for dealing with the low signal-to-noise ratio are realized through the cooperation of all modules, so that the corresponding effect is realized. Wherein:

and the extraction module 11 is configured to extract a time series signal of the audio to be identified.

The first processing module 12 is configured to stack audio frames frame by frame according to a time sequence for the sample point values in the time series signal to obtain a first feature matrix.

And the detection module 13 is configured to perform endpoint detection on the first feature matrix by using a preset detection algorithm to obtain a second feature matrix of the true sounding interval of the bird song.

A second processing module 14, configured to stack audio frames frame by frame again according to a time sequence for sampling point values in the bird song true sounding interval, so as to obtain a calibrated third feature matrix, where a first dimension of the second feature matrix is aligned with a first dimension of the third feature matrix.

And the recognition and classification module 15 is configured to perform feature extraction on the third feature matrix by using a convolutional neural network so as to input the third feature matrix into a gated cyclic network for processing to obtain a recognition and classification result.

It is to be understood that, in addition to the above modules, the apparatus 100 for identifying and classifying a bird song that deals with a low signal-to-noise ratio may further include other modules, such as a noise reduction processing module, and the like, wherein the noise reduction processing module is configured to perform noise reduction processing on the first feature matrix by using a weighted average threshold algorithm.

Furthermore, the present invention provides a computer arrangement comprising a processor for implementing the steps of the above-described method for identifying and classifying birdsong coping with low signal-to-noise ratio when executing a computer program stored in a memory.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Furthermore, the present invention also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the aforementioned bird song recognition classification method coping with a low signal-to-noise ratio.

Referring to fig. 3, a noise reduction effect graph after noise reduction processing in the present invention is shown, and comparing the effect graphs before noise reduction and after noise reduction, it can be seen that the background noise in the picture can be better removed by using the weighted average threshold method.

Referring to fig. 4, it is a diagram of the effectiveness proof of the endpoint detection algorithm provided by the present invention, where the left side is a picture without endpoint detection processing, the middle is a bird song picture divided after endpoint detection, and the right side is a bird song picture after renormalization.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for identifying and classifying bird song coping with low signal-to-noise ratio, the method comprising:

extracting a time sequence signal of the audio to be identified;

2. The method for identifying and classifying the bird song coping with the low signal-to-noise ratio as claimed in claim 1, wherein the performing the end point detection on the first feature matrix by using the preset detection algorithm to obtain the second feature matrix of the true sounding interval comprises:

3. The method of claim 2, wherein the detecting each first acoustic candidate segment in the first feature matrix using an energy detection algorithm comprises:

solving the energy sum of all audio frames in the first characteristic matrix;

4. The method according to claim 2, wherein the merging and non-merging the neighboring first utterance candidates for each of the first utterance candidates to obtain each of the second utterance candidates comprises:

5. The method according to claim 4, wherein the merging and non-merging adjacent first utterance candidates in each of the first utterance candidates to obtain each of second utterance candidates according to a preset rule further comprises:

6. The method for identifying and classifying the bird song coping with the low signal-to-noise ratio according to claim 1, wherein the extracting the features of the third feature matrix by using the convolutional neural network to input the feature matrix into the gated cyclic network for processing to obtain the identification and classification result comprises:

7. The method of claim 6, wherein the step of inputting the third feature matrix into a convolutional neural network for pre-feature extraction to obtain a fourth feature matrix comprises:

8. The method for identifying and classifying the bird song coping with the low signal-to-noise ratio according to any one of claims 1 to 6, wherein before the performing the end point detection on the first feature matrix by using a preset detection algorithm to obtain the second feature matrix of the true sounding interval, the method further comprises:

9. A bird song recognition and classification apparatus that copes with a low signal-to-noise ratio, the apparatus comprising:

10. A storage medium having stored thereon a computer program, wherein the computer program, when being executed by a processor, implements the steps of the method for identifying and classifying birdsong coping with low signal-to-noise ratio according to any one of claims 1 to 8.