CN118173102A

CN118173102A - Bird voiceprint recognition method in complex scene

Info

Publication number: CN118173102A
Application number: CN202410600097.3A
Authority: CN
Inventors: 高树会; 许志方; 杨兆宇
Original assignee: Bainiao Data Technology Beijing Co ltd
Current assignee: Bainiao Data Technology Beijing Co ltd
Priority date: 2024-05-15
Filing date: 2024-05-15
Publication date: 2024-06-11

Abstract

The application relates to the technical field of voice recognition, in particular to a bird voiceprint recognition method in a complex scene, which comprises the following steps: collecting spectrograms and energy; acquiring an audio mutation index of each sampling point in each frame of environmental sound; acquiring a suspected bird song start and stop point sequence of each frame of environmental sound; acquiring an order sequence of the environmental sounds of each frame; acquiring an audio mutation repetition index of each frame of environmental sound; acquiring each potential bird song audio frame and each non-potential bird song audio frame; acquiring formants of environmental sounds of each frame; acquiring the definition coefficients of potential bird song formants of all potential bird song audio frames; acquiring an updated weight coefficient of the loss function; bird voiceprints are identified using a neural network model. The bird song signal recognition method and device can improve the precision of bird song signal recognition by utilizing the bird voiceprint recognition model under a complex scene.

Description

Bird voiceprint recognition method in complex scene

Technical Field

The application relates to the technical field of voice recognition, in particular to a bird voiceprint recognition method under a complex scene.

Background

The intelligent wetland monitoring system is mainly used for monitoring species and quantity in a wetland environment, identifying factors which artificially interfere with the wetland environment and monitoring water quality, wherein birds are the most representative groups in wetland wild animals, are important components of a wetland ecosystem, and sensitively and deeply reflect the transition of the wetland environment. Species identification and monitoring of the number of birds is critical. The conventional bird monitoring means mainly comprise video and audio monitoring, and for some woodbirds which are more crisp in privacy, the audio monitoring is mainly performed. The method comprises the steps of arranging sound receiving equipment in an area needing to monitor bird activities in a protection area, receiving sounds in real time, processing edge ends in real time, identifying bird sounds, returning identification results and sound original files to a service end, and providing the service end with the identification results and sound original files for workers in the protection area to analyze bird activities. Therefore, the bird voiceprint recognition effect is very important, and the accuracy of bird monitoring is directly affected.

However, in a field wetland scene, background sounds are complex, and bird sounds can be influenced by environmental sounds. And the data marking cost is higher. The usual method of audio classification is to classify fixed sounds only with relatively few considerations for specific background sounds and sound levels. The prior art adopts a general data enhancement and modeling method, and has the defect of poor effect on specific background noise.

Disclosure of Invention

In order to solve the technical problems, the application provides a bird voiceprint recognition method in a complex scene to solve the existing problems.

The bird voiceprint recognition method in the complex scene adopts the following technical scheme:

an embodiment of the application provides a bird voiceprint recognition method in a complex scene, which comprises the following steps:

collecting spectrograms of environmental sounds of each frame and energy of each sampling point in a training data set;

Acquiring an audio mutation index of each sampling point in each frame of environmental sound according to the energy of the energy; acquiring suspected bird song start and stop point sequences of environmental sounds of each frame according to the audio mutation indexes;

Acquiring an order sequence of environmental sounds of each frame according to the suspected bird song starting dead center sequence; acquiring the audio mutation repetition index of each frame of environmental sound according to the relation between the suspected bird song starting and ending point sequence and the sequence; acquiring each potential bird song audio frame and non-potential bird song audio frame according to the audio mutation repetition index;

Obtaining formants of environmental sounds of each frame according to the spectrogram; acquiring the definition coefficients of the potential bird song formants of each potential bird song audio frame according to the formants of each potential bird song audio frame; acquiring an updated weight coefficient of the loss function according to the definition coefficient of the potential bird song resonance peak;

and identifying the bird voiceprints by using a neural network model according to the updated weight coefficient.

Further, the obtaining the audio mutation index of each sampling point in each frame of environmental sound according to the energy of the energy includes:

For the energy of each frame of environment sound, sliding the energy of each frame of environment sound by using a sliding window with a preset size, for the energy of a central sampling point of the sliding window, calculating the absolute value of the difference between the energy of the central sampling point and the energy of a next sampling point adjacent to the energy as a first absolute value of the difference, calculating the absolute value of the difference between the energy of the central sampling point and the energy of a previous sampling point adjacent to the energy as a second absolute value of the difference, calculating the sum value between the second absolute value of the difference and a preset adjusting parameter, calculating the ratio of the first absolute value of the difference to the sum value, and calculating the calculation result of a logarithmic function taking the number 2 as the bottom and the ratio as the true number;

And calculating standard deviation of energy of all sampling points in the sliding window, and taking the product of the absolute value of the calculation result and the standard deviation as an audio mutation index of the sliding window corresponding to the central sampling point in each frame of environment sound.

Further, the obtaining the suspected bird song start and stop point sequence of each frame of environmental sound according to the audio mutation index comprises the following steps:

And for each frame of environmental sound, arranging the audio mutation indexes of all sampling points in a descending order, acquiring the audio mutation indexes of the preset number, and arranging the suspected bird song start and stop point sequences of each frame of environmental sound according to the sequence of the sampling points corresponding to each audio mutation index in each frame of environmental sound.

Further, the acquiring the sequence of the environmental sounds of each frame according to the suspected bird song start dead center sequence comprises the following steps:

And forming the sequence of the frame environment sound of the suspected bird song starting dead center sequence according to the sequence value corresponding to the audio mutation indexes in the frame environment sound of the suspected bird song starting dead center sequence from small to large.

Further, the obtaining the audio mutation repetition index of each frame of environmental sound according to the relation between the suspected bird song start and stop point sequence and the sequence of order comprises the following steps:

for each frame of environment sound, calculating a pearson correlation coefficient between a suspected bird song start-stop point sequence and an order sequence, calculating shannon entropy of all audio mutation indexes in the suspected bird song start-stop point sequence, calculating a sum value of the shannon entropy and a preset adjusting parameter, and taking the ratio of the absolute value of the pearson correlation coefficient to the sum value as the audio mutation repetition index of each frame of environment sound.

Further, the obtaining each potential bird song audio frame and non-potential bird song audio frame according to the audio mutation repetition index includes:

And taking the audio repetition index of each frame of environmental sound as the input of a clustering algorithm, acquiring two clustering clusters, respectively marking each frame of environmental sound in the cluster with the largest audio mutation repetition index as a potential bird song audio frame, and respectively taking each frame of environmental sound in the other clustering cluster as a non-potential bird song audio frame.

Further, the obtaining formants of environmental sounds of each frame according to the spectrogram includes: taking the spectrogram of each frame of environment sound as the input of an interpolation method, and acquiring the center frequency and the bandwidth of the formants of each frame of environment sound.

Further, the obtaining the resolution coefficient of the potential bird song formants of each potential bird song audio frame according to the formants of each potential bird song audio frame includes:

for each potential bird song audio frame, calculating kurtosis of each formant of the potential bird song audio frame in the bandwidth range;

the resolution index of the resonance peak of the potential bird song is expressed as the following formula:

In the method, in the process of the invention, Represents the/>Potential bird song formant sharpness coefficient of individual potential bird song audio frames,/>Represents the/>Number of formants contained in frame,/>The kurtosis of the j-th formant is expressed by/>Represents the/>Center frequency of jth formant in frame,/>Representing the amplitude value corresponding to the center frequency of the jth formant,/>Representing the bandwidth of the j-th formant,/>Is a logarithmic function based on the number 2.

Further, the updating weight coefficient has the formula:

In the method, in the process of the invention, Representing an update weight coefficient; /(I)For presetting basic parameters,/>Is a preset control coefficient; /(I)As an exponential function based on natural constants,/>For training the number of frames of potential bird song frames in the dataset,/>For training the number of frames of non-potential bird song frames in the dataset,/>Is the average of the sharpness coefficients of the potential bird song formants of all potential bird song tones.

Further, the identifying bird voiceprints using the neural network model according to the updated weight coefficients includes:

and carrying the updated weight coefficient into a loss function, taking the loss function as a loss function of the neural network model, taking a Mel spectrogram of the environmental sound in the training data set as input of the neural network model, and outputting the acquired model as the type of sound.

The application has at least the following beneficial effects:

According to the method, through analyzing the characteristics of abrupt start and stop and rapid amplitude change of the bird song in the environmental sound energy, the audio mutation index is calculated, a suspected bird song start and stop point sequence is obtained based on the audio mutation index, and whether the similar features of the bird song repetition are obvious is judged by calculating the audio mutation repetition index according to the stability of the sequence. And extracting potential bird song audio frames based on the audio mutation repetition indexes, further analyzing the characteristics of formants in the spectrograms of the frames, and constructing a definition coefficient of the potential bird song formants to obtain the degree of interference of bird song signals. Calculating weight parameters of loss function through frame number of potential bird song audio frames and definition coefficients of potential bird song formants . The method has the beneficial effects that the relative change influence characteristics between background sounds and the bird sounds in the acquired bird sounds training sample under the complex scene are considered, the training effect of identifying the bird sounds model in the environment by adopting the CNN neural network model is improved, and the accuracy of identifying bird sounds signals by utilizing the bird sounds identification model under the complex scene is further improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for identifying bird voiceprints in a complex scene provided by the application;

Fig. 2 is a flowchart for obtaining updated weight coefficients.

Detailed Description

In order to further describe the technical means and effects adopted by the application to achieve the preset aim, the following detailed description refers to specific implementation, structure, characteristics and effects of the bird voiceprint recognition method in a complex scene according to the application by combining the accompanying drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

The following specifically describes a specific scheme of the bird voiceprint recognition method under a complex scene provided by the application with reference to the accompanying drawings.

The application provides a bird voiceprint recognition method under a complex scene, in particular to a bird voiceprint recognition method under a complex scene, referring to fig. 1, comprising the following steps:

and S001, acquiring the environmental sound and preprocessing.

First, environmental sound is acquired, and the present embodiment collects audio data of the environmental sound by disposing a barrel microphone in a wetland protection area.

The digital camera is adopted to collect peripheral video image data in the environment of the wetland protection area, and an expert confirms all categories of possible environmental sounds according to the audio frequency and the peripheral video image returned by voiceprint equipment arranged in the protection area, continuously collects the environmental sounds which can be collected through the audio equipment, and collects the environmental sounds which cannot be collected through public data sets on the Internet, wherein the public data sets selected in the embodiment are Audioset data sets.

This portion is updated once every quarter and the entire iteration period lasts 2 years. In this embodiment, taking one iteration period as an example, the type of the sound signal in the collected audio signal of the environmental sound is marked according to the voiceprint recognition device arranged in the protection area, and specific major types of marks, such as wind sound, water flow sound, insect sound and bird song in the environmental sound are respectively marked as 1,2,3 and 4. A detail mark is then carried out, for example, the bird song-azalea mark (4, A).

The collected audio data of the environmental sound are divided into a training data set and a test data set of the neural network, wherein the dividing ratio is 8:2.

And preprocessing the environmental sound in the acquired training data set. First, the training data set is sampled, and in this embodiment, the sampling frequency is set to 8KHZ, so as to obtain the energy of each sampling point in each frame of environmental sound. Then, pre-emphasis processing is performed to improve the signal-to-noise ratio, that is, the environmental sound is filtered by a high-pass filter, and the filter coefficient in this embodiment takes a value of 0.95. In general, the frequency of bird audio changes with time, and it is difficult to obtain a frequency contour line, and in this embodiment, a frame length is set to 30ms, and environmental sounds are subjected to framing and windowing. Further, a fast fourier transform (Fast Fourier Transform, FFT) is employed within each frame of ambient sound to obtain its spectrogram. The fast fourier transform is a well-known technique, and the detailed process is not repeated.

So far, the energy of each sampling point of each frame of environmental sound and a spectrogram thereof are obtained.

Step S002, designing a loss function calculation scheme of the recognition model classification layer; constructing an audio mutation index based on waveform amplitude variation characteristics of the bird song in a time domain; analyzing the repetition similarity of the start and stop points of the bird song, and constructing an audio mutation repetition index; analyzing the formant characteristics of the audio spectrogram in each frame to construct a definition coefficient of a potential bird song formant; and calculating the updated weight coefficient of the loss function.

The environment sounds of the wetland protection zone mainly comprise wind sounds, rain sounds, water flow sounds, java sounds, bird sounds, other animal sounds and the like. In order to identify bird voiceprints, bird song sounds are first extracted from complex environmental sounds. In the embodiment, the bird voiceprint recognition is performed by adopting a neural network model, wherein the neural network model is CNN, and the loss function is set as a cross entropy loss function.

Specifically, the loss function is calculated at the classification layer of the CNN model and divided into two parts, wherein one part is a large cross entropy loss function, which is denoted by Losslarge, and the other part is a small cross entropy loss function, which is denoted by Losssmall. The weight is distributed to the two, and the loss function calculation method comprises the following steps:

In the method, in the process of the invention, Loss function calculated for classification layer,/>Representing a broad class of cross entropy loss functions,Representing subclass cross entropy loss function,/>Is a weight coefficient. Wherein, the major cross entropy loss function is mainly used for distinguishing the bird song from other sounds in the environment, and the minor cross entropy loss function is mainly used for distinguishing specific song categories from the bird song, so that during model training, weight coefficient/>, is setHas a crucial influence on the recognition result.

In the neural network training data set obtained in step S001, the amplitude and loudness of the bird song in the environmental sound have certain fluctuation in the time domain compared with other animal sounds, wind sounds and water sounds, the start and end of the waveform are usually abrupt, the amplitude change is fast, and the bird song has a repeating similar sound pattern. Other environmental sounds, such as wind sounds, rain sounds, are often represented as continuous, low frequency noise, with some degree of randomness and irregularity. Thus, the likelihood of bird song presence is analyzed by the waveform characteristics of the environmental sounds in the time domain.

In the first placeThe energy of the sampling point of the frame ambient sound is set to be/>The sliding window of the audio frequency mutation index is slid from left to right, the sliding step length is 1, the central point of the sliding window is marked as i, and the audio frequency mutation index is constructed firstly, and the formula is as follows:

In the method, in the process of the invention, Represents the/>Audio mutation index of ith sampling point in frame ambient sound,/>The function is a logarithmic function with 2 as the base,/>Represents the/>Energy of (i+1) th sampling point in frame ambient sound,/>And/>Respectively represent the/>Energy of ith sample point and ith-1 th sample point in frame ambient sound,/>In order to adjust parameters, the value in the embodiment is 0.1, and the denominator is prevented from being 0; /(I)Representing the standard deviation of all the energies within the ith sliding window.

If the energy of the ambient sound varies more around the ith point, the difference between the calculated adjacent amplitude values is greater, the resultingAnd/>The farther from 1 the ratio of (c), while the more pronounced the fluctuations of the local data are,/>The greater the value. Audio mutation index/>The larger the point is, the more likely it is to belong to the start or end point of the bird song portion.

Can calculate the firstThe audio mutation indexes of all sampling points in the frame environment sound are arranged in descending order to obtain the first 10% of values, and the first 10% of values are arranged in the first/>, according to the sampling pointsSequential arrangement within frame ambient sound constitutes the/>Suspected bird song start dead center sequence of frame/>。

Sequence of suspected bird song producing dead pointThe audio mutation index of (a) is shown in the (a) >The sequence of corresponding sequence values in the frame formed in the order from small to large is named as the/>Sequence of frame ambient sounds/>. According to the characteristic that the bird song has repetition similarity in a short period, an audio mutation repetition index is constructed, and the formula is as follows:

In the method, in the process of the invention, Represents the/>Audio mutation repetition index of frame ambient sound,/>Represents the/>Suspected bird song start and stop point sequence of frame environment sound,/>Represents the/>Sequence of frame ambient sounds,/>Is the sequence/>And/>Pearson correlation coefficient of (c). /(I)Representation sequence/>Shannon entropy of/>In order to adjust the parameters, the value in this embodiment is 0.1, and the denominator is avoided being 0. The calculation of the pearson correlation coefficient and shannon entropy is a well-known technique, and the specific process is not described in detail.

When (when)The larger the value, the distribution of suspected bird song starting and ending points is shown to be obvious regular characteristics, iThe smaller the value is, the more stable the audio amplitude of the suspected bird song starting dead point is, and the sound frequency of the suspected bird song starting dead point isThe larger the value, the more/>The more pronounced the similar features of the bird song repetition within the frame.

Furthermore, in order to analyze the interference degree of the bird song in the collected environmental sound, the bird song existence frame is extracted. In the embodiment, the K-means clustering algorithm is adopted to cluster the environmental sounds, the algorithm is input as the audio mutation repetition index in each frame in the neural network training data set, and the clustering number is set to be 2. The output is two cluster classes and the corresponding cluster center.

And comparing the magnitudes of the audio mutation repetition indexes of the central points of the two clusters, marking all frames in the cluster with the largest audio mutation repetition index as each potential bird song audio frame, marking the number of frames in the cluster as K1, marking all frames in the cluster with the smallest cluster central value as each non-potential bird song audio frame, and marking the number of the contained frames as K2. The K-means clustering algorithm is a known technique, and this embodiment is not described in detail.

For data within a potential bird song audio frame, the bird song typically appears as a high frequency sound signal that appears as a high frequency component on the spectrogram. The significance of these high frequency signals increases with increasing degree of audio mutation, making it easier to identify in the spectrogram. In addition, when the high frequency formants in the spectrogram show a sharper morphology, the bird song is less disturbed by other sound signals in the environment, so that the clear characteristics and the high identification degree of the bird song are maintained.

In this embodiment, the formants in the spectrogram are calculated by LPC interpolation. The input of LPC interpolation is the firstAnd outputting the spectrogram in the frame as the center frequency and the bandwidth of each formant in the spectrogram. The LPC interpolation is a well-known technique, and the detailed process is not repeated.

Will be the firstThe center frequency of the jth formant within a potential bird song audio frame is noted as/>Will/>The amplitude value corresponding to the center frequency of the jth formant in the frame is recorded as/>Will/>The bandwidth of the jth formant within the frame is noted/>And calculate the kurtosis at the jth formant over its bandwidth, noted as/>. The calculation method of the kurtosis is a known technology and will not be described in detail herein.

Formants with greater bandwidth may indicate that the ambient sound itself has a complexity such that the sharpness of the formants is reduced, i.e., the bird song may include multiple different sounds or other ambient sounds simultaneously. Thus, the definition coefficient of the resonance peak of the potential bird song is constructed, and the formula is as follows:

In the method, in the process of the invention, Represents the/>Potential bird song formant sharpness coefficient within a single potential bird song audio frame,/>Represents the/>Number of formants contained in frame,/>The kurtosis of the j-th formant is expressed by/>Represents the/>Center frequency of jth formant in frame,/>Representing the amplitude value corresponding to the center frequency of the jth formant,/>Representing the bandwidth of the j-th formant. /(I)Is a logarithmic function with a base of 2.

If it isThe larger the value, the greater the degree of aggregation of the jth formant, the greater the/>The higher the value, the greater the likelihood of containing a bird song within the frame, the greater the/>The larger the bird song frequency response characteristic indicating the environmental sound is higher. Because the overlapping of various sounds increases the bandwidth of the formants,/>The smaller the description environment sound the lower the complexity. Calculated/>The greater the value, the less the bird song signal is disturbed.

Thus, the definition coefficients of the potential bird song formants of each potential bird song audio frame of the environment sound of the training data set are obtained. In the initial stage of CNN training, weight coefficientThe smaller the effect of distinguishing the bird song from other audios is, the better.

Therefore, the number of frames of the potential bird song audio frames in the training data set and the definition coefficient of the potential bird song resonance peak of each potential bird song audio frame are combined to calculate the updated weight coefficientThe formula is as follows:

In the method, in the process of the invention, Representing an update weight coefficient; /(I)The value of this example is 0.05 as the basic parameter, which is to avoid/>The value is too small. /(I)To control the coefficient, the value is 0.1, and the function is to control/>Is a value range of (a); /(I)As an exponential function based on a natural constant e,/>For training the number of frames of potential bird song frames in the dataset,/>To train the number of non-potential bird song frames in the dataset,Is the average of the sharpness coefficients of the potential bird song formants of all potential bird song tones. The flowchart for obtaining the updated weight coefficient is shown in fig. 2.

The larger the value, the more bird song appears in the training dataset,/>The larger the value, the purer the collected bird song, calculated/>The smaller the value, the larger the duty cycle of the large class cross entropy loss function when the classification layer calculates the loss function. The more advantageous it is to identify bird song from the ambient sound.

Step S003, model training is carried out based on a designed loss function calculation scheme, and recognition of bird voiceprints is completed.

The audio data in the training data set is converted into a mel pattern, and the calculation of the mel pattern is a well-known technique, and the specific process is not repeated.

The input of the CNN neural network model is the Mel spectrum of the training data set audio frequency, the ADAM optimizer with the size of 32 and the optimizer with the learning rate of 0.001 are used for training, and the weight coefficient is updatedThe loss function brought into step S002 is trained as a loss function of the CNN network while/>The parameters vary as the number of training rounds increases, balancing the model's ability to distinguish between large classes and fine granularity, as detailed below.

According to the calculationAnd (5) value, starting model training. During the first 10 rounds of training, the general loss function was optimized primarily for the purpose of distinguishing between bird song and background and other animal sounds,/>Remain unchanged.

During training after 10 rounds, the weights of the subclass loss functions are gradually increased, eventually approaching 0.5. The purpose is that the model can learn the difference between fine classifications at the same time under the condition of ensuring that the major classes are partially wrong. The CNN method is a well-known technique, and the specific process is not described in detail. The output of the model includes, for example, major and minor categories such as background sound-wind sound, bird song-azalea, etc.

Thus, the recognition of the bird voiceprint is completed.

It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; the technical solutions described in the foregoing embodiments are modified or some of the technical features are replaced equivalently, so that the essence of the corresponding technical solutions does not deviate from the scope of the technical solutions of the embodiments of the present application, and all the technical solutions are included in the protection scope of the present application.

Claims

1. The bird voiceprint recognition method under the complex scene is characterized by comprising the following steps of:

2. The method for identifying bird voiceprint in complex scene as claimed in claim 1, wherein said obtaining the audio mutation index of each sampling point in each frame of environmental sound according to the energy of the energy comprises:

3. The method for identifying bird voiceprint in a complex scene according to claim 1, wherein the step of obtaining a suspected bird song start and stop point sequence of each frame of environmental sound according to the audio mutation index comprises the steps of:

4. The method for identifying bird voiceprints in a complex scene according to claim 1, wherein the step of obtaining the sequence of the environmental sounds of each frame according to the sequence of suspected bird song start dead points comprises the steps of:

5. The method for recognizing bird voiceprints in a complex scene according to claim 1, wherein, the method for obtaining the audio mutation repetition index of each frame of environmental sound according to the relation between the suspected bird song starting and ending point sequence and the sequence of the sequence comprises the following steps:

6. The method for identifying bird voice prints in a complex scene according to claim 1, wherein the step of obtaining each potential bird song audio frame and non-potential bird song audio frame according to the audio mutation repetition index comprises the steps of:

7. The method for identifying bird voiceprints in a complex scene according to claim 1, wherein the obtaining formants of environmental sounds of each frame according to a spectrogram comprises: taking the spectrogram of each frame of environment sound as the input of an interpolation method, and acquiring the center frequency and the bandwidth of the formants of each frame of environment sound.

8. The method for identifying bird voiceprint in a complex scene of claim 7, wherein the obtaining the sharpness coefficients of the potential bird song formants of each potential bird song audio frame from the formants of each potential bird song audio frame comprises:

9. The method for identifying bird voiceprints in a complex scene according to claim 1, wherein the updating weight coefficient is as follows:

10. The method for identifying bird voiceprints in a complex scene according to claim 1, wherein the identifying bird voiceprints using a neural network model according to the updated weight coefficients comprises: