CN118173102A - Bird voiceprint recognition method in complex scene - Google Patents

Bird voiceprint recognition method in complex scene Download PDF

Info

Publication number
CN118173102A
CN118173102A CN202410600097.3A CN202410600097A CN118173102A CN 118173102 A CN118173102 A CN 118173102A CN 202410600097 A CN202410600097 A CN 202410600097A CN 118173102 A CN118173102 A CN 118173102A
Authority
CN
China
Prior art keywords
frame
audio
bird
bird song
potential
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410600097.3A
Other languages
Chinese (zh)
Inventor
高树会
许志方
杨兆宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bainiao Data Technology Beijing Co ltd
Original Assignee
Bainiao Data Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bainiao Data Technology Beijing Co ltd filed Critical Bainiao Data Technology Beijing Co ltd
Priority to CN202410600097.3A priority Critical patent/CN118173102A/en
Publication of CN118173102A publication Critical patent/CN118173102A/en
Pending legal-status Critical Current

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The application relates to the technical field of voice recognition, in particular to a bird voiceprint recognition method in a complex scene, which comprises the following steps: collecting spectrograms and energy; acquiring an audio mutation index of each sampling point in each frame of environmental sound; acquiring a suspected bird song start and stop point sequence of each frame of environmental sound; acquiring an order sequence of the environmental sounds of each frame; acquiring an audio mutation repetition index of each frame of environmental sound; acquiring each potential bird song audio frame and each non-potential bird song audio frame; acquiring formants of environmental sounds of each frame; acquiring the definition coefficients of potential bird song formants of all potential bird song audio frames; acquiring an updated weight coefficient of the loss function; bird voiceprints are identified using a neural network model. The bird song signal recognition method and device can improve the precision of bird song signal recognition by utilizing the bird voiceprint recognition model under a complex scene.

Description

Bird voiceprint recognition method in complex scene
Technical Field
The application relates to the technical field of voice recognition, in particular to a bird voiceprint recognition method under a complex scene.
Background
The intelligent wetland monitoring system is mainly used for monitoring species and quantity in a wetland environment, identifying factors which artificially interfere with the wetland environment and monitoring water quality, wherein birds are the most representative groups in wetland wild animals, are important components of a wetland ecosystem, and sensitively and deeply reflect the transition of the wetland environment. Species identification and monitoring of the number of birds is critical. The conventional bird monitoring means mainly comprise video and audio monitoring, and for some woodbirds which are more crisp in privacy, the audio monitoring is mainly performed. The method comprises the steps of arranging sound receiving equipment in an area needing to monitor bird activities in a protection area, receiving sounds in real time, processing edge ends in real time, identifying bird sounds, returning identification results and sound original files to a service end, and providing the service end with the identification results and sound original files for workers in the protection area to analyze bird activities. Therefore, the bird voiceprint recognition effect is very important, and the accuracy of bird monitoring is directly affected.
However, in a field wetland scene, background sounds are complex, and bird sounds can be influenced by environmental sounds. And the data marking cost is higher. The usual method of audio classification is to classify fixed sounds only with relatively few considerations for specific background sounds and sound levels. The prior art adopts a general data enhancement and modeling method, and has the defect of poor effect on specific background noise.
Disclosure of Invention
In order to solve the technical problems, the application provides a bird voiceprint recognition method in a complex scene to solve the existing problems.
The bird voiceprint recognition method in the complex scene adopts the following technical scheme:
an embodiment of the application provides a bird voiceprint recognition method in a complex scene, which comprises the following steps:
collecting spectrograms of environmental sounds of each frame and energy of each sampling point in a training data set;
Acquiring an audio mutation index of each sampling point in each frame of environmental sound according to the energy of the energy; acquiring suspected bird song start and stop point sequences of environmental sounds of each frame according to the audio mutation indexes;
Acquiring an order sequence of environmental sounds of each frame according to the suspected bird song starting dead center sequence; acquiring the audio mutation repetition index of each frame of environmental sound according to the relation between the suspected bird song starting and ending point sequence and the sequence; acquiring each potential bird song audio frame and non-potential bird song audio frame according to the audio mutation repetition index;
Obtaining formants of environmental sounds of each frame according to the spectrogram; acquiring the definition coefficients of the potential bird song formants of each potential bird song audio frame according to the formants of each potential bird song audio frame; acquiring an updated weight coefficient of the loss function according to the definition coefficient of the potential bird song resonance peak;
and identifying the bird voiceprints by using a neural network model according to the updated weight coefficient.
Further, the obtaining the audio mutation index of each sampling point in each frame of environmental sound according to the energy of the energy includes:
For the energy of each frame of environment sound, sliding the energy of each frame of environment sound by using a sliding window with a preset size, for the energy of a central sampling point of the sliding window, calculating the absolute value of the difference between the energy of the central sampling point and the energy of a next sampling point adjacent to the energy as a first absolute value of the difference, calculating the absolute value of the difference between the energy of the central sampling point and the energy of a previous sampling point adjacent to the energy as a second absolute value of the difference, calculating the sum value between the second absolute value of the difference and a preset adjusting parameter, calculating the ratio of the first absolute value of the difference to the sum value, and calculating the calculation result of a logarithmic function taking the number 2 as the bottom and the ratio as the true number;
And calculating standard deviation of energy of all sampling points in the sliding window, and taking the product of the absolute value of the calculation result and the standard deviation as an audio mutation index of the sliding window corresponding to the central sampling point in each frame of environment sound.
Further, the obtaining the suspected bird song start and stop point sequence of each frame of environmental sound according to the audio mutation index comprises the following steps:
And for each frame of environmental sound, arranging the audio mutation indexes of all sampling points in a descending order, acquiring the audio mutation indexes of the preset number, and arranging the suspected bird song start and stop point sequences of each frame of environmental sound according to the sequence of the sampling points corresponding to each audio mutation index in each frame of environmental sound.
Further, the acquiring the sequence of the environmental sounds of each frame according to the suspected bird song start dead center sequence comprises the following steps:
And forming the sequence of the frame environment sound of the suspected bird song starting dead center sequence according to the sequence value corresponding to the audio mutation indexes in the frame environment sound of the suspected bird song starting dead center sequence from small to large.
Further, the obtaining the audio mutation repetition index of each frame of environmental sound according to the relation between the suspected bird song start and stop point sequence and the sequence of order comprises the following steps:
for each frame of environment sound, calculating a pearson correlation coefficient between a suspected bird song start-stop point sequence and an order sequence, calculating shannon entropy of all audio mutation indexes in the suspected bird song start-stop point sequence, calculating a sum value of the shannon entropy and a preset adjusting parameter, and taking the ratio of the absolute value of the pearson correlation coefficient to the sum value as the audio mutation repetition index of each frame of environment sound.
Further, the obtaining each potential bird song audio frame and non-potential bird song audio frame according to the audio mutation repetition index includes:
And taking the audio repetition index of each frame of environmental sound as the input of a clustering algorithm, acquiring two clustering clusters, respectively marking each frame of environmental sound in the cluster with the largest audio mutation repetition index as a potential bird song audio frame, and respectively taking each frame of environmental sound in the other clustering cluster as a non-potential bird song audio frame.
Further, the obtaining formants of environmental sounds of each frame according to the spectrogram includes: taking the spectrogram of each frame of environment sound as the input of an interpolation method, and acquiring the center frequency and the bandwidth of the formants of each frame of environment sound.
Further, the obtaining the resolution coefficient of the potential bird song formants of each potential bird song audio frame according to the formants of each potential bird song audio frame includes:
for each potential bird song audio frame, calculating kurtosis of each formant of the potential bird song audio frame in the bandwidth range;
the resolution index of the resonance peak of the potential bird song is expressed as the following formula:
In the method, in the process of the invention, Represents the/>Potential bird song formant sharpness coefficient of individual potential bird song audio frames,/>Represents the/>Number of formants contained in frame,/>The kurtosis of the j-th formant is expressed by/>Represents the/>Center frequency of jth formant in frame,/>Representing the amplitude value corresponding to the center frequency of the jth formant,/>Representing the bandwidth of the j-th formant,/>Is a logarithmic function based on the number 2.
Further, the updating weight coefficient has the formula:
In the method, in the process of the invention, Representing an update weight coefficient; /(I)For presetting basic parameters,/>Is a preset control coefficient; /(I)As an exponential function based on natural constants,/>For training the number of frames of potential bird song frames in the dataset,/>For training the number of frames of non-potential bird song frames in the dataset,/>Is the average of the sharpness coefficients of the potential bird song formants of all potential bird song tones.
Further, the identifying bird voiceprints using the neural network model according to the updated weight coefficients includes:
and carrying the updated weight coefficient into a loss function, taking the loss function as a loss function of the neural network model, taking a Mel spectrogram of the environmental sound in the training data set as input of the neural network model, and outputting the acquired model as the type of sound.
The application has at least the following beneficial effects:
According to the method, through analyzing the characteristics of abrupt start and stop and rapid amplitude change of the bird song in the environmental sound energy, the audio mutation index is calculated, a suspected bird song start and stop point sequence is obtained based on the audio mutation index, and whether the similar features of the bird song repetition are obvious is judged by calculating the audio mutation repetition index according to the stability of the sequence. And extracting potential bird song audio frames based on the audio mutation repetition indexes, further analyzing the characteristics of formants in the spectrograms of the frames, and constructing a definition coefficient of the potential bird song formants to obtain the degree of interference of bird song signals. Calculating weight parameters of loss function through frame number of potential bird song audio frames and definition coefficients of potential bird song formants . The method has the beneficial effects that the relative change influence characteristics between background sounds and the bird sounds in the acquired bird sounds training sample under the complex scene are considered, the training effect of identifying the bird sounds model in the environment by adopting the CNN neural network model is improved, and the accuracy of identifying bird sounds signals by utilizing the bird sounds identification model under the complex scene is further improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for identifying bird voiceprints in a complex scene provided by the application;
Fig. 2 is a flowchart for obtaining updated weight coefficients.
Detailed Description
In order to further describe the technical means and effects adopted by the application to achieve the preset aim, the following detailed description refers to specific implementation, structure, characteristics and effects of the bird voiceprint recognition method in a complex scene according to the application by combining the accompanying drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
The following specifically describes a specific scheme of the bird voiceprint recognition method under a complex scene provided by the application with reference to the accompanying drawings.
The application provides a bird voiceprint recognition method under a complex scene, in particular to a bird voiceprint recognition method under a complex scene, referring to fig. 1, comprising the following steps:
and S001, acquiring the environmental sound and preprocessing.
First, environmental sound is acquired, and the present embodiment collects audio data of the environmental sound by disposing a barrel microphone in a wetland protection area.
The digital camera is adopted to collect peripheral video image data in the environment of the wetland protection area, and an expert confirms all categories of possible environmental sounds according to the audio frequency and the peripheral video image returned by voiceprint equipment arranged in the protection area, continuously collects the environmental sounds which can be collected through the audio equipment, and collects the environmental sounds which cannot be collected through public data sets on the Internet, wherein the public data sets selected in the embodiment are Audioset data sets.
This portion is updated once every quarter and the entire iteration period lasts 2 years. In this embodiment, taking one iteration period as an example, the type of the sound signal in the collected audio signal of the environmental sound is marked according to the voiceprint recognition device arranged in the protection area, and specific major types of marks, such as wind sound, water flow sound, insect sound and bird song in the environmental sound are respectively marked as 1,2,3 and 4. A detail mark is then carried out, for example, the bird song-azalea mark (4, A).
The collected audio data of the environmental sound are divided into a training data set and a test data set of the neural network, wherein the dividing ratio is 8:2.
And preprocessing the environmental sound in the acquired training data set. First, the training data set is sampled, and in this embodiment, the sampling frequency is set to 8KHZ, so as to obtain the energy of each sampling point in each frame of environmental sound. Then, pre-emphasis processing is performed to improve the signal-to-noise ratio, that is, the environmental sound is filtered by a high-pass filter, and the filter coefficient in this embodiment takes a value of 0.95. In general, the frequency of bird audio changes with time, and it is difficult to obtain a frequency contour line, and in this embodiment, a frame length is set to 30ms, and environmental sounds are subjected to framing and windowing. Further, a fast fourier transform (Fast Fourier Transform, FFT) is employed within each frame of ambient sound to obtain its spectrogram. The fast fourier transform is a well-known technique, and the detailed process is not repeated.
So far, the energy of each sampling point of each frame of environmental sound and a spectrogram thereof are obtained.
Step S002, designing a loss function calculation scheme of the recognition model classification layer; constructing an audio mutation index based on waveform amplitude variation characteristics of the bird song in a time domain; analyzing the repetition similarity of the start and stop points of the bird song, and constructing an audio mutation repetition index; analyzing the formant characteristics of the audio spectrogram in each frame to construct a definition coefficient of a potential bird song formant; and calculating the updated weight coefficient of the loss function.
The environment sounds of the wetland protection zone mainly comprise wind sounds, rain sounds, water flow sounds, java sounds, bird sounds, other animal sounds and the like. In order to identify bird voiceprints, bird song sounds are first extracted from complex environmental sounds. In the embodiment, the bird voiceprint recognition is performed by adopting a neural network model, wherein the neural network model is CNN, and the loss function is set as a cross entropy loss function.
Specifically, the loss function is calculated at the classification layer of the CNN model and divided into two parts, wherein one part is a large cross entropy loss function, which is denoted by Losslarge, and the other part is a small cross entropy loss function, which is denoted by Losssmall. The weight is distributed to the two, and the loss function calculation method comprises the following steps:
In the method, in the process of the invention, Loss function calculated for classification layer,/>Representing a broad class of cross entropy loss functions,Representing subclass cross entropy loss function,/>Is a weight coefficient. Wherein, the major cross entropy loss function is mainly used for distinguishing the bird song from other sounds in the environment, and the minor cross entropy loss function is mainly used for distinguishing specific song categories from the bird song, so that during model training, weight coefficient/>, is setHas a crucial influence on the recognition result.
In the neural network training data set obtained in step S001, the amplitude and loudness of the bird song in the environmental sound have certain fluctuation in the time domain compared with other animal sounds, wind sounds and water sounds, the start and end of the waveform are usually abrupt, the amplitude change is fast, and the bird song has a repeating similar sound pattern. Other environmental sounds, such as wind sounds, rain sounds, are often represented as continuous, low frequency noise, with some degree of randomness and irregularity. Thus, the likelihood of bird song presence is analyzed by the waveform characteristics of the environmental sounds in the time domain.
In the first placeThe energy of the sampling point of the frame ambient sound is set to be/>The sliding window of the audio frequency mutation index is slid from left to right, the sliding step length is 1, the central point of the sliding window is marked as i, and the audio frequency mutation index is constructed firstly, and the formula is as follows:
In the method, in the process of the invention, Represents the/>Audio mutation index of ith sampling point in frame ambient sound,/>The function is a logarithmic function with 2 as the base,/>Represents the/>Energy of (i+1) th sampling point in frame ambient sound,/>And/>Respectively represent the/>Energy of ith sample point and ith-1 th sample point in frame ambient sound,/>In order to adjust parameters, the value in the embodiment is 0.1, and the denominator is prevented from being 0; /(I)Representing the standard deviation of all the energies within the ith sliding window.
If the energy of the ambient sound varies more around the ith point, the difference between the calculated adjacent amplitude values is greater, the resultingAnd/>The farther from 1 the ratio of (c), while the more pronounced the fluctuations of the local data are,/>The greater the value. Audio mutation index/>The larger the point is, the more likely it is to belong to the start or end point of the bird song portion.
Can calculate the firstThe audio mutation indexes of all sampling points in the frame environment sound are arranged in descending order to obtain the first 10% of values, and the first 10% of values are arranged in the first/>, according to the sampling pointsSequential arrangement within frame ambient sound constitutes the/>Suspected bird song start dead center sequence of frame/>
Sequence of suspected bird song producing dead pointThe audio mutation index of (a) is shown in the (a) >The sequence of corresponding sequence values in the frame formed in the order from small to large is named as the/>Sequence of frame ambient sounds/>. According to the characteristic that the bird song has repetition similarity in a short period, an audio mutation repetition index is constructed, and the formula is as follows:
In the method, in the process of the invention, Represents the/>Audio mutation repetition index of frame ambient sound,/>Represents the/>Suspected bird song start and stop point sequence of frame environment sound,/>Represents the/>Sequence of frame ambient sounds,/>Is the sequence/>And/>Pearson correlation coefficient of (c). /(I)Representation sequence/>Shannon entropy of/>In order to adjust the parameters, the value in this embodiment is 0.1, and the denominator is avoided being 0. The calculation of the pearson correlation coefficient and shannon entropy is a well-known technique, and the specific process is not described in detail.
When (when)The larger the value, the distribution of suspected bird song starting and ending points is shown to be obvious regular characteristics, iThe smaller the value is, the more stable the audio amplitude of the suspected bird song starting dead point is, and the sound frequency of the suspected bird song starting dead point isThe larger the value, the more/>The more pronounced the similar features of the bird song repetition within the frame.
Furthermore, in order to analyze the interference degree of the bird song in the collected environmental sound, the bird song existence frame is extracted. In the embodiment, the K-means clustering algorithm is adopted to cluster the environmental sounds, the algorithm is input as the audio mutation repetition index in each frame in the neural network training data set, and the clustering number is set to be 2. The output is two cluster classes and the corresponding cluster center.
And comparing the magnitudes of the audio mutation repetition indexes of the central points of the two clusters, marking all frames in the cluster with the largest audio mutation repetition index as each potential bird song audio frame, marking the number of frames in the cluster as K1, marking all frames in the cluster with the smallest cluster central value as each non-potential bird song audio frame, and marking the number of the contained frames as K2. The K-means clustering algorithm is a known technique, and this embodiment is not described in detail.
For data within a potential bird song audio frame, the bird song typically appears as a high frequency sound signal that appears as a high frequency component on the spectrogram. The significance of these high frequency signals increases with increasing degree of audio mutation, making it easier to identify in the spectrogram. In addition, when the high frequency formants in the spectrogram show a sharper morphology, the bird song is less disturbed by other sound signals in the environment, so that the clear characteristics and the high identification degree of the bird song are maintained.
In this embodiment, the formants in the spectrogram are calculated by LPC interpolation. The input of LPC interpolation is the firstAnd outputting the spectrogram in the frame as the center frequency and the bandwidth of each formant in the spectrogram. The LPC interpolation is a well-known technique, and the detailed process is not repeated.
Will be the firstThe center frequency of the jth formant within a potential bird song audio frame is noted as/>Will/>The amplitude value corresponding to the center frequency of the jth formant in the frame is recorded as/>Will/>The bandwidth of the jth formant within the frame is noted/>And calculate the kurtosis at the jth formant over its bandwidth, noted as/>. The calculation method of the kurtosis is a known technology and will not be described in detail herein.
Formants with greater bandwidth may indicate that the ambient sound itself has a complexity such that the sharpness of the formants is reduced, i.e., the bird song may include multiple different sounds or other ambient sounds simultaneously. Thus, the definition coefficient of the resonance peak of the potential bird song is constructed, and the formula is as follows:
In the method, in the process of the invention, Represents the/>Potential bird song formant sharpness coefficient within a single potential bird song audio frame,/>Represents the/>Number of formants contained in frame,/>The kurtosis of the j-th formant is expressed by/>Represents the/>Center frequency of jth formant in frame,/>Representing the amplitude value corresponding to the center frequency of the jth formant,/>Representing the bandwidth of the j-th formant. /(I)Is a logarithmic function with a base of 2.
If it isThe larger the value, the greater the degree of aggregation of the jth formant, the greater the/>The higher the value, the greater the likelihood of containing a bird song within the frame, the greater the/>The larger the bird song frequency response characteristic indicating the environmental sound is higher. Because the overlapping of various sounds increases the bandwidth of the formants,/>The smaller the description environment sound the lower the complexity. Calculated/>The greater the value, the less the bird song signal is disturbed.
Thus, the definition coefficients of the potential bird song formants of each potential bird song audio frame of the environment sound of the training data set are obtained. In the initial stage of CNN training, weight coefficientThe smaller the effect of distinguishing the bird song from other audios is, the better.
Therefore, the number of frames of the potential bird song audio frames in the training data set and the definition coefficient of the potential bird song resonance peak of each potential bird song audio frame are combined to calculate the updated weight coefficientThe formula is as follows:
In the method, in the process of the invention, Representing an update weight coefficient; /(I)The value of this example is 0.05 as the basic parameter, which is to avoid/>The value is too small. /(I)To control the coefficient, the value is 0.1, and the function is to control/>Is a value range of (a); /(I)As an exponential function based on a natural constant e,/>For training the number of frames of potential bird song frames in the dataset,/>To train the number of non-potential bird song frames in the dataset,Is the average of the sharpness coefficients of the potential bird song formants of all potential bird song tones. The flowchart for obtaining the updated weight coefficient is shown in fig. 2.
The larger the value, the more bird song appears in the training dataset,/>The larger the value, the purer the collected bird song, calculated/>The smaller the value, the larger the duty cycle of the large class cross entropy loss function when the classification layer calculates the loss function. The more advantageous it is to identify bird song from the ambient sound.
Step S003, model training is carried out based on a designed loss function calculation scheme, and recognition of bird voiceprints is completed.
The audio data in the training data set is converted into a mel pattern, and the calculation of the mel pattern is a well-known technique, and the specific process is not repeated.
The input of the CNN neural network model is the Mel spectrum of the training data set audio frequency, the ADAM optimizer with the size of 32 and the optimizer with the learning rate of 0.001 are used for training, and the weight coefficient is updatedThe loss function brought into step S002 is trained as a loss function of the CNN network while/>The parameters vary as the number of training rounds increases, balancing the model's ability to distinguish between large classes and fine granularity, as detailed below.
According to the calculationAnd (5) value, starting model training. During the first 10 rounds of training, the general loss function was optimized primarily for the purpose of distinguishing between bird song and background and other animal sounds,/>Remain unchanged.
During training after 10 rounds, the weights of the subclass loss functions are gradually increased, eventually approaching 0.5. The purpose is that the model can learn the difference between fine classifications at the same time under the condition of ensuring that the major classes are partially wrong. The CNN method is a well-known technique, and the specific process is not described in detail. The output of the model includes, for example, major and minor categories such as background sound-wind sound, bird song-azalea, etc.
Thus, the recognition of the bird voiceprint is completed.
It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; the technical solutions described in the foregoing embodiments are modified or some of the technical features are replaced equivalently, so that the essence of the corresponding technical solutions does not deviate from the scope of the technical solutions of the embodiments of the present application, and all the technical solutions are included in the protection scope of the present application.

Claims (10)

1. The bird voiceprint recognition method under the complex scene is characterized by comprising the following steps of:
collecting spectrograms of environmental sounds of each frame and energy of each sampling point in a training data set;
Acquiring an audio mutation index of each sampling point in each frame of environmental sound according to the energy of the energy; acquiring suspected bird song start and stop point sequences of environmental sounds of each frame according to the audio mutation indexes;
Acquiring an order sequence of environmental sounds of each frame according to the suspected bird song starting dead center sequence; acquiring the audio mutation repetition index of each frame of environmental sound according to the relation between the suspected bird song starting and ending point sequence and the sequence; acquiring each potential bird song audio frame and non-potential bird song audio frame according to the audio mutation repetition index;
Obtaining formants of environmental sounds of each frame according to the spectrogram; acquiring the definition coefficients of the potential bird song formants of each potential bird song audio frame according to the formants of each potential bird song audio frame; acquiring an updated weight coefficient of the loss function according to the definition coefficient of the potential bird song resonance peak;
and identifying the bird voiceprints by using a neural network model according to the updated weight coefficient.
2. The method for identifying bird voiceprint in complex scene as claimed in claim 1, wherein said obtaining the audio mutation index of each sampling point in each frame of environmental sound according to the energy of the energy comprises:
For the energy of each frame of environment sound, sliding the energy of each frame of environment sound by using a sliding window with a preset size, for the energy of a central sampling point of the sliding window, calculating the absolute value of the difference between the energy of the central sampling point and the energy of a next sampling point adjacent to the energy as a first absolute value of the difference, calculating the absolute value of the difference between the energy of the central sampling point and the energy of a previous sampling point adjacent to the energy as a second absolute value of the difference, calculating the sum value between the second absolute value of the difference and a preset adjusting parameter, calculating the ratio of the first absolute value of the difference to the sum value, and calculating the calculation result of a logarithmic function taking the number 2 as the bottom and the ratio as the true number;
And calculating standard deviation of energy of all sampling points in the sliding window, and taking the product of the absolute value of the calculation result and the standard deviation as an audio mutation index of the sliding window corresponding to the central sampling point in each frame of environment sound.
3. The method for identifying bird voiceprint in a complex scene according to claim 1, wherein the step of obtaining a suspected bird song start and stop point sequence of each frame of environmental sound according to the audio mutation index comprises the steps of:
And for each frame of environmental sound, arranging the audio mutation indexes of all sampling points in a descending order, acquiring the audio mutation indexes of the preset number, and arranging the suspected bird song start and stop point sequences of each frame of environmental sound according to the sequence of the sampling points corresponding to each audio mutation index in each frame of environmental sound.
4. The method for identifying bird voiceprints in a complex scene according to claim 1, wherein the step of obtaining the sequence of the environmental sounds of each frame according to the sequence of suspected bird song start dead points comprises the steps of:
And forming the sequence of the frame environment sound of the suspected bird song starting dead center sequence according to the sequence value corresponding to the audio mutation indexes in the frame environment sound of the suspected bird song starting dead center sequence from small to large.
5. The method for recognizing bird voiceprints in a complex scene according to claim 1, wherein, the method for obtaining the audio mutation repetition index of each frame of environmental sound according to the relation between the suspected bird song starting and ending point sequence and the sequence of the sequence comprises the following steps:
for each frame of environment sound, calculating a pearson correlation coefficient between a suspected bird song start-stop point sequence and an order sequence, calculating shannon entropy of all audio mutation indexes in the suspected bird song start-stop point sequence, calculating a sum value of the shannon entropy and a preset adjusting parameter, and taking the ratio of the absolute value of the pearson correlation coefficient to the sum value as the audio mutation repetition index of each frame of environment sound.
6. The method for identifying bird voice prints in a complex scene according to claim 1, wherein the step of obtaining each potential bird song audio frame and non-potential bird song audio frame according to the audio mutation repetition index comprises the steps of:
And taking the audio repetition index of each frame of environmental sound as the input of a clustering algorithm, acquiring two clustering clusters, respectively marking each frame of environmental sound in the cluster with the largest audio mutation repetition index as a potential bird song audio frame, and respectively taking each frame of environmental sound in the other clustering cluster as a non-potential bird song audio frame.
7. The method for identifying bird voiceprints in a complex scene according to claim 1, wherein the obtaining formants of environmental sounds of each frame according to a spectrogram comprises: taking the spectrogram of each frame of environment sound as the input of an interpolation method, and acquiring the center frequency and the bandwidth of the formants of each frame of environment sound.
8. The method for identifying bird voiceprint in a complex scene of claim 7, wherein the obtaining the sharpness coefficients of the potential bird song formants of each potential bird song audio frame from the formants of each potential bird song audio frame comprises:
for each potential bird song audio frame, calculating kurtosis of each formant of the potential bird song audio frame in the bandwidth range;
the resolution index of the resonance peak of the potential bird song is expressed as the following formula:
In the method, in the process of the invention, Represents the/>Potential bird song formant sharpness coefficient of individual potential bird song audio frames,/>Represents the/>Number of formants contained in frame,/>The kurtosis of the j-th formant is expressed by/>Represents the/>Center frequency of jth formant in frame,/>Representing the amplitude value corresponding to the center frequency of the jth formant,/>Representing the bandwidth of the j-th formant,/>Is a logarithmic function based on the number 2.
9. The method for identifying bird voiceprints in a complex scene according to claim 1, wherein the updating weight coefficient is as follows:
In the method, in the process of the invention, Representing an update weight coefficient; /(I)For presetting basic parameters,/>Is a preset control coefficient; /(I)As an exponential function based on natural constants,/>For training the number of frames of potential bird song frames in the dataset,/>For training the number of frames of non-potential bird song frames in the dataset,/>Is the average of the sharpness coefficients of the potential bird song formants of all potential bird song tones.
10. The method for identifying bird voiceprints in a complex scene according to claim 1, wherein the identifying bird voiceprints using a neural network model according to the updated weight coefficients comprises:
and carrying the updated weight coefficient into a loss function, taking the loss function as a loss function of the neural network model, taking a Mel spectrogram of the environmental sound in the training data set as input of the neural network model, and outputting the acquired model as the type of sound.
CN202410600097.3A 2024-05-15 2024-05-15 Bird voiceprint recognition method in complex scene Pending CN118173102A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410600097.3A CN118173102A (en) 2024-05-15 2024-05-15 Bird voiceprint recognition method in complex scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410600097.3A CN118173102A (en) 2024-05-15 2024-05-15 Bird voiceprint recognition method in complex scene

Publications (1)

Publication Number Publication Date
CN118173102A true CN118173102A (en) 2024-06-11

Family

ID=91351017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410600097.3A Pending CN118173102A (en) 2024-05-15 2024-05-15 Bird voiceprint recognition method in complex scene

Country Status (1)

Country Link
CN (1) CN118173102A (en)

Similar Documents

Publication Publication Date Title
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN108922513B (en) Voice distinguishing method and device, computer equipment and storage medium
CN111369982A (en) Training method of audio classification model, audio classification method, device and equipment
CN109285551B (en) Parkinson patient voiceprint recognition method based on WMFCC and DNN
CN108615533A (en) A kind of high-performance sound enhancement method based on deep learning
CN113327626A (en) Voice noise reduction method, device, equipment and storage medium
CN105448291A (en) Parkinsonism detection method and detection system based on voice
CN112259104A (en) Training device of voiceprint recognition model
CN111489763B (en) GMM model-based speaker recognition self-adaption method in complex environment
CN112397074A (en) Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning
CN117727307B (en) Bird voice intelligent recognition method based on feature fusion
WO2019232867A1 (en) Voice discrimination method and apparatus, and computer device, and storage medium
CN116842460A (en) Cough-related disease identification method and system based on attention mechanism and residual neural network
CN117409761B (en) Method, device, equipment and storage medium for synthesizing voice based on frequency modulation
Chaves et al. Katydids acoustic classification on verification approach based on MFCC and HMM
CN116895287A (en) SHAP value-based depression voice phenotype analysis method
CN107993666B (en) Speech recognition method, speech recognition device, computer equipment and readable storage medium
CN118173102A (en) Bird voiceprint recognition method in complex scene
CN111091816B (en) Data processing system and method based on voice evaluation
CN114678039A (en) Singing evaluation method based on deep learning
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN112259107A (en) Voiceprint recognition method under meeting scene small sample condition
CN113936667A (en) Bird song recognition model training method, recognition method and storage medium
Therese et al. A linear visual assessment tendency based clustering with power normalized cepstral coefficients for audio signal recognition system
CN110689875A (en) Language identification method and device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination