CN114038479A - Bird song recognition and classification method and device for coping with low signal-to-noise ratio and storage medium - Google Patents

Bird song recognition and classification method and device for coping with low signal-to-noise ratio and storage medium Download PDF

Info

Publication number
CN114038479A
CN114038479A CN202111323056.7A CN202111323056A CN114038479A CN 114038479 A CN114038479 A CN 114038479A CN 202111323056 A CN202111323056 A CN 202111323056A CN 114038479 A CN114038479 A CN 114038479A
Authority
CN
China
Prior art keywords
feature matrix
sounding
processing
candidate
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111323056.7A
Other languages
Chinese (zh)
Inventor
陈爱斌
伍安芸
周国雄
刘志华
彭伟雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University of Forestry and Technology
Original Assignee
Central South University of Forestry and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University of Forestry and Technology filed Critical Central South University of Forestry and Technology
Priority to CN202111323056.7A priority Critical patent/CN114038479A/en
Publication of CN114038479A publication Critical patent/CN114038479A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The invention discloses a low signal-to-noise ratio bird song recognition and classification method, a device and a storage medium, and relates to artificial intelligence. The method comprises the following steps: extracting a time sequence signal of the audio to be identified; stacking audio frames frame by frame according to the time sequence of the sampling point values in the time sequence signal to obtain a first characteristic matrix; carrying out endpoint detection on the first feature matrix by using a preset detection algorithm to obtain a second feature matrix of the real sounding interval of the bird song; stacking the audio frames frame by frame according to the time sequence again for the sampling point values in the real sounding interval of the bird song to obtain a calibrated third feature matrix; and performing feature extraction on the third feature matrix by using a convolutional neural network so as to input the third feature matrix into a gated cyclic network for processing to obtain an identification classification result. When the method is used for classifying and identifying the bird song, the accuracy of voice endpoint detection and the accuracy of bird song classification are both high.

Description

Bird song recognition and classification method and device for coping with low signal-to-noise ratio and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a low signal-to-noise ratio bird song recognition and classification method, a device and a storage medium.
Background
Bird song is as the indispensable sound in the forest, and the main part of bird song is closely related with local ecological environment quality, and birds is the important factor index that ecological scientist carried out environmental quality and assesses. Birds usually live in dense forests, and the traditional bird singing identification method mostly relies on experts to enter the forests for identification, so that the method has certain limitation.
At present, due to the rise of deep learning, the classification of bird song becomes a popular research branch in the field of audio classification, and the classification has an extremely important promoting effect on helping ecologists to evaluate the environmental quality. There are a lot of studies on deep learning, and a number of excellent algorithm models are applied to the field of audio classification, such as deep learning for urban environmental sound classification, human dialect classification, music genre classification, and the like. However, there are few studies on classification of bird song, and since bird song data collected in real life is often contaminated with non-target sound features, there is a problem in how to correctly recognize bird song in a low signal-to-noise ratio.
The existing identification method for setting bird song has two defects: firstly, the current bird song recognition directly converts the collected bird song into a spectrogram, and then sends the spectrogram into a network for recognition, so that environmental noise mixed with the acquired bird song can be used as an effective characteristic by the network, and thus, the classification of the bird song is certainly negatively influenced. Therefore, how to efficiently reduce the noise of the audio features is an urgent problem to be researched; secondly, bird song has the characteristic of time domain continuity, then most of the existing recognition models do not consider one point, so that the robustness of the models is not very strong, and how to accurately detect and separate bird song features is an important prerequisite for utilizing the characteristics of bird song features. However, existing detection algorithms are not applicable in low signal-to-noise ratio situations, and thus algorithms for bird song endpoint detection remain to be improved.
Disclosure of Invention
The invention provides a method, a device and a storage medium for identifying and classifying bird song, which are used for solving the problems that the accuracy of voice endpoint detection is not high and the accuracy of classifying bird song is not high when the existing bird song is identified.
The technical scheme provided by the invention for the technical problem is as follows:
in one aspect, the present invention provides a method for identifying and classifying birdsong coping with low signal-to-noise ratio, where the method includes:
extracting a time sequence signal of the audio to be identified;
stacking audio frames frame by frame according to the time sequence of the sampling point values in the time sequence signal to obtain a first characteristic matrix;
carrying out endpoint detection on the first feature matrix by using a preset detection algorithm to obtain a second feature matrix of the real sounding interval of the bird song;
stacking the audio frames frame by frame according to the time sequence again for the sampling point values in the real sounding interval to obtain a calibrated third feature matrix, wherein the first dimension of the second feature matrix is aligned with the first dimension of the third feature matrix;
and performing feature extraction on the third feature matrix by using a convolutional neural network so as to input the third feature matrix into a gated cyclic network for processing to obtain an identification classification result.
According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, the performing end point detection on the first feature matrix by using a preset detection algorithm to obtain a second feature matrix of the real sounding interval of the bird song comprises the following steps:
detecting each first utterance candidate segment in the first feature matrix by adopting an energy detection algorithm;
merging and un-merging the adjacent first utterance candidate segments in each first utterance candidate segment according to a preset rule to obtain each second utterance candidate segment;
and selecting the second sounding candidate segment with the largest audio frame number in the second sounding candidate segments as the real sounding interval of the bird song of the second characteristic matrix.
According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, the step of detecting each first sound-emitting candidate segment in the first feature matrix by adopting an energy detection algorithm comprises the following steps:
solving the energy sum of all audio frames in the first characteristic matrix;
calculating an energy average value of the audio frame according to the energy sum;
comparing the energy sum of each audio frame in the first feature matrix with the energy mean value one by one;
if the energy sum of the current audio frame is larger than the energy average value, judging that the current audio frame is a candidate frame for the singing and sounding of the bird;
recording all the bird singing sound production candidate frames to obtain a candidate frame list;
and stacking continuous audio frames in the candidate frame list according to the audio frame sequence in the first feature matrix to obtain each first pronunciation candidate segment.
According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, the merging and un-merging the adjacent first utterance candidate segments in each first utterance candidate segment according to a preset rule to obtain each second utterance candidate segment comprises:
performing primary processing on the first utterance candidate segments to obtain a primary processing list, wherein the first processing comprises merging first merged utterance candidate segments obtained by corresponding adjacent first utterance candidate segments if the number of discontinuous audio frames between adjacent first utterance candidate segments in each first utterance candidate segment does not exceed a first threshold; otherwise, not merging; the primary processing list comprises primary processing sound production candidate segments, including the first combined sound production candidate segment obtained by combination and the first sound production candidate segment which is not combined; if the number of the first-stage processing sounding candidate segments is equal to 1, ending the processing;
if the number of the primary processing sounding candidate segments is more than 2, performing secondary processing on the primary processing sounding candidate segments to obtain a secondary processing list, wherein the secondary processing comprises the steps of obtaining the minimum value of discontinuous audio frame numbers between adjacent primary processing sounding candidate segments, and if the minimum value is not more than a second threshold value, combining second combined sounding candidate segments obtained by corresponding adjacent primary processing sounding candidate segments; otherwise, not merging; the secondary processing list comprises secondary processing sounding candidate segments, including the second combined sounding candidate segment obtained by combination and the primary processing sounding candidate segment which is not combined; wherein the second threshold is greater than the first threshold; if the number of the secondary processing sounding candidate segments is more than 2, repeating the secondary processing; if the number of the secondary processing sounding candidate segments is equal to 1, ending the processing;
and acquiring the primary processing sounding candidate segment obtained after finishing the processing or the secondary processing sounding candidate segment obtained after finishing the processing as the second sounding candidate segment.
According to the bird song recognition and classification method dealing with the low signal-to-noise ratio, the merging and un-merging the adjacent first utterance candidate segments in each first utterance candidate segment according to a preset rule to obtain each second utterance candidate segment further includes:
if the number of the primary processing sounding candidate segments is equal to 2 or the number of the secondary processing sounding candidate segments is equal to 2, performing three-level processing on the corresponding primary processing sounding candidate segments or the corresponding secondary processing sounding candidate segments to obtain a three-level processing list, wherein the three-level processing comprises combining the corresponding adjacent primary processing sounding candidate segments to obtain a third combined sounding candidate segment if the number of discontinuous audio frames between the adjacent primary processing sounding candidate segments does not exceed a third threshold value; otherwise, not merging; if the number of discontinuous audio frames between the adjacent secondary processing sounding candidate segments does not exceed a third threshold value, combining the corresponding adjacent secondary processing sounding candidate segments to obtain a fourth combined sounding candidate segment; otherwise, not merging; wherein the third threshold is greater than the second threshold; finishing the processing;
and acquiring the candidate section of the sounding obtained by the three-stage processing as the second candidate section of the sounding.
According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, the step of extracting the features of the third feature matrix by using the convolutional neural network so as to input the third feature matrix into the gated cyclic network for processing to obtain the identification and classification result comprises the following steps:
inputting the third feature matrix into a convolutional neural network for pre-feature extraction to obtain a fourth feature matrix;
performing dimensionality reduction processing on the fourth feature matrix to obtain a fifth feature matrix with strong time domain continuity;
inputting the fifth feature matrix into the gated loop network to obtain a final classification prediction score;
and inputting the classification prediction score into an Argmax () function to obtain the identification classification result.
According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, the step of inputting the third feature matrix into a convolutional neural network for pre-feature extraction to obtain a fourth feature matrix comprises the following steps:
inputting the third feature matrix into a convolutional neural network for convolution operation;
arranging the values obtained after the convolution operation according to the corresponding positions in the third feature matrix to obtain a sixth feature matrix;
activating each element in the sixth feature matrix by using an activation function to obtain an activation value;
and putting the activation values to corresponding positions in the third feature matrix and performing pooling operation to obtain the fourth feature matrix.
According to the method for identifying and classifying the bird song coping with the low signal-to-noise ratio, before the end point detection is performed on the first feature matrix by using a preset detection algorithm to obtain the second feature matrix of the real sounding interval of the bird song, the method further comprises the following steps:
and carrying out noise reduction treatment on the first characteristic matrix by adopting a weighted average threshold algorithm.
In a second aspect, the present invention provides a bird song recognition and classification apparatus coping with a low signal-to-noise ratio, the apparatus comprising:
the extraction module is used for extracting a time sequence signal of the audio to be identified;
the first processing module is used for stacking audio frames frame by frame according to the time sequence of the sampling point values in the time sequence signal to obtain a first characteristic matrix;
the detection module is used for carrying out endpoint detection on the first characteristic matrix by using a preset detection algorithm so as to obtain a second characteristic matrix of the real sounding interval of the bird song;
a second processing module, configured to stack audio frames frame by frame again according to a time sequence for sampling point values within the bird song true sounding interval to obtain a calibrated third feature matrix, where a first dimension of the second feature matrix is aligned with a first dimension of the third feature matrix;
and the identification and classification module is used for extracting the characteristics of the third characteristic matrix by using a convolutional neural network so as to input the third characteristic matrix into the gated cyclic network for processing to obtain an identification and classification result.
In a third aspect, the present invention also provides a storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method for identifying and classifying bird song coping with low signal-to-noise ratio as described above.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
according to the low signal-to-noise ratio bird song recognition and classification method, the unconverted signal sampling point values are directly used as the feature matrix, compared with the traditional method that the spectrum features are obtained by a series of conversion of the audio, the operation of obtaining the features is simpler, and the calculation resources are saved. Meanwhile, the target characteristics of the bird song can be well detected under the condition of more noise by adopting an endpoint detection algorithm based on frame-level energy, then a two-dimensional convolutional neural network is adopted as a pre-characteristic extractor, and then the characteristics with strong time domain attributes are processed by utilizing a gated cyclic network, so that a good classification effect is achieved under the condition of low signal-to-noise ratio. Therefore, the method for identifying and classifying the bird song which deals with the low signal-to-noise ratio has relatively high accuracy rate of voice endpoint detection and relatively high accuracy of bird song classification.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a low snr bird song recognition and classification method according to an embodiment of the present invention;
FIG. 2 is a functional block diagram of the bird song recognition and classification apparatus for dealing with low SNR according to the present invention;
FIG. 3 is a diagram illustrating the noise reduction effect of the noise reduction process according to the present invention;
fig. 4 is a diagram illustrating the effectiveness of the endpoint detection algorithm provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Referring to fig. 1, a flow chart of a bird song recognition and classification method for handling a low signal-to-noise ratio according to an embodiment of the present invention is provided. The low signal to noise ratio bird song recognition and classification method is mainly used for carrying out bird song recognition and classification on the acquired material audio, and has good recognition efficiency and accuracy.
As shown in fig. 1, the method for identifying and classifying a bird song coping with a low signal-to-noise ratio according to the present embodiment may include the following steps:
s101: extracting a time sequence signal of the audio to be identified, wherein the audio to be identified can be a real-time recorded audio, or an audio file which is pre-recorded and stored in a memory, or of course, the audio to be identified can also be an audio file which is obtained by using an automatic script program from the largest public audio website of birds.
This step can be implemented by extracting the time-series signal y of the audio to be recognized using a library of librosa audio signal extractions in python, where y ═ { x ═1,x2,x3,…,xNAnd N is the length of the signal sequence. Here, librosa is a python toolkit for audio and music analysis and processing, and can perform time-frequency processing and feature extraction on an audio file.
S102: and stacking the audio frames frame by frame according to the sampling point values in the time sequence signal according to the time sequence to obtain a first feature matrix.
In this step, the specific implementation process may include normalizing the time series signal, that is, stacking audio frames frame by frame according to a time sequence on the sampling point values in the time series signal to obtain the first feature matrix of a fixed size, where each audio frame includes 1024 sampling point values.
The first feature matrix may be denoted as T and may be represented by the following equation:
Figure BDA0003346001980000061
where, the number of rows of T is 600, the number of columns is 1024, and xnmIs the sampling point value in the y sequence, and the first feature matrix T obtained after normalization is a matrix with the size of 600 multiplied by 1024. Here, the first feature matrix T is saved in the form of a picture.
S103: and carrying out endpoint detection on the first characteristic matrix by using a preset detection algorithm to obtain a second characteristic matrix of the real sounding interval of the bird song.
In this step, the preset detection algorithm may be an energy detection algorithm, and in the process of implementing this step, the following substeps may be specifically included:
s1031: detecting each first utterance candidate segment in the first feature matrix by using an energy detection algorithm, wherein the detection step is as follows:
s10311: and solving the energy sum of all audio frames in the first characteristic matrix, wherein the energy sum can be calculated as s and satisfies the relation:
Figure BDA0003346001980000071
wherein s isnFor the sum of the energies of each audio frame, the relation is satisfied:
Figure BDA0003346001980000072
s10312: calculating an energy mean value of the audio frame according to the energy sum, wherein d can be calculated, and the energy mean value satisfies the relation: d is s/600.
S10313: comparing the energy sum of each audio frame in the first feature matrix with the energy mean value one by one, specifically, sequentially judging the energy sum s of each audio frame in the first feature matrixnWhether d is greater than or equal to d.
S10314: and if the energy sum of the current audio frame is larger than the energy average value, judging that the current audio frame is a candidate frame for the bird singing and sounding. It will be appreciated that the current audio frame is only a candidate frame for utterance that is more likely to be a true bird song.
S10315: recording all of the bird song production candidate frames to obtain a candidate frame list, which may be w0And satisfies the relation:
Figure BDA0003346001980000073
wherein n is more than or equal to 1N≤n,n=600。
S10316: selecting the candidate according to the audio frame order in the first feature matrixSuccessive audio frames in the frame list are stacked to obtain each of the first utterance candidate segments. Here, the first utterance candidate may be counted asRjAnd satisfies the relation: rj=(n1,n2,…,ni),(n1,n2,…,ni)∈w0And all the stacked first utterance candidates may be chronologically maintained in a stacking process list, which may be w and satisfies the relation: w ═ R1,R2,…,Rj) Wherein j representsw0J is more than or equal to 1 and less than or equal to 300.
S1032: merging and un-merging the adjacent sound production candidate segments in each first sound production candidate segment according to a preset rule to obtain each second sound production candidate segment, wherein the sound production candidate segments in the stacking processing list w are processed by adopting an automatic breakpoint technology, and the processing steps are as follows:
s10321: performing primary processing on the first utterance candidate segments to obtain a primary processing list, wherein the first processing comprises merging first merged utterance candidate segments obtained by corresponding adjacent first utterance candidate segments if the number of discontinuous audio frames between adjacent first utterance candidate segments in each first utterance candidate segment does not exceed a first threshold; otherwise, not merging; the primary processing list comprises primary processing sound production candidate segments, including the first combined sound production candidate segment obtained by combination and the first sound production candidate segment which is not combined; and if the number of the primary processing sounding candidate segments is equal to 1, ending the processing. The first threshold is a1, which may be 15.
Here, the adjacent first utterance candidate segments R are sequentially calculatedj、Rj+1The number of discontinuous audio frames (which may be num) between them, if num ≦ a1, the two adjacent utterance candidates are merged, denoted as Rj=Rj+Rj+1Otherwise, not merging. The primary processing list obtained by primary processing can be counted as w1And satisfies the relation: w is a1=(R1,R2,…,Ri) Wherein i is more than or equal to 1 and less than or equal to j.
S10322: if the number of the primary processing sounding candidate segments is more than 2, performing secondary processing on the primary processing sounding candidate segments to obtain a secondary processing list, wherein the secondary processing comprises the steps of obtaining the minimum value of discontinuous audio frame numbers between adjacent primary processing sounding candidate segments, and if the minimum value is not more than a second threshold value, combining second combined sounding candidate segments obtained by corresponding adjacent primary processing sounding candidate segments; otherwise, not merging; the secondary processing list comprises secondary processing sounding candidate segments, including the second combined sounding candidate segment obtained by combination and the primary processing sounding candidate segment which is not combined; wherein the second threshold is greater than the first threshold; if the number of the secondary processing sounding candidate segments is more than 2, repeating the secondary processing; and if the number of the secondary processing sounding candidate segments is equal to 1, ending the processing. The second threshold is a2, which may take the value 30.
Here, the list w is processed by computing the primary processing1The number of the middle-stage processing utterance candidate segments (which can be counted as len1) is used as an execution condition: when the number of the primary processing sounding candidate segments is 1 (len1 is 1), ending the processing, and taking the primary processing sounding candidate segments as the second sounding candidate segments; when the number of the first-stage processing sounding candidate segments is more than 2, a first-stage processing list w is calculated1The number of discontinuous audio frames between all adjacent primary processing sound production candidate segments is counted as nsAnd satisfies the relation: ns (num1, num2, …, num-1), and then taking the minimum value n thereofminAnd satisfies the relation: n ismin=min(ns)。
If n isminIf the value is more than 2 × a1, taking the secondary processing sound production candidate segment as the second sound production candidate segment; if n isminIf num is less than or equal to a2, the two adjacent primary processing sound production candidate segments are merged and expressed as Ri=Ri+Ri+1Otherwise, not merging. The secondary processing list obtained by secondary processing can be counted as w2And satisfies the relation: w2 ═ R1,R2,…,Rii) Wherein, 1 is less than or equal to ii and less than or equal to i.
Computing a Secondary processing List w2The number of the secondary processing utterance candidate segments (which can be counted as len2) is middle, when the number of the secondary processing utterance candidate segments is 1 (len2 is 1), the processing is finished, and the secondary processing utterance candidate segments are used as the second utterance candidate segments; and when the number of the secondary processing sounding candidate segments is more than 2, performing secondary processing again until the number of the secondary processing sounding candidate segments is 1 or 2.
When the number of the primary processing utterance candidate segments is 2, the step in S10323 is performed; when the number of secondary-processing utterance candidate segments is 2, the step in S10323 is also performed.
S10323: if the number of the primary processing sounding candidate segments is equal to 2 or the number of the secondary processing sounding candidate segments is equal to 2, performing three-level processing on the corresponding primary processing sounding candidate segments or the corresponding secondary processing sounding candidate segments to obtain a three-level processing list, wherein the three-level processing comprises combining the corresponding adjacent primary processing sounding candidate segments to obtain a third combined sounding candidate segment if the number of discontinuous audio frames between the adjacent primary processing sounding candidate segments does not exceed a third threshold value; otherwise, not merging; if the number of discontinuous audio frames between the adjacent secondary processing sounding candidate segments does not exceed a third threshold value, combining the corresponding adjacent secondary processing sounding candidate segments to obtain a fourth combined sounding candidate segment; otherwise, not merging; wherein the third threshold is greater than the second threshold. The third threshold is a3, and preferably 40.
Here, if the number of the one-stage processing utterance candidate segments is equal to 2, the adjacent one-stage processing utterance candidate segments R are sequentially calculatedi、Ri+1The discontinuous audio frame number num between the two adjacent first-level processing sound production candidate segments is merged and is represented as R if num is less than or equal to a3i=Ri+Ri+1Otherwise, the combination is not carried out, and the processing is finished. The three-level processing list obtained by the three-level processing can be calculated as w3And satisfies the relation: w3 ═ (R1, R2, …, Riii), where 1. ltoreq. iii. ltoreq.i.
If the second level processing is soundingIf the number of the candidate segments is equal to 2, the adjacent two-stage processing sounding candidate segments R are calculated in sequenceii、Rii+1And if num is less than or equal to a3, merging two adjacent secondary processing sound production candidate segments, namely Rii-Rii + Rii +1, otherwise, not merging, and then ending the processing. Similarly, the list of three-stage treatments obtained by the three-stage treatment can be designated as w3And satisfies the relation: w3 ═ (R1, R2, …, Riii), where 1. ltoreq. iii. ltoreq.i.
S10324: and acquiring the primary processing sound production candidate section obtained after finishing the processing, the secondary processing sound production candidate section obtained after finishing the processing or the sound production candidate section obtained through the tertiary processing as the second sound production candidate section.
Here, when the number of the utterance candidate segments for the first-stage processing is 1, the list w for the third-stage processing may be made3First order processing list w1(ii) a Similarly, when the number of the second-stage processing sounding candidate segments is 1, the third-stage processing list w can be ordered3First order processing list w2
S1033: and selecting the second sounding candidate segment with the largest audio frame number in the second sounding candidate segments as the actual sounding interval of the bird song.
According to a three-level processing list w3Selecting the segment R with the largest audio frame numberiii=max(w3) And taking the first value and the last value in the sounding candidate segment as end values to respectively represent the starting frame and the ending frame of the bird song. Where the first value may be counted as C and the last value may be counted as D.
Before this step, the first feature matrix may be further subjected to noise reduction, where all the element values x in the first feature matrix may be calculated by using a weighted average threshold methodnmSum of (a) is calculated as
Figure BDA0003346001980000101
Then, an average value ave is obtained, wherein the average value ave satisfies the following relation: ave sum/D, where D represents the magnitude of the first feature matrixSmall and equal to 600 × 1024.
After the average value ave is calculated, weighting is performed to obtain an average threshold Q, which satisfies the following relation: q is (1+ α) × ave, where α is a weighting coefficient, and is 0.15.
Then, performing noise reduction processing on the first feature matrix T according to the average threshold Q, specifically, sequentially processing elements in the first feature matrix T according to the average threshold Q, and the processing manner is:
Figure BDA0003346001980000102
i.e. at the current element value xnmWhen less than Q, the current element value x is setnmSet to 0; at the current element value xnmWhen greater than or equal to Q, the current element value x is retainednm
The pre-processing feature matrix, designated as T1, is obtained by the following method:
Figure BDA0003346001980000103
s104: and stacking the audio frames frame by frame according to the time sequence again for the sampling point values in the real sounding interval to obtain a calibrated third feature matrix, wherein the first dimension of the second feature matrix is aligned with the first dimension of the third feature matrix.
The values between the C-th audio frame and the D-th audio frame are extracted from the first feature matrix T according to the first value C and the last value D, and a second feature matrix T2 representing the real sound segment of the bird song is obtained, which can be represented by the following formula:
Figure BDA0003346001980000104
wherein, C is more than or equal to 1 and less than or equal to D is less than or equal to 600.
And then, normalizing the second feature matrix again to complete matrix calibration, namely stacking the audio frames frame by frame according to the time sequence again for the sampling point values in the real sounding interval, so as to obtain a third feature matrix, wherein the first dimension of the second feature matrix is aligned with the first dimension of the third feature matrix. Here, the third feature matrix is T3, which can be represented by the following formula:
Figure BDA0003346001980000111
it will be appreciated that the third feature matrix T3 is a 600 x 1024 matrix corresponding in size to the first feature matrix T.
It is understood that the bird song data obtained through the processing in step S105 may be trained as an experimental sample, the training process includes forward propagation and backward propagation, the model parameters are updated according to the loss, and the training is finished and the model is saved when the loss value is small and tends to be stable, so as to obtain a trained model.
S106: and performing feature extraction on the third feature matrix by using a convolutional neural network so as to input the third feature matrix into a gated cyclic network for processing to obtain an identification classification result.
In this step, the size of the third feature matrix T3 is n0×m0Wherein n is0=600,m01024. In the process of implementing this step, the following sub-steps may be specifically included:
s1061: inputting the third feature matrix into a convolutional neural network for pre-feature extraction to obtain a fourth feature matrix, which may include the following steps:
s10611: inputting the third feature matrix into a convolutional neural network for convolution operation, specifically, inputting the third feature matrix T3 into a three-layer two-dimensional convolutional neural network, where the convolution operation performed by each layer of convolutional neural network satisfies the following relation:
Figure BDA0003346001980000112
Figure BDA0003346001980000113
wherein i ═ 1,2,3]Representing a current network hierarchy; [ f ] ofh,fw]Denotes the size of the convolution filter with a sliding step of s on the vertical and horizontal axesh,sw]P represents the fill size for the matrix dimension; n isii、miiAnd represents the dimension size of the third feature matrix T3 obtained after the convolution operation. Here, the convolution operation will be of size [ f ]h,fw]The weight matrix z of the filter of (2) can be represented by:
Figure BDA0003346001980000114
s10612: and arranging the values obtained after the convolution operation according to the corresponding positions in the third feature matrix to obtain a sixth feature matrix. Specifically, the region value X of the feature matrix introduced from the previous layer is multiplied in turn to obtain a value Xij. Wherein, the region value X satisfies the following relation:
Figure BDA0003346001980000121
and the value x obtainedijSatisfies the following relation:
Figure BDA0003346001980000122
the step size of the filter is s for each shift in the feature matrixh,sw]Then arranging the obtained values according to the relative positions of the values in the original matrix to obtain a feature matrix T after convolutionjAs the sixth feature matrix, it can be represented by the following equation:
Figure BDA0003346001980000123
s10613: activating each element in the sixth feature matrix by using an activation function to obtain an activation value, specifically, activating the feature matrix T by using the activation functionjEach element in the table is activated to obtain a corresponding activation value, and then the activation value is put back to its position in the original matrix, where g (x) can be used to represent the LEAKRELU () activation function, and the feature matrix TjEach element in (b) is a value obtained by activation processing of an activation function, wherein the activation function g (x) satisfies the following relation:
Figure BDA0003346001980000124
Figure BDA0003346001980000125
s10614: and putting the activation values to corresponding positions in the third feature matrix and performing pooling operation to obtain the fourth feature matrix. Specifically, a pooling operation is performed to obtain a feature matrix, the pooling operation satisfying the following relation:
Figure BDA0003346001980000126
Figure BDA0003346001980000127
wherein n isi,miFor the dimension of the feature matrix obtained after pooling operation, pcRepresenting the fill size of the feature matrix before pooling, the size of the pooling kernel being fhc,fwc]The step length of sliding on the vertical axis and the horizontal axis is shc,swc]. Here, the maximum pooling type is used, i.e. the size of the pooling kernel is a region, and accordingly, in the feature matrix TjThe areas with the same size are selected,satisfies the following relation:
Figure BDA0003346001980000131
then, the maximum value of the elements in the matrix is obtained as the representative element of the area, and the characteristic matrix T after the pooling operation is obtainedcAs the fourth feature matrix T4, the representative elements satisfy the relation:
Figure BDA0003346001980000132
feature matrix TcSatisfies the following relation:
Figure BDA0003346001980000133
by integrating S10611 to S10614, the three-layer two-dimensional convolutional neural network is used to extract high-level features, each convolutional operation includes a maximum pooling operation, and a feature matrix with a size of 8 × 17 × 10 is finally extracted as a final high-level feature, and the input and output sizes and parameter settings of each layer can be shown in the following table:
Figure BDA0003346001980000134
here, using the RMSProp gradient descent optimization algorithm as the gradient optimization method of the convolutional network, the momentum parameter is set to 0.7, the initial learning rate lr is 0.001, and the average accuracy (MAP) is used as the classification prediction effect index, while the cross entropy (cross entropy) loss function is selected as the loss measurement function of the network.
The loss function satisfies the following relation:
loss=-(y×log(h(x))+(1-y)×log(1-h(x))),
wherein y represents the case that the prediction result of the current sample is true or false, the value is 0 or 1, h (x) is the fraction of the model predicted by the current sample, and loss is the loss value.
S1062: and performing dimensionality reduction on the fourth feature matrix to obtain a fifth feature matrix with strong time domain continuity, wherein the fourth feature matrix T4 is reduced to be a 17 × 80 fifth feature matrix T5, which can support a subsequent gated round-robin network (GRU).
Here, after the convolution operation and the pooling operation of step S1061, the fifth feature matrix of 17 × 80 with strong time-domain continuity is obtained by concatenating the obtained fourth feature matrix (corresponding to the final high-level features) in the frequency dimension.
S1063: inputting the fifth feature matrix into the gated loop network to obtain a final classification prediction score, wherein the output of the last layer of the gated loop network is used as the final classification prediction score by utilizing the characteristic that the gated loop network is good at processing time series data, and the classification prediction score can be counted as T6.
S1064: inputting the classification prediction score into an Argmax () function to obtain the recognition classification result, specifically, inputting the classification prediction score into the Argmax () function to obtain a one-hot vector of the final classification result, wherein the one-hot vector is expressed as a one-dimensional vector such as [1,0,0,0], and the vector representation model presets the current experimental sample as the first bird.
According to the low signal-to-noise ratio bird song recognition and classification method, the unconverted signal sampling point values are directly used as the feature matrix, compared with the traditional method that the spectrum features are obtained by a series of conversion of the audio, the operation of obtaining the features is simpler, and the calculation resources are saved. Meanwhile, the target characteristics of the bird song can be well detected under the condition of more noise by adopting an endpoint detection algorithm based on frame-level energy, then a two-dimensional convolutional neural network is adopted as a pre-characteristic extractor, and then the characteristics with strong time domain attributes are processed by utilizing a gated cyclic network, so that a good classification effect is achieved under the condition of low signal-to-noise ratio. Therefore, the method for identifying and classifying the bird song which deals with the low signal-to-noise ratio has relatively high accuracy rate of voice endpoint detection and relatively high accuracy of bird song classification.
Furthermore, the noise reduction is carried out by adopting a weighted average threshold method, most of non-target characteristics can be removed, and a better noise reduction effect is achieved.
Referring to fig. 2, a functional block diagram of the bird song recognition and classification apparatus for dealing with low signal-to-noise ratio provided by the present invention is shown. The bird song recognition and classification device 100 for dealing with the low signal-to-noise ratio comprises an extraction module 11, a first processing module 12, a detection module 13, a second processing module 14 and a recognition and classification module 15, and all processes of the bird song recognition and classification method for dealing with the low signal-to-noise ratio are realized through the cooperation of all modules, so that the corresponding effect is realized. Wherein:
and the extraction module 11 is configured to extract a time series signal of the audio to be identified.
The first processing module 12 is configured to stack audio frames frame by frame according to a time sequence for the sample point values in the time series signal to obtain a first feature matrix.
And the detection module 13 is configured to perform endpoint detection on the first feature matrix by using a preset detection algorithm to obtain a second feature matrix of the true sounding interval of the bird song.
A second processing module 14, configured to stack audio frames frame by frame again according to a time sequence for sampling point values in the bird song true sounding interval, so as to obtain a calibrated third feature matrix, where a first dimension of the second feature matrix is aligned with a first dimension of the third feature matrix.
And the recognition and classification module 15 is configured to perform feature extraction on the third feature matrix by using a convolutional neural network so as to input the third feature matrix into a gated cyclic network for processing to obtain a recognition and classification result.
It is to be understood that, in addition to the above modules, the apparatus 100 for identifying and classifying a bird song that deals with a low signal-to-noise ratio may further include other modules, such as a noise reduction processing module, and the like, wherein the noise reduction processing module is configured to perform noise reduction processing on the first feature matrix by using a weighted average threshold algorithm.
Furthermore, the present invention provides a computer arrangement comprising a processor for implementing the steps of the above-described method for identifying and classifying birdsong coping with low signal-to-noise ratio when executing a computer program stored in a memory.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Furthermore, the present invention also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the aforementioned bird song recognition classification method coping with a low signal-to-noise ratio.
Referring to fig. 3, a noise reduction effect graph after noise reduction processing in the present invention is shown, and comparing the effect graphs before noise reduction and after noise reduction, it can be seen that the background noise in the picture can be better removed by using the weighted average threshold method.
Referring to fig. 4, it is a diagram of the effectiveness proof of the endpoint detection algorithm provided by the present invention, where the left side is a picture without endpoint detection processing, the middle is a bird song picture divided after endpoint detection, and the right side is a bird song picture after renormalization.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A method for identifying and classifying bird song coping with low signal-to-noise ratio, the method comprising:
extracting a time sequence signal of the audio to be identified;
stacking audio frames frame by frame according to the time sequence of the sampling point values in the time sequence signal to obtain a first characteristic matrix;
carrying out endpoint detection on the first feature matrix by using a preset detection algorithm to obtain a second feature matrix of the real sounding interval of the bird song;
stacking the audio frames frame by frame according to the time sequence again for the sampling point values in the real sounding interval to obtain a calibrated third feature matrix, wherein the first dimension of the second feature matrix is aligned with the first dimension of the third feature matrix;
and performing feature extraction on the third feature matrix by using a convolutional neural network so as to input the third feature matrix into a gated cyclic network for processing to obtain an identification classification result.
2. The method for identifying and classifying the bird song coping with the low signal-to-noise ratio as claimed in claim 1, wherein the performing the end point detection on the first feature matrix by using the preset detection algorithm to obtain the second feature matrix of the true sounding interval comprises:
detecting each first utterance candidate segment in the first feature matrix by adopting an energy detection algorithm;
merging and un-merging the adjacent first utterance candidate segments in each first utterance candidate segment according to a preset rule to obtain each second utterance candidate segment;
and selecting the second sounding candidate segment with the largest audio frame number in the second sounding candidate segments as the real sounding interval of the bird song of the second characteristic matrix.
3. The method of claim 2, wherein the detecting each first acoustic candidate segment in the first feature matrix using an energy detection algorithm comprises:
solving the energy sum of all audio frames in the first characteristic matrix;
calculating an energy average value of the audio frame according to the energy sum;
comparing the energy sum of each audio frame in the first feature matrix with the energy mean value one by one;
if the energy sum of the current audio frame is larger than the energy average value, judging that the current audio frame is a candidate frame for the singing and sounding of the bird;
recording all the bird singing sound production candidate frames to obtain a candidate frame list;
and stacking continuous audio frames in the candidate frame list according to the audio frame sequence in the first feature matrix to obtain each first pronunciation candidate segment.
4. The method according to claim 2, wherein the merging and non-merging the neighboring first utterance candidates for each of the first utterance candidates to obtain each of the second utterance candidates comprises:
performing primary processing on the first utterance candidate segments to obtain a primary processing list, wherein the first processing comprises merging first merged utterance candidate segments obtained by corresponding adjacent first utterance candidate segments if the number of discontinuous audio frames between adjacent first utterance candidate segments in each first utterance candidate segment does not exceed a first threshold; otherwise, not merging; the primary processing list comprises primary processing sound production candidate segments, including the first combined sound production candidate segment obtained by combination and the first sound production candidate segment which is not combined; if the number of the first-stage processing sounding candidate segments is equal to 1, ending the processing;
if the number of the primary processing sounding candidate segments is more than 2, performing secondary processing on the primary processing sounding candidate segments to obtain a secondary processing list, wherein the secondary processing comprises the steps of obtaining the minimum value of discontinuous audio frame numbers between adjacent primary processing sounding candidate segments, and if the minimum value is not more than a second threshold value, combining second combined sounding candidate segments obtained by corresponding adjacent primary processing sounding candidate segments; otherwise, not merging; the secondary processing list comprises secondary processing sounding candidate segments, including the second combined sounding candidate segment obtained by combination and the primary processing sounding candidate segment which is not combined; wherein the second threshold is greater than the first threshold; if the number of the secondary processing sounding candidate segments is more than 2, repeating the secondary processing; if the number of the secondary processing sounding candidate segments is equal to 1, ending the processing;
and acquiring the primary processing sounding candidate segment obtained after finishing the processing or the secondary processing sounding candidate segment obtained after finishing the processing as the second sounding candidate segment.
5. The method according to claim 4, wherein the merging and non-merging adjacent first utterance candidates in each of the first utterance candidates to obtain each of second utterance candidates according to a preset rule further comprises:
if the number of the primary processing sounding candidate segments is equal to 2 or the number of the secondary processing sounding candidate segments is equal to 2, performing three-level processing on the corresponding primary processing sounding candidate segments or the corresponding secondary processing sounding candidate segments to obtain a three-level processing list, wherein the three-level processing comprises combining the corresponding adjacent primary processing sounding candidate segments to obtain a third combined sounding candidate segment if the number of discontinuous audio frames between the adjacent primary processing sounding candidate segments does not exceed a third threshold value; otherwise, not merging; if the number of discontinuous audio frames between the adjacent secondary processing sounding candidate segments does not exceed a third threshold value, combining the corresponding adjacent secondary processing sounding candidate segments to obtain a fourth combined sounding candidate segment; otherwise, not merging; wherein the third threshold is greater than the second threshold; finishing the processing;
and acquiring the candidate section of the sounding obtained by the three-stage processing as the second candidate section of the sounding.
6. The method for identifying and classifying the bird song coping with the low signal-to-noise ratio according to claim 1, wherein the extracting the features of the third feature matrix by using the convolutional neural network to input the feature matrix into the gated cyclic network for processing to obtain the identification and classification result comprises:
inputting the third feature matrix into a convolutional neural network for pre-feature extraction to obtain a fourth feature matrix;
performing dimensionality reduction processing on the fourth feature matrix to obtain a fifth feature matrix with strong time domain continuity;
inputting the fifth feature matrix into the gated loop network to obtain a final classification prediction score;
and inputting the classification prediction score into an Argmax () function to obtain the identification classification result.
7. The method of claim 6, wherein the step of inputting the third feature matrix into a convolutional neural network for pre-feature extraction to obtain a fourth feature matrix comprises:
inputting the third feature matrix into a convolutional neural network for convolution operation;
arranging the values obtained after the convolution operation according to the corresponding positions in the third feature matrix to obtain a sixth feature matrix;
activating each element in the sixth feature matrix by using an activation function to obtain an activation value;
and putting the activation values to corresponding positions in the third feature matrix and performing pooling operation to obtain the fourth feature matrix.
8. The method for identifying and classifying the bird song coping with the low signal-to-noise ratio according to any one of claims 1 to 6, wherein before the performing the end point detection on the first feature matrix by using a preset detection algorithm to obtain the second feature matrix of the true sounding interval, the method further comprises:
and carrying out noise reduction treatment on the first characteristic matrix by adopting a weighted average threshold algorithm.
9. A bird song recognition and classification apparatus that copes with a low signal-to-noise ratio, the apparatus comprising:
the extraction module is used for extracting a time sequence signal of the audio to be identified;
the first processing module is used for stacking audio frames frame by frame according to the time sequence of the sampling point values in the time sequence signal to obtain a first characteristic matrix;
the detection module is used for carrying out endpoint detection on the first characteristic matrix by using a preset detection algorithm so as to obtain a second characteristic matrix of the real sounding interval of the bird song;
a second processing module, configured to stack audio frames frame by frame again according to a time sequence for sampling point values within the bird song true sounding interval to obtain a calibrated third feature matrix, where a first dimension of the second feature matrix is aligned with a first dimension of the third feature matrix;
and the identification and classification module is used for extracting the characteristics of the third characteristic matrix by using a convolutional neural network so as to input the third characteristic matrix into the gated cyclic network for processing to obtain an identification and classification result.
10. A storage medium having stored thereon a computer program, wherein the computer program, when being executed by a processor, implements the steps of the method for identifying and classifying birdsong coping with low signal-to-noise ratio according to any one of claims 1 to 8.
CN202111323056.7A 2021-11-09 2021-11-09 Bird song recognition and classification method and device for coping with low signal-to-noise ratio and storage medium Pending CN114038479A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111323056.7A CN114038479A (en) 2021-11-09 2021-11-09 Bird song recognition and classification method and device for coping with low signal-to-noise ratio and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111323056.7A CN114038479A (en) 2021-11-09 2021-11-09 Bird song recognition and classification method and device for coping with low signal-to-noise ratio and storage medium

Publications (1)

Publication Number Publication Date
CN114038479A true CN114038479A (en) 2022-02-11

Family

ID=80143710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111323056.7A Pending CN114038479A (en) 2021-11-09 2021-11-09 Bird song recognition and classification method and device for coping with low signal-to-noise ratio and storage medium

Country Status (1)

Country Link
CN (1) CN114038479A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863937A (en) * 2022-05-17 2022-08-05 武汉工程大学 Hybrid birdsong identification method based on deep migration learning and XGboost

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863937A (en) * 2022-05-17 2022-08-05 武汉工程大学 Hybrid birdsong identification method based on deep migration learning and XGboost

Similar Documents

Publication Publication Date Title
Lim et al. Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks.
Stöter et al. Countnet: Estimating the number of concurrent speakers using supervised learning
Dennis et al. Image feature representation of the subband power distribution for robust sound event classification
KR100745976B1 (en) Method and apparatus for classifying voice and non-voice using sound model
CN111179975A (en) Voice endpoint detection method for emotion recognition, electronic device and storage medium
Dennis Sound event recognition in unstructured environments using spectrogram image processing
CN106710599A (en) Particular sound source detection method and particular sound source detection system based on deep neural network
Qian et al. Wavelets revisited for the classification of acoustic scenes
Massoudi et al. Urban sound classification using CNN
Tan et al. Evaluation of a Sparse Representation-Based Classifier For Bird Phrase Classification Under Limited Data Conditions.
CN113327626A (en) Voice noise reduction method, device, equipment and storage medium
CN112183107A (en) Audio processing method and device
CN110428853A (en) Voice activity detection method, Voice activity detection device and electronic equipment
Battaglino et al. Acoustic context recognition using local binary pattern codebooks
Naranjo-Alcazar et al. On the performance of residual block design alternatives in convolutional neural networks for end-to-end audio classification
Kumar et al. Intelligent Audio Signal Processing for Detecting Rainforest Species Using Deep Learning.
Ntalampiras et al. Exploiting temporal feature integration for generalized sound recognition
CN114038479A (en) Bird song recognition and classification method and device for coping with low signal-to-noise ratio and storage medium
Liu et al. Birdsong classification based on multi feature channel fusion
Ozerov et al. GMM-based classification from noisy features
Bang et al. Evaluation of various feature sets and feature selection towards automatic recognition of bird species
Thakare et al. Comparative analysis of emotion recognition system
Hajihashemi et al. Novel time-frequency based scheme for detecting sound events from sound background in audio segments
JP2011191542A (en) Voice classification device, voice classification method, and program for voice classification
Nicolson et al. Sum-product networks for robust automatic speaker identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination