WO2016176887A1 - 基于声谱图双特征的动物声音识别方法 - Google Patents
基于声谱图双特征的动物声音识别方法 Download PDFInfo
- Publication number
- WO2016176887A1 WO2016176887A1 PCT/CN2015/080284 CN2015080284W WO2016176887A1 WO 2016176887 A1 WO2016176887 A1 WO 2016176887A1 CN 2015080284 W CN2015080284 W CN 2015080284W WO 2016176887 A1 WO2016176887 A1 WO 2016176887A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sound
- lbpv
- lbp
- equivalent
- feature
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 241001465754 Metazoa Species 0.000 title claims abstract description 32
- 239000013598 vector Substances 0.000 claims abstract description 58
- 230000005236 sound signal Effects 0.000 claims abstract description 50
- 239000011159 matrix material Substances 0.000 claims abstract description 49
- 238000007637 random forest analysis Methods 0.000 claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000001228 spectrum Methods 0.000 claims description 92
- 238000010586 diagram Methods 0.000 claims description 18
- 238000003066 decision tree Methods 0.000 claims description 16
- 238000000354 decomposition reaction Methods 0.000 claims description 15
- 238000004422 calculation algorithm Methods 0.000 claims description 13
- 238000009499 grossing Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 9
- 238000012360 testing method Methods 0.000 claims description 9
- 238000009826 distribution Methods 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 230000003595 spectral effect Effects 0.000 claims description 6
- 230000009977 dual effect Effects 0.000 claims description 5
- 230000002457 bidirectional effect Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000007704 transition Effects 0.000 claims description 4
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000010845 search algorithm Methods 0.000 claims description 3
- 238000012706 support-vector machine Methods 0.000 description 7
- 241000124879 Grus leucogeranus Species 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 241001125281 Eubalaena glacialis Species 0.000 description 1
- 241000288140 Gruiformes Species 0.000 description 1
- 206010039740 Screaming Diseases 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
Definitions
- the invention relates to an animal sound recognition method based on the dual features of a sound spectrum diagram.
- the ecological environment is closely related to our lives, and the animal sounds in the ecological environment contain rich information. Through the identification of animal sounds, we can understand and analyze their living habits and distribution, so that they can be effectively monitored and protected. In recent years, animal voice recognition has received increasing attention.
- Animal sound recognition generally based on the spectrum, time series, Mel Frequency Cepstrum Coefficient (MFCC), sound library index and wavelet packet decomposition, through the support vector machine (SVM) classification Identification.
- Typical methods include identifying the sound of the animal based on the Spectrogram Correlation and using the edge detection ('edge'detector') to extract the Right Whale sound detection based on the smoothed sound spectrum.
- index-based animal sound retrieval, and animal sound retrieval based on context variables. Recently, Exadaktylos et al.
- the characteristics of time and frequency are mainly composed of time and frequency characteristics, wavelet domain features, and features extracted by Gabor dictionary matching pursuit algorithm.
- Recent research also includes low-signal-to-noise sound event recognition based on Wavelet Packets filtering, sound event recognition based on high-pass filtered MFCC extended features, and sound event recognition and detection based on random regression forests with multiple cross-superframes.
- the matching pursuit algorithm is used to select important atoms from Gabor dictionary, and the characteristics of sound events are determined by principal component analysis (PCA) and linear discriminant analysis (LDA).
- PCA principal component analysis
- LDA linear discriminant analysis
- SVM classifier is used for classification and recognition, for low SNR sound. The recognition of the event is obvious.
- the sound signal is mainly obtained by Short-Time Fourier Transform (STFT), and some image recognition methods can be used for low signal noise. More than sound recognition.
- STFT Short-Time Fourier Transform
- Khunarsal et al. propose an environmental sound classification method that combines feedforward neural networks and k-nearest neighbors (k-NN) using spectrogram pattern matching. We also extracted the gray level co-occurrence matrix features from the spectrogram and combined the random forest classifier to identify the bird sounds.
- Duan et al. proposed a sound enhancement algorithm based on non-negative spectrogram decomposition.
- Dennis et al. proposed a sound event recognition method based on the characteristics of the spectrogram.
- Czarnecki and Moszy ⁇ ski use the Concentrated Spectrograph method for time-frequency analysis of sound signals.
- Dennis et al. proposed Local Spectrogram Features to identify overlapping sound events using a generalized Hough Transform voting system.
- McLoughlin et al. proposed Spectrogram Image-based Front End Features to classify sound events using SVM and Deep Neural Network classifiers.
- the sub-band power distribution (SPD) feature proposed by Dennis et al. separates reliable sound events from noise in the spectrum and identifies the features using the nearest neighbor classifier (kNN). This method can also identify related sound events when the signal-to-noise ratio is as low as 0 dB. However, for different sound environments, the overall recognition accuracy is still low for various low SNR sound signals.
- an animal voice recognition method based on dual features of a sound spectrum diagram which is characterized by the following steps:
- Step S1 establishing a sound sample library for pre-storing sound samples
- Step S2 collecting a sound signal to be identified
- Step S3 converting the pre-stored sound sample and the sound signal to be recognized into a sound spectrum map
- Step S4 Normalizing the spectrogram, and performing eigenvalue decomposition and projection on the normalized spectrogram, and converting it into a projection feature XK ;
- Step S5 converting the sound spectrum into an equivalent LBP value matrix u, and counting the variance of the pixel corresponding to each equivalent LBP value and the surrounding pixel gray value to form a feature vector LBPV;
- Step S6 combining the projection feature X K and the feature vector LBPV to form a two-layer feature X K +LBPV;
- Step S7 using the two-layer feature set corresponding to the pre-stored sound samples in the sound sample library as a training sample set, and taking the two-layer feature corresponding to the sound signal to be identified as an input sample, and obtaining the to-be-identified by the training of the random forest.
- the sound signal is in the corresponding category in the sound sample library and the result is output.
- step S3 conversion process is as follows:
- step S4 is as follows:
- the normalized log scale vector S t represents data of the t-th frame of the normalized log scale
- the matrix U ⁇ R N ⁇ N contains all the eigenvectors ⁇ 1 , . . . , ⁇ N of the matrix C, ⁇ is a diagonal matrix, and the elements on the diagonal are eigenvalues ⁇ 1 , . . . , ⁇ N , the eigenvalues ⁇ 1 , . . . , ⁇ N represent the weights of the corresponding feature vectors, and ⁇ 1 ⁇ ⁇ 2 ⁇ ... ⁇ ⁇ N , and then the contribution ⁇ of the first K eigenvalues is calculated by the following formula ⁇ K measures the importance of the top K eigenvectors in representing sound:
- the matrix U carries the main information of the sound, and the first K feature vectors are selected to form a basic vector matrix U K ⁇ R N ⁇ K , and the projection feature X K is for the sound spectrum matrix X in the basic vector matrix U K Projection on ⁇ R N ⁇ K :
- step S5 is as follows:
- the texture T is a joint distribution T of P pixels on the ring neighborhood with radius R centered on the g c pixel:
- the binary pattern is calculated according to the 0/1 sequence of the joint distribution T sorted in a specific direction combined with the LBP operator to form an LBP value, that is, LBP P, R :
- the superscript u2 indicates that the U value corresponding to the LBP is at most 2, and the equivalent mode reduces the number of modes from 2 P to P(P-1)+2, and the modes other than the equivalent mode are classified as P(P-1)+3;
- the equivalent LBP is extracted, and each pixel (m, n) obtains an equivalent LBP value.
- These equivalent LBP values constitute an equivalent LBP graph, and the equivalent LBP graph That is, the equivalent LBP value matrix u, the frequency of occurrence of each digit in the equivalent LBP graph is obtained, and the texture feature vector of the sound spectrum map is obtained, but for the equivalent LBP graph of the same equivalent LBP value, the texture is The difference may be different.
- the variance of the pixel corresponding to each equivalent LBP value and the gray value of the surrounding pixels is calculated to form a feature vector LBPV.
- the kth component LBPV(k) of the feature vector LBPV is expressed as:
- the range of the integer k is k ⁇ [1, P(P-1)+3], and w(m, n, k) represents that the pixel (m, n) in the spectrogram corresponds to the kth component of the LBPV, etc.
- the weight of the price LBP value, LBPV(k) is the weight of the equivalent LBP value corresponding to the kth component of all the pixels in the spectrogram, and the LBPV(k), LBPV obtained according to formula (14). (2), LBPV(k), ..., LBPV(P(P-1)+3), finally forming a feature vector LBPV of size P(P-1)+3.
- step S7 is as follows:
- Double layer feature corresponding to the sound signal collected by the test sound module For inputting a sample, it is placed at the root node of the s decision tree in the random forest, and is passed down according to the classification rule of the decision tree until reaching a leaf node, which is the decision tree corresponding to the class label.
- the two-layer feature For the voting made by category l, the decision tree of the random forest has the double layer feature The category l votes to get s votes, and the s votes are counted, wherein the category 1 with the highest number of votes is a double layer feature. Corresponding category.
- a sound enhancement is further included between the step S2 and the step S3, and the pre-stored sound sample and the sound signal to be identified are subjected to enhancement processing, and the enhancement processing uses a short-time spectrum estimation algorithm. .
- the sound signal y(t) can be expressed as:
- s(t) is the animal sound
- n(t) is the ambient sound
- the amplitude spectrum Y(k,l) is obtained by performing STFT on the sound signal y(t), where k is the frame index and l is the frequency
- the index, short-time spectrum estimation is composed of three parts: the ambient sound power spectrum N(k,l) estimation, the gain factor G(k,l) calculation and the enhanced sound signal amplitude spectrum F(k,l):
- Step S81 Smoothing the power spectrum
- Step S82 Find the S(k, l) minimum spectral component by a forward and backward combined bidirectional search algorithm:
- S min2 (k,l) min ⁇ S(i,l) ⁇ , k ⁇ i ⁇ i+D-1 (22) where S min1 (k,l) represents the minimum value of the forward search D frame , S min2 (k, l) represents the minimum value of the backward search D frame, and S min (k, l) represents the minimum spectral component obtained by the bidirectional search;
- Step S83 Calculating the probability that the animal sound exists:
- ⁇ 1 is a constant smoothing parameter.
- ⁇ 1 0.2
- H(k, l) is the criterion for the existence of ambient sound:
- ⁇ (k) is the frequency-dependent discriminant threshold
- Step S84 Calculating the time-frequency smoothing factor ⁇ (k, l) to perform ambient sound power spectrum estimation.
- ⁇ 2 0.95 is set according to the actual situation, and it is obvious that ⁇ 2 ⁇ ⁇ (k, l) ⁇ 1.
- the noise power spectrum can be estimated from ⁇ (k,l):
- Step S85 Calculating the spectrum gain factor:
- Step S86 obtaining an enhanced audio signal amplitude spectrum:
- the invention has the following beneficial effects:
- the present invention proposes a two-layer feature combining projection features and LBPV features for animal sound recognition in various environments, which not only improves the recognition rate, but also has high noise immunity;
- the present invention proposes the use of a random forest identifier for the identification of two-layer features
- the present invention proposes a short-time spectrum estimation sound enhancement combined with a two-layer feature and a random forest architecture, and is particularly suitable for low SNR animal sound recognition.
- Figure 1 is a flow chart of the algorithm of the present invention.
- FIG. 2 is a block diagram of a system according to an embodiment of the present invention.
- FIG. 3 is a system block diagram of a second embodiment of the present invention.
- FIG. 4 is a schematic diagram of a module for sound enhancement using a short time spectrum estimation algorithm according to an embodiment of the present invention.
- Figure 5a is a sound spectrum diagram of a white crane call in an embodiment of the present invention.
- Figure 5b is a spectrogram of a normalized log scale of a white crane call in accordance with an embodiment of the present invention.
- Fig. 6 is a graph showing the specific gravity of the sum of the K eigenvalues before the singer of the embodiment of the present invention.
- Fig. 7a is a schematic diagram showing the gray value of an image area of the embodiment 3*3 of the present invention.
- Figure 7b is a diagram showing the LBP value of the intermediate pixel point c of Figure 7a of the present invention.
- Figure 7c is an equivalent LBP diagram formed by the present invention for calculating the equivalent LBP value for the solid-line frame portion of Figure 7a.
- Figure 7d is a schematic diagram of the variance matrix v of the corresponding pixel of the frame portion of Figure 7a of the present invention.
- Figure 7e is a frequency histogram of each mode of the present invention.
- Figure 7f is an LBPV histogram formed by calculating the equivalent LBP value of Figure 7c and the sequence number k of Table 1 of the present invention by calculating LBPV(k) from the variance of Figure 7d.
- Figure 8a is an equivalent LBP diagram transformed from Figure 5a of the present invention.
- Figure 8b is an equivalent LBP histogram of Figure 8a of the present invention.
- Figure 8c is a LBPV histogram of the Figure of the present invention.
- Figure 9 is a schematic diagram showing the basic principle of the random forest of the present invention.
- the present invention provides an animal sound recognition method based on a dual feature of a sound spectrum, which comprises the following steps:
- Step S1 establishing a sound sample library for pre-storing sound samples
- Step S2 collecting a sound signal to be identified
- Step S3 converting the pre-stored sound sample and the sound signal to be recognized into a sound spectrum map
- Step S4 Normalizing the spectrogram, and performing eigenvalue decomposition and projection on the normalized spectrogram, and converting it into a projection feature XK ;
- Step S5 converting the sound spectrum into an equivalent LBP value matrix u, and counting the variance of the pixel corresponding to each equivalent LBP value and the surrounding pixel gray value to form a feature vector LBPV;
- Step S6 combining the projection feature X K and the feature vector LBPV to form a two-layer feature X K +LBPV;
- Step S7 taking the two-layer feature set corresponding to the pre-stored sound samples in the sound sample library as a training sample set, and taking the two-layer feature corresponding to the sound signal to be identified as an input sample, and obtaining the to-be-identified by the training of the random forest.
- the sound signal is in the corresponding category in the sound sample library and the result is output.
- step S3 conversion process is as follows:
- step S4 is as follows:
- the normalized log scale vector S t represents the data of the t-th frame of the normalized log scale
- FIG. 5b shows the spectrogram of the normalized log scale of FIG. 5a.
- the matrix U ⁇ R N ⁇ N contains all the eigenvectors ⁇ 1 , . . . , ⁇ N of the matrix C
- ⁇ is a diagonal matrix
- the elements on the diagonal are eigenvalues ⁇ 1 , . . . , ⁇ N
- the eigenvalues ⁇ 1 , . . . , ⁇ N represent the weights of the corresponding feature vectors
- the magnitude ⁇ n of the eigenvalues reflects its corresponding eigenvector ⁇ n
- the matrix U carries the main information of the sound, and the first K feature vectors are selected to form a basic vector matrix U K ⁇ R N ⁇ K , and the projection feature X K is for the sound spectrum matrix X in the basic vector matrix U K Projection on ⁇ R N ⁇ K :
- step S5 is as follows:
- LBPV is a vector formed by accumulating the variances of all pixels corresponding to each mode in the ULBP.
- the equivalent LBP value describes the spatial structure of the image texture feature, and the variance represents the contrast information, and the LBPV vector combines the two features.
- the texture T is a joint distribution T of P pixels on the ring neighborhood with radius R centered on the g c pixel:
- g c represents the pixel value of the central pixel of the ring domain
- s is Symbol function:
- the binary pattern is calculated according to the 0/1 sequence of the joint distribution T sorted in a specific direction combined with the LBP operator to form an LBP value, that is, LBP P, R :
- the corresponding pixel can be expanded first in the manner shown by the broken line in Fig. 7a. The calculation is performed by the equation (11).
- an equivalent mode which corresponds to a cyclic binary from 0 to 1 or from 1 to 0 up to two times.
- the U value represents the number of transitions in the equivalent mode, and the equivalent value is determined by the U value:
- the LBP value The superscript u2 indicates that the U value corresponding to the LBP is at most 2, and the equivalent mode reduces the number of modes from 2 P to P(P-1)+2, and the modes other than the equivalent mode are classified as P(P-1)+3, taking Figure 7a as an example.
- 59 equivalent LBP values can be obtained. They correspond to the sequence number k of 1-59, and can obtain the correspondence between the equivalent LBP value and the sequence number k as shown in Table 1, wherein ULBP(k) is the LBP value corresponding to the sequence number k;
- the equivalent LBP is extracted, and each pixel (m, n) obtains an equivalent LBP value.
- These equivalent LBP values constitute an equivalent LBP graph, and the equivalent LBP graph That is, the equivalent LBP value matrix u, the frequency of occurrence of each digit in the equivalent LBP graph is obtained, and the texture feature vector of the sound spectrum map is obtained, and FIG. 7c calculates the equivalent LBP value for the solid line frame portion of FIG. 7a.
- the post-formed equivalent LBP map is also a matrix consisting of equivalent LBP values, ie, an equivalent LBP value matrix u.
- Figure 7e shows the frequency histogram of each pattern appearing, that is, the texture feature vector of Figure 7a;
- the texture may be different. Therefore, we use the variance to represent the contrast information.
- the larger the variance the larger the texture change in the region, and the statistics of the pixels and surrounding of each equivalent LBP value.
- the variance of the pixel gray value forms a feature vector LBPV, and the kth component LBPV(k) of the feature vector LBPV is expressed as:
- the range of the integer k is k ⁇ [1, P(P-1)+3], and w(m, n, k) represents that the pixel (m, n) in the spectrogram corresponds to the kth component of the LBPV, etc.
- the weight of the price LBP value, LBPV(k) is the weight of the equivalent LBP value corresponding to the kth component of all the pixels in the spectrogram, and the LBPV(k), LBPV obtained according to formula (14).
- Figure 7d is the variance matrix v of the corresponding pixel in the solid line region of Figure 7a
- Figure 7f is the LBPV histogram formed by calculating the LBPV(k) from the variance of Figure 7d according to the equivalent LBP value of Figure 7c and the corresponding sequence number k of Table 1.
- Figure, LBPV feature, its schematic process is as follows:
- LBPV ⁇ 0,...,LBPV(38),0,...,LBPV(44),0,LBPV(46),0,0,LBPV(49),0,...,LBPV(58),0 ⁇ ,substitute Corresponding value
- Figs. 8a-8c show the comparison of the LBP histogram of the equivalent mode with the LBPV histogram.
- the corresponding equivalent LBP value is 255, and the frequency of occurrence is particularly high. , that is, the ratio of the binary mode to 11111111 is particularly high.
- the variance of degree is used as the weight, which can better reflect the texture changes in the spectrogram, which is beneficial to classification and recognition.
- step S7 is as follows:
- Random forest is an integrated classifier algorithm that uses multiple decision tree classifiers to discriminate data.
- the principle is shown in Figure 9.
- the self-re-sampling technique is used to match the pre-stored sound samples in the sound sample module.
- the process of identifying the recognized sound using the random forest is as follows, and the two-layer feature corresponding to the sound signal collected by the test sound module is
- the projection feature X k or the feature LBPV is an input sample, which is placed at the root node of the s decision tree in the random forest, and is transmitted downward according to the classification rule of the decision tree until reaching a certain leaf node, and the leaf node corresponds to
- the class label is the decision tree to the two-layer feature
- the decision tree of the random forest has the double layer feature
- a sound enhancement is further included between the step S2 and the step S3, and the pre-stored sound sample and the sound signal to be recognized are enhanced.
- the enhancement process employs a short time spectrum estimation algorithm.
- the sound signal y(t) can be expressed as:
- s(t) is the animal sound
- n(t) is the ambient sound
- the amplitude spectrum Y(k,l) is obtained by performing STFT on the sound signal y(t), where k is the frame index and l is the frequency
- the index, short-time spectrum estimation is composed of three parts: the ambient sound power spectrum N(k,l) estimation, the gain factor G(k,l) calculation and the enhanced sound signal amplitude spectrum F(k,l):
- Step S81 Smoothing the power spectrum
- Step S82 Find the S(k, l) minimum spectral component by a forward and backward combined bidirectional search algorithm:
- S min1 (k, l) represents the minimum value of the forward search D frame
- S min2 (k, l) represents the minimum value of the backward search D frame
- S min (k, l) represents the two-way search. The smallest spectral component obtained
- Step S83 Calculating the probability that the animal sound exists:
- ⁇ 1 is a constant smoothing parameter.
- ⁇ 1 0.2
- H(k, l) is the criterion for the existence of ambient sound:
- ⁇ (k) is the frequency-dependent discriminant threshold
- Step S84 Calculating the time-frequency smoothing factor ⁇ (k, l) to perform ambient sound power spectrum estimation.
- ⁇ (k,l) ⁇ 2 +(1 ⁇ 2 )P(k,l) (26)
- ⁇ 2 0.95 is set according to the actual situation, obviously, ⁇ 2 ⁇ (k,l) ⁇ 1.
- the noise power spectrum can be estimated from ⁇ (k,l):
- Step S85 Calculating the spectrum gain factor:
- Step S86 obtaining an enhanced audio signal amplitude spectrum:
- the system used in the present invention includes a sound spectrum diagram module.
- the input end of the sound spectrum map module is connected to a sound sample library module and a test sound module, and the output end of the sound spectrum map module and a projection.
- the characteristic module and the input end of the LBPV feature module are connected, and the output ends of the projection feature module and the LBPV feature module are respectively connected to the input end of a double-layer feature module, and the output end of the double-layer feature module and an RF identification module And a result output module is sequentially connected;
- the sound spectrum map module converts the sound sample pre-stored in the sound sample library module and the sound signal collected by the test sound module into a sound spectrum map
- the projection feature module normalizes the spectrogram outputted by the spectrogram module, and performs eigenvalue decomposition and projection on the normalized spectrogram to obtain a projection feature X K ;
- the LBPV feature module converts the spectrogram outputted by the spectrogram module into an equivalent LBP value matrix u, and calculates a variance of a pixel corresponding to each equivalent LBP value and a surrounding pixel gray value to form a feature vector LBPV. ;
- the two-layer feature module combines the projection feature X K output by the projection feature module and the feature vector LBPV output by the LBPV feature module to form a two-layer feature X K +LBPV;
- the RF identification module uses a two-layer feature set corresponding to the sound sample pre-stored in the sound sample module as a training sample set to test the two-layer feature corresponding to the sound signal collected by the sound module as an input sample, and through random forest training, The class corresponding to the sound sample pre-stored in the sound sample library module is obtained and sent to the result output module.
- a sound enhancement module is further included, an output end of the sound enhancement module is connected to an input point of the sound spectrum diagram module, and an input end of the sound enhancement module and the sound sample library The module and the test sound module are connected.
- the sound enhancement module uses a sound enhancement algorithm to enhance the sound signal, and in various sound enhancement algorithms, the short-time spectrum estimation algorithm is compared. The most prominent, as shown in Figure 4.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
- Circuit For Audible Band Transducer (AREA)
- Complex Calculations (AREA)
Abstract
一种基于声谱图双特征的动物声音识别方法,包括以下步骤:建立一声音样本库;采集待识别的声音信号;将预存声音样本及待识别的声音信号转化成声谱图;将声谱图进行规范化,并进行特征值分解和投影,转化成一投影特征XK;将声谱图转化成等价LBP值矩阵u,统计对应的像素与周围像素灰度值的方差,形成一特征向量LBPV;将投影特征XK和特征向量LBPV结合,形成双层特征XK+LBPV ;以所述声音样本库中预存声音样本对应的双层特征集为训练样本集,以待识别的声音信号对应的双层特征为输入样本,通过随机森林的训练,得出待识别的声音信号于声音样本库中对应的类别并输出结果。该方法改善了不同声音环境下各种低信噪比动物声音的识别率。
Description
本发明涉及一种基于声谱图双特征的动物声音识别方法。
生态环境与我们的生活密切相关,生态环境下的动物叫声包含丰富的信息。通过对动物声音的识别,可以对其生活习性和分布做一定的了解与分析,从而可以有效地对其进行监控和保护。近年来,动物声音识别受到越来越多的关注。
动物声音识别,一般以声谱图、时间序列、Mel频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)、声音库索引以及小波包分解为基础,通过支持向量机(Support Vector Machine,SVM)等分类识别。比较典型方法包括,基于声谱图相关系数(Spectrogram Correlation)识别动物声音,对平滑处理的声谱图使用边缘检测(‘edge’detector)提取特征进行露脊鲸(Right Whale)叫声检测,基于时间序列特征的动物声音识别,Mel频率倒谱系数结合支持向量机的鸟类声音分类等。此外,也借助于经典的基于文本(Text-based)数据库查询方法,采用基于索引(index-based)的动物声音检索,以及基于上下文变量(context variables)的动物声音检索。近期,Exadaktylos等通过声音识别确定动物的状态,用于畜牧业生产优化。Potamitis等提出在连续和真实的现场录音中,识别特定的鸟类声音。我们也在最近的工作中,提出经过自适应能量检测(AED)后,基于Mel尺度的小波包分解子带倒谱系数(MWSCC)特征和MFCC,结合支持向量机(SVM)的鸟声检测方法。
由于真实环境中存在各种各样的噪声,因此对动物声音的识别带来一定的挑战。尤其,实时获取的声音信号,当信噪比很低时,对动物声音的识别尤为困难。对于低信噪比情况下,声音信号的分析、分类和识别,目前已有一定的研究。对于低信噪比声音识别的特征,常见的有基于时间与频率相结合的特征和基于声谱图及其相关的特征。
关于时间与频率相结合的特征,主要有时间、频率特征,小波域特征,Gabor字典匹配追踪算法提取的特征等。近期的研究还包括,小波包(Wavelet Packets)过滤的低信噪比声音事件识别,基于高通滤波的MFCC扩展特征的声音事件识别,基于多个交叉超级帧的随机回归森林的声音事件识别和检测。其中,利用匹配追踪算法从Gabor字典中选择重要的原子,用主成分分析(PCA)和线性判别分析(LDA)确定声音事件的特征,最后采用SVM分类器进行分类识别,对于低信噪比声音事件的识别效果明显。
关于声谱图及其相关的特征,主要是声音信号经过短时傅里叶变换(Short-Time Fourier Transform,STFT)得到声谱图,借助图像特征,一些图像识别的方法可以用于低信噪比声音识别。如,Khunarsal等提出利用声谱图模式匹配结合前馈神经网络和k近邻(k-NN)的环境声音分类方法。我们也对声谱图提取灰度共生矩阵特征,并结合随机森林分类器识别鸟类声音。在非平稳的噪声环境中,Duan等提出基于非负声谱图分解(non-negative spectrogram decomposition)的声音增强算法。Dennis等提出基于声谱图特征的声音事件识别方法。Czarnecki和Moszyński使用集中摄谱(Concentrated Spectrograph)的方法进行声音信号的时频分析。Dennis等提出局部声谱图特征(Local Spectrogram Features)使用广义的霍夫变换(Generalised Hough Transform)投票系统识别重叠的声音事件。McLoughlin等提出谱图基于图像的前端特征(Spectrogram Image-based Front End Features)使用SVM和深度神经网络分离器(Deep Neural Network)分类器分类声音事件。尤
其,Dennis等提出的子带功率分布(sub-band power distribution,SPD)特征,在谱图中将可靠的声音事件与噪声分开,并用最近邻居分类器(kNN)对特征进行识别。这种方法能在信噪比低至0dB时,也可能识别相关的声音事件。然而,对于不同的声音环境,对于各种低信噪比声音信号,整体识别精度依然很低。
发明内容
本发明的目的在于提供一种基于声谱图双特征的动物声音识别方法,改善不同声音环境下各种低信噪比动物声音的识别率。
为实现上述目的,本发明采用如下技术方案:一种基于声谱图双特征的动物声音识别方法,其特征在于包括以下步骤:
步骤S1:建立一声音样本库,用以预存声音样本;
步骤S2:采集待识别的声音信号;
步骤S3:将所述预存声音样本及待识别的声音信号转化成声谱图;
步骤S4:将所述声谱图进行规范化,并对所述规范化后的声谱图进行特征值分解和投影,将其转化得到一投影特征XK;
步骤S5:将所述声谱图转化成等价LBP值矩阵u,统计每一个等价LBP值对应的像素与周围像素灰度值的方差,形成一特征向量LBPV;
步骤S6:将所述投影特征XK和特征向量LBPV结合,形成双层特征XK+LBPV;
步骤S7:以所述声音样本库中预存声音样本对应的双层特征集为训练样本集,以待识别的声音信号对应的双层特征为输入样本,通过随机森林的训练,得出待识别的声音信号于声音样本库中对应的类别并输出结果。
进一步的,所述步骤S3转化过程具体内容如下:
对所述预存的声音样本或采集的声音信号进行STFT,得到其幅度谱S(t,f),其中,t为帧索引,f为频率索引,对应的幅度谱S(t,f)的值转化为灰度级所构成的二维图像即为所述声谱图。
进一步的,所述步骤S4的具体内容如下:
所述规范化的log尺度向量St表示规范化的log尺度的第t个帧的数据;
假设所述幅度谱S(t,f)共有M个帧,将所述M个帧的向量表示为一声谱图矩阵X=[S1,...,St,...SM]T,X∈RM×N,由于特征分解的对象为方阵,因此,计算C=XTX得到矩阵X的协方差矩阵C∈RN×N,按以下公式利用特征值分解对所述协方差矩阵C降维:
C=UΛUT (3)
C=λ1u1u′1+λ2u2u′2+...+λNuNu'N (5)
C≈λ1u1u1′+λ2u2u2′+...+λKuKuK′,K<<N (6)
其中,矩阵U∈RN×N包含矩阵C的所有特征向量μ1,...,μN,Λ是对角矩阵,其对角线上的元素是特征值λ1,...,λN,所述特征值λ1,...,λN代表对应特征向量的权重,而且λ1≥λ2≥...≥λN,再通过以下公式计算前K个特征值的贡献比重ηK来衡量前K个特征向量在表示声音中的重要性:
矩阵U携带了声音的主要信息,选取前K个特征向量组成基本向量矩阵UK∈RN×K,所述投影特征XK是对所述声谱图矩阵X在所述基本向量矩阵UK∈RN×K上进行投影:
XK=XUK (8)
其中XK∈RM×K。
进一步的,所述步骤S5的具体内容如下:
纹理T是以gc像素为中心,在半径为R的环形邻域上的P个像素点的联合分布T:
T≈t(s(g0-gc),s(g1-gc),...,s(gP-1-gc)) (9)
其中,gc表示所述环形领域的中心像素的像素值,gi(i=0,1,...,P-1)表示环形邻域上的P个像素点的灰度值,s为符号函数:
根据所述联合分布T按特定方向排序构成的0/1序列结合LBP算子计算其二进制模式,形成LBP值,即LBPP,R:
环形领域上具有P个像素点,LBP产生2P种二进制模式,即2P个不同的LBP值;
提出一等价模式,所述等价模式对应的循环二进制从0到1或从1到0最多有两次跳变,U值表示所述等价模式中跳变的次数,并用U值判定等价模式:
对一M×N的声谱图提取其等价LBP,每个像素点(m,n)都得到一等价LBP值,这些等价LBP值组成一等价LBP图,所述等价LBP图即为等价LBP值矩阵u,统计所述等价LBP图中每个数字出现的频率,得到所述声谱图的纹理特征向量,但对于相同等价LBP值的等价LBP图,其纹理可能不同,统计每一个等价LBP值对应的像素与周围像素灰度值的方差,形成一特征向量LBPV,特征向量LBPV的第k个成分LBPV(k)表示为:
其中,整数k的范围为k∈[1,P(P-1)+3],w(m,n,k)表示声谱图中像素(m,n)对应于LBPV第k个成分的等价LBP值的权值,LBPV(k)是把声谱图中所有像素对应于第k个成分的等价LBP值的权值进行累加,根据公式(14),得到的LBPV(k),LBPV(2),LBPV(k),…,LBPV(P(P-1)+3),最终形成一个大小为P(P-1)+3的特征向量LBPV。
进一步的,所述步骤S7的具体内容如下:
将所述声音样本模块中预存的声音样本对应的双层特征集为训练样本集 从所述训练样本集中自助重采样,生成s棵决策树,并形成一随机森林;
将所述测试声音模块采集的声音信号对应的双层特征为输入样本,置于所述随机森林中s棵决策树的根节点处,按照决策树的分类规则向下传递,直到到达某一个叶节点处,这个叶节点对应类标便是这棵决策树对所述双层特征所属类别l所做的投票,所述随机森林的s棵决策树均对所述双层特征的类别l进行投票得到s个投票,统计所述s个投票,其中票数最多的类别l便是双层特征对应的类别。
在本发明一实施例中,在所述步骤S2和步骤S3之间还包括一声音增强,将所述预存声音样本及待识别的声音信号进行增强处理,所述增强处理采用短时谱估计算法。
进一步的,所述短时谱估计的具体内容如下:
声音信号y(t)可表示为:
y(t)=s(t)+n(t) (18)
其中,s(t)为动物声音,n(t)为环境声音,对所述声音信号y(t)进行STFT可得到其幅度谱Y(k,l),其中k为帧索引,l为频率索引,短时谱估计由环境声功率谱N(k,l)估计、增益因子G(k,l)计算和增强的声音信号幅度谱F(k,l)计算三部分组成:
步骤S81:对含噪声信号功率谱|Y(k,l)|2进行平滑处理,得到平滑后功率谱:
S(k,l)=αS(k-1,l)+(1-α)|Y(k,l)|2 (19)
式中,α为平滑系数,α=0.7;
步骤S82:通过前向和后向相结合的双向搜索算法寻找S(k,l)最小频谱分量:
Smin(k,l)=max{Smin1(k,l),Smin2(k,l)} (20)
Smin1(k,l)=min{S(i,l)},k-D+1≤i≤k (21)
Smin2(k,l)=min{S(i,l)},k≤i≤i+D-1 (22)式中,Smin1(k,l)表示前向搜索D帧出来的最小值,Smin2(k,l)表示后向搜索D帧出来的最小值,Smin(k,l)表示采用双向搜索得到的最小频谱分量;
步骤S83:计算动物声音存在的概率:
P(k,l)=α1P(k-1,l)+(1-α1)H(k,l) (23)
式中,α1是常量平滑参数,本文设α1=0.2,H(k,l)是环境声音存在的判别准则:
式中,φ(k)是依赖于频率的判别阈值:
式中,Lf和Hf分别表示音频信号频率集中范围的最小值和最大值,Lf=1kHz,Hf=18kHz,Fs表示采样频率;
步骤S84:计算时-频平滑因子η(k,l)进行环境声功率谱估计。
η(k,l)=α2+(1-α2)P(k,l) (26)
式中,根据实际情况设定α2=0.95,显然,α2≤η(k,l)≤1。由η(k,l)可以进行噪声功率谱的估计:
N(k,l)=η(k,l)N(k-1,l)+(1-η(k,l))|Y(k,l)|2 (27)
以上是环境声功率谱N(k,l)的估计过程;
步骤S85:计算频谱增益因子:
G(k,l)=C(k,l)/(C(k,l)+σN(k,l)) (28)
式中,C(k,l)=|Y(k,l)|2-N(k,l)表示纯净声音信号功率谱,σ为过减因子,其值为:
步骤S86:得到增强后的音频信号幅度谱:
F(k,l)=|G(k,l)×|Y(k,l)|2|1/2 (30)。
本发明与现有技术相比具有以下有益效果:
1、本发明提出投影特征与LBPV特征相结合的双层特征用于各种环境下的动物声音识别,不仅提高识别率,还具有较高的抗噪性;
2、本发明提出用随机森林识别器用于双层特征的识别;
3、本发明提出短时谱估计声音增强结合双层特征与随机森林的架构,特别适用于低信噪比动物声音识别。
图1是本发明算法流程图。
图2是本发明实施例一系统模块图。
图3是本发明实施例二系统模块图。
图4是本发明实施例声音增强采用短时谱估计算法模块示意图。
图5a是本发明实施例白鹤叫声的声谱图。
图5b是本发明实施例白鹤叫声的规范化log尺度的声谱图。
图6是本发明实施例白鹤叫声前K个特征值之和占全部特征值之和的比重图。
图7a是本发明实施例3*3图像区域灰度值示意图。
图7b是本发明图7a中间像素点c的LBP值示意图。
图7c是本发明对图7a实线框部分计算等价LBP值后形成的等价LBP图。
图7d是本发明图7a实现框部分对应像素的方差矩阵v示意图。
图7e是本发明每个模式出现的频率直方图。
图7f是本发明图7c的等价LBP值及表1的序号k,通过图7d方差计算LBPV(k),形成的LBPV直方图。
图8a是本发明图5a转化而成的等价LBP图。
图8b是本发明图8a的等价LBP直方图。
图8c是本发明图的LBPV直方图。
图9是本发明随机森林的基本原理示意图。
下面结合附图及实施例对本发明做进一步说明。
请参照图1,本发明提供一种基于声谱图双特征的动物声音识别方法,其特征在于包括以下步骤:
步骤S1:建立一声音样本库,用以预存声音样本;
步骤S2:采集待识别的声音信号;
步骤S3:将所述预存声音样本及待识别的声音信号转化成声谱图;
步骤S4:将所述声谱图进行规范化,并对所述规范化后的声谱图进行特征值分解和投影,将其转化得到一投影特征XK;
步骤S5:将所述声谱图转化成等价LBP值矩阵u,统计每一个等价LBP值对应的像素与周围像素灰度值的方差,形成一特征向量LBPV;
步骤S6:将所述投影特征XK和特征向量LBPV结合,形成双层特征XK+LBPV;
步骤S7:以所述声音样本库中预存声音样本对应的双层特征集为训练样本集,以待识别的声音信号对应的双层特征为输入样本,通过随机森林的训练,得出待识别的声音信号于声音样本库中对应的类别并输出结果。
进一步的,所述步骤S3转化过程具体内容如下:
对所述预存的声音样本或采集的声音信号进行STFT,得到其幅度谱S(t,f),其中,t为帧索引,f为频率索引,对应的幅度谱S(t,f)的值转化为灰度级所构成的二维图像即为所述声谱图,图5a所示白鹤叫声的声谱图。
进一步的,所述步骤S4的具体内容如下:
所述规范化的log尺度向量St表示规范化的log尺度的第t个帧的数据,图5b所示是图5a规范化log尺度的声谱图,这些向量由于维度过高,不适合直接用于分类,必须转化成低维度的表示;
特征值分解是低维度表示的一个简单有效的方法,我们采用特征值分解来降低维度,假设所述幅度谱S(t,f)共有M个帧,将所述M个帧的向量表示为一声谱图矩阵X=[S1,...,St,...SM]T,X∈RM×N,由于特征分解的对象为方阵,因此,计算C=XTX得到矩阵X的协方差矩阵C∈RN×N,按以下公式利用特征值分解对所述协方差矩阵C降维:
C=UΛUT (3)
C=λ1u1u′1+λ2u2u'2+...+λNuNu'N (5)
C≈λ1u1u1′+λ2u2u2′+...+λKuKuK′,K<<N (6)
其中,矩阵U∈RN×N包含矩阵C的所有特征向量μ1,...,μN,Λ是对角矩阵,其对角线上的元素是特征值λ1,...,λN,所述特征值λ1,...,λN代表对应特征向量的权重,而且λ1≥λ2≥...≥λN,特征值的大小λn反应了它对应的特征向量μn对于声音的重要性,特征值越大对应的特征向量越重要,再通过以下公式计算前K个特征值的贡献比重ηK来衡量前K个特征向量在表示声音中的重要性,如图6所示是本发明实施例白鹤叫声前K个特征值之和占全部特征值之和的比重图,从图中我们可以看出,当K≤10时K个特征值之和所占比重快速上升,当K继续增大,比重上升趋势较为平缓且逐渐趋于100%:
矩阵U携带了声音的主要信息,选取前K个特征向量组成基本向量矩阵UK∈RN×K,所述投影特征XK是对所述声谱图矩阵X在所述基本向量矩阵UK∈RN×K上进行投影:
XK=XUK (8)
其中XK∈RM×K。
进一步的,所述步骤S5的具体内容如下:
LBPV是对ULBP中每个模式对应的所有像素的方差进行累计形成的向量,等价LBP值描述了图像纹理特征的空间结构,方差则表示对比度信息,LBPV向量结合了这两者特征。
纹理T是以gc像素为中心,在半径为R的环形邻域上的P个像素点的联合分布T:
T≈t(s(g0-gc),s(g1-gc),...,s(gP-1-gc)) (9)
其中,gc表示所述环形领域的中心像素的像素值,gi(i=0,1,...,P-1)表示环形邻域上的P个像素点的灰度值,s为符号函数:
根据所述联合分布T按特定方向排序构成的0/1序列结合LBP算子计算其二进制模式,形成LBP值,即LBPP,R:
图7a实线框部分为本发明实施例3*3图像区域像素灰度值示意图,计算灰度值为80的中心像素点c的LBP值如图7b所示,其中(141≥80)→1,(109≥80)→1,(89≥80)→1,(68<80)→0,(48<80)→0,(52<80)→0,(60<80)→0,(89≥80)→1,因此LBPP,R=(11100001)2=(225)10,对于边缘像素的LBP值,可以用图7a虚线部分所示的方式,先对相应像素进行拓展后,在用式(11)进行计算。
环形领域上具有P个像素点,LBP产生2P种二进制模式,即2P个不同的LBP值;
根据绝大多数的模式最多包含两次从1到0或0到1的跳变,提出一等价模式,所述等价模式对应的循环二进制从0到1或从1到0最多有两次跳变,U值表示所述等价模式中跳变的次数,并用U值判定等价模式:
其中,所述LBP值的上标u2表示LBP对应的U值最大是2,所述等价模式把模式的数量从2P减少为P(P-1)+2,除所述等价模式以外的模式都归为第P(P-1)+3类,以图7a为例,当P=8和R=1时,等价模式的数量为59个,根据式(13)可以得到59个等价LBP值,把他们与1-59的序号k相对应,可以得到如表1所述的等价LBP值与序号k对应关系,其中ULBP(k)是序号k对应的LBP值;
表1.等价LBP值与序号k对应关系
对一M×N的声谱图提取其等价LBP,每个像素点(m,n)都得到一等价LBP值,这些等价LBP值组成一等价LBP图,所述等价LBP图即为等价LBP值矩阵u,统计所述等价LBP图中每个数字出现的频率,得到所述声谱图的纹理特征向量,图7c是对图7a实线框部分计算等价LBP值后形成的等价LBP图,也是一个由等价LBP值组成的矩阵,即等价LBP值矩阵u,图7e表示每个模式出现的频率直方图,也即表示图7a的纹理特征向量;
但对于相同等价LBP值的等价LBP图,其纹理可能不同,因此,我们用方差来表示对比度信息,方差越大说明该区域纹理变化大,统计每一个等价LBP值对应的像素与周围像素灰度值的方差,形成一特征向量LBPV,特征向量LBPV的第k个成分LBPV(k)表示为:
其中,整数k的范围为k∈[1,P(P-1)+3],w(m,n,k)表示声谱图中像素(m,n)对应于LBPV第k个成分的等价LBP值的权值,LBPV(k)是把声谱图中所有像素对应于第k个成分的等价LBP值的权值进行累加,根据公式(14),得到的LBPV(k),LBPV(2),LBPV(k),…,LBPV(P(P-1)+3),最终形成一个大小为P(P-1)+3的特征向量LBPV;
图7d是图7a实线区域内对应像素的方差矩阵v,图7f是根据图7c的等价LBP值及相应的表1的序号k,通过图7d方差计算LBPV(k),形成的LBPV直方图,即LBPV特征,它的示意过程如下:
u(0,0)=u(0,1)=193=ULBP(38)→v(0,0)+v(0,1)=577+653→LBPV(38)=1230,
u(0,2)=u(1,2)=241=ULBP(49)→v(0,2)+v(1,2)=218+446→LBPV(49)=664,
u(1,0)=u(1,1)=225=ULBP(44)→v(1,0)+v(1,1)=1111+880→LBPV(44)=1991,
u(2,0)=u(2,1)=231=ULBP(46)→v(2,0)+v(2,1)=216+197→LBPV(46)=413,
u(2,2)=255=ULBP(58)→u(2,2)=132→LBPV(58)=132,
因此,
LBPV={0,…,LBPV(38),0,…,LBPV(44),0,LBPV(46),0,0,LBPV(49),0,…,LBPV(58),0},代入相应的值,得
LBPV={0,…,1230,0,…,1991,0,413,0,0,664,0,…,132,0},其直方图如图7f所示;
以图5a的白鹤声的声谱图为例,图8a-8c表示等价模式的LBP直方图与LBPV直方图的比较,在图8b中,对应等价LBP值为255,出现的频率特别高,也即二进制模式为11111111的比例特别高,根据式(10),当gn≥gc时,s(gn-gc)=1,也就是当中心像素的灰度或灰度值相等是,二进制模式对应的位取1,它表示相应的声谱图空白部分或灰度值相同的部分占的比例特别高,相对于等价LBP直方图,如图8c所示,LBPV直方图,用周围像素灰度的方差作为权值,更能反映声谱图中的纹理变化,有利于分类识别。
因此,下一步,我们把投影特征Xk和特征向量LBPV相结合形成双层特征Xk+LBPV,作为各种环境下,动物声音识别的特征,当然也可单把投影特征Xk或特征向量LBPV作为动物识别的特征,双层特征相比于这两者识别率会更加高。
进一步的,所述步骤S7的具体内容如下:
随机森林是一种利用多个决策树分类器来对数据进行判别的集成分类器算法,其原理如图9所示,通过自助重采样技术将所述声音样本模块中预存的声音样本对应的双层特征集 投影特征集 或特征向量集W={LBPV1,LBPV2,...,LBPVQ}为训练样
本集 从所述训练样本集中自助重采样,生成s棵决策树,并形成一随机森林,测试数据的判别结果则由森林中s可数投票形成的分数而定;
使用随机森林对待识别声音的识别过程如下,将所述测试声音模块采集的声音信号对应的双层特征投影特征Xk或特征向LBPV为输入样本,置于所述随机森林中s棵决策树的根节点处,按照决策树的分类规则向下传递,直到到达某一个叶节点处,这个叶节点对应类标便是这棵决策树对所述双层特征所属类别l所做的投票,所述随机森林的s棵决策树均对所述双层特征的类别l进行投票得到s个投票,统计所述s个投票,其中票数最多的类别l便是双层特征对应的类别。
在本发明一实施例中,针对于受到噪声严重污染的声音样本,在所述步骤S2和步骤S3之间还包括一声音增强,将所述预存声音样本及待识别的声音信号进行增强处理,所述增强处理采用短时谱估计算法。
进一步的,所述短时谱估计的具体内容如下:
声音信号y(t)可表示为:
y(t)=s(t)+n(t) (18)
其中,s(t)为动物声音,n(t)为环境声音,对所述声音信号y(t)进行STFT可得到其幅度谱Y(k,l),其中k为帧索引,l为频率索引,短时谱估计由环境声功率谱N(k,l)估计、增益因子G(k,l)计算和增强的声音信号幅度谱F(k,l)计算三部分组成:
步骤S81:对含噪声信号功率谱|Y(k,l)|2进行平滑处理,得到平滑后功率谱:
S(k,l)=αS(k-1,l)+(1-α)|Y(k,l)|2 (19)
式中,α为平滑系数,α=0.7;
步骤S82:通过前向和后向相结合的双向搜索算法寻找S(k,l)最小频谱分量:
Smin(k,l)=max{Smin1(k,l),Smin2(k,l)} (20)
Smin1(k,l)=min{S(i,l)},k-D+1≤i≤k (21)
Smin2(k,l)=min{S(i,l)},k≤i≤i+D-1 (22)
式中,Smin1(k,l)表示前向搜索D帧出来的最小值,Smin2(k,l)表示后向搜索D帧出来的最小值,Smin(k,l)表示采用双向搜索得到的最小频谱分量;
步骤S83:计算动物声音存在的概率:
P(k,l)=α1P(k-1,l)+(1-α1)H(k,l) (23)
式中,α1是常量平滑参数,本文设α1=0.2,H(k,l)是环境声音存在的判别准则:
式中,φ(k)是依赖于频率的判别阈值:
式中,Lf和Hf分别表示音频信号频率集中范围的最小值和最大值,Lf=1kHz,Hf=18kHz,Fs表示采样频率;
步骤S84:计算时-频平滑因子η(k,l)进行环境声功率谱估计。
η(k,l)=α2+(1-α2)P(k,l) (26)式中,根据实际情况设定α2=0.95,显然,α2≤η(k,l)≤1。由η(k,l)可以进行噪声功率谱的估计:
N(k,l)=η(k,l)N(k-1,l)+(1-η(k,l))|Y(k,l)|2 (27)
以上是环境声功率谱N(k,l)的估计过程;
步骤S85:计算频谱增益因子:
G(k,l)=C(k,l)/(C(k,l)+σN(k,l)) (28)
式中,C(k,l)=|Y(k,l)|2-N(k,l)表示纯净声音信号功率谱,σ为过减因子,其值为:
步骤S86:得到增强后的音频信号幅度谱:
F(k,l)=|G(k,l)×|Y(k,l)|2|1/2 (30)。
为了让一般技术人员更好的理解本发明的技术方案,以下结合系统本发明进行进一步介绍。
本发明采用的系统如图2所示,包括一声谱图模块,所述声谱图模块的输入端与一声音样本库模块、测试声音模块连接,所述声谱图模块的输出端与一投影特征模块、一LBPV特征模块的输入端连接,所述投影特征模块、LBPV特征模块的输出端分别与一双层特征模块的输入端连接,所述双层特征模块的输出端与一RF识别模块、一结果输出模块依次连接;
所述声谱图模块将所述声音样本库模块中预存的声音样本及测试声音模块采集的声音信号转化成声谱图;
所述投影特征模块将所述声谱图模块输出的声谱图进行规范化,并对规范化的声谱图进行特征值分解和投影得到投影特征XK;
所述LBPV特征模块将所述声谱图模块输出的声谱图转化成等价LBP值矩阵u,统计每一个等价LBP值对应的像素与周围像素灰度值的方差,形成一特征向量LBPV;
所述双层特征模块将所述投影特征模块输出的投影特征XK和LBPV特征模块输出的特征向量LBPV结合,形成双层特征XK+LBPV;
所述RF识别模块以所述声音样本模块中预存的声音样本对应的双层特征集为训练样本集,以测试声音模块采集的声音信号对应的双层特征为输入样本,通过随机森林的训练,得出测试声音模块采集的声音信号于声音样本库模块中预存的声音样本对应的类别并输送给结果输出模块。
于本发明另一实施例中,还包括一声音增强模块,所述声音增强模块的输出端与所述声谱图模块的输入点连接,所述声音增强模块的输入端与所述声音样本库模块、测试声音模块连接,如图3所示,所述声音增强模块使用声音增强算法对声音信号进行声音增强,而于各种声音增强算法中,经比较得出又以短时谱估计算法效果最为突出,如图4所示。
以上所述仅为本发明的较佳实施例,凡依本发明申请专利范围所做的均等变化与修饰,皆应属本发明的涵盖范围。
Claims (7)
- 一种基于声谱图双特征的动物声音识别方法,其特征在于包括以下步骤:步骤S1:建立一声音样本库,用以预存声音样本;步骤S2:采集待识别的声音信号;步骤S3:分别将所述预存声音样本及待识别的声音信号转化成声谱图;步骤S4:将所述声谱图进行规范化,并对所述规范化后的声谱图进行特征值分解和投影,将其转化得到一投影特征XK;步骤S5:将所述声谱图转化成等价LBP值矩阵u,统计每一个等价LBP值对应的像素与周围像素灰度值的方差,形成一特征向量LBPV;步骤S6:将所述投影特征XK和特征向量LBPV结合,形成双层特征XK+LBPV;步骤S7:以所述声音样本库中预存声音样本对应的双层特征集为训练样本集,以待识别的声音信号对应的双层特征为输入样本,通过随机森林的训练,得出待识别的声音信号于声音样本库中对应的类别并输出结果。
- 根据权利要求1所述的基于声谱图双特征的动物声音识别方法,其特征在于:所述步骤S3转化过程具体内容如下:对所述预存的声音样本或采集的声音信号进行STFT,得到其幅度谱S(t,f),其中,t为帧索引,f为频率索引,对应的幅度谱S(t,f)的值转化为灰度级所构成的二维图像即为所述声谱图。
- 根据权利要求2所述的基于声谱图双特征的动物声音识别方法,其特征在于:所述步骤S4的具体内容如下:所述规范化的log尺度向量St表示规范化的log尺度的第t个帧的数据;假设所述幅度谱S(t,f)共有M个帧,将所述M个帧的向量表示为一声谱图矩阵X=[S1,…,St,…SM]T,X∈RM×N,由于特征分解的对象为方阵,因此,计算C=XTX得到矩阵X的协方差矩阵C∈RN×N,按以下公式利用特征值分解对所述协方差矩阵C降维:C=UΛUT (3)C=λ1u1u′1+λ2u2u′2+…+λNuNu′N (5)C≈λ1u1u1′+λ2u2u2′+…+λKuKuK′,K<<N (6)其中,矩阵U∈RN×N包含矩阵C的所有特征向量μ1,…,μN,Λ是对角矩阵,其对角线上的元素是特征值λ1,…,λN,所述特征值λ1,…,λN代表对应特征向量的权重,而且λ1≥λ2≥…≥λN,再通过以下公式计算前K个特征值的贡献比重ηK来衡量前K个特征向量在表示声音中的重要性:矩阵U携带了声音的主要信息,选取前K个特征向量组成基本向量矩阵UK∈RN×K,所述投影特征XK是对所述声谱图矩阵X在所述基本向量矩阵UK∈RN×K上进行投影:XK=XUK (8)其中XK∈RM×K。
- 根据权利要求1所述的基于声谱图双特征的动物声音识别方法,其特征在于:所述步骤S5的具体内容如下:纹理T是以gc像素为中心,在半径为R的环形邻域上的P个像素点的联合分布T:T≈t(s(g0-gc),s(g1-gc),…,s(gP-1-gc)) (9)其中,gc表示所述环形领域的中心像素的像素值,gi(i=0,1,…,P-1)表示环形邻域上的P个像素点的灰度值,S为符号函数:根据所述联合分布T按特定方向排序构成的0/1序列结合LBP算子计算其二进制模式,形成LBP值,即LBPP,R:环形领域上具有P个像素点,LBP产生2P种二进制模式,即2P个不同的LBP值;提出一等价模式,所述等价模式对应的循环二进制从0到1或从1到0最多有两次跳变,U值表示所述等价模式中跳变的次数,并用U值判定等价模式:对一M×N的声谱图提取其等价LBP,每个像素点(m,n)都得到一等价LBP值,这些等价LBP值组成一等价LBP图,所述等价LBP图即为等价LBP值矩阵u,统计所述等价LBP图中每个数字出现的频率,得到所述声谱图的纹理特征向量,但对于相同等价LBP值的等价LBP图,其纹理可能不同,统计每一个等价LBP值对应的像素与周围像素灰度值的方差,形成一特征向量LBPV,特征向量LBPV的第k个成分LBPV(k)表示为:其中,整数k的范围为k∈[1,P(P-1)+3],w(m,n,k)表示声谱图中像素(m,n)对应于LBPV第k个成分的等价LBP值的权值,LBPV(k)是把声谱图中所有像素对应于第k个成分的等价LBP值的权值进行累加,根据公式(14),得到的LBPV(k),LBPV(2),LBPV(k),…,LBPV(P(P-1)+3),最终形成一个大小为P(P-1)+3的特征向量LBPV。
- 根据权利要求1所述的基于声谱图双特征的动物声音识别方法,其特征在于:在所述步骤S2和步骤S3之间还包括一声音增强,将所述预存声音样本及待识别的声音信号进行增强处理,所述增强处理采用短时谱估计算法。
- 根据权利要求6所述的基于声谱图双特征的动物声音识别方法,其特征在于:所述短时谱估计算法的具体内容如下:声音信号y(t)可表示为:y(t)=s(t)+n(t) (18)其中,s(t)为动物声音,n(t)为环境声音,对所述声音信号y(t)进行STFT可得到其幅度谱Y(k,l),其中k为帧索引,l为频率索引,短时谱估计由环境声功率谱N(k,l)估计、增益因子G(k,l)计算和增强的声音信号幅度谱F(k,l)计算三部分组成:步骤S81:对含噪声信号功率谱|Y(k,l)|2进行平滑处理,得到平滑后功率谱:S(k,l)=αS(k-1,l)+(1-α)|Y(k,l)|2 (19)式中,α为平滑系数,α=0.7;步骤S82:通过前向和后向相结合的双向搜索算法寻找S(k,l)最小频谱分量:Smin(k,l)=max{Smin1(k,l),Smin2(k,l)} (20)Smin1(k,l)=min{S(i,l)},k-D+1≤i≤k (21)Smin2(k,l)=min{S(i,l)},k≤i≤i+D-1 (22)式中,Smin1(k,l)表示前向搜索D帧出来的最小值,Smin2(k,l)表示后向搜索D帧出来的最小值,Smin(k,l)表示采用双向搜索得到的最小频谱分量;步骤S83:计算动物声音存在的概率:P(k,l)=α1P(k-1,l)+(1-α1)H(k,l) (23)式中,α1是常量平滑参数,本文设α1=0.2,H(k,l)是环境声音存在的判别准则:式中,φ(k)是依赖于频率的判别阈值:式中,Lf和Hf分别表示音频信号频率集中范围的最小值和最大值,Lf=1kHz,Hf=18kHz,Fs表示采样频率;步骤S84:计算时-频平滑因子η(k,l)进行环境声功率谱估计。η(k,l)=α2+(1-α2)P(k,l) (26)式中,根据实际情况设定α2=0.95,显然,α2≤η(k,l)≤1。由η(k,l)可以进行噪声功率谱的估计:N(k,l)=η(k,l)N(k-1,l)+(1-η(k,l))|Y(k,l)|2 (27)以上是环境声功率谱N(k,l)的估计过程;步骤S85:计算频谱增益因子:G(k,l)=C(k,l)/(C(k,l)+σN(k,l)) (28)式中,C(k,l)=|Y(k,l)|2-N(k,l)表示纯净声音信号功率谱,σ为过减因子,其值为:步骤S86:得到增强后的音频信号幅度谱:F(k,l)=|G(k,l)×|Y(k,l)|2|1/2 (30)。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510226082.6 | 2015-05-06 | ||
CN201510226082.6A CN104882144B (zh) | 2015-05-06 | 2015-05-06 | 基于声谱图双特征的动物声音识别方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016176887A1 true WO2016176887A1 (zh) | 2016-11-10 |
Family
ID=53949612
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/080284 WO2016176887A1 (zh) | 2015-05-06 | 2015-05-29 | 基于声谱图双特征的动物声音识别方法 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN104882144B (zh) |
WO (1) | WO2016176887A1 (zh) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109256141A (zh) * | 2018-09-13 | 2019-01-22 | 芯盾(北京)信息技术有限公司 | 利用语音信道进行数据传输的方法 |
CN109740423A (zh) * | 2018-11-22 | 2019-05-10 | 霍尔果斯奇妙软件科技有限公司 | 基于人脸和小波包分析的种族识别方法及系统 |
CN109949825A (zh) * | 2019-03-06 | 2019-06-28 | 河北工业大学 | 基于fpga加速的pcnn算法的噪声分类方法 |
CN111276158A (zh) * | 2020-01-22 | 2020-06-12 | 嘉兴学院 | 一种基于语谱图纹理特征的音频场景识别方法 |
CN111540368A (zh) * | 2020-05-07 | 2020-08-14 | 广州大学 | 一种稳健的鸟声提取方法、装置及计算机可读存储介质 |
CN112153461A (zh) * | 2020-09-25 | 2020-12-29 | 北京百度网讯科技有限公司 | 用于定位发声物的方法、装置、电子设备及可读存储介质 |
CN113823295A (zh) * | 2021-10-12 | 2021-12-21 | 青岛农业大学 | 一种通过羊的声音智能识别发情状态的方法 |
CN114187479A (zh) * | 2021-12-28 | 2022-03-15 | 河南大学 | 一种基于空谱特征联合的高光谱图像分类方法 |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105304078B (zh) * | 2015-10-28 | 2019-04-30 | 中国电子科技集团公司第三研究所 | 目标声数据训练装置和目标声数据训练方法 |
CN105489228A (zh) * | 2015-12-08 | 2016-04-13 | 杭州百世伽信息科技有限公司 | 一种基于频域图处理的干罗音识别方法 |
CN105959789B (zh) * | 2016-05-26 | 2018-11-20 | 无锡天脉聚源传媒科技有限公司 | 一种节目频道确定方法及装置 |
CN107436599A (zh) * | 2016-05-26 | 2017-12-05 | 北京空间技术研制试验中心 | 近距离在轨操作航天器的敏捷运动规划方法 |
CN106653032B (zh) * | 2016-11-23 | 2019-11-12 | 福州大学 | 低信噪比环境下基于多频带能量分布的动物声音检测方法 |
CN106531174A (zh) * | 2016-11-27 | 2017-03-22 | 福州大学 | 基于小波包分解和声谱图特征的动物声音识别方法 |
CN108205535A (zh) * | 2016-12-16 | 2018-06-26 | 北京酷我科技有限公司 | 情感标注的方法及其系统 |
CN107424248A (zh) * | 2017-04-13 | 2017-12-01 | 成都步共享科技有限公司 | 一种共享自行车的声纹开锁方法 |
CN107393550B (zh) * | 2017-07-14 | 2021-03-19 | 深圳永顺智信息科技有限公司 | 语音处理方法及装置 |
CN107369451B (zh) * | 2017-07-18 | 2020-12-22 | 北京市计算中心 | 一种辅助鸟类繁殖期的物候研究的鸟类声音识别方法 |
CN109409434B (zh) * | 2018-02-05 | 2021-05-18 | 福州大学 | 基于随机森林的肝脏疾病数据分类规则提取的方法 |
CN109065034B (zh) * | 2018-09-25 | 2023-09-08 | 河南理工大学 | 一种基于声音特征识别的婴儿哭声翻译方法 |
CN109597305A (zh) * | 2018-12-03 | 2019-04-09 | 东华大学 | 基于语言信号分析和大数据分析的服装震动智能提醒系统 |
CN110390952B (zh) * | 2019-06-21 | 2021-10-22 | 江南大学 | 基于双特征2-DenseNet并联的城市声音事件分类方法 |
CN110827837B (zh) * | 2019-10-18 | 2022-02-22 | 中山大学 | 一种基于深度学习的鲸鱼活动音频分类方法 |
CN111626093B (zh) * | 2020-03-27 | 2023-12-26 | 国网江西省电力有限公司电力科学研究院 | 一种基于鸣声功率谱密度的输电线路相关鸟种识别方法 |
CN112721933B (zh) * | 2020-07-28 | 2022-01-04 | 盐城工业职业技术学院 | 一种基于语音识别的农用拖拉机的控制终端 |
CN112735444B (zh) * | 2020-12-25 | 2024-01-09 | 浙江弄潮儿智慧科技有限公司 | 一种具有模型匹配的中华凤头燕鸥识别系统及其模型匹配方法 |
CN112687068B (zh) * | 2021-03-19 | 2021-05-28 | 四川通信科研规划设计有限责任公司 | 一种基于微波和振动传感器数据的侵入检测方法 |
CN114400009B (zh) * | 2022-03-10 | 2022-07-12 | 深圳市声扬科技有限公司 | 声纹识别方法、装置以及电子设备 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102522082A (zh) * | 2011-12-27 | 2012-06-27 | 重庆大学 | 一种公共场所异常声音的识别与定位方法 |
CN103474072A (zh) * | 2013-10-11 | 2013-12-25 | 福州大学 | 利用纹理特征与随机森林的快速抗噪鸟鸣声识别方法 |
CN103474066A (zh) * | 2013-10-11 | 2013-12-25 | 福州大学 | 基于多频带信号重构的生态声音识别方法 |
CN103489446A (zh) * | 2013-10-10 | 2014-01-01 | 福州大学 | 复杂环境下基于自适应能量检测的鸟鸣识别方法 |
US8838260B2 (en) * | 2009-10-07 | 2014-09-16 | Sony Corporation | Animal-machine audio interaction system |
-
2015
- 2015-05-06 CN CN201510226082.6A patent/CN104882144B/zh not_active Expired - Fee Related
- 2015-05-29 WO PCT/CN2015/080284 patent/WO2016176887A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8838260B2 (en) * | 2009-10-07 | 2014-09-16 | Sony Corporation | Animal-machine audio interaction system |
CN102522082A (zh) * | 2011-12-27 | 2012-06-27 | 重庆大学 | 一种公共场所异常声音的识别与定位方法 |
CN103489446A (zh) * | 2013-10-10 | 2014-01-01 | 福州大学 | 复杂环境下基于自适应能量检测的鸟鸣识别方法 |
CN103474072A (zh) * | 2013-10-11 | 2013-12-25 | 福州大学 | 利用纹理特征与随机森林的快速抗噪鸟鸣声识别方法 |
CN103474066A (zh) * | 2013-10-11 | 2013-12-25 | 福州大学 | 基于多频带信号重构的生态声音识别方法 |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109256141A (zh) * | 2018-09-13 | 2019-01-22 | 芯盾(北京)信息技术有限公司 | 利用语音信道进行数据传输的方法 |
CN109740423A (zh) * | 2018-11-22 | 2019-05-10 | 霍尔果斯奇妙软件科技有限公司 | 基于人脸和小波包分析的种族识别方法及系统 |
CN109949825A (zh) * | 2019-03-06 | 2019-06-28 | 河北工业大学 | 基于fpga加速的pcnn算法的噪声分类方法 |
CN111276158A (zh) * | 2020-01-22 | 2020-06-12 | 嘉兴学院 | 一种基于语谱图纹理特征的音频场景识别方法 |
CN111540368A (zh) * | 2020-05-07 | 2020-08-14 | 广州大学 | 一种稳健的鸟声提取方法、装置及计算机可读存储介质 |
CN111540368B (zh) * | 2020-05-07 | 2023-03-14 | 广州大学 | 一种稳健的鸟声提取方法、装置及计算机可读存储介质 |
CN112153461A (zh) * | 2020-09-25 | 2020-12-29 | 北京百度网讯科技有限公司 | 用于定位发声物的方法、装置、电子设备及可读存储介质 |
CN112153461B (zh) * | 2020-09-25 | 2022-11-18 | 北京百度网讯科技有限公司 | 用于定位发声物的方法、装置、电子设备及可读存储介质 |
CN113823295A (zh) * | 2021-10-12 | 2021-12-21 | 青岛农业大学 | 一种通过羊的声音智能识别发情状态的方法 |
CN114187479A (zh) * | 2021-12-28 | 2022-03-15 | 河南大学 | 一种基于空谱特征联合的高光谱图像分类方法 |
Also Published As
Publication number | Publication date |
---|---|
CN104882144A (zh) | 2015-09-02 |
CN104882144B (zh) | 2018-10-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016176887A1 (zh) | 基于声谱图双特征的动物声音识别方法 | |
WO2021208287A1 (zh) | 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质 | |
Reney et al. | An efficient method to face and emotion detection | |
US8428945B2 (en) | Acoustic signal classification system | |
WO2016155047A1 (zh) | 低信噪比声场景下声音事件的识别方法 | |
CN109446948A (zh) | 一种基于Android平台的人脸和语音多生物特征融合认证方法 | |
US20180277146A1 (en) | System and method for anhedonia measurement using acoustic and contextual cues | |
Renjith et al. | Speech based emotion recognition in Tamil and Telugu using LPCC and hurst parameters—A comparitive study using KNN and ANN classifiers | |
Whitehill et al. | Whosecough: In-the-wild cougher verification using multitask learning | |
EP3816996A1 (en) | Information processing device, control method, and program | |
Singh et al. | Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition | |
Ramashini et al. | A Novel Approach of Audio Based Feature Optimisation for Bird Classification. | |
Tsau et al. | Content/context-adaptive feature selection for environmental sound recognition | |
Yue et al. | Speaker age recognition based on isolated words by using SVM | |
Sas et al. | Gender recognition using neural networks and ASR techniques | |
Rajesh | Performance analysis of ML algorithms to detect gender based on voice | |
Zhong et al. | Gender recognition of speech based on decision tree model | |
Li et al. | Aging face verification in score-age space using single reference image template | |
Zhang et al. | Sparse coding for sound event classification | |
Shawkat | Evaluation of Human Voice Biometrics and Frog Bioacoustics Identification Systems Based on Feature Extraction Method and Classifiers | |
CN112669881B (zh) | 一种语音检测方法、装置、终端及存储介质 | |
Keyvanrad et al. | Feature selection and dimension reduction for automatic gender identification | |
Farhood et al. | Investigation on model selection criteria for speaker identification | |
Koshtura | INFORMATION TECHNOLOGY FOR GENDER RECOGNITION BY VOICE | |
Patil et al. | Unveiling the State-of-the-Art: A Comprehensive Survey on Voice Activity Detection Techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15891139 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15891139 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15891139 Country of ref document: EP Kind code of ref document: A1 |