WO2021088176A1 - 基于二值多频带能量分布的低信噪比声音事件检测方法 - Google Patents

基于二值多频带能量分布的低信噪比声音事件检测方法 Download PDF

Info

Publication number
WO2021088176A1
WO2021088176A1 PCT/CN2019/123469 CN2019123469W WO2021088176A1 WO 2021088176 A1 WO2021088176 A1 WO 2021088176A1 CN 2019123469 W CN2019123469 W CN 2019123469W WO 2021088176 A1 WO2021088176 A1 WO 2021088176A1
Authority
WO
WIPO (PCT)
Prior art keywords
bmbpd
sound
noise ratio
sound event
dctz
Prior art date
Application number
PCT/CN2019/123469
Other languages
English (en)
French (fr)
Inventor
李应
吴灵菲
王庆
池哲坚
Original Assignee
福州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 福州大学 filed Critical 福州大学
Publication of WO2021088176A1 publication Critical patent/WO2021088176A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Definitions

  • the present invention belongs to the field of sound event detection (Sound event detection, SED), and in particular relates to a low signal-to-noise ratio sound event detection method based on binary multi-band energy distribution.
  • Sound event detection is the task of assigning the audio content of a short sound clip to one of a set of pre-trained classes.
  • Sound event detection has been a research hotspot in the field of acoustic analysis.
  • Acoustic event detection has been applied to the fields of acoustic monitoring, bioacoustic monitoring, environmental sound, context-aware auxiliary robots, music genre classification, and multimedia archiving.
  • feature representation mainly includes audio feature conventional representation (R.Grzeszick, A.Plinge, and GAFink, "Bag-of-features methods for acoustic event detection and classification," IEEE/ACM Trans.Audio, Speech, Lang.Process ., vol.25, no.6, pp.1242-1252, Jun.2017), deep audio features extracted by deep neural networks (Y.Li, X.Zhang, H.Jin, X.Li, Q.Wang, Q.He, and Q.
  • Multimedia vol.21, no.6, pp.1359-1371, Jun.2019
  • large-scale audio annotation Q.Kong, Y.Xu, W.Wang, and MDPlumbley, "Audio set classification with attention model: a probabilistic perspective, "in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 316-320), abnormal sound event detection (Y. Koizumi, S. Saito, H Uematsu, Y. Kawachi, and N. Harada, "Unsupervised detection of anomalous sound based on deep learning and the neyman-pearson lemma," IEEE/ACM Trans.Audio, Speech, Lang.Process., vol.27, no.
  • multi-tone sound event detection it mainly includes convolutional neural networks for multi-tone sound event detection (E.Cakir, G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Convolutional recurrent neural networks for polyphonic sound event detection, "IEEE/ACM Trans.Audio, Speech, Lang.Process., vol.25, no.6, pp.1291-1303, Jun.2017), polyphonic event tracking using linear dynamic system (E.Benetos, G.Lafay, M.Lagrange, and MDPlumbley, "Polyphonic sound event tracking using linear dynamical systems," IEEE/ACM Trans.Audio, Speech, Lang.Process., vol.25, no.6, pp.1266-1267 , Jun.
  • convolutional neural networks for multi-tone sound event detection E.Cakir, G. Parascandolo, T. Heittola, H. Huttunen, T.
  • MFCC F. Zheng, G. Zhang, and Z. Song, “Comparison of different implementations of MFCC,” Journal of Computer Science and Technology, vol. 16, no. 6, pp. 582 -589, 2001
  • PNCC C. Kim and RMStern, "Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooding,” in Proc. IEEE Int.Conf.
  • the graph feature extraction is through Jet mapping, and the gray-scale logarithmic spectrum is mapped into 3 sub-images; then, each sub-image is divided into 9 ⁇ 9 blocks; and then the mean and variance of each block are extracted to obtain 486 (2 ⁇ 3 ⁇ 9 ⁇ 9) dimensional vector is used as the feature; finally, it is used for SVM training and classification.
  • the document “Image feature representation of the subband power distribution for robust sound event” “Classification” has further improved the spectrum, spectrum analysis and the selection of classifiers.
  • the frequency spectrum and analysis include: gray-scale gammatone spectrogram, sub-band energy distribution (SPD), and contrast enhancement to form an enhanced sub-band energy distribution diagram.
  • Further processing of image features includes: frame missing concealment estimation, removal of unreliable dimensions. Then use the k-nearest neighbor classifier (kNN) based on Hellinger distance to classify.
  • kNN k-nearest neighbor classifier
  • Jet is also used to map the subband energy distribution graph into 3 subgraphs; then each subgraph is divided into 10 ⁇ 10 blocks; then the mean and variance are extracted , Get 600 (2 ⁇ 3 ⁇ 10 ⁇ 10) dimensional vectors as features; finally, perform kNN modeling and classification on the features.
  • the principle of Jet mapping is shown in Figure 2 of the accompanying drawings in the specification.
  • the horizontal axis represents the gray value of the sub-band energy distribution map, and the range is [0, 1].
  • the vertical axis represents the value of the gray value after Jet mapping. Among them, the three sub-images are respectively mapped according to the three RGB polylines in Figure 2.
  • the Jet mapping used in the sub-band energy distribution map has the following problems.
  • the point with the gray value of 1 indicates that the probability density of the corresponding frequency band and the corresponding energy level is large, and it is related to the low-energy part of the sound event or background noise.
  • the low-energy part of the sound event is greatly interfered and easily becomes an unreliable area, and the background noise part needs to be suppressed.
  • the sub-band energy distribution map is a gray value of 0.5, which is mapped to 0.5, and the point with a gray value of 1 is also mapped to 0.5, which is equivalent to adding more to the red sub-image Unreliable ingredients.
  • the present invention proposes a new sound event detection framework, which uses a unique combination of Binary Multi-Band Power Distribution (BMBPD) and Random Forest (RF) to provide excellent performance.
  • BMBPD binarizes the pixels whose grayscale is less than a certain threshold in the MBPD image to 1, and the rest to 0, which can highlight the pixels related to sound events in the MBPD image, while suppressing the influence of noise, thereby reducing the low SNR environment The influence of medium noise on the sound event to be measured.
  • the main part of the Z-code of the DCT coefficients is used as the feature of the sound event, that is, BMBPD-DCTZ, and the random forest (RF) classifier is used to train and detect the BMBPD-DCTZ.
  • DCT discrete cosine transform
  • RF random forest
  • Step S1 Filter the sound signal y(t) through the gammatone filter bank to obtain y f [t]; take the logarithm of y f [t] to form the corresponding gammatone spectrogram S g (f, t);
  • Step S2 For the energy spectrum of each sound signal Perform normalization processing to obtain the normalized energy spectrum G(f, t);
  • Step S3 Calculate the multi-band energy distribution of G(f, t) to obtain the MBPD map M(f, b);
  • Step S4 Binarize the MBPD map M(f, b) to obtain the BMBPD map M R (f, b);
  • Step S5 divide the BMBPD picture M R (f, b) into blocks, and perform DCT (discrete cosine transform) on the sub-blocks;
  • Step S6 Perform Zigzag scanning on DCT coefficients to obtain a 1-dimensional arrangement of DCT coefficients, and take the first m DCT coefficients as BMBPD-DCTZ;
  • Step S7 Use BMBPD-DCTZ as a feature and RF (random forest) as a classifier to classify and/or identify BMBPD-DCTZ.
  • step S1 the first step S1
  • f represents the center frequency of the gammatone filter
  • t represents the frame index
  • step S2
  • step S3 suppose that G(f, t) has a total of B energy levels, and a non-parametric method based on statistics is used to perform probability density statistics on the energy elements of each frequency subband f to obtain the The probability distribution of each energy level M(f, b):
  • W is the number of frames of the sound signal
  • M(f, b) represents the ratio of the elements with energy level b in the frequency band f to the total number of elements in the frequency band (0 ⁇ M(f,b) ⁇ 1)
  • I b ( G(f, t)) is an indicator function.
  • G(f, t) belongs to the energy level b, its value is 1, otherwise it is 0;
  • step S4
  • the range of the threshold n is in the interval [1, W].
  • step S5 the BMBPD picture M R (f, b) is divided into 8 ⁇ 8 blocks, and after DCT is performed on the sub-blocks, 8 ⁇ 8 DCT coefficients are obtained.
  • step S6 the first 5 coefficients of 64 1-dimensional Zigzag arrangements are taken as BMBPD-DCTZ.
  • step S7 to identify BMBPD-DCTZ specifically includes the following steps:
  • Step S71 Data feature setting: Place the BMBPD-DCTZ feature of the sound signal to be measured at the root node of all nk decision trees in the random forest;
  • Step S72 Decision tree setting and decision: According to the classification rules of the decision tree, the root node is passed down in turn until reaching a leaf node.
  • the class label corresponding to the leaf node is the category of the decision tree to the BMBPD-DCTZ feature. Vote made
  • Step S73 Random forest detection: The nk decision trees of the random forest vote on the BMBPD-DCTZ feature category of each sound signal to be tested; the nk decision trees in the random forest vote, and the class label with the most votes is the final The class mark corresponding to the determined sound signal to be measured.
  • the BMBPD map is 64 energy levels ⁇ 256 frequency bands.
  • the main feature of the present invention and its preferred solution uses binary multi-band energy distribution to generate features for low signal-to-noise ratio sound event detection.
  • the binarized multi-band energy distribution provides better resolution control and can be tuned to better capture low signal-to-noise ratio sound event information.
  • the binary multi-band energy distribution map is divided into 8*8 blocks and the Zigzag coefficients of the discrete cosine transform are extracted as features.
  • random forest will be used to model and detect features. The accuracy of this method under both stationary noise conditions and non-stationary noise conditions is significantly better than the commonly used methods.
  • the advantages of the method of the present invention are more prominent.
  • the BMBPD map can also be combined with deep learning technology to further improve and increase the detection rate of various sound events under low signal-to-noise ratio.
  • Fig. 1 is a schematic diagram of sound event classification in the prior art where spectrogram features are used for non-matching conditions;
  • FIG. 2 is a schematic diagram of Jet mapping rules (SPD, mapping) in the prior art
  • FIG. 3 is a schematic diagram of a low signal-to-noise ratio sound event detection process according to an embodiment of the present invention
  • FIG. 4 is a gammatone spectrum diagram and MBPD diagram of a fox call according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of 8 ⁇ 8 blocks of a BMBPD diagram according to an embodiment of the present invention.
  • FIG. 6 is a partial enlarged schematic diagram of a sub-block in a black box in the 8 ⁇ 8 sub-block of the BMBPD diagram according to an embodiment of the present invention
  • FIG. 7 is a schematic diagram of DCT coefficients and Zigzag scanning of sub-blocks in black boxes according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of the Zigzag arrangement of 64 DCT coefficients and the top 5 Zigzag coefficients according to an embodiment of the present invention
  • FIG. 9 is a schematic diagram of the detection rate (Precision, Threshold) of different values of n according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of a comparison (SNR (dB), Precision (%)) of detection accuracy between the method of the embodiment of the present invention and other existing methods;
  • the detection method provided by the embodiment of the present invention can be summarized as follows: detecting sound events according to the sequential steps of gray-scale gammatone spectrogram, BMBPD map, 8 ⁇ 8 block, DCT, Zigzag coding, and RF classifier.
  • BMBPD converts sound data into a BMBPD map by counting the probability densities of 64 energy levels in 256 frequency bands in the signal.
  • the steps of forming a BMBPD map are as follows:
  • f represents the center frequency of the filter
  • t represents the frame index
  • the gammatone spectrogram and MBPD display of the fox's call are provided.
  • Multi-band energy distribution Statistics of the energy distribution of G(f, t) are performed to obtain M(f, b), which is the MBPD diagram shown in Figure 4(b). If there are B energy levels in total, then the non-parametric method based on statistics is used to perform probability density statistics on the energy elements of each frequency subband f, and the probability distribution M(f, b) of each energy level of each frequency subband can be obtained. .
  • W is the number of frames of the sound clip.
  • W 198
  • M(f, b) represents the ratio of elements with energy level b in the frequency band f to the total number of elements in the frequency band (0 ⁇ M(f,b) ⁇ 1).
  • I b (G(f, t)) is an indicator function. When G(f, t) belongs to the energy level b, its value is 1, otherwise it is 0.
  • Binarization Binarize M(f, b) to obtain M R (f, b).
  • the range of the threshold n is in the interval [1, W].
  • Performing Discrete Cosine Transform (DCT) on an image can concentrate the important visual information of the image into a small number of DCT coefficients (GAPapakostas, DEKoulouriotis, and EGKarakasis, "Efficient 2-D DCT computation from an image representation point of view, "London, UK, Intch Open, pp. 21-34, 2009).
  • DCT coefficient matrix along the upper left to lower right direction, the size of the DCT coefficients decreases in order.
  • the first coefficient in the upper left corner called the DC coefficient of the D CT, is the average value of the image pixels.
  • the other coefficients are called alternating current (AC) coefficients. The closer the AC coefficient is to the upper left corner, the more image information it contains.
  • the BMBPD picture composed of M B (f, b) is divided into blocks, and then DCT is performed on the sub-blocks.
  • each sub-block and its pixels carry the distribution of the sound signal in the corresponding frequency band and energy level.
  • the black-framed sub-block indicated by the arrow in 5 corresponding to FIG. 6 is the energy distribution of the frequency band from 129 to 136 and the energy level from 23 to 30 in the BMBPD diagram.
  • DCT is performed on the sub-block, an 8 ⁇ 8 DCT coefficient list as shown in FIG. 7 can be obtained.
  • the 8 ⁇ 8 DCT coefficients in Fig. 7 are scanned by Zigzag, and a 1-dimensional arrangement of 64 DCT coefficients as shown in Fig. 8(a) can be obtained.
  • the importance of the 64 DCT coefficients in this 1-dimensional arrangement to visual information is arranged in order from left to right.
  • the coefficients on the left side of the one-dimensional arrangement can represent the main information of the image.
  • a part of the coefficients on the left side of the 1-dimensional arrangement is used to represent the main information of the 8 ⁇ 8 sub-block, that is, the energy distribution of the sound signal in a specific frequency band.
  • the first 5 coefficients of 64 1-dimensional Zigzag arrangements are taken as the features of the 8 ⁇ 8 sub-block. This feature is the feature of Zigzag scanning of the DCT coefficients of the BMBPD sub-block, and is referred to as BMBPD-DCTZ for short.
  • Random forest (RF, L. Breiman, "Random forests,” Machine learning, vol.45, no.1, pp.5-32, 2001) is an ensemble that uses multiple decision tree classifiers to discriminate data Classifier algorithm.
  • the steps of random forest to detect sound events are: (1) Data feature setting.
  • the BMBPD-DCTZ feature of the sound data to be tested is placed at the root node of all nk decision trees in the random forest.
  • the class label corresponding to the leaf node is the vote made by this decision tree on the class of the BMBPD-DCTZ feature.
  • the nk decision trees of the random forest vote on the BMBPD-DCTZ feature category of each sound signal to be tested. Count the votes of nk decision trees in the forest, and the class label with the most votes is the class label corresponding to the final sample to be tested.
  • the experimental data uses two data sets, an animal sound event set and an office sound event set.
  • 50 animal sound events come from the Freesound sound database (F. Font, G. Roma, P. Herrera, and X. Serra, "Characterization of the Freesound online community,” in Proc. 3rd int. Workshop Cognitive Inf. Process. , May 2012, pp.1-6), including the sounds of different birds and mammals.
  • Virtanen, and MDPlumbley "Detection and classification of acoustic scenes and events : Outcome of the DCASE 2016 challenge, "IEEE Trans.Audio, Speech, Lang.Process., vol.26, no.2, pp379-393, Feb.2018).
  • 50 kinds of animal sound events are the main ones, and 11 kinds of office sound events are used as further auxiliary verification.
  • the six noise environments used in the experiment can be divided into two categories, namely stationary noise and non-stationary noise.
  • the stationary noise is pink, and the non-stationary noise includes the sound of running water, wind, road, ocean waves, and rain that simulate real scene sounds.
  • the format of noise samples and sound events is a mono ".wav" format, and the sampling rate is 44.1kHz.
  • the relevant parameters are: the frame length is 25ms, the frame shift is 10ms, the number of filter banks is 256, and the center frequency is between 50Hz and fs/2.
  • the size of the threshold n has different effects on the detection rate under different signal-to-noise ratio conditions.
  • low signal-to-noise ratio conditions such as -5dB, 0dB and 5dB
  • the influence of n is not obvious.
  • the detection rate first improves and then continues to decline.
  • the MFCC feature uses a triangular filter bank of 256 filters to extract 12-dimensional DCT coefficients.
  • a 32-order gammatone filter and 12-dimensional DCT coefficients are used to extract PNCC features.
  • Table 3 The average detection rate of animal sound events with different characteristics in six noise environments (%)
  • Table 3 shows the detection results of animal sound events with several features under six background noises including flowing water, pink noise, wind, ocean waves, roads and rain.
  • the detection results of the original sound of an office sound event, and the detection results of pink noise with a signal-to-noise ratio of 5dB, 0dB, and -5dB are shown in Table 4. It can be seen from Table 3 and Table 4 that the use of BMBPD-DCTZ features is significantly better than the commonly used features. In particular, for a signal-to-noise ratio of -5dB pink noise, office sound events still reach an average detection rate of 58.8 ⁇ 3.7%.
  • Table 5 shows the average detection accuracy of animal sound events in six noise environments and four different signal-to-noise ratios.
  • Figure 10 shows the detection accuracy of four signal-to-noise ratio sound events under various noise environments.
  • Table 6 shows the detection accuracy of different methods for office sound events.
  • the SNET method is currently the most advanced deep learning-based method, which uses pre-trained SoundNet CNN (snet) (Y. Aytar, C. Vondrick, and A. Torralba, "Soundnet: Learning sound representations from unlabeled video, "in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 892-900)
  • the output of the internal layer is used as a feature.
  • using the features of the pool5 layer suggested by the author using them in (ATYusuf Aytar, Carl Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video.” 2016. [Source: Internet].
  • the method of this embodiment has obvious advantages. Especially, in the case of -5dB, the average detection rate of the method in this embodiment is 58 ⁇ 3.7%, which is better than 48 ⁇ 6.6% of Snet and 54.6 ⁇ 5.4% of MBPD. It is explained that the method of this embodiment has strong robustness to various low signal-to-noise ratio sound events, and has a certain ability to suppress the influence of noise.
  • Fig. 10 it can be seen that for the -10dB sound event, the detection rate of the wind sound environment in Fig. 10(c) and the rain sound environment in Fig. 10(f) is relatively low. Among them, the detection rate under wind sound environment is 51.0 ⁇ 5.2%, which is slightly higher than 50.0 ⁇ 4.3% under rain sound environment. As shown in Figure 11(a) and Figure 11(c), the specific detection situation of the wind and rain environment is further analyzed. Among them, Fig. 11(a) and Fig. 11(c) are respectively the detection rate of 50 types of animal sound events in the wind sound environment and the rain sound environment.
  • the color value of the point with the coordinate (x, y) is the number of sound events belonging to the x category that are detected as the y category.
  • the x and y values are the sequential numbers of animal sounds in Table 1. Further analysis revealed the following changes.
  • the detection rate of the method in this embodiment can be further improved.
  • the detection errors of sound events are mainly concentrated in being misdetected as the 4th and 7th categories.
  • the detection errors of sound events are mainly concentrated in the misdetection of the 4th, 10th and 16th categories.
  • the error rates of being misidentified as category 4, category 10, and category 16 are 7.5%, 18.1%, and 7.8%, respectively.
  • the reason for this error is that the BMBPD of the low signal-to-noise ratio sound event is similar to the BMBPD of the sound event of the wrong category. Therefore, if specific enhancements related to environmental sound can be performed on low SNR sound events and the value of n can be adjusted appropriately for different environments, the detection performance of low SNR sound events will be further improved and improved.
  • the high energy and related parts of the sound event may be compressed.
  • a low signal-to-noise ratio means that the energy of the ambient sound is high.
  • the height of the BMBPD related to the sound event is compressed. But as shown in Figure 12(h)(i), the key green long frame and red circle are still clear.
  • the green frame and red circle in Figure 12(h)(i) are the key basis for the detection of low-signal-to-noise ratio sound events.
  • the sound event has a frequency band and energy different from the background noise, it can be reflected in the BMBPD diagram using the method of this embodiment.
  • the MBPD is divided into 256 frequency bands and the BMBPD is divided into 8 ⁇ 8 blocks. In practical applications, adjustments can be made according to the changes in the sound range of the specific sound event to be measured.
  • the detection rate of the method in this embodiment is not ideal due to the influence of non-stationary environmental sounds.
  • specific sound events can be divided into more detailed frequency bands and energy levels, and more effective BMBPD-DCTZ can be extracted to improve the detection rate.

Abstract

一种基于二值多频带能量分布的低信噪比声音事件检测方法,利用二值多频带能量分布(Binary Multi-Band Power Distribution,BMBPD)和随机森林(RF)的组合来提供优良的性能。其中,BMBPD将MBPD图中灰度小于一定阈值的像素二值化为1,其余为0,能够在MBPD图中突出与声音事件相关的像素,同时抑制噪声的影响,从而减少低信噪比环境中噪声对待测声音事件的影响。通过对BMBPD图分块离散余弦变换(DCT),把DCT系数的Z编码的主要部分作为声音事件的特征,即BMBPD-DCTZ,并用随机森林(RF)分类器对BMBPD-DCTZ进行训练与检测。对声级适用范围广泛,在严重的非平稳噪声中具有较强的鲁棒性。

Description

基于二值多频带能量分布的低信噪比声音事件检测方法 技术领域
本发明属于声音事件检测(Sound event detection,SED)领域,尤其涉及一种基于二值多频带能量分布的低信噪比声音事件检测方法。
背景技术
声音事件检测(Sound event detection,SED)是将一个短的声音片段的音频内容分配到一组预先训练类之一中的任务。近20年来,声音事件检测的研究一直是声学分析领域的研究热点。声音事件检测已应用于声学监测,生物声学监测,环境声音,情境感知辅助机器人,音乐流派分类和多媒体存档等领域。
当前,对于声音事件的分类与检测的研究,可以归纳为特征表示、基于深度学习的声音事件分类与检测、和多音声音事件检测等三个方面。关于特征表示,主要包括音频特征常规表示(R.Grzeszick,A.Plinge,and G.A.Fink,“Bag-of-features methods for acoustic event detection and classification,”IEEE/ACM Trans.Audio,Speech,Lang.Process.,vol.25,no.6,pp.1242-1252,Jun.2017)、深度神经网络提取的深度音频特征(Y.Li,X.Zhang,H.Jin,X.Li,Q.Wang,Q.He,and Q.Huang,“Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection,”Multimed Tools Appl.,vol.77,pp.897-916,2018)、左奇异矢量提取的声谱图特征(Manjunath.M and S.G.Koolagudi,“Segmentation and characterization of acoustic event spectrograms using singular value decomposition,”Expert Systems Appl.,vol.120,pp.413-425,2019)、非线性时间归一化表示(I.M.Morato,M.Cobos,and F.J.Ferri,“Adaptive Mid-Term representations for robust audio event classification,”IEEE/ACM Trans.Audio,Speech,Lang.Process.,vol.26,no.12,pp.2381-2392,Dec.2018)等。关于深度学习的声音事件检测,包括基于深度学习的声音事件分类与检测(X.Xia,R.Togneri,F.Sohel,and D.Huang,“Auxiliary classifier generative adversarial network with soft labels in imbalanced acoustic event detection,”IEEE Trans. Multimedia,vol.21,no.6,pp.1359-1371,Jun.2019),大规模音频标注(Q.Kong,Y.Xu,W.Wang,and M.D.Plumbley,“Audio set classification with attention model:a probabilistic perspective,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.,2018,pp.316-320)、异常声音事件检测(Y.Koizumi,S.Saito,H.Uematsu,Y.Kawachi,and N.Harada,“Unsupervised detection of anomalous sound based on deep learning and the neyman-pearson lemma,”IEEE/ACM Trans.Audio,Speech,Lang.Process.,vol.27,no.1,pp.212-224,Jan.2019)、弱标记声音事件检测(B.McFee,J.Salamon,and J.P.Bello,“Adaptive pooling operators for weakly labeled sound event detection,”IEEE/ACM Trans.Audio,Speech,Lang.Process.,vol.26,no.11,pp.2180-2193,Nov.2018;Q.Kong,Y.Xu,I.Sobieraj,W.Wang,and M.D.Plumbley,“Sound event detection and time-frequency segmentation from weakly labelled data,”IEEE/ACM Trans.Audio,Speech,Lang.Process.,vol.27,no.4,pp.777-778,Apr.2019)等。关于多音声音事件检测,则主要包括用于多音声音事件检测的卷积神经网络(E.Cakir,G.Parascandolo,T.Heittola,H.Huttunen,T.Virtanen,Convolutional recurrent neural networks for polyphonic sound event detection,”IEEE/ACM Trans.Audio,Speech,Lang.Process.,vol.25,no.6,pp.1291-1303,Jun.2017)、利用线性动力系统的复音事件跟踪(E.Benetos,G.Lafay,M.Lagrange,and M.D.Plumbley,“Polyphonic sound event tracking using linear dynamical systems,”IEEE/ACM Trans.Audio,Speech,Lang.Process.,vol.25,no.6,pp.1266-1267,Jun.2017)和基于谱图的多任务音频分类(Y.Zeng,H.Mao,Hua;and D.Peng,“Spectrogram based multi-task audio classification,”Multimed Tools Appl.,vol.78,2019,pp3705-3722)。上述研究表明,对于特定声音场景,如果信噪比合适,可以一定程度地分类与检测出相关的声音事件。
然而,在许多这样的应用中,声音事件发生在各种各样的具有挑战性的噪声条件下,并且信噪比(signal-to-noise ratio,SNR)甚至可能接近-10分贝(Z.Feng,Q.Zhou,J.Zhang,and P.Jiang,“A target guided subband filter for acoustic event detection in noisy environments using wavelet packets,”IEEE Trans.Audio,Speech,Lang.Process.,vol.23,no.2,pp361-372,Feb.2015)。低信噪比声音事件检测的目 标是检测与识别复杂声环境下的微弱声音事件。现实中,对于低信噪比复杂声场景下的声音事件检测,依然还是一项挑战性的问题。
在低信噪比及复杂声场景下,噪声包含不同的和非平稳的背景声。它们通常会降低分类与检测的性能。此外,多项研究(J.Dennis,H.D.TRAN,and E.S.CHNG,“Image feature representation of the subband power distribution for robust sound event classification,”IEEE Trans.Audio,Speech,Lang.Process.,vol.21,no.2,pp367-377,Feb.2013;J.Dennis,H.D.Tran,and H.Li,“Spectrogram image feature for sound event classification in mismatched conditions,”IEEE Signal Process.Lett.,vol.18,no.2,pp.130-133,Feb.2011)表明,不同于结构化信号,环境音频可能包含较强的时间特征或宽的平谱。这种现象可能使传统采用的MFCC(F.Zheng,G.Zhang,and Z.Song,“Comparison of different implementations of MFCC,”Journal of Computer Science and Technology,vol.16,no.6,pp.582-589,2001)、PNCC(C.Kim and R.M.Stern,“Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.,2010,pp.4574-4577)、LBP(T.Kobayashi and J.Ye,“Acoustic feature extraction by statictics based local binary pattern for environmental sound classification,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.,2014,pp.3052-3056)和HOG(A.Rakotomamonjy and G.Gasso,“Histogram of gradients of time-frequency representations for audio scene classification,”IEEE Trans.Audio,Speech,Lang.Process.,vol.23,no.1,pp142-153,Jan.2015)等几种特征,现有的SPD-KNN、SIF-SVM、ELBP-HOG(S.Abidin,R.Togneri R,and F.Sohel,“Enhanced LBP texture features from time frequency representations for acoustic scene classification,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.,2017,pp.626-630)和MP-SVM(J.Wang,C.Lin,and B.Chen,“Gabor-based nonuniform scale-frequency map for environmental sound classification in home automation[J].IEEE Trans.Autom.Sci.Eng.,vol.11,no.2,pp.607-613,Apr.2014)等方法,不适用于低信噪比下的声音事件检测。此外,而基于深度学习的声音事件分类与检 测,除了识别性能之外,还应该考虑现实生活应用程序的其他方面,比如使用精心录制的声音示例和系统的计算成本(S.Sigtia,A.M.Stark,S.Krstulovi,and M.D.Plumbley,“Automatic environmental sound recognition:Performance versus computational cost,”IEEE/ACM Trans.Audio,Speech,Lang.Process.,vol.24,no.11,pp.2096-2107,Nov.2016)。因此,如何为非结构化信号提供一种可行的特征提取及分类方法依然是人们关注的重点。
对于低信噪比声音事件,Dennis等人在J.Dennis,H.D.TRAN,and E.S.CHNG,“Image feature representation of the subband power distribution for robust sound event classification,”IEEE Trans.Audio,Speech,Lang.Process.,vol.21,no.2,pp367-377,Feb.2013和J.Dennis,H.D.Tran,and H.Li,“Spectrogram image feature for sound event classification in mismatched conditions,”IEEE Signal Process.Lett.,vol.18,no.2,pp.130-133,Feb.2011中,把SPD图特征及谱图特征用于非匹配条件下声音事件分类。这两种方法分类检测声音事件的过程如说明书附图图1所示。其中,文献J.Dennis,H.D.Tran,and H.Li,“Spectrogram image feature for sound event classification in mismatched conditions,”IEEE Signal Process.Lett.,vol.18,no.2,pp.130-133,Feb.2011的过程如图1的下半部分的细线框所示,包括灰度对数频谱图、图像特征抽取,SVM分类。采用这种方法,在0dB的情况下,检测率达到79.4%。在该文献中,首先对图特征抽取是通过Jet映射,把灰度对数谱图,映射成3张子图;然后,对每张子图进行9×9分块;再提取每一块均值与方差,得到486(2×3×9×9)维向量作为特征;最后用进行SVM的训练与分类。
以文献“Spectrogram image feature for sound event classification in mismatched conditions”的图像特征抽取方法为基础,如图1的上半部分的粗线框所示,文献“Image feature representation of the subband power distribution for robust sound event classification”在频谱、频谱分析及分类器的选择上进行了进一步的改进。其中,频谱及分析包括:灰度gammatone谱图、子带能量分布(SPD)、对比增强形成增强的子带能量分布图。对图像特征的进一步处理包括:帧缺失掩饰估计,去除不可靠维度。然后再用基于Hellinger距离的k近邻分类器(kNN)分类。采用这种方法,在信噪比为0dB情况下,对声音事件的检测率可以达到90.4%。在文献“Imag e feature representation of the subband power distribution for robust sound event classification”中,也是通过Jet把子带能量分布图映射成3张子图;然后对每张子图进行10×10分块;再提取均值与方差,得到600(2×3×10×10)维向量作为特征;最后,对特征进行kNN的建模与分类。
文献“Image feature representation of the subband power distribution for robust sound event classification”实现低信噪比声音事件分类的主要措施,是采用帧缺失掩饰估计与去除不可靠维度。最后,用保留的SPD来表示相关声音事件的特征。
发明概述
技术问题
对SPD进一步分析发现,对于更低信噪比的声音事件,如-5dB或-10dB采用文献“Image feature representation of the subband power distribution for robust sound event classification”的方法,可能存在问题。在更低信噪比的情况下,对增强的子带能量分布图,直接使用Jet映射难以保证在每张子图中能尽量大地保留与声音事件相关的可靠区域。
Jet映射的原理如说明书附图图2所示。横轴表示子带能量分布图的灰度值,范围是[0,1]。纵轴表示灰度值经Jet映射后的值。其中,三张子图分别按图2中的RGB三条折线映射。子带能量分布图采用Jet映射存在如下问题。
1)蓝色子图。在子带能量分布图中,灰度值为0的点,表示对应频带及对应能量等级的概率密度为0,即在该频带没有存在相应能量等级的能量。而按照Jet的Blue折线映射,该点在蓝色子图中被映射为0.5,相当于在蓝色子图中添加了额外的“噪声”。
2)红色子图。在子带能量分布图中,灰度值为1的点,表示对应频带及对应能量等级的概率密度大,与声音事件的低能量部分或背景噪声相关。在低信噪比环境下,声音事件的低能量部分受干扰大,易成为不可靠区域,而背景噪声部分更是需要进行抑制。按照Jet映射的Red折线映射,子带能量分布图为0.5的灰度值,被映射为0.5,而灰度值为1的点同样被映射为0.5,相当于在红色子图中增加了更多不可靠成分。
3)绿色子图。相对于Blue折线和Red折线,Green折线映射不但没有引入额外 的“噪声”,而且还在一定程度上抑制了原来的噪声。它的不足之处仅在于子带能量分布值只为[0.125,0.875]。
问题的解决方案
技术解决方案
本发明提出了一种新的声音事件检测框架,利用二值多频带能量分布(Binary Multi-Band Power Distribution,BMBPD)和随机森林(RF)的独特组合来提供优良的性能。其中,BMBPD将MBPD图中灰度小于一定阈值的像素二值化为1,其余为0,能够在MBPD图中突出与声音事件相关的像素,同时抑制噪声的影响,从而减少低信噪比环境中噪声对待测声音事件的影响。通过对BMBPD图分块离散余弦变换(DCT),把DCT系数的Z编码的主要部分作为声音事件的特征,即BMBPD-DCTZ,并用随机森林(RF)分类器对BMBPD-DCTZ进行训练与检测。
本发明具体采用以下技术方案:
一种基于二值多频带能量分布的低信噪比声音事件检测方法,其特征在于,包括以下步骤:
步骤S1:将声音信号y(t)通过gammatone滤波器组滤波,得到y f[t];对y f[t]取对数,形成相应的gammatone谱图S g(f,t);
步骤S2:对每个声音信号的能量谱
Figure PCTCN2019123469-appb-000001
进行归一化处理,得到归一化后的能量谱G(f,t);
步骤S3:对G(f,t)的多频带能量分布情况进行统计,得到MBPD图M(f,b);
步骤S4:对MBPD图M(f,b)进行二值化处理得到BMBPD图M R(f,b);
步骤S5:对BMBPD图M R(f,b)进行分块,并对子块进行DCT(离散余弦变换);
步骤S6:对DCT系数进行Zigzag扫描,获得DCT系数的1维排列,取前m个DCT系数作为BMBPD-DCTZ;
步骤S7:采用BMBPD-DCTZ作为特征,以RF(随机森林)作为分类器,对BMBPD-DCTZ进行分类和/或识别。
优选地,在步骤S1中,
S g(f,t)=lg|y f[t]|   (1);
其中,f表示gammatone滤波器的中心频率,t表示帧索引;
在步骤S2中,
Figure PCTCN2019123469-appb-000002
优选地,在步骤S3中,设G(f,t)共有B个能量等级,采用基于统计的非参数法,对每个频率子带f的能量元素进行概率密度统计,得到各个频率子带的各个能量等级的概率分布M(f,b):
Figure PCTCN2019123469-appb-000003
Figure PCTCN2019123469-appb-000004
其中,W为声音信号的帧数,M(f,b)表示在频带f中能量等级为b的元素占该频带元素总数的比例(0≤M(f,b)≤1);I b(G(f,t))为指示函数,当G(f,t)属于能量等级b时,其值为1,否则为0;
在步骤S4中,
Figure PCTCN2019123469-appb-000005
其中,阈值n的范围在[1,W]区间内。
优选地,在步骤S5中,对BMBPD图M R(f,b)进行8×8分块,对子块进行DCT后,获得8×8的DCT系数。
优选地,在步骤S6中,取64个1维Zigzag排列的前5个系数作为BMBPD-DCTZ。
优选地,步骤S7对BMBPD-DCTZ进行识别具体包括以下步骤:
步骤S71:数据特征设置:将待测声音信号的BMBPD-DCTZ特征置于随机森林中所有nk棵决策树的根节点处;
步骤S72:决策树设置与决策:按照决策树的分类规则,由根节点依次向下传递直到到达某一叶节点,该叶节点对应的类标签便是这棵决策树对BMBPD-DCTZ特征所属类别所做的投票;
步骤S73:随机森林检测:随机森林的nk棵决策树对每一个待测声音信号的BMBPD-DCTZ特征的类别均进行投票;随机森林中nk棵决策树投票,其中票数最多的类标签便是最终确定的待测声音信号对应的类标。
优选地,所述BMBPD图为64能量等级×256频带。
发明的有益效果
有益效果
本发明及其优选方案的最主要特征使用二值多频带能量分布来生成用于低信噪比声音事件检测的特征。与子带能量分布及Jet映射的方案相比,二值化多频带能量分布提供了更好的分辨率控制,并且可以调优以更好地捕获低信噪比声音事件信息。然后对二值多频带能量分布图进行8*8分块并提取离散余弦变换的Zigzag系数作为特征。最后将用随机森林对特征进行建模与检测。该方法在平稳噪声条件及非平稳噪声条件下的精度都明显优于常用的方法。尤其,在信噪比低 于0dB的情况下,本发明方法优势更加突出。基于本发明提供的MBPD、BMBPD和BMBPD-DCTZ等方法,在实际应用中,还可以把BMBPD图与深度学习技术相结合,进一步改善与提高低信噪比下的各种声音事件的检测率。
采用本发明提供的方案,在一系列具有挑战性的噪声条件下,对50类环境声音事件的数据库和11类办公室声音事件的DCASE 2016数据集进行了全面的实验。结果表明,该方法对声级适用范围广泛,在严重的非平稳噪声中具有较强的鲁棒性。
对附图的简要说明
附图说明
下面结合附图和具体实施方式对本发明进一步详细的说明:
图1是现有技术中谱图特征用于非匹配条件的声音事件分类示意图;
图2是现有技术中Jet映射规则(SPD,mapping)示意图;
图3是本发明实施例低信噪比声音事件检测流程示意图;
图4是本发明实施例狐狸叫声的gammatone频谱图及MBPD示意图;
图5是本发明实施例BMBPD图的8×8分块示意图;
图6是本发明实施例BMBPD图的8×8分块当中黑框中的子块局部放大示意图;
图7是本发明实施例黑框中的子块的DCT系数及Zigzag扫描示意图;
图8是本发明实施例64个DCT系数的Zigzag排列以及排在前面的5个Zigzag系数示意图;
图9是本发明实施例不同n值的检测率(Precision,Threshold)示意图;
图10是本发明实施例方法与现有其他方法的检测精度的比较(SNR(dB),Precision(%))示意图;
图11是本发明实施例n=6和n=2时-10dB风声、雨声环境下的平均检测结果示意图;
图12是本发明实施例狐狸叫声、雨声以及-10dB雨声环境的狐狸叫声的gammatone频谱示意图(n=6和n=2的BMBPD.与图12(b)(c)相比,图12(h)(i)的绿框及红圈部分被一定程度的压缩)。
发明实施例
本发明的实施方式
为让本专利的特征和优点能更明显易懂,下文特举实施例,作详细说明如下:
如图3所示,本发明实施例提供的检测方法流程可以归纳为:按照灰度gammatone谱图、BMBPD图、8×8分块及DCT、Zigzag编码和RF分类器的顺序步骤检测声音事件。
具体地,在本实施例中,BMBPD通过统计信号中256个频带内的64个能量等级的概率密度,将声音数据转化为BMBPD图。其中,形成BMBPD图的步骤如下:
1)Gammatone频谱图。声音信号y(t)通过gammatone滤波器组滤波,得到y f[t]。对y f[t]取对数,即对y f[t]进行动态压缩,形成相应的gammatone谱图S g(f,t)。图4(a)所示的是狐狸叫声对应的gammatone谱图。
S g(f,t)=lg|y f[t]|   (1)
其中,f表示滤波器的中心频率,t表示的帧索引。
2)归一化能量谱。对每个声音信号的能量谱进行归一化处理,得到归一化后的能量谱G(f,t)。
Figure PCTCN2019123469-appb-000006
如图4所示,提供了狐狸叫声的gammatone频谱图及MBPD的展示。
3)多频带能量分布。对G(f,t)的能量分布情况进行统计,得到M(f,b),即图4(b)所示的MBPD图。如果共有B个能量等级,那么采用基于统计的非参数法,对每个频率子带f的能量元素进行概率密度统计,可以得到各个频率子带的各个能量等级的概率分布M(f,b)。
Figure PCTCN2019123469-appb-000007
Figure PCTCN2019123469-appb-000008
其中,W为声音片段的帧数。对于本实施例声音样本而言,W=198,M(f,b)表示在频带f中能量等级为b的元素占该频带元素总数的比例(0≤M(f,b)≤1)。I b(G(f,t))为指示函数,当G(f,t)属于能量等级b时,其值为1,否则为0。
4)二值化。对M(f,b)进行二值化处理得到M R(f,b)。
Figure PCTCN2019123469-appb-000009
其中,阈值n的范围在[1,W]区间内。经过二值化处理,可以把图4(b)的MPBD转化成图5所示的BMBPD图。
之后是生成离散余弦变换与二值多频带能量分布图的步骤:
对一幅图像进行离散余弦变换(DCT),可以将图象的重要可视信息都集中到DCT的少部分系数中(G.A.Papakostas,D.E.Koulouriotis,and E.G.Karakasis,“Efficient 2-D DCT computation from an image representation point of view,”London,UK,Intch Open,pp.21-34,2009)。一般情况下,DCT系数矩阵中,沿左上至右下的方向,DCT系数大小是依次递减的。左上角的第一个系数,被称为D CT的直流系数,是图像像素的均值。其它系数被称为交流(AC)系数。AC系数越靠近左上角,包含着越多的图像信息。利用图像DCT的这些特性,对由M B(f,b)构成的BMBPD图进行分块,然后对子块进行DCT。
受图像处理中8×8的子块编码具有较高的效率的启发,本实施例对64×256大小的BMBPD图进行8×8分块,即把图5所示的BMBPD图分为8×32=256个的8×8子块。其中,每个子块及其像素,都携带着声音信号在相应频带及能量等级的分布情况。如,图6对应的5中箭头所指的黑框子块,是BMBPD图中频带从129至136、能量等级从23至30的能量分布。对子块进行DCT后,可以得到如图7所示的8×8的DCT系数列表。
之后是进行Zigzag扫描的步骤:
通过图7中箭头所示的Zigzag扫描(J.A.Lay and L.Guan,“Image retrieval based on energy histograms of the low frequency DCT coefficients,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.,1999,pp.3009-3012),可以得到图像信息的重要性排序。图7中的8×8的DCT系数经过Zigzag扫描,可以得到如图8(a)的64个DCT系数的1维排列。这1维排列的64个DCT系数对可视信息的重要程度,按从左到右顺序排列。在图像处理中,只要1维排列左边的一部分系数即可表征图像的主要信息。在本实施例中,采用1维排列左边的一部分系数表征8×8子块的主要信息,即声音信号特定频带的能量分布。通过综合实验分析,如图8(b)所示,取64个1维Zigzag排列的前5个系数作为该8×8子块的特征。这个特征,即为BMBPD子块的DCT系数经过Zigzag扫描的特征,简称为BMBPD-DCTZ。
之后是采用随机森林分类器的步骤:
进行初步实验表明,采用BMBPD-DCTZ作为特征,SVM、GMM和DNN作为分类器,在低信噪比及可用数据有限的情况下与RF比较没有性能优势。因此,采用RF对BMBPD-DCTZ进行分类。
随机森林(RF,L.Breiman,“Random forests,”Machine learning,vol.45,no.1,pp.5-32,2001)是一种利用多个决策树分类器来对数据进行判别的集成分类器算法。随机森林检测声音事件的步骤为:(1)数据特征设置。将待测声音数据的BMBPD-DCTZ特征置于随机森林中所有nk棵 决策树的根节点处。(2)决策树设置与决策。按照决策树的分类规则,由根节点依次向下传递直到到达某一叶节点。该叶节点对应的类标签便是这棵决策树对BMBPD-DCTZ特征所属类别所做的投票。(3)随机森林检测。随机森林的nk棵决策树均对每一个待测声音信号的BMBPD-DCTZ特征的类别进行了投票。统计森林中nk棵决策树投票,其中票数最多的类标签便是最终待测样本对应的类标。
本实施例利用以上提供的方案进行了具体的验证试验。
如表1所示,实验数据使用两个数据集,动物声音事件集和办公室声音事件集。其中,50种动物声音事件来自Freesound声音数据库(F.Font,G.Roma,P.Herrera,and X.Serra,″Characterization of the Freesound online community,″in Proc.3rd int.Workshop Cognitive Inf.Process.,May 2012,pp.1-6),包括不同鸟鸣声和哺乳动物叫声。每种声音事件有30个样本。11种办公室中常见的声音事件,来自DCASE2016 Task2(A.Mesaros,T.Heittola,E.Benetos,P.Foster,M.Lagrange,T.Virtanen,and M.D.Plumbley,“Detection and classification of acoustic scenes and events:outcome of the DCASE 2016 challenge,”IEEE Trans.Audio,Speech,Lang.Process.,vol.26,no.2,pp379-393,Feb.2018)。每一类声音事件共有20个样本。实验中,以50种动物声音事件为主,11种办公室声音事件作为进一步的辅助验证。实验用到的6种噪声环境可分为两类,即平稳噪声和非平稳噪声。平稳噪声为粉噪声(pink),非平稳噪声包括模拟真实场景声音的流水声、风声、公路声、海浪声和雨声。噪声样本与声音事件的格式为单声道“.wav”格式,采样率为44.1kHz。
表1 声音事件样本集
Figure PCTCN2019123469-appb-000010
实验中,相关参数为:帧长为25ms,帧移为10ms,滤波器组数目为256,中心频率在50Hz到fs/2之间。综合考虑本实施例实验样本和特征维度以及文献(L.Breiman,“Random forests,”Machine learning,vol.45,no.1,pp.5-32,2001)的建议,对于随机森林分类器中决策树的个数,取k=800用于动物声音事件集和k=500用于办公室声音事件集。决策树中,非叶节点分裂时,预选特征成分的数量为 m=11。为了验证本实施例方法的检测性能,进行4个实验。它们包括:1)二值多频带能量分布阈值选择;2)BMBPD-DCTZ特征与RF分类器相结合的性能检测;3)BMBPD-DCTZ特征与常用特征性能的比较;4)MBPD-DCTZ-RF与现有方法的比较。在评价上采用简单的分类精度,即TP/(TP+FP),其中,TP真阳性,FP是假阳性.实验中,通过与经典及现有方法在分类精度的比较,来说明本实施例方法的有效性。
以下是实验结果与分析:
1)BMBPD阈值n的选取。分别选取n=2,4,6,8,10,12,14,16,18,20进行实验。在-10dB、-5dB、0dB和5dB四种信噪比的六种噪声环境中,三次交叉验证下,随机森林的平均检测结果如图9所示。
从图9可以看出,阈值n的大小,对不同信噪比条件下的检测率,影响程度不同。对于较低信噪比条件,如-5dB、0dB和5dB三种情况,n的影响不太明显。然而,对于极低信噪比条件,如-10dB情况,随着n的增加,检测率先有所提高,然后持续下滑。其中,当n=6时,-10dB情况下检测率达到最高。因此,在后面的实验中,选取n=6。
2)BMBPD-DCTZ与RF的有效性。为说明BMBPD-DCTZ与RF分类器结合的有效性,对动物声音事件集进行了交叉验证实验。在-10dB、-5dB、-0dB和5dB等四种不同信噪比,及流水、粉噪声、风声、海浪、公路和雨声等六种背景噪声条件下,三次交叉验证实验的平均检测结果如表2所示。由表2可知,不论在平稳噪声条件下,还是在非平稳噪声条件下,BMBPD-DCTZ特征都表现出了良好的性能。在-5dB低信噪比时,达到90.2±2.1%的平均检测率。尤其,在信噪比低至-10dB时,依然达到66.3±12.2%的平均检测率。
表2 BMBPD-DCTZ特征的交叉验证结果(%)
Figure PCTCN2019123469-appb-000011
3)BMBPD-DCTZ的优越性。为了进一步说明BMBPD-DCTZ特征表征低信噪比声音事件的性能,采用RF分类器进行BMBPD-DCTZ特征与MFCC(F.Zheng,G.Zhang,and Z.Song,“Comparison of different implementations of MFCC,”Journal of Computer Science and Technology,vol.16,no.6,pp.582-589,2001)、PNCC(C.Kim and R.M.Stern,“Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.,2010,pp.4574-4577)、GLCM-SDH(J.Wei and Y.Li,“Rapid bird sound recognition using anti-noise texture features,”Acta Electronica Sinica,2015,43(1):185-190)、LBP(T.Kobayashi and J.Ye,“Acoustic feature extraction by statictics based local binary pattern for environmental sound classification,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.,2014,pp.3052-3056)、HOG(A.Rakotomamonjy and G.Gasso,“Histogram of gradients of time-frequency representations for audio scene classification,”IEEE Trans.Audio,Speech,Lang.Process.,vol.23,no.1,pp142-153,Jan.2015)等几种特征的比较。其中,MFCC特征采用256个滤波器的三角滤波器组,提取12维DCT系数。用32阶的gammatone滤波器和12维的DCT系数来提取PNCC特征。
表3 六种噪声环境下不同特征对动物声音事件的平均检测率(%)
Figure PCTCN2019123469-appb-000012
表4 不同特征对办公室声音事件的检测率(%)
Figure PCTCN2019123469-appb-000013
在流水、粉噪声、风声、海浪、公路和雨声等六种背景噪声下,几种特征对动物声音事件的检测结果如表3所示。办公室声音事件的原声的检测结果,以及信噪比为5dB、0dB和-5dB的粉噪声的检测结果,如表4所示。从表3和表4可知,采用BMBPD-DCTZ特征,明显优于常用特征。尤其,对-5dB粉噪声的信噪比,办 公室声音事件依然达到58.8±3.7%的平均检测率。
4)与常规方法的比较。为了进一步说明BMBPD-DCTZ特征表征低信噪比声音事件的性能,把本实施例方法与SNET(A.T.Yusuf Aytar,Carl Vondrick,and A.Torralba,“Soundnet:Learning sound representations from unlabeled video.”2016.[来源:互联网].可用网址:https://github.com/cvondrick/soundnet)、MBPD(Y.Li and L.Wu.“Detection of sound event under low SNR using multi-band power distribution,”Journal of Electronics&Information Technology,2018,40(12):2905-2912)、ELBP-HOG(S.Abidin,R.Togneri R,and F.Sohel,“Enhanced LBP texture features from time frequency representations for acoustic scene classification,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.,2017,pp.626-630)、MP-SVM(J.Wang,C.Lin,and B.Chen,“Gabor-based nonuniform scale-frequency map for environmental sound classification in home automation[J].IEEE Trans.Autom.Sci.Eng.,vol.11,no.2,pp.607-613,Apr.2014)、SIF-SVM(J.Dennis,H.D.Tran,and H.Li,“Spectrogram image feature for sound event classification in mismatched conditions,”IEEE Signal Process.Lett.,vol.18,no.2,pp.130-133,Feb.2011.)和SPD-KNN(J.Dennis,H.D.TRAN,and E.S.CHNG,“Image feature representation of the subband power distribution for robust sound event classification,”IEEE Trans.Audio,Speech,Lang.Process.,vol.21,no.2,pp367-377,Feb.2013)等进行比较。采用三折交叉验证的方式,表5是动物声音事件在六种噪声环境及四种不同信噪比下的平均检测精度。图10则是各个种噪声环境下对四种信噪比声音事件的检测精度。表6则是不同方法对办公室声音事件的检测精度。
表5 不同方法对动物声音事件的检测精度(%)
Figure PCTCN2019123469-appb-000014
表6 不同方法对办公室声音事件的检测率(%)
Figure PCTCN2019123469-appb-000015
需要注意的是,SNET方法是一种当前最先进的基于深度学习的方法,该方法 使用预先训练的SoundNet CNN(snet)(Y.Aytar,C.Vondrick,and A.Torralba,“Soundnet:Learning sound representations from unlabeled video,”in Proc.Adv.Neural Inf.Process.Syst.,2016,pp.892-900)内部层的输出作为特征。具体来说,使用作者建议的pool5层的特性,使用他们在(A.T.Yusuf Aytar,Carl Vondrick,and A.Torralba,“Soundnet:Learning sound representations from unlabeled video.”2016.[来源:互联网].可用网址:https://github.com/cvondrick/soundnet)中提供的代码。然而,从表5和图10可以看出,本实施例方法在低信噪比条件下的检测率都好于现有方法,尤其在极低信噪比的情况下优势更为显著。
同时,在表6中,对于办公室声音事件的原声,SNET的检测精度最高。然而,对于0dB,及更低信噪比的声音事件,本实施例方法则有明显的优势。尤其,在-5dB的情况下,本实施例方法平均检测率为58±3.7%,优于Snet的48±6.6%,也优于MBPD中的54.6±5.4%。说明本实施例方法对各种低信噪比声音事件,有较强的鲁棒性,对噪声带来的影响有一定的抑制能力。
在进一步的研究中,考虑不同环境影响低信噪比声音事件的检测率:
进一步观察图10,可以看出,对于-10dB的声音事件,图10(c)的风声环境和图10(f)的雨声环境检测率比较低。其中,风声环境下的检测率为51.0±5.2%,略高于雨声环境下的50.0±4.3%。如图11(a)和图11(c)所示,进一步对风声环境和雨声环境具体检测情况进行分析。其中,图11(a)和图11(c)分别是风声环境和雨声环境下,本实施例方法对50类动物声音事件的检测率。图中,坐标为(x,y)的点的颜色值是属于x类的声音事件被检测为y类的数量。这里x与y值是表1中动物声音的顺序编号。进一步分析,发现如下变化。
1)不同n影响低信噪比声音事件检测率
为了分析n的变化可能对检测结果的影响,也分析了n=2时的情况。结果如图11(b)和图11(d)所示。通过图11,可以看出,n=6和n=2时,它们对声音事件的检测率是有区别的。其中,在风声环境下,n=2时的检测率为46.5±5.3%,低于n=6时的51.0±5.2%。而在雨声环境下,n=2时的检测率为54.5±4.7%,则高于n=6时的50.0±4.3%。
因此,对于不同的声音环境,如果采用适当的n值,可以进一步提高本实施例方法的检测率。
2)环境声影响低信噪比声音事件的检测率
通过图11,可以进一步发现,在风声环境下,声音事件的检测错误主要集中在被误检测为第4类和第7类。而在雨声环境下,声音事件的检测错误主要集中在被误检测为第4类、第10类和第16类。其中,当n=2时,风声环境下,被误检测为第4类和第7类的错误率分别为19.6%和14.8%,高于n=6时的9.9%和14.4%。而雨声环境下,当n=2时,被误识别为第4类、第10类和第16类的错误率分别为7.5%,18.1%和7.8%。雨声环境下,虽然总体与n=6时的9.1%,14.3%和9.3%情况相当,但不同的n值,影响着不同声音事件的检测率。
造成这种错误的原因是低信噪比声音事件的BMBPD类似于错误类别的声音事件的BMBPD。因此,如果能针对不同的环境,对低信噪比声音事件进行与环境声音相关的特定增强及适当地调整n的取值,将进一步改善与提高低信噪比声音事件的检测性能。
考虑BMBPD图与低信噪比:
图12给出狐狸叫声、雨声和雨声环境下-10dB狐狸叫声的gammatone频谱图,以及这三种声音在n=6和n=2的BMBPD图。从图12可以看出,低信噪比时,声音事件的BMBPD图存在3种主要情况。
1)声音事件的大部分BMBPD被环境声覆盖。如图12(g),(h)和(i)的蓝线框部分所示,在-10dB雨声环境中,声音事件的BMBPD大部分被雨声BMBPD所覆盖。
2)特定频带的高能部分得以保留。如图12中绿色长框及红圈部分,声音事件的BMBPD中,人类听觉系统赖以感知的特定频带依然存在。本实施例中,把文献(J.Dennis,H.D.TRAN,and E.S.CHNG,“Image feature representation of the subband power distribution for robust sound event classification,”IEEE Trans.Audio,Speech,Lang.Process.,vol.21,no.2,pp367-377,Feb.2013)中提供方案的频带划分数从50增加到256,就是为了突出与增强这部分的存在。
3)声音事件高能及相关部分可能被压缩。低信噪比,意味着环境声的能量高。如图12的蓝线框、绿色长框及红圈部分所示,在环境声音高能部分高于声音事 件的情况下,与声音事件相关的BMBPD的高度被压缩。但如图12(h)(i)所示,关键的绿色长框及红圈部分依然清晰。
4)n的取值影响关键部分的清晰度。如图12(h)(i)所示,当n=6与n=2时,BMBPD中声音事件与环境声的黑点所占的比率不同。声音事件的占比高,意味关键部分更突出。如,在雨声环境下,n=2时,图12(i)绿色框中的黑点在BMBPD中占比更高,更有利于声音事件的检测。
因此,在低信噪比环境下,图12(h)(i)中的绿框及红色圆圈部分是低信噪比声音事件检测的关键依据。在音频数据中,声音事件只要有不同于背景噪声的频带及能量存在,用本实施例方法,在BMBPD图中都可以体现出来。本实施例把MBPD划分成256个频带和对BMBPD图8×8分块。在实际应用中可以根据具体待测声音事件的音域变化做调整。如表5与表6所示,对于-10dB的动物声音事件和-5dB的办公室声音事件,由于非平稳环境声音的影响,本实施例方法的检测率不够理想。这种情况,可以通过对n值的调整,对特定的声音事件进行更细致的频带及能量等级划分,提取更有效的BMBPD-DCTZ来改善检测率。
本专利不局限于上述最佳实施方式,任何人在本专利的启示下都可以得出其它各种形式的基于二值多频带能量分布的低信噪比声音事件检测方法,凡依本发明申请专利范围所做的均等变化与修饰,皆应属本专利的涵盖范围。

Claims (7)

  1. 一种基于二值多频带能量分布的低信噪比声音事件检测方法,其特征在于,包括以下步骤:
    步骤S1:将声音信号y(t)通过gammatone滤波器组滤波,得到y f[t];对y f[t]取对数,形成相应的gammatone谱图S g(f,t);
    步骤S2:对每个声音信号的能量谱
    Figure PCTCN2019123469-appb-100001
    进行归一化处理,得到归一化后的能量谱G(f,t);
    步骤S3:对G(f,t)的多频带能量分布情况进行统计,得到MBPD图M(f,b);
    步骤S4:对MBPD图M(f,b)进行二值化处理得到BMBPD图M R(f,b);
    步骤S5:对BMBPD图M R(f,b)进行分块,并对子块进行DCT;
    步骤S6:对DCT系数进行Zigzag扫描,获得DCT系数的1维排列,取前m个DCT系数作为BMBPD-DCTZ;
    步骤S7:采用BMBPD-DCTZ作为特征,以RF作为分类器,对BMBPD-DCTZ进行分类和/或识别。
  2. 根据权利要求1所述的基于二值多频带能量分布的低信噪比声音事件检测方法,其特征在于:
    在步骤S1中,
    S g(f,t)=lg|y f[t]|
    (1);
    其中,f表示gammatone滤波器的中心频率,t表示帧索引;
    在步骤S2中,
    Figure PCTCN2019123469-appb-100002
  3. 根据权利要求2所述的基于二值多频带能量分布的低信噪比声音事件检测方法,其特征在于:
    在步骤S3中,设G(f,t)共有B个能量等级,采用基于统计的非参数法,对每个频率子带f的能量元素进行概率密度统计,得到各个频率子带的各个能量等级的概率分布M(f,b):
    Figure PCTCN2019123469-appb-100003
    Figure PCTCN2019123469-appb-100004
    其中,W为声音信号的帧数,M(f,b)表示在频带f中能量等级为b的元素占该频带元素总数的比例(0≤M(f,b)≤1);I b(G(f,t))为指示函数,当G(f,t)属于能量等级b时,其值为1,否则为0;
    在步骤S4中,
    Figure PCTCN2019123469-appb-100005
    其中,阈值n的范围在[1,W]区间内。
  4. 根据权利要求3所述的基于二值多频带能量分布的低信噪比声音事件检测方法,其特征在于:
    在步骤S5中,对BMBPD图M R(f,b)进行8×8分块,对子块进行DCT后,获得8×8的DCT系数。
  5. 根据权利要求4所述的基于二值多频带能量分布的低信噪比声音事件检测方法,其特征在于:
    在步骤S6中,取64个1维Zigzag排列的前5个系数作为BMBPD-DCTZ。
  6. 根据权利要求3所述的基于二值多频带能量分布的低信噪比声音事件检测方法,其特征在于,步骤S7对BMBPD-DCTZ进行识别具体包括以下步骤:
    步骤S71:数据特征设置:将待测声音信号的BMBPD-DCTZ特征置于随机森林中所有nk棵决策树的根节点处;
    步骤S72:决策树设置与决策:按照决策树的分类规则,由根节点依次向下传递直到到达某一叶节点,该叶节点对应的类标签便是这棵决策树对BMBPD-DCTZ特征所属类别所做的投票;
    步骤S73:随机森林检测:随机森林的nk棵决策树对每一个待测声音信号的BMBPD-DCTZ特征的类别均进行投票;随机森林中nk棵决策树投票,其中票数最多的类标签便是最终确定的待测声音信号对应的类标。
  7. 根据权利要求4所述的基于二值多频带能量分布的低信噪比声音事件检测方法,其特征在于:所述BMBPD图为64能量等级×256频带。
PCT/CN2019/123469 2019-11-08 2019-12-06 基于二值多频带能量分布的低信噪比声音事件检测方法 WO2021088176A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911091796.5 2019-11-08
CN201911091796.5A CN110808067A (zh) 2019-11-08 2019-11-08 基于二值多频带能量分布的低信噪比声音事件检测方法

Publications (1)

Publication Number Publication Date
WO2021088176A1 true WO2021088176A1 (zh) 2021-05-14

Family

ID=69501854

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/123469 WO2021088176A1 (zh) 2019-11-08 2019-12-06 基于二值多频带能量分布的低信噪比声音事件检测方法

Country Status (2)

Country Link
CN (1) CN110808067A (zh)
WO (1) WO2021088176A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626102B (zh) * 2020-04-13 2022-04-26 上海交通大学 基于视频弱标记的双模态迭代去噪异常检测方法及终端
CN111261194A (zh) * 2020-04-29 2020-06-09 浙江百应科技有限公司 一种基于pcm技术的音量分析方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104795064A (zh) * 2015-03-30 2015-07-22 福州大学 低信噪比声场景下声音事件的识别方法
CN106653032A (zh) * 2016-11-23 2017-05-10 福州大学 低信噪比环境下基于多频带能量分布的动物声音检测方法
CN107545890A (zh) * 2017-08-31 2018-01-05 桂林电子科技大学 一种声音事件识别方法
CN108305616A (zh) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 一种基于长短时特征提取的音频场景识别方法及装置
US20180322338A1 (en) * 2017-05-02 2018-11-08 King Fahd University Of Petroleum And Minerals Computer implemented method for sign language characterization

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294331B (zh) * 2015-05-11 2020-01-21 阿里巴巴集团控股有限公司 音频信息检索方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104795064A (zh) * 2015-03-30 2015-07-22 福州大学 低信噪比声场景下声音事件的识别方法
CN106653032A (zh) * 2016-11-23 2017-05-10 福州大学 低信噪比环境下基于多频带能量分布的动物声音检测方法
US20180322338A1 (en) * 2017-05-02 2018-11-08 King Fahd University Of Petroleum And Minerals Computer implemented method for sign language characterization
CN107545890A (zh) * 2017-08-31 2018-01-05 桂林电子科技大学 一种声音事件识别方法
CN108305616A (zh) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 一种基于长短时特征提取的音频场景识别方法及装置

Also Published As

Publication number Publication date
CN110808067A (zh) 2020-02-18

Similar Documents

Publication Publication Date Title
US7457749B2 (en) Noise-robust feature extraction using multi-layer principal component analysis
CN106653032B (zh) 低信噪比环境下基于多频带能量分布的动物声音检测方法
CN104795064B (zh) 低信噪比声场景下声音事件的识别方法
CN111279414B (zh) 用于声音场景分类的基于分段的特征提取
CN110310666B (zh) 一种基于se卷积网络的乐器识别方法及系统
CN111986699B (zh) 基于全卷积网络的声音事件检测方法
KR100792016B1 (ko) 오디오 및 비디오 정보를 이용한 등장인물 기반 비디오요약 장치 및 그 방법
CN110335611B (zh) 一种基于质量维度的声纹识别算法评估方法
WO2021088176A1 (zh) 基于二值多频带能量分布的低信噪比声音事件检测方法
Jiang et al. An improved speech segmentation and clustering algorithm based on SOM and k-means
CN110717423B (zh) 一种老人面部表情的情感识别模型的训练方法及装置
CN117095694B (zh) 一种基于标签层级结构属性关系的鸟类鸣声识别方法
CN114898438A (zh) 一种基于时频域视觉伪影特征自适应融合的跨模态深度伪造检测方法
CN111310719B (zh) 一种未知辐射源个体识别及检测的方法
CN107274912B (zh) 一种手机录音的设备来源辨识方法
CN110767248B (zh) 一种抗变调干扰的音频指纹提取方法
CN112151067B (zh) 一种基于卷积神经网络的数字音频篡改被动检测方法
Kim et al. Audio-based objectionable content detection using discriminative transforms of time-frequency dynamics
CN102789780B (zh) 基于谱时幅度分级向量辨识环境声音事件的方法
CN116434759A (zh) 一种基于srs-cl网络的说话人识别方法
CN103544953B (zh) 一种基于背景噪声最小统计量特征的声音环境识别方法
CN114626412A (zh) 用于无人值守传感器系统的多类别目标识别方法及系统
CN109935234B (zh) 一种对录音鉴定来源设备的方法
CN115909398A (zh) 一种基于特征增强的跨域行人再识别方法
CN113782051B (zh) 广播效果分类方法及系统、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19951669

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19951669

Country of ref document: EP

Kind code of ref document: A1