WO2016155047A1 - Method of recognizing sound event in auditory scene having low signal-to-noise ratio - Google Patents

Method of recognizing sound event in auditory scene having low signal-to-noise ratio Download PDF

Info

Publication number
WO2016155047A1
WO2016155047A1 PCT/CN2015/077075 CN2015077075W WO2016155047A1 WO 2016155047 A1 WO2016155047 A1 WO 2016155047A1 CN 2015077075 W CN2015077075 W CN 2015077075W WO 2016155047 A1 WO2016155047 A1 WO 2016155047A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
scene
level
event
signal
Prior art date
Application number
PCT/CN2015/077075
Other languages
French (fr)
Chinese (zh)
Inventor
李应
林巍
Original Assignee
福州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 福州大学 filed Critical 福州大学
Publication of WO2016155047A1 publication Critical patent/WO2016155047A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Definitions

  • the invention relates to a method for identifying sound events in a low signal-to-noise ratio sound scene that can effectively improve the recognition rate under a low signal-to-noise ratio in various sound scenes.
  • Sound event detection is for audio forensics [1], environmental sound recognition [2], biological sound monitoring [3], acoustic scene analysis [4], environmental security monitoring [5], real-time military focus detection [6], location tracking Harmony source classification [7], patient monitoring [8-12], abnormal event monitoring [13-18], fault diagnosis, and submission of key information for early maintenance [21, 22] are all of great significance.
  • Detect (recognize) sound events in the sound scene, and try to identify the real events hidden in them in the audio data.
  • time, frequency features, and wavelet domain features [23], features extracted by Gabor dictionary matching pursuit algorithm [24,25], filtering based on Wavelet Packets [26], The extended features of high-pass filtering and MFCC [27], decomposed into multiple intersecting super frames, and proposed based on random regression forest [28].
  • extremely relevant features of the spectrogram there are mainly subband power distribution (SPD), local spectral feature (Local Spectrogram Feature, LSF), Gabor transform, Cosine Log Scattering (CLS), etc.[ 29-40].
  • SVM Support Vector Machine
  • GMM Gaussian Mixture Model
  • k-NN k-nearest neighbor
  • KFDA Kernel Fisher Discriminant Analysis
  • GHT Generalised Hough transform
  • HMM Hidden Markov Model
  • ML Maximum Likelihood
  • the feature extraction process all have different degrees of impact on the characteristics of the sound event, that is, the structure of the sound signal to be measured.
  • the spectral mask estimation algorithm used for feature missing can effectively remove the features of the sound event disturbed by the scene sound [34], it also masks some of the features of the sound event.
  • the method of short-term estimation of feature concealment range [41] is easy to filter out most of the sound event features, and the recognition effect is very poor.
  • Spectral subtraction [42] processes signals in all frequency bands and inevitably destroys the characteristics of sound events.
  • the multi-band spectral subtraction [43] has made improvements to the spectral subtraction, there are still cases where the characteristics of the sound event are destroyed.
  • this paper proposes to train the classifier with the sound of the scene sound mixed with the sound event.
  • the scene sounds are superimposed with the sound events according to different signal-to-noise ratios, and the sound data of the sound events in various sound scenes are obtained, and the classifier is trained.
  • the boundary point between the sound event and the scene sound is detected through the Empirical Mode Decomposition (EMD) [44] in the Hilbert-Huang transform (HHT) transformation.
  • EMD Empirical Mode Decomposition
  • HHT Hilbert-Huang transform
  • the signal-to-noise ratio of the sound event and the type of scene sound are estimated.
  • a classifier is selected to identify the sound event in the sound data.
  • the process of identifying the sound event is to use the signal-to-noise ratio range and the scene sound type to select RF from the RFM, and use the selected RF to identify the sound event.
  • scene sound data and sound event sample sets to train RF or M-RF to identify sound events in the scene.
  • the purpose of the present invention is to provide a method for identifying sound events in a low signal-to-noise ratio sound scene that can effectively improve the recognition rate under a low signal-to-noise ratio in various sound scenes.
  • the technical solution of the present invention is: a method for identifying sound events in a low signal-to-noise ratio sound scene, including the following steps
  • Step S1 Training and generation of the random forest matrix: sound mixing the known sound event samples in the sound event sample set and the known scene sound samples in the scene sound sample set to obtain a mixed sound signal set and store it in the training sound set. Generating a feature set of the training sound set through GLCM-HOSVD from the sound signals in the training sound set, and training the feature set of the training sound set to generate a random forest matrix;
  • Step S2 The training and generation of the random forest for discriminating the type of scene sound: GLCM-HOSVD is performed on the known scene sound samples in the scene sound sample set to generate the feature set of the scene sound sample set, and perform GLCM-HOSVD on the feature set of the scene sound sample set Training to generate a random forest to discriminate the sound type of the scene;
  • Step S3 Recognizing the sound event to be measured:
  • the sound signal to be measured is decomposed into scene sounds and sound events through EMD, and the signal-to-noise ratio of the sound event to be measured is calculated;
  • the second step is to calculate the characteristic values of the scene sound to be measured and the sound event to be measured, and input the characteristic values of the scene sound to be measured into the scene sound type discrimination random forest generated in step S2, and detect the sound type of the scene to be measured ;
  • the third step is to select a random forest for sound event recognition from the random forest matrix generated in step S1 according to the sound type of the scene to be measured and the signal-to-noise ratio of the sound event to be measured;
  • the characteristic value of the sound event to be measured is identified through the random forest selected in the third step to obtain the sound type.
  • step S3 the specific implementation process of the first step in step S3 is as follows:
  • the EMD can adaptively divide the measured sound signal y(t) into the linear superposition of n-level inherent modal functions according to the characteristics of the signal itself, namely
  • r i (t) is the residual function
  • Li (t) is the n-level intrinsic modal function
  • the first-level inherent modal function L 1 (t) mainly contains noise components, with very few effective sound components.
  • the process of detecting the sound endpoints to be tested is as follows:
  • H ⁇ L i (t) ⁇ represents the Hilbert transformation of the intrinsic mode function
  • is the smoothing window, which is 0.05 times the sampling rate
  • N level ⁇ F i (t) (6)
  • k is the window index
  • W d is the window length
  • the signal sampling rate is 0.02 times
  • step S318 If the sound event exists, skip to step S318;
  • N level (n) is the scene sound level of the nth window, and jump to step S319 after updating the scene sound level N level (n);
  • 0.5, which is used as the weight for updating the level of the sound event
  • l represents the ratio of the length of the sound event segment to the scene sound segment. Since the separated sound event segment contains scene sound components, it has an impact on the energy value of the sound event segment. Therefore, l ⁇ P n (t) is used as the influence It is estimated that the influence of the sound of the scene on the energy value is eliminated.
  • the calculation methods for the characteristics of the scene sound to be measured, the characteristics of the sound event to be measured, the characteristics of the training sound event, and the characteristics of the known scene sound are as follows:
  • GLCM can be expressed as:
  • x, y represent the pixel coordinates in the spectrogram
  • x+ ⁇ x ⁇ M, y+ ⁇ y ⁇ N, M ⁇ N represents the size of the image
  • # ⁇ S ⁇ represents the number of elements in the set S;
  • U (n) is a unitary matrix
  • ⁇ (n) is a positive semi-definite diagonal matrix
  • V (n)H the conjugate transpose of V, is a unitary matrix
  • ⁇ (n) is obtained according to formula (17), According to ⁇ (n) , we can get
  • 1 ⁇ n ⁇ N It represents ⁇ (n) of singular values i n, 1 ⁇ i n ⁇ I n;
  • the characteristics of the sound event to be measured, the characteristics of the sound event to be measured, the characteristics of the training sound event, and the characteristics of the known scene sound can be calculated.
  • the present invention has the following beneficial effects:
  • Propose a method of using EMD and voting on the Intrinsic Mode Function (IMF) to detect scene sounds and sound event endpoints and estimate the signal-to-noise ratio the multi-level intrinsic mode function is used to determine the signal-to-noise ratio in the sound data. The boundary points of scene sounds and sound events are detected, the final boundary point detection results are determined by voting, and the signal-to-noise ratio of the sound event is estimated;
  • IMF Intrinsic Mode Function
  • GLCM-HOSVD features convert the spectrogram into a gray-level co-occurrence matrix (GLCM), and obtain the eigenvalues of the sound signal by performing high-order singular value decomposition (HOSVD) on GLCM;
  • Random Forest Matrix select the corresponding random forest to identify sound events through the scene sound type and the signal-to-noise ratio of the sound event;
  • Propose random forest (RF) and multi-random forest (M-RF) to recognize sound events in real time mix real-time scene sounds with sound events in the sound event sample set, and train RF or M-RF for real-time sound event recognition .
  • Figure 1 shows the spectrogram GLCM-HOSVD.
  • Figure 2 shows the EMD+GLCM-HOSVD+RFM architecture diagram of sound event recognition in various sound scenarios.
  • Figure 3 is an EMD+GLCM-HOSVD+RF architecture diagram for real-time recognition of sound events in a sound scene.
  • Figure 4 is an EMD+GLCM-HOSVD+M-RF architecture diagram for real-time recognition of sound events in a sound scene.
  • Figure 5 shows the results of endpoint detection in different sound scenes of 0db.
  • Figure 5(a) Pure sound Figure 5((b) Wind sound scene
  • Figure 5(c) Rain sound scene Figure 5((d) Gaussian white noise
  • Figure 5 ((e) road sound scene Figure 5 ((f) airport sound scene).
  • Figure 6 shows the positional relationship between pixel pairs in the gray level co-occurrence matrix.
  • Figure 8 is the basic schematic diagram of random forest.
  • Figure 9 is a diagram of the recognition results of two texture feature extraction methods in different scenes and different signal-to-noise ratios.
  • Figure 9(a) highway scene
  • Figure 10 shows the average recognition results of EMD+M-RF, EMD+RF and pRF in 6 scenarios.
  • Figure 11 is a comparison diagram of recognition rate between EMD+GLCM-HOSVD+M-RF and MP-feature, Figure 11(a) highway scene, Figure 11(b) wind sound scene, Figure 11(c) flowing water scene, Figure 11(d) Rain scene, Figure 11(e) airport noise, Figure 11(f) Gaussian white noise.
  • Figure 12 is a comparison diagram of EMD+GLCM-HOSVD+M-RF and SPD under low signal-to-noise ratio, Figure 12(a) highway scene,
  • Figure 12(b) Wind sound scene
  • Figure 12(c) Flowing water scene
  • Figure 12(d) Rain sound scene
  • Figure 12(f) Gaussian To White Noise.
  • Figure 13 is a graph of the average recognition rate of the three methods of EMD+GLCM-HOSVD+M-RF, MP-feature and SPD under low signal-to-noise ratio.
  • the invention provides a method for recognizing sound events in a low signal-to-noise ratio sound scene, including the following steps:
  • Step S1 Training and generation of the random forest matrix: sound mixing the known sound event samples in the sound event sample set and the known scene sound samples in the scene sound sample set to obtain a mixed sound signal set and store it in the training sound set. Generating a feature set of the training sound set through GLCM-HOSVD from the sound signals in the training sound set, and training the feature set of the training sound set to generate a random forest matrix;
  • Step S2 The training and generation of the random forest for discriminating the type of scene sound: GLCM-HOSVD is performed on the known scene sound samples in the scene sound sample set to generate the feature set of the scene sound sample set, and perform GLCM-HOSVD on the feature set of the scene sound sample set Training to generate a random forest to discriminate the sound type of the scene;
  • Step S3 Recognizing the sound event to be measured:
  • the sound signal to be measured is decomposed into scene sounds and sound events through EMD, and the signal-to-noise ratio of the sound event to be measured is calculated;
  • the second step is to calculate the characteristic values of the scene sound to be measured and the sound event to be measured, and input the characteristic values of the scene sound to be measured into the scene sound type discrimination random forest generated in step S2, and detect the sound type of the scene to be measured ;
  • the third step is to select a random forest for sound event recognition from the random forest matrix generated in step S1 according to the sound type of the scene to be measured and the signal-to-noise ratio of the sound event to be measured;
  • the characteristic value of the sound event to be measured is identified through the random forest selected in the third step to obtain the sound type.
  • This part introduces the architecture of sound event recognition based on GLCM-HOSVD in various sound scenes with low signal-to-noise ratio.
  • the process of passing the sound signal through GLCM-HOSVD to generate the characteristic value w is shown in Figure 1.
  • the process of GLCM-HOSVD is to convert the sound signal into a spectrogram, calculate the GLCM of the spectrogram, and obtain the characteristic value w of the sound signal by performing HOSVD on the GLCM.
  • the feature value w that needs to be involved in this application includes the feature set W l of the training sound set in training figure 2 and W s in figure 3 and W sh , W s , and W sl in figure 4, which are known limited kinds of scenes.
  • Figure 2 shows the architecture for recognizing sound events in various sound scenarios.
  • This architecture is called EMD, GLCM-HOSVD and RFM architecture.
  • Related content includes: 1) Random forest matrix RFM training and generation part, as shown in the dashed box in Figure 2; 2) Scene sound type discrimination Random forest RF n training and generation part, as shown in the dotted line box in Figure 2. ; 3) The recognition part of the sound event to be tested, as shown in the half-lined frame part in Figure 2.
  • the random forest matrix training and generation part includes sound event samples, scene sound samples, sound mixing, training sound set, GLCM-HOSVD and random forest matrix RFM.
  • Sound event samples storing various types of known sound event samples.
  • Scene sound samples store S kinds of known types of scene sound samples.
  • Sound mixing which superimposes various known sound event samples and S scene sound samples according to N different signal-to-noise ratios to generate S ⁇ N types of mixed sound signals of S sound scenes and N different signal-to-noise ratios Set and stored in the training sound set.
  • GLCM-HOSVD performs GLCM-HOSVD on the sounds in the training sound set to generate the feature set of the sound training set Among them, M is the number of sound samples.
  • RFM trains the S ⁇ N feature sets W1 to generate an S ⁇ N random forest matrix.
  • Scene sound type discrimination random forest training and generation part GLCM-HOSVD is performed on the scene sound samples to generate scene sound characteristics w n .
  • Sample set of sound characteristics of the scene Perform training to generate a scene sound type discrimination random forest RF n , where P is the number of scene sound samples.
  • the sound signal y(t) to be measured is passed through EMD to decompose the scene sound part n(t) and the sound event part s(t). Calculate the characteristic value w t of the sound of the scene to be tested.
  • the input scene sound type of w t is judged by random forest RF n , and the sound type l t of the scene to be tested is detected.
  • the signal-to-noise ratio l s of the sound event to be measured is calculated.
  • a random forest RF s,n for sound event recognition is selected from the random forest matrix. Calculate the characteristic value w e of the sound event to be measured, and use w e to identify the sound event through the random forest RF s,n to obtain the type l.
  • w e the characteristic value of the sound event to be measured
  • w e the characteristic value of the sound event to be measured
  • the scene sound n(t) after EMD segmentation needs to be directly compared with the M in the sound event sample library according to the signal-to-noise ratio l s of the scene sound n(t) and the sound event s(t).
  • a variety of sound events are mixed, and the mixed sound set is GLCM-HOSVD to generate a feature set Use W s to establish RF s .
  • the established RF s is used to identify the characteristic value w e of the sound event s(t) to be measured.
  • This part includes the empirical modal decomposition of the sound data, the detection of the end points of the sound event, and the calculation of the signal-to-noise ratio between the sound event and the scene sound in the sound data. Convert the sound data into a spectrogram, and calculate the GLCM of the sound data. Perform HOSVD on GLCM to generate feature w. Use feature set W to train to generate a random forest matrix, and use random forest pairs to identify sound events in the sound data.
  • EMD is the core of HHT transformation [44]. EMD can adaptively divide the original signal y(t) into the linear superposition of n-level IMF according to the characteristics of the signal itself, namely
  • r i (t) is the residual function.
  • the process of using the i-th intrinsic mode function Li (t) to detect the foreground sound endpoint is as follows.
  • H ⁇ L i (t) ⁇ represents the Hilbert transformation of the intrinsic mode function.
  • is the smoothing window, and here is 0.05 times the sampling rate.
  • N level ⁇ F i (t) (6)
  • k is the window index
  • W d is the window length
  • the signal sampling rate is 0.02 times.
  • N level (n) is the scene sound level of the nth window. After updating the scene sound level N level (n), skip to step 9).
  • 0.5, which is used as the weight for updating the level of the sound event.
  • the selected 2-6 level intrinsic modal function Li (t) is processed by the above steps, and 5 different endpoint detection results can be obtained, and then the final endpoint detection result is determined by voting.
  • the blue part is the sound signal waveform
  • the red part is the endpoint detection result
  • the high bit indicates that the sound event is included
  • the low bit indicates that only the scene sound is included.
  • (b), (c), (d), (e), (f) are waveform diagrams and sound event endpoint detection results with a signal strength of 0db in various acoustic scenarios. From the above figures, it can be seen that this method can basically detect the sound segment of the sound event under 0db.
  • the coefficient ⁇ 3.
  • the purpose of this process is to adjust the misclassified sound event segment in the scene sound segment.
  • l represents the ratio of the length of the sound event segment to the scene sound segment. Since the separated sound event segment contains scene sound components, it affects the energy value of the sound event segment. Therefore, l ⁇ P n (t) is used as the estimation of the influence, and the influence of the scene sound on the energy value is eliminated. Due to the error in endpoint detection, there is a certain error in the calculation of the signal-to-noise ratio.
  • the SNR calculation result is in the interval of (-6,-0.5),[-0.5,2.5),[2.5,7.5),[7.5,15),[15,25)dB
  • the -5dB, 0dB, 5dB, 10dB, and 20dB classifier models are used to identify sound events.
  • GLCM refers to the joint probability distribution of two pixels that are separated by ( ⁇ x, ⁇ y) in the spectrogram, and gray levels are i and j, respectively.
  • the specific ranges of ⁇ x and ⁇ y are determined by two parameters: pixel
  • GLCM can be expressed as:
  • x, y represent the pixel coordinates in the spectrogram
  • x+ ⁇ x ⁇ M, y+ ⁇ y ⁇ N, M ⁇ N represents the size of the image
  • #(S) represents the number of elements in the set S.
  • Figure 7 shows an example of GLCM generation.
  • Figure 7(a) is an image area with a size of 4 ⁇ 5 and a gray level of 8 cut from the spectrogram.
  • the gray pair (4,6) appears twice, so the value of the fourth row and sixth column of GLCM in Figure 7(b) should be 2.
  • 1,0°) 2.
  • the gray pair (0,1) appears twice from left to right, so the value of row 0 and column 1 of GLCM in Figure 7(b) is 2, that is, P(0 , 1
  • 1, 0°) 2.
  • U is a unitary matrix of order m ⁇ m
  • is a positive semi-definite m ⁇ n order diagonal matrix
  • V * the conjugate transpose of V, is a unitary matrix of order n ⁇ n.
  • U (n) is a unitary matrix
  • ⁇ (n) is a positive semi-definite diagonal matrix
  • V (n)H the conjugate transpose of V, is a unitary matrix.
  • 1 ⁇ n ⁇ N Represents the i n th singular value of ⁇ (n) , 1 ⁇ i n ⁇ I n .
  • ⁇ (n) namely ⁇ (1) , ⁇ (2) , ⁇ (3) and ⁇ (1) , ⁇ (2) and ⁇ (3) can be obtained .
  • ⁇ (1) [6.31, 5.24, 5.01, 3.08, 2.71, 2.12, 1.91, 1.27].
  • ⁇ (2) [6.26, 5.66, 4.60, 3.31,2.77, 2, 1.69, 1]
  • ⁇ (3) [6.51, 5.65, 4.43, 3.10, 2.46, 2.16, 1.68, 1.36] can be obtained.
  • ⁇ (1) , ⁇ (2) and ⁇ (3) are combined into w as the feature of sound event recognition.
  • Random forest is an ensemble classifier algorithm that uses multiple decision tree classifiers to discriminate data [49-52].
  • the principle is shown in Figure 7, that is, through self-service re-sampling (Bootstrap) technology, self-service re-sampling from the feature set of the original training sample to generate new k training data sets. Then these newly generated k training data sets are grown into k/decision trees according to the decision tree construction method, and they are combined to form a forest.
  • the discriminant result of the test data is determined by the scores formed by voting of k trees in the forest.
  • the process of using random forest to identify unknown test samples is as follows. First, as shown in Figure 2 or Figure 3 and Figure 4, the feature w t of the scene sound in the sound to be tested or the feature w e of the sound event to be tested is placed at the root node of all k decision trees in the random forest. Then it is passed down according to the classification rules of the decision tree until it reaches a certain leaf node. The corresponding class label of this leaf node is the vote made by this decision tree on the class l of the feature w t or w e. The k decision trees of the random forest all vote for the category l of w t or w e , and count these k votes. The one with the most votes becomes the category label l corresponding to w t or w e.
  • the 40 kinds of pure bird sounds used in the experiment came from the Freesound sound database [43]. There are 30 samples of each kind of bird sounds, and a total of 1200 samples.
  • the six scene sounds used in the experiment are Gaussian white noise, busy highway scene sound, flowing water scene sound, airport scene sound, rain scene sound, and wind scene sound.
  • Gaussian white noise is a random signal with a mean value of 0, a power spectral density of 1 and a uniform distribution randomly generated by a computer, and it is obtained by whitening.
  • the other scene sounds are recorded in the corresponding sound scenes at a sampling frequency of 44.1kHz.
  • they are uniformly converted into mono WAV format sound clips with a sampling frequency of 8kHz, a sampling accuracy of 16bit, and a length of 2s.
  • Random forest RF n for discriminating the sound types of scenes, and establishing a random forest for discriminating 6 sound types of scenes. Select RF s,n from RFM, and use RF s,n to identify the corresponding sound scene and 5 kinds of signal-to-noise ratios to be measured.
  • the pRF method is a random forest trained with 40 simple sound events in the sound event sample library.
  • the RF s,n method is to use the signal-to-noise ratio l s of the sound event to be measured and the type of scene sound l t in Figure 2 to select the matching RF s,n from the RFM, and use the selected RF s,n to be the sound event to be measured Identify it.
  • the architecture of Figure 2 into the EMD, GLCM-HOSVD and RF architectures of Figure 3, that is, identification through RF s.
  • EMD, GLCM-HOSVD and M-RF architecture are compared with MP-feature [27,28].
  • the SVM method of MP combining PCA and LDA, here referred to as MP-feature is to use the matching pursuit algorithm to select important atoms from the Gabor dictionary, and use principal component analysis (PCA) and linear discriminant analysis (LDA) to determine the characteristics of the sound event.
  • PCA principal component analysis
  • LDA linear discriminant analysis
  • the SVM classifier performs recognition.
  • SPD combines the KNN method, referred to as SPD, in which the sub-band power distribution (SPD) separates the high-energy and reliable small part of the sound event spectrogram from the scene sound, and the nearest neighbor classifier (kNN) treats these high frequencies.
  • SPD sub-band power distribution
  • kNN nearest neighbor classifier
  • the 6 types of sound scenes include: road scene sound, wind scene sound, flowing water scene sound, rain scene sound, airport scene sound and Gaussian white noise.
  • the first group of experiments compares the recognition rates of GLCM-HOSVD and GLCM-SDH.
  • the recognition rate is shown in Figure 9.
  • Figure 9 (a), (b), (c), (d), (e), (f) are highway scenes, wind sound scenes, flowing water scenes, rain sound scenes, airport sound scenes, and Gaussian white noise.
  • the recognition rate under the signal-to-noise ratio.
  • the second set of experiments is the key experiment of this article.
  • IV.D we discuss in IV.D.
  • Table 1 Use different RF s, n to recognize sound events with different signal-to-noise ratios
  • the test sound signal-to-noise ratio is 19.00% at 0dB, 7.13% at 5dB, 2.38% at 10dB, and 5.43% at 20dB.
  • the signal-to-noise ratio of RF s,n matches the signal-to-noise ratio of the test sound, even if the RF s,n is -5dB with a low signal-to-noise ratio, a high recognition rate can be maintained.
  • the third group of experiments, for EMD, GLCM-HOSVD and RF architecture, is referred to as EMD+RF in Figure 10.
  • the estimated signal-to-noise ratio l s of the sound event to be measured has a deviation from its true signal-to-noise ratio. This makes the signal-to-noise ratio of RF s deviate from the signal-to-noise ratio of the sound event to be measured. As a result, the recognition rate of the RF s of the sound event to be measured is reduced. This situation is especially obvious when the RF s is low.
  • the relevant results are shown in the green histograms in Figure 10.
  • the average recognition rate for the 6 types of sound scenes is 92%; at 10dB, 83%; at 5dB, 77.5%; at 0dB, 64%; at -5dB, 29%.
  • RF the average recognition rate of sound events in 6 sound scenes.
  • the recognition rate of pRF is slightly less when the signal-to-noise ratio is 20dB. Higher than RF s .
  • the overall recognition result, RF s is significantly better than the RF recognition result.
  • EMD+M-RF The fourth group of experimental EMD, GLCM-HOSVD and M-RF architectures are referred to as EMD+M-RF in Figure 10.
  • EMD+M-RF The fourth group of experimental EMD, GLCM-HOSVD and M-RF architectures are referred to as EMD+M-RF in Figure 10.
  • the average recognition rate of this method under different signal-to-noise ratios is shown in the red histogram in Figure 10.
  • the EMD, GLCM-HOSVD and M-RF architecture methods can greatly improve the recognition rate under low signal-to-noise ratio. Regarding related improvements, discussed in IV.D.
  • the recognition situation of the two feature extraction methods of EMD, GLCM-HOSVD and M-RF architecture and MP-feature in 6 types of sound scenes is shown in Figure 11.
  • the MP feature is at a low signal-to-noise ratio, such as 5dB or less, and most of the sound events cannot be identified.
  • Figure 11(f) Since Gaussian white noise does not have obvious laws, it is not easy to restore through matching pursuit (MP). Therefore, it can maintain a certain recognition ability at 5dB.
  • the EMD, GLCM-HOSVD and M-RF architectures can maintain a recognition rate of more than 80% in various types of scene sound at 0dB. In particular, in the case of -5dB, an average recognition rate of more than 70% is still maintained.
  • Figure 12 shows the recognition rates of EMD, GLCM-HOSVD and M-RF architecture and SPD in 6 types of scene sounds and 3 signal-to-noise ratios of 5dB, 0dB and -5dB.
  • the SPD method in the case of semi-supervision, discards some features that are disturbed by scene sounds, and retains some reliable high-energy features. It can be seen from Figure 12 that although SPD can still maintain a certain degree of recognition rate in the case of 5dB and 0dB, it cannot maintain the normal recognition ability for a lower signal-to-noise ratio, such as -5dB. For 0dB and -5dB, EMD, GLCM-HOSVD and M-RF architecture still maintain good recognition efficiency.
  • This part analyzes the EMD, GLCM-HOSVD and RFM architecture, EMD, GLCM-HOSVD and RF architecture, and EMD, GLCM-HOSVD and M-RF architecture classifiers proposed in this paper, and recognizes the performance of environmental sounds in various sound scenarios. And compare EMD, GLCM-HOSVD and M-RF architecture with SPD and MP methods.
  • GLCM-HOSVD is better than GLCM-SDH; using EMD, GLCM-HOSVD and RFM architecture and EMD, GLCM-HOSVD and RF architecture can detect sound events in low signal-to-noise ratio.
  • the performance of EMD, GLCM-HOSVD and M-RF architecture is better than the SVM method of MP combined with PCA and LDA.
  • EMD, GLCM-HOSVD and M-RF architecture are better than SPD combined with KNN.
  • Figure 13 shows the average detection accuracy of sound events under 3 different signal-to-noise ratios: 5dB, 0dB, and -5dB under 6 types of sound scenarios. It can be seen from Figure 13 that this method can still maintain a high recognition accuracy rate from 0dB to -5dB.
  • the sound events that may occur are limited. Therefore, the number of sound events in the sound event sample library is also limited. Therefore, according to the EMD, GLCM-HOSVD and RF architecture in Figure 3 or the EMD, GLCM-HOSVD and M-RF architecture in Figure 4, the relevant scene sounds are mixed with the sound events in the sample library, and RF s or RF is established sh -RF s -RF sl can be performed in real time. This enables real-time recognition of low signal-to-noise ratio sound events in various sound scenarios.
  • the deviation of the estimation of the signal-to-noise ratio of the sound event to be measured causes the recognition rate to decrease.
  • the separated environmental sound has a deviation from the environmental sound of other time periods.
  • one of the improved methods is to select multiple representative non-stationary scene sounds, respectively, and the sounds in the sample library To The events are mixed to generate multiple RFs, and the final result is determined by the results of multiple RFs and further voting.
  • this paper proposes a sound event recognition method that can effectively improve the recognition rate under low signal-to-noise ratio in various acoustic scenarios.
  • This method combines the scene sound in the sound event to be measured with the sound event sample library, extracts the characteristics of the sound data through GLCM-HOSVD, and generates an RF for judging the sound event to be measured.
  • the RF generated by this method can be used to recognize sound events under a low signal-to-noise ratio in a specific scene.
  • Experimental results show that this method can make the sound event and scene sound signal-to-noise ratio of -5dB, and maintain an average accuracy of more than 73% of the sound event recognition rate.
  • the proposed method solves the problem of sound event recognition in the case of low signal-to-noise ratio.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

A method of recognizing a sound event in an auditory scene having a low signal-to-noise ratio. The method comprises: combining a scene sound in a sound event to be tested and a sound event sample set, extracting a feature of sound data via gray level co-occurrence matrix (GLCM)-higher-order singular value decomposition (HOSVD), and generating random forest (RF) data for recognizing the sound event to be tested. The RF data generated in this method can recognize, in a specific scene, the sound event in the auditory scene having a low signal-to-noise ratio. The method maintains a recognition rate of the sound event that exceeds 73% of average precision when the signal-to-noise ratio of the sound event and the scene sound is -5dB.

Description

低信噪比声场景下声音事件的识别方法Sound event recognition method in low signal-to-noise ratio sound scene 技术领域Technical field
本发明涉及一种能够在各种声场景下、有效提高低信噪比下识别率的低信噪比声场景下声音事件的识别方法。The invention relates to a method for identifying sound events in a low signal-to-noise ratio sound scene that can effectively improve the recognition rate under a low signal-to-noise ratio in various sound scenes.
背景技术Background technique
近来,声音事件检测(识别)引起广泛关注。声音事件检测对于音频取证[1]、环境声音识别[2]、生物声音监控[3]、声场景分析[4]、环境安全监控[5],实时军事关注点的检测[6]、定位跟踪和声源分类[7],病人监护[8-12]、非正常事件监测[13-18]及故障诊断、递交早期维护的关键信息[21,22]等都具有重要意义。声场景中检测(识别)声音事件,试图在音频数据中识别出隐藏在它们中的真实事件。Recently, sound event detection (recognition) has attracted widespread attention. Sound event detection is for audio forensics [1], environmental sound recognition [2], biological sound monitoring [3], acoustic scene analysis [4], environmental security monitoring [5], real-time military focus detection [6], location tracking Harmony source classification [7], patient monitoring [8-12], abnormal event monitoring [13-18], fault diagnosis, and submission of key information for early maintenance [21, 22] are all of great significance. Detect (recognize) sound events in the sound scene, and try to identify the real events hidden in them in the audio data.
由于环境不同,与声音事件同时存在的声场景也不同,且常以非稳定的形式出现。因此,在各种声场景中,尤其低信噪比下有效地识别声音事件,还是一个挑战性的任务。相关工作目前已有一定的研究[23-40]。这些研究主要包括声音信号特征的提取与对这些特征的分类及识别。对于特征的抽取,其中常见的有效方法有两类,即1)时间与频率相结合的特征,2)声谱图及其相关的特征。关于时间与频率相结合的特征,主要有时间、频率特征、小波域特征[23],Gabor字典匹配追踪算法提取的特征[24,25],基于小波包(Wavelet Packets)的过滤[26],高通滤波和MFCC的扩展特征[27],分解成多个交叉的超级帧,提出基于随机回归森林[28]。关于声谱图极其相关的特征,主要有子带功率分布(subband power distribution, SPD), 局部频谱特征(Local Spectrogram Feature, LSF),Gabor变换,余弦对数散射(Cosine Log Scattering,CLS)等[29-40]。对于声音事件及场景声音的分类,常见的有效方法有支持向量机(Support Vector Machine, SVM)[24,29,32,37,40],高斯混合模型(Gaussian Mixture Model,GMM)[23,31,39],k-最近邻(k-nearest neighbor, k-NN)[30, 34],核费舍尔判别分析(Kernel Fisher Discriminant Analysis, KFDA)[29,38],广义霍夫变换(Generalised Hough Transform, GHT)的投票[19],GMM与隐马尔科夫模型(Hidden Markov Model, HMM)相结合[35],极大似然(Maximum Likelihood, ML)[36]。Due to different environments, the sound scenes coexisting with sound events are also different, and often appear in unstable forms. Therefore, it is still a challenging task to effectively identify sound events in various acoustic scenarios, especially at low signal-to-noise ratios. Related work has been studied [23-40]. These studies mainly include the extraction of sound signal features and the classification and recognition of these features. For feature extraction, there are two common effective methods, namely 1) the combination of time and frequency, and 2) the spectrogram and its related features. Regarding the features of the combination of time and frequency, there are mainly time, frequency features, and wavelet domain features [23], features extracted by Gabor dictionary matching pursuit algorithm [24,25], filtering based on Wavelet Packets [26], The extended features of high-pass filtering and MFCC [27], decomposed into multiple intersecting super frames, and proposed based on random regression forest [28]. Regarding the extremely relevant features of the spectrogram, there are mainly subband power distribution (SPD), local spectral feature (Local Spectrogram Feature, LSF), Gabor transform, Cosine Log Scattering (CLS), etc.[ 29-40]. For the classification of sound events and scene sounds, common effective methods include Support Vector Machine (SVM)[24,29,32,37,40], Gaussian Mixture Model (GMM)[23,31 ,39], k-nearest neighbor (k-NN) [30, 34], Kernel Fisher Discriminant Analysis (KFDA) [29,38], Generalised Hough transform (Generalised Hough Transform, GHT) voting [19], GMM combined with Hidden Markov Model (HMM) [35], Maximum Likelihood (ML) [36].
这些方法对于声音事件的识别都取得一定的效果。然而,特征提取过程都有不同程度地对声音事件的特征即待测声音信号的特征本身的结构造成影响。虽然用于特征缺失的谱掩饰估算算法能有效去除被场景声音干扰的声音事件的特征[34],但也屏蔽了声音事件的部分特征。而在白噪音的情况下,短时估计特征掩盖范围的方法[41],容易滤除大部分声音事件特征,识别效果很差。谱减法[42]对所有频段的信号都进行了处理,不可避免地破坏了声音事件中的特征。虽然多频带谱减法[43]对谱减法做出了改进,但还存在破坏了声音事件特征的情况。These methods have achieved certain results in the recognition of sound events. However, the feature extraction process all have different degrees of impact on the characteristics of the sound event, that is, the structure of the sound signal to be measured. Although the spectral mask estimation algorithm used for feature missing can effectively remove the features of the sound event disturbed by the scene sound [34], it also masks some of the features of the sound event. In the case of white noise, the method of short-term estimation of feature concealment range [41] is easy to filter out most of the sound event features, and the recognition effect is very poor. Spectral subtraction [42] processes signals in all frequency bands and inevitably destroys the characteristics of sound events. Although the multi-band spectral subtraction [43] has made improvements to the spectral subtraction, there are still cases where the characteristics of the sound event are destroyed.
为了避免在抑制场景声音的同时,对声音事件的信号结构的影响,从而在低信噪比下得到了更高的识别率,本文提出用场景声音与声音事件混合的声音来训练分类器。在分类器模型的训练过程中,场景声音按不同信噪比与声音事件进行叠加,得到声音事件在各种声场景下的声音数据,对分类器进行训练。在检测处理中,通过希尔伯特-黄变换(Hilbert-Huang transform, HHT)变换中的经验模态分解(Empirical Mode Decomposition, EMD)[44]检测声音事件和场景声音的边界点。根据检测出的声音事件和场景声音的边界点,估计声音事件的信噪比和场景声音种类。从而,用信噪比区间和场景声音种类,选择分类器对声音数据中的声音事件进行识别。In order to avoid the influence on the signal structure of the sound event while suppressing the scene sound, so as to obtain a higher recognition rate under the low signal-to-noise ratio, this paper proposes to train the classifier with the sound of the scene sound mixed with the sound event. In the training process of the classifier model, the scene sounds are superimposed with the sound events according to different signal-to-noise ratios, and the sound data of the sound events in various sound scenes are obtained, and the classifier is trained. In the detection process, the boundary point between the sound event and the scene sound is detected through the Empirical Mode Decomposition (EMD) [44] in the Hilbert-Huang transform (HHT) transformation. According to the detected boundary points between the sound event and the scene sound, the signal-to-noise ratio of the sound event and the type of scene sound are estimated. Thus, using the signal-to-noise ratio interval and the type of scene sound, a classifier is selected to identify the sound event in the sound data.
对于各种声音事件及其场景声音的信号特征,本文总结相关文献[45-48]和已有工作[49],采用声谱图的灰度共生矩阵(Gray Level Co-occurrence Matrix,GLCM)与高阶奇异值分解(Higher-Order Singular Value Decomposition,HOSVD)提取声音信号的特征。对于声音事件 及场景声音的分类与识别,我们采用随机森林矩阵(Random forests Matrix, RFM)、随机森林[50](Random forests,RF)和多随机森林(Multi Random forests,M-RF)。For the signal characteristics of various sound events and scene sounds, this paper summarizes related literature [45-48] and existing work [49], using the gray level co-occurrence matrix (GLCM) of the spectrogram and Higher-Order Singular Value Decomposition (HOSVD) extracts the characteristics of the sound signal. For sound events To As for the classification and recognition of scene sounds, we use Random forests Matrix (RFM), Random forests [50] (Random forests, RF) and Multi Random forests (M-RF).
对声音事件的识别过程,就是把信噪比区间和场景声音种类用于从RFM中选择RF,使用所选择的RF对声音事件进行识别。在实时声音事件检测中,我们利用场景声音数据与声音事件样本集训练RF或M-RF,来识别场景中的声音事件。The process of identifying the sound event is to use the signal-to-noise ratio range and the scene sound type to select RF from the RFM, and use the selected RF to identify the sound event. In real-time sound event detection, we use scene sound data and sound event sample sets to train RF or M-RF to identify sound events in the scene.
发明内容Summary of the invention
本发明的目的在于提供一种能够在各种声场景下、有效提高低信噪比下识别率的低信噪比声场景下声音事件的识别方法。The purpose of the present invention is to provide a method for identifying sound events in a low signal-to-noise ratio sound scene that can effectively improve the recognition rate under a low signal-to-noise ratio in various sound scenes.
为实现上述目的,本发明的技术方案是:一种低信噪比声场景下声音事件的识别方法,包括如下步骤,In order to achieve the above objective, the technical solution of the present invention is: a method for identifying sound events in a low signal-to-noise ratio sound scene, including the following steps
步骤S1:随机森林矩阵的训练与生成:将声音事件样本集中的已知声音事件样本和场景声音样本集中的已知场景声音样本进行声音混合,得到混合声音信号集,并存放于训练声音集中,将所述训练声音集中的声音信号通过GLCM-HOSVD生成训练声音集的特征集,对该训练声音集的特征集进行训练,生成随机森林矩阵;Step S1: Training and generation of the random forest matrix: sound mixing the known sound event samples in the sound event sample set and the known scene sound samples in the scene sound sample set to obtain a mixed sound signal set and store it in the training sound set. Generating a feature set of the training sound set through GLCM-HOSVD from the sound signals in the training sound set, and training the feature set of the training sound set to generate a random forest matrix;
步骤S2:场景声音类型判别随机森林的训练与生成:对场景声音样本集中的已知场景声音样本进行GLCM-HOSVD,生成场景声音样本集的特征集,并对该场景声音样本集的特征集进行训练,生成场景声音类型判别随机森林;Step S2: The training and generation of the random forest for discriminating the type of scene sound: GLCM-HOSVD is performed on the known scene sound samples in the scene sound sample set to generate the feature set of the scene sound sample set, and perform GLCM-HOSVD on the feature set of the scene sound sample set Training to generate a random forest to discriminate the sound type of the scene;
步骤S3:对待测声音事件进行识别:Step S3: Recognizing the sound event to be measured:
第一步,将待测声音信号通过EMD分解出场景声音和声音事件,并计算出该待测声音事件的信噪比;In the first step, the sound signal to be measured is decomposed into scene sounds and sound events through EMD, and the signal-to-noise ratio of the sound event to be measured is calculated;
第二步,计算待测场景声音和待测声音事件的特征值,并将所述待测场景声音的特征值输入所述步骤S2生成的场景声音类型判别随机森林,检测出待测场景声音类型;The second step is to calculate the characteristic values of the scene sound to be measured and the sound event to be measured, and input the characteristic values of the scene sound to be measured into the scene sound type discrimination random forest generated in step S2, and detect the sound type of the scene to be measured ;
第三步,通过所述待测场景声音类型和待测声音事件的信噪比,从所述步骤S1生成的随机森林矩阵中选择进行声音事件识别的随机森林;The third step is to select a random forest for sound event recognition from the random forest matrix generated in step S1 according to the sound type of the scene to be measured and the signal-to-noise ratio of the sound event to be measured;
第四步,将所述待测声音事件的特征值通过第三步所选择的随机森林进行识别得到声音类型。In the fourth step, the characteristic value of the sound event to be measured is identified through the random forest selected in the third step to obtain the sound type.
在本发明实施例中,所述步骤S3中的第一步的具体实现过程如下,In the embodiment of the present invention, the specific implementation process of the first step in step S3 is as follows:
将待测声音信号y(t)通过EMD,EMD能依据信号自身的特性将待测声音信号y(t)自适地分为n级固有模态函数的线性叠加,即Pass the measured sound signal y(t) through the EMD, and the EMD can adaptively divide the measured sound signal y(t) into the linear superposition of n-level inherent modal functions according to the characteristics of the signal itself, namely
Figure PCTCN2015077075-appb-000001
Figure PCTCN2015077075-appb-000001
其中,ri(t)为残余函数,Li(t)为n级固有模态函数;Among them, r i (t) is the residual function, and Li (t) is the n-level intrinsic modal function;
在n级固有模态函数Li(t)中,1级固有模态函数L1(t)主要包含噪音成分,有效声音成分极少,所述噪音成分即场景声音部分,有效声音成分即声音事件部分;因此,我们仅选取2-6级固有模态函数,即取i=2,3,…,6,用于对待测声音端点的检测;用第i级固有模态函数Li(t)进行待测声音端点检测的过程具体如下,Among the n-level inherent modal functions Li (t), the first-level inherent modal function L 1 (t) mainly contains noise components, with very few effective sound components. The noise components are the sound part of the scene, and the effective sound components are the sound Event part; therefore, we only select the intrinsic modal function of level 2-6, that is, take i=2,3,...,6 for the detection of the sound endpoint under test; use the i-th intrinsic modal function Li (t ) The process of detecting the sound endpoints to be tested is as follows:
S311:对第i级固有模态函数Li(t)做预处理S311: Perform preprocessing on the i-th inherent modal function Li (t)
ei(t)=|H{Li(t)}|+Li(t)   (2)e i (t)=|H{L i (t)}|+L i (t) (2)
其中,H{Li(t)}表示对固有模态函数做希尔伯特变换;Among them, H{L i (t)} represents the Hilbert transformation of the intrinsic mode function;
S312:对ei(t)进行平滑 S312: Smooth e i (t)
Figure PCTCN2015077075-appb-000002
Figure PCTCN2015077075-appb-000002
其中,σ为平滑窗口,取采样率的0.05倍;Among them, σ is the smoothing window, which is 0.05 times the sampling rate;
S313:对Ei(t)归一化S313: Normalize E i (t)
Figure PCTCN2015077075-appb-000003
Figure PCTCN2015077075-appb-000003
S314:计算声音事件等级Slevel、场景声音等级Nlevel和初始化场景声音等级阀值TS314: Calculate the sound event level S level , the scene sound level N level, and initialize the scene sound level threshold T
Slevel=mean[Fi(t)]   (5)Slevel=mean[F i (t)] (5)
Nlevel=βΣFi(t)   (6)N level = βΣF i (t) (6)
T=αSlevel   (7)T=αS level (7)
其中,α,β为门限值参数,取α=4,β=0.25;Among them, α and β are threshold parameters, and α=4, β=0.25;
S315:计算Fi(t)在第k个窗口的平均值S315: Calculate the average value of F i (t) in the k-th window
Figure PCTCN2015077075-appb-000004
Figure PCTCN2015077075-appb-000004
其中,k为窗口索引,Wd为窗长,取信号采样率0.02倍;Among them, k is the window index, W d is the window length, and the signal sampling rate is 0.02 times;
S316:对是否存在声音事件进行判断S316: Judge whether there is a sound event
Figure PCTCN2015077075-appb-000005
Figure PCTCN2015077075-appb-000005
若声音事件存在,跳转至步骤S318;If the sound event exists, skip to step S318;
S317:对场景声音进行动态估计,更新场景声音等级S317: Dynamically estimate the scene sound and update the scene sound level
Figure PCTCN2015077075-appb-000006
Figure PCTCN2015077075-appb-000006
其中,Nlevel(n)为第n个窗口的场景声音等级,在更新场景声音等级Nlevel(n)后跳转至步骤S319;Wherein, N level (n) is the scene sound level of the nth window, and jump to step S319 after updating the scene sound level N level (n);
S318:更新场景声音等级阀值S318: Update the scene sound level threshold
Figure PCTCN2015077075-appb-000007
Figure PCTCN2015077075-appb-000007
其中,θ为常数,取θ=0.2;Among them, θ is a constant, and θ=0.2;
S319:若场景声音等级阀值在之前的循环中被更新过,则更新声音事件等级Slevel S319: If the scene sound level threshold has been updated in the previous cycle, update the sound event level S level
Slevel=Nlevel+λ|T-Nlevel|   (12)S level =N level +λ|TN level | (12)
其中,λ=0.5,作为声音事件等级更新的权值;Among them, λ=0.5, which is used as the weight for updating the level of the sound event;
S3110:k=k+1,移动窗口,若窗口没有结束跳转至步骤S315,否则循环结束;S3110: k=k+1, move the window, if the window does not end, jump to step S315, otherwise the loop ends;
选取的2-6级固有模态函数Li(t)经上述步骤S311至S3110的处理,得到5种不同的端点检测结果,再经投票确定最终端点检测结果; 2-6 level selected intrinsic mode function L i (t) by the above-described processes S3110 to step S311, to give five different endpoint detection result, and then determines the final endpoint detection result of a vote;
将声音信号y(t)分离为声音事件段s(t)与场景声音段n(t)之后,为了能够更准确地估计信噪比,我们对信号能量进行平滑,首先计算场景声音能量:After separating the sound signal y(t) into the sound event segment s(t) and the scene sound segment n(t), in order to be able to estimate the signal-to-noise ratio more accurately, we smooth the signal energy and first calculate the scene sound energy:
Pn(t)=n2(t)  (13)P n (t)=n 2 (t) (13)
其次,对场景声音能量进行调整Second, adjust the sound energy of the scene
Pn(t)=mean(Pn)if Pn(t)>γmean(Pn)  (14)P n (t)=mean(P n ) if P n (t)>γmean(P n ) (14)
其中,系数γ=3,该过程的目的是将场景声音段中错分的声音事件段做调整;Among them, the coefficient γ=3, the purpose of this process is to adjust the misclassified sound event segment in the scene sound segment;
最后计算信噪比Finally calculate the signal-to-noise ratio
Figure PCTCN2015077075-appb-000008
Figure PCTCN2015077075-appb-000008
其中,l表示声音事件段与场景声音段长度的比值,由于分离后的声音事件段中含有场景声音成分,对声音事件段的能量值产生影响,因此,使用lΣPn(t)作为该影响的估计,剔除了场景声音对能量值的影响。Among them, l represents the ratio of the length of the sound event segment to the scene sound segment. Since the separated sound event segment contains scene sound components, it has an impact on the energy value of the sound event segment. Therefore, lΣP n (t) is used as the influence It is estimated that the influence of the sound of the scene on the energy value is eliminated.
在本发明实施例中,所述步骤S1至S3中,待测场景声音的特征、待测声音事件的特征、训练声音事件的特征、已知场景声音的特征的计算方法如下:In the embodiment of the present invention, in the steps S1 to S3, the calculation methods for the characteristics of the scene sound to be measured, the characteristics of the sound event to be measured, the characteristics of the training sound event, and the characteristics of the known scene sound are as follows:
GLCM可表示为:GLCM can be expressed as:
P(i,j|d,θ)=#{(x,y),(x+Δx,y+Δy)|f(x,y)=i,f(x+Δx,y+Δy)=j}   (16)P(i,j|d,θ)=#{(x,y),(x+Δx,y+Δy)|f(x,y)=i, f(x+Δx,y+Δy)=j } (16)
其中,x,y表示声谱图中的像素坐标,且x+Δx≤M,y+Δy≤N,M×N表示图像的大小;i,j=0,1,…,L-1,L为图像的灰度级数,#{S}表示集合S中元素的数量;Among them, x, y represent the pixel coordinates in the spectrogram, and x+Δx≤M, y+Δy≤N, M×N represents the size of the image; i,j=0,1,...,L-1,L Is the number of gray levels of the image, #{S} represents the number of elements in the set S;
截取声音事件的声谱图中大小为M×N,灰度级为L的图像区域,根据公式(16)及d、θ的取值,计算获得GLCM,并将各个GLCM组合成高阶矩阵A,
Figure PCTCN2015077075-appb-000009
对该高阶矩阵A进行张量展开,得到A(n)
Figure PCTCN2015077075-appb-000010
其中,将A的元素
Figure PCTCN2015077075-appb-000011
放置在大小为In×(In+1×…IN×I1×…In-1)二维矩阵的in行、j列,这里,
Figure PCTCN2015077075-appb-000012
当k>n时,
Figure PCTCN2015077075-appb-000013
当k<n时,
Figure PCTCN2015077075-appb-000014
Intercept the image area in the spectrogram of the sound event whose size is M×N and the gray level is L. According to formula (16) and the values of d and θ, calculate the GLCM, and combine each GLCM into a high-order matrix A ,
Figure PCTCN2015077075-appb-000009
Perform tensor expansion on the higher-order matrix A to obtain A (n) ,
Figure PCTCN2015077075-appb-000010
Among them, the elements of A
Figure PCTCN2015077075-appb-000011
Placed in the i n row and j column of a two-dimensional matrix of size I n ×(I n+1 ×…I N ×I 1 ×…I n-1 ), here,
Figure PCTCN2015077075-appb-000012
When k>n,
Figure PCTCN2015077075-appb-000013
When k<n,
Figure PCTCN2015077075-appb-000014
对A(n)进行奇异值分解,得到Perform singular value decomposition on A (n) to get
Figure PCTCN2015077075-appb-000015
Figure PCTCN2015077075-appb-000015
其中U(n)是酉矩阵;Σ(n)是半正定对角矩阵;而V(n)H,即V的共轭转置,是酉矩阵;根据式(17)得到Σ(n),根据Σ(n),可得 Among them, U (n) is a unitary matrix; Σ (n) is a positive semi-definite diagonal matrix; and V (n)H , the conjugate transpose of V, is a unitary matrix; Σ (n) is obtained according to formula (17), According to Σ (n) , we can get
Figure PCTCN2015077075-appb-000016
Figure PCTCN2015077075-appb-000016
将σ(1)…σ(n)…σ(N)作为声音事件的特征,即Take σ (1) …σ (n) …σ (N) as the characteristics of sound events, namely
Figure PCTCN2015077075-appb-000017
Figure PCTCN2015077075-appb-000017
其中,1≤n≤N;
Figure PCTCN2015077075-appb-000018
表示Σ(n)的第in个奇异值,1≤in≤In
Among them, 1≤n≤N;
Figure PCTCN2015077075-appb-000018
It represents Σ (n) of singular values i n, 1≤i n ≤I n;
根据上述声音事件的特征值计算方式,即可计算得到待测场景声音的特征、待测声音事件的特征、训练声音事件的特征、已知场景声音的特征。According to the above-mentioned method of calculating the feature value of the sound event, the characteristics of the sound event to be measured, the characteristics of the sound event to be measured, the characteristics of the training sound event, and the characteristics of the known scene sound can be calculated.
相较于现有技术,本发明具有以下有益效果:Compared with the prior art, the present invention has the following beneficial effects:
1、提出随机森林矩阵(RFM):把声音事件与各种环境声以不同信噪比混合,用混合声音来对声音事件进行分类器训练;1. Propose a random forest matrix (RFM): mix sound events with various environmental sounds with different signal-to-noise ratios, and use the mixed sounds to train the sound events as a classifier;
2、提出用EMD及对固有模态函数(Intrinsic Mode Function,IMF)进行投票的方法,对场景声音与声音事件端点进行检测并估算信噪比:通过多级固有模态函数对声音数据中的场景声音和声音事件的边界点进行检测,经投票确定最终边界点的检测结果,并估算声音事件的信噪比;2. Propose a method of using EMD and voting on the Intrinsic Mode Function (IMF) to detect scene sounds and sound event endpoints and estimate the signal-to-noise ratio: the multi-level intrinsic mode function is used to determine the signal-to-noise ratio in the sound data. The boundary points of scene sounds and sound events are detected, the final boundary point detection results are determined by voting, and the signal-to-noise ratio of the sound event is estimated;
3、提出GLCM-HOSVD特征:把声谱图转换成灰度共生矩阵(GLCM),通过对GLCM进行高阶奇异值分解(HOSVD),得到声音信号特征值;3. Propose GLCM-HOSVD features: convert the spectrogram into a gray-level co-occurrence matrix (GLCM), and obtain the eigenvalues of the sound signal by performing high-order singular value decomposition (HOSVD) on GLCM;
4、用随机森林矩阵(RFM)对不同场景与信噪比的声音事件进行识别:通过场景声类型和声音事件的信噪比,选择相应的随机森林,对声音事件进行识别;4. Use Random Forest Matrix (RFM) to identify sound events in different scenes and signal-to-noise ratios: select the corresponding random forest to identify sound events through the scene sound type and the signal-to-noise ratio of the sound event;
5、提出随机森林(RF)和多随机森林(M-RF)实时识别声音事件:把实时场景声音与声音事件样本集中的声音事件相混合,训练RF或M-RF,用于实时声音事件识别。5. Propose random forest (RF) and multi-random forest (M-RF) to recognize sound events in real time: mix real-time scene sounds with sound events in the sound event sample set, and train RF or M-RF for real-time sound event recognition .
附图说明Description of the drawings
图1为声谱图GLCM-HOSVD。Figure 1 shows the spectrogram GLCM-HOSVD.
图2为各种声场景下声音事件识别的EMD+GLCM-HOSVD+RFM架构图。Figure 2 shows the EMD+GLCM-HOSVD+RFM architecture diagram of sound event recognition in various sound scenarios.
图3为声场景中的声音事件实时识别的EMD+GLCM-HOSVD+RF架构图。Figure 3 is an EMD+GLCM-HOSVD+RF architecture diagram for real-time recognition of sound events in a sound scene.
图4为声场景中的声音事件实时识别的EMD+GLCM-HOSVD+M-RF架构图。Figure 4 is an EMD+GLCM-HOSVD+M-RF architecture diagram for real-time recognition of sound events in a sound scene.
图5为0db的不同声场景下端点检测检测结果图,图5(a)纯净声音,图5((b)风声场景,图5(c)雨声场景,图5((d)高斯白噪声,图5((e)公路声场景,图5((f)机场声音场景。Figure 5 shows the results of endpoint detection in different sound scenes of 0db. Figure 5(a) Pure sound, Figure 5((b) Wind sound scene, Figure 5(c) Rain sound scene, Figure 5((d) Gaussian white noise , Figure 5 ((e) road sound scene, Figure 5 ((f) airport sound scene).
图6为灰度共生矩阵中像素对间的位置关系。Figure 6 shows the positional relationship between pixel pairs in the gray level co-occurrence matrix.
图7为灰度共生矩阵GLCM生成示例,图7(a)4×5的灰度图像,图7(b)d1=1及θ=0°时的GLCM,图7(c)d1=1、d2=2及θ=0°,45°,90°,135°成的8×8×8三阶矩阵。Fig. 7 is an example of gray-level co-occurrence matrix GLCM generation, Fig. 7(a) 4×5 gray-scale image, Fig. 7(b) GLCM when d 1 =1 and θ=0°, Fig. 7(c) d 1 = 1. An 8×8×8 third-order matrix formed by d 2 = 2 and θ=0°, 45°, 90°, and 135°.
图8为随机森林的基本原理图。Figure 8 is the basic schematic diagram of random forest.
图9为两种纹理特征提取方法在不同场景、不同信噪比下的识别结果图,图9(a)公路场景,Figure 9 is a diagram of the recognition results of two texture feature extraction methods in different scenes and different signal-to-noise ratios. Figure 9(a) highway scene,
图9(b)风声场景,图9(c)流水场景,图9(d)雨声场景,图9(e)机场声场景,图9(f)高斯白噪声。Figure 9 (b) wind sound scene, Figure 9 (c) flowing water scene, Figure 9 (d) rain sound scene, Figure 9 (e) airport sound scene, Figure 9 (f) Gaussian white noise.
图10为EMD+M-RF,EMD+RF与pRF在6种场景下的平均识别结果图。Figure 10 shows the average recognition results of EMD+M-RF, EMD+RF and pRF in 6 scenarios.
图11为EMD+GLCM-HOSVD+M-RF与MP-feature识别率比较图,图11(a)公路场景,图11(b)风声场景,图11(c)流水场景,图11(d)雨声场景,图11(e)机场噪音,图11(f)高斯白噪声。Figure 11 is a comparison diagram of recognition rate between EMD+GLCM-HOSVD+M-RF and MP-feature, Figure 11(a) highway scene, Figure 11(b) wind sound scene, Figure 11(c) flowing water scene, Figure 11(d) Rain scene, Figure 11(e) airport noise, Figure 11(f) Gaussian white noise.
图12为EMD+GLCM-HOSVD+M-RF与SPD在低信噪比下的比较图,图12(a)公路场景,Figure 12 is a comparison diagram of EMD+GLCM-HOSVD+M-RF and SPD under low signal-to-noise ratio, Figure 12(a) highway scene,
图12(b)风声场景,图12(c)流水场景,图12(d)雨声场景,图12(e)机场噪音,图12(f)高斯 白噪声。Figure 12(b) Wind sound scene, Figure 12(c) Flowing water scene, Figure 12(d) Rain sound scene, Figure 12(e) Airport noise, Figure 12(f) Gaussian To White Noise.
图13为低信噪比下EMD+GLCM-HOSVD+M-RF、MP-feature和SPD三种方法的平均识别率图。Figure 13 is a graph of the average recognition rate of the three methods of EMD+GLCM-HOSVD+M-RF, MP-feature and SPD under low signal-to-noise ratio.
具体实施方式Detailed ways
下面结合附图,对本发明的技术方案进行具体说明。The technical solution of the present invention will be described in detail below in conjunction with the accompanying drawings.
本发明一种低信噪比声场景下声音事件的识别方法,包括如下步骤,The invention provides a method for recognizing sound events in a low signal-to-noise ratio sound scene, including the following steps:
步骤S1:随机森林矩阵的训练与生成:将声音事件样本集中的已知声音事件样本和场景声音样本集中的已知场景声音样本进行声音混合,得到混合声音信号集,并存放于训练声音集中,将所述训练声音集中的声音信号通过GLCM-HOSVD生成训练声音集的特征集,对该训练声音集的特征集进行训练,生成随机森林矩阵;Step S1: Training and generation of the random forest matrix: sound mixing the known sound event samples in the sound event sample set and the known scene sound samples in the scene sound sample set to obtain a mixed sound signal set and store it in the training sound set. Generating a feature set of the training sound set through GLCM-HOSVD from the sound signals in the training sound set, and training the feature set of the training sound set to generate a random forest matrix;
步骤S2:场景声音类型判别随机森林的训练与生成:对场景声音样本集中的已知场景声音样本进行GLCM-HOSVD,生成场景声音样本集的特征集,并对该场景声音样本集的特征集进行训练,生成场景声音类型判别随机森林;Step S2: The training and generation of the random forest for discriminating the type of scene sound: GLCM-HOSVD is performed on the known scene sound samples in the scene sound sample set to generate the feature set of the scene sound sample set, and perform GLCM-HOSVD on the feature set of the scene sound sample set Training to generate a random forest to discriminate the sound type of the scene;
步骤S3:对待测声音事件进行识别:Step S3: Recognizing the sound event to be measured:
第一步,将待测声音信号通过EMD分解出场景声音和声音事件,并计算出该待测声音事件的信噪比;In the first step, the sound signal to be measured is decomposed into scene sounds and sound events through EMD, and the signal-to-noise ratio of the sound event to be measured is calculated;
第二步,计算待测场景声音和待测声音事件的特征值,并将所述待测场景声音的特征值输入所述步骤S2生成的场景声音类型判别随机森林,检测出待测场景声音类型;The second step is to calculate the characteristic values of the scene sound to be measured and the sound event to be measured, and input the characteristic values of the scene sound to be measured into the scene sound type discrimination random forest generated in step S2, and detect the sound type of the scene to be measured ;
第三步,通过所述待测场景声音类型和待测声音事件的信噪比,从所述步骤S1生成的随机森林矩阵中选择进行声音事件识别的随机森林;The third step is to select a random forest for sound event recognition from the random forest matrix generated in step S1 according to the sound type of the scene to be measured and the signal-to-noise ratio of the sound event to be measured;
第四步,将所述待测声音事件的特征值通过第三步所选择的随机森林进行识别得到声音类型。In the fourth step, the characteristic value of the sound event to be measured is identified through the random forest selected in the third step to obtain the sound type.
以下具体讲述本发明方法的实施过程。The following specifically describes the implementation process of the method of the present invention.
一、声音事件识别模型1. Sound event recognition model
这部分,介绍在各种低信噪比的声场景中基于GLCM-HOSVD的声音事件识别的架构。其中,把声音信号经过GLCM-HOSVD,生成特征值w的过程如图1所示。GLCM-HOSVD的过程,就是把声音信号转换成声谱图,计算声谱图的GLCM,通过对GLCM进行HOSVD,得到声音信号的特征值w。本申请中需要涉及的特征值w包括训练图2中的训练声音集的特征集Wl以及图3中的Ws和图4中的Wsh,Ws,Wsl,已知的有限种场景声音的特征值Wn,待测试声音中的场景声音的特征值wt和待测声音事件的特征值weThis part introduces the architecture of sound event recognition based on GLCM-HOSVD in various sound scenes with low signal-to-noise ratio. Among them, the process of passing the sound signal through GLCM-HOSVD to generate the characteristic value w is shown in Figure 1. The process of GLCM-HOSVD is to convert the sound signal into a spectrogram, calculate the GLCM of the spectrogram, and obtain the characteristic value w of the sound signal by performing HOSVD on the GLCM. The feature value w that needs to be involved in this application includes the feature set W l of the training sound set in training figure 2 and W s in figure 3 and W sh , W s , and W sl in figure 4, which are known limited kinds of scenes. The characteristic value W n of the sound, the characteristic value w t of the scene sound in the sound to be tested, and the characteristic value w e of the sound event to be measured.
在各种声场景下识别声音事件的架构如图2所示。这种架构我们称为EMD,GLCM-HOSVD与RFM架构。相关内容包括,1)随机森林矩阵RFM训练与生成部分,如图2中虚线框部分所示;2)场景声音类型判别随机森林RFn训练与生成部分,如图2中点线框部分所示;3)待测声音事件识别部分,如图2中半划线框部分所示。Figure 2 shows the architecture for recognizing sound events in various sound scenarios. This architecture is called EMD, GLCM-HOSVD and RFM architecture. Related content includes: 1) Random forest matrix RFM training and generation part, as shown in the dashed box in Figure 2; 2) Scene sound type discrimination Random forest RF n training and generation part, as shown in the dotted line box in Figure 2. ; 3) The recognition part of the sound event to be tested, as shown in the half-lined frame part in Figure 2.
其中,随机森林矩阵训练与生成部分包括声音事件样本,场景声音样本,声音混合,训练声音集,GLCM-HOSVD和随机森林矩阵RFM。声音事件样本,存放各种类型的已知声音事件样本。场景声音样本,存放S种已知类型的场景声音样本。声音混合,把各种已知声音事件样本和S种场景声音样本按照N个不同的信噪比进行叠加,生成S种声场景、N个不同信噪比的S×N种类型的混合声音信号集,并存放在训练声音集中。GLCM-HOSVD对训练声音集中的声音进行GLCM-HOSVD,生成声音训练集的特征集
Figure PCTCN2015077075-appb-000019
其中,M为声音样本的数量。RFM,对S×N个特征集Wl进行训练,生成S×N的随机森林矩阵。
Among them, the random forest matrix training and generation part includes sound event samples, scene sound samples, sound mixing, training sound set, GLCM-HOSVD and random forest matrix RFM. Sound event samples, storing various types of known sound event samples. Scene sound samples, store S kinds of known types of scene sound samples. Sound mixing, which superimposes various known sound event samples and S scene sound samples according to N different signal-to-noise ratios to generate S×N types of mixed sound signals of S sound scenes and N different signal-to-noise ratios Set and stored in the training sound set. GLCM-HOSVD performs GLCM-HOSVD on the sounds in the training sound set to generate the feature set of the sound training set
Figure PCTCN2015077075-appb-000019
Among them, M is the number of sound samples. RFM trains the S×N feature sets W1 to generate an S×N random forest matrix.
场景声音类型判别随机森林训练与生成部分,对场景声音样本进行GLCM-HOSVD,生 成场景声音特征wn。对场景声音特征样本集
Figure PCTCN2015077075-appb-000020
进行训练,生成场景声音类型判别随机森林RFn,其中,P为场景声音样本的数量。
Scene sound type discrimination random forest training and generation part, GLCM-HOSVD is performed on the scene sound samples to generate scene sound characteristics w n . Sample set of sound characteristics of the scene
Figure PCTCN2015077075-appb-000020
Perform training to generate a scene sound type discrimination random forest RF n , where P is the number of scene sound samples.
待测声音事件识别部分,把待测声音信号y(t)通过EMD,分解出场景声音部分n(t)和声音事件部分s(t)。计算待测场景声音的特征值wt。把wt输入场景声音类型判别随机森林RFn,检测出待测场景声音类型lt。通过场景声音n(t)和声音事件s(t),计算待测声音事件的信噪比ls。通过待测声音事件中的场景声音类型lt和信噪比ls,从随机森林矩阵中选择进行声音事件识别的随机森林RFs,n。计算待测声音事件的特征值we,用we通过随机森林RFs,n识别声音事件得到类型l。对于声场景中的声音事件的实时识别过程,我们对图2的架构进行简化,得到图3所示的EMD,GLCM-HOSVD与RF架构。In the sound event recognition part to be measured, the sound signal y(t) to be measured is passed through EMD to decompose the scene sound part n(t) and the sound event part s(t). Calculate the characteristic value w t of the sound of the scene to be tested. The input scene sound type of w t is judged by random forest RF n , and the sound type l t of the scene to be tested is detected. Through the scene sound n(t) and the sound event s(t), the signal-to-noise ratio l s of the sound event to be measured is calculated. According to the scene sound type l t and the signal-to-noise ratio l s in the sound event to be tested, a random forest RF s,n for sound event recognition is selected from the random forest matrix. Calculate the characteristic value w e of the sound event to be measured, and use w e to identify the sound event through the random forest RF s,n to obtain the type l. For the real-time recognition process of sound events in the acoustic scene, we simplified the architecture of Figure 2 to obtain the EMD, GLCM-HOSVD and RF architectures shown in Figure 3.
在实时测试中,只需把经过EMD分割后的场景声音n(t),根据场景声音n(t)和声音事件s(t)的信噪比ls,直接与声音事件样本库中的M个各种声音事件进行混合,把混合声音集进行GLCM-HOSVD,生成特征集
Figure PCTCN2015077075-appb-000021
用Ws建立RFs。用建立的RFs对待测声音事件s(t)的特征值we进行识别。
In the real-time test, only the scene sound n(t) after EMD segmentation needs to be directly compared with the M in the sound event sample library according to the signal-to-noise ratio l s of the scene sound n(t) and the sound event s(t). A variety of sound events are mixed, and the mixed sound set is GLCM-HOSVD to generate a feature set
Figure PCTCN2015077075-appb-000021
Use W s to establish RF s . The established RF s is used to identify the characteristic value w e of the sound event s(t) to be measured.
一般情况下,对声音信号中声音事件的信噪比的检测,存在偏差。尤其在低信噪比时,如果对信噪比的估算出现偏差,训练出的RF分类器,可能无法对声音事件进行准确检测。因此,我们把图3的EMD,GLCM-HOSVD与RF组成的架构,进一步扩展成图4所示的EMD,GLCM-HOSVD与M-RF组成的声音事件识别架构。对于声音信号,我们同时用实际检测得到的信噪比ls值相近的两个信噪比lsh和lsl(lsh>ls>lsl),分别与声音事件样本混合成三组声音集。把三组混合声音集通过GLCM-HOSVD,生成
Figure PCTCN2015077075-appb-000022
Figure PCTCN2015077075-appb-000023
Figure PCTCN2015077075-appb-000024
分别训练三个RF分类器RFsh,RFs,RFsl。在对声音事件进行识别时,分别用RFsh,RFs,RFsl对声音事件进行识别,最后通过三个随机森林中的所有决策树投票确定识别结果。
In general, there is a deviation in the detection of the signal-to-noise ratio of the sound event in the sound signal. Especially when the signal-to-noise ratio is low, if the estimation of the signal-to-noise ratio is biased, the trained RF classifier may not be able to accurately detect the sound event. Therefore, we further expand the architecture composed of EMD, GLCM-HOSVD and RF in Figure 3 into the sound event recognition architecture composed of EMD, GLCM-HOSVD and M-RF shown in Figure 4. For the sound signal, we use the two signal-to-noise ratios l sh and l sl (l sh >l s >l sl ) with similar values of the actual detected signal-to-noise ratio l s at the same time, and mix them with the sound event samples to form three sets of sounds. set. Pass the three sets of mixed sound sets through GLCM-HOSVD to generate
Figure PCTCN2015077075-appb-000022
Figure PCTCN2015077075-appb-000023
with
Figure PCTCN2015077075-appb-000024
Train three RF classifiers RF sh , RF s , and RF sl respectively . When recognizing sound events, RF sh , RF s , and RF sl are used to recognize the sound events. Finally, all decision trees in the three random forests are voted to determine the recognition result.
二、低信噪比声音事件识别2. Low signal-to-noise ratio sound event recognition
这部分包括对声音数据进行经验模态分解,对声音事件的端点进行检测,计算声音数据中声音事件与场景声音的信噪比。把声音数据转换成声谱图,计算声音数据的GLCM。对GLCM进行HOSVD,生成特征w。用特征集W训练生成随机森林矩阵,用随机森林对识别声音数据中的声音事件。This part includes the empirical modal decomposition of the sound data, the detection of the end points of the sound event, and the calculation of the signal-to-noise ratio between the sound event and the scene sound in the sound data. Convert the sound data into a spectrogram, and calculate the GLCM of the sound data. Perform HOSVD on GLCM to generate feature w. Use feature set W to train to generate a random forest matrix, and use random forest pairs to identify sound events in the sound data.
A.声音事件端点检测与信噪比估算A. Sound event endpoint detection and signal-to-noise ratio estimation
首先,我们通过经验模态分解,进行声音事件端点检测,然后,根据场景声音与声音事件的端点进行信噪比估算。First, we use empirical mode decomposition to detect the sound event endpoint, and then estimate the signal-to-noise ratio based on the scene sound and the sound event endpoint.
EMD是HHT变换的核心[44]。EMD能依据信号自身的特性将原始信号y(t)自适地分为n级IMF的线性叠加,即EMD is the core of HHT transformation [44]. EMD can adaptively divide the original signal y(t) into the linear superposition of n-level IMF according to the characteristics of the signal itself, namely
Figure PCTCN2015077075-appb-000025
Figure PCTCN2015077075-appb-000025
其中,ri(t)为残余函数。Among them, r i (t) is the residual function.
在n级固有模态函数Li(t)中,1级固有模态函数L1(t)主要包含噪音成分,有效声音成分极少。因此,我们仅选取2-6级固有模态函数,即取i=2,3,…,6,用于对前景声音端点的检测。用第i级固有模态函数Li(t)进行前景声音端点检测的过程如下。Among the n-level natural modal functions Li (t), the first-level natural modal function L 1 (t) mainly contains noise components, with very few effective sound components. Therefore, we only select 2-6 intrinsic modal functions, that is, i=2,3,...,6, which is used to detect the foreground sound endpoints. The process of using the i-th intrinsic mode function Li (t) to detect the foreground sound endpoint is as follows.
1)对第i级固有模态函数Li(t)做预处理 1) Preprocessing of the i-th inherent modal function Li (t)
ei(t)=|H{Li(t)}|+Li(t)(2)e i (t)=|H{L i (t)}|+L i (t)(2)
其中,H{Li(t)}表示对固有模态函数做希尔伯特变换。Among them, H{L i (t)} represents the Hilbert transformation of the intrinsic mode function.
2)对ei(t)进行平滑。2) Smooth e i (t).
Figure PCTCN2015077075-appb-000026
Figure PCTCN2015077075-appb-000026
其中,σ为平滑窗口,这里取采样率的0.05倍。Among them, σ is the smoothing window, and here is 0.05 times the sampling rate.
3)对Ei(t)归一化。3) Normalize E i (t).
Figure PCTCN2015077075-appb-000027
Figure PCTCN2015077075-appb-000027
4)计算声音事件等级Slevel、场景声音等级Nlevel和初始化场景声音等级阀值T4) Calculate the sound event level S level , the scene sound level N level and the initial scene sound level threshold T
Slevel=mean[Fi(t)]   (5)S level = mean[F i (t)] (5)
Nlevel=βΣFi(t)   (6)N level = βΣF i (t) (6)
T=αSlevel   (7)T=αS level (7)
其中,α,β为门限值参数,取α=4,β=0.25。Among them, α, β are threshold parameters, and α=4, β=0.25.
5)计算Fi(t)在第k个窗口的平均值5) Calculate the average of F i (t) in the kth window
Figure PCTCN2015077075-appb-000028
Figure PCTCN2015077075-appb-000028
其中,k为窗口索引,Wd为窗长,取信号采样率0.02倍。Among them, k is the window index, W d is the window length, and the signal sampling rate is 0.02 times.
6)对是否存在声音事件进行判断。6) Judge whether there is a sound event.
Figure PCTCN2015077075-appb-000029
Figure PCTCN2015077075-appb-000029
如果声音事件存在,跳转至步骤8)。If the sound event exists, skip to step 8).
7)对场景声音进行动态估计,更新场景声音等级Nlevel7) Dynamically estimate the scene sound, and update the scene sound level N level .
Figure PCTCN2015077075-appb-000030
Figure PCTCN2015077075-appb-000030
其中,Nlevel(n)为第n个窗口的场景声音等级。在更新场景声音等级Nlevel(n)后跳转至步骤9)。Among them, N level (n) is the scene sound level of the nth window. After updating the scene sound level N level (n), skip to step 9).
8)更新场景声音等级阀值。8) Update the scene sound level threshold.
Figure PCTCN2015077075-appb-000031
Figure PCTCN2015077075-appb-000031
其中,θ为常数,取θ=0.2。 Among them, θ is a constant, and θ=0.2. To
9)如果阀值在之前的循环中被更新过,则更新声音事件等级Slevel9) If the threshold has been updated in the previous cycle, update the sound event level S level :
Slevel=Nlevel+λ|T-Nlevel|  (12)S level =N level +λ|TN level | (12)
其中,λ=0.5,作为声音事件等级更新的权值。Among them, λ=0.5, which is used as the weight for updating the level of the sound event.
10)k=k+1,移动窗口。如窗口没有结束跳转至步骤5),否则循环结束。10) k=k+1, move the window. If the window is not over, skip to step 5), otherwise the loop ends.
选取的2-6级固有模态函数Li(t)经上述步骤处理,可以得到5种不同的端点检测结果,再经投票确定最终端点检测结果。The selected 2-6 level intrinsic modal function Li (t) is processed by the above steps, and 5 different endpoint detection results can be obtained, and then the final endpoint detection result is determined by voting.
图5中蓝色部分为声音信号波形图,红色部分为端点检测结果,高位表示包含声音事件,低位表示仅包含场景声音。(b)、(c)、(d)、(e)、(f)分别为各类声场景下,信号强度为0db的波形图及声音事件端点检测结果。通过以上各图,可以看出,该方法在0db下能够基本检测出声音事件的声音段。In Figure 5, the blue part is the sound signal waveform, the red part is the endpoint detection result, the high bit indicates that the sound event is included, and the low bit indicates that only the scene sound is included. (b), (c), (d), (e), (f) are waveform diagrams and sound event endpoint detection results with a signal strength of 0db in various acoustic scenarios. From the above figures, it can be seen that this method can basically detect the sound segment of the sound event under 0db.
将声音信号y(t)分离为声音事件段s(t)与场景声音段n(t)之后,为了能够更准确地估计信噪比,我们对信号能量进行平滑。首先计算场景声音能量:After separating the sound signal y(t) into the sound event segment s(t) and the scene sound segment n(t), in order to be able to estimate the signal-to-noise ratio more accurately, we smooth the signal energy. First calculate the sound energy of the scene:
Pn(t)=n2(t)  (13)P n (t)=n 2 (t) (13)
其次,对场景声音能量进行调整:Secondly, adjust the sound energy of the scene:
Pn(t)=mean(Pn)if Pn(t)>γmean(Pn)  (14)P n (t)=mean(P n ) if P n (t)>γmean(P n ) (14)
其中系数γ=3。该过程的目的是将场景声音段中错分的声音事件段做调整。The coefficient γ=3. The purpose of this process is to adjust the misclassified sound event segment in the scene sound segment.
最后计算信噪比:Finally calculate the signal-to-noise ratio:
Figure PCTCN2015077075-appb-000032
Figure PCTCN2015077075-appb-000032
其中,l表示声音事件段与场景声音段长度的比值。由于分离后的声音事件段中含有场景声音成分,对声音事件段的能量值产生影响。因此,使用lΣPn(t)作为该影响的估计,剔除了场景声音对能量值的影响。由于端点检测存在错误,所以信噪比的计算存在一定的误差。因此,为了匹配相应的分类器模型,信噪比计算结果在(-6,-0.5),[-0.5,2.5),[2.5,7.5),[7.5,15),[15,25)dB区间内的,分别使用-5dB、0dB、5dB、10dB、20dB分类器模型对声音事件进行识别。Among them, l represents the ratio of the length of the sound event segment to the scene sound segment. Since the separated sound event segment contains scene sound components, it affects the energy value of the sound event segment. Therefore, lΣP n (t) is used as the estimation of the influence, and the influence of the scene sound on the energy value is eliminated. Due to the error in endpoint detection, there is a certain error in the calculation of the signal-to-noise ratio. Therefore, in order to match the corresponding classifier model, the SNR calculation result is in the interval of (-6,-0.5),[-0.5,2.5),[2.5,7.5),[7.5,15),[15,25)dB Inside, the -5dB, 0dB, 5dB, 10dB, and 20dB classifier models are used to identify sound events.
B.声音信号的GLCMB. GLCM of sound signal
这部分,我们计算每个声音段的谱图S(f,t)的GLCM。In this part, we calculate the GLCM of the spectrum S(f,t) of each sound segment.
这里,GLCM是指声谱图中相距(Δx,Δy),灰度级分别为i和j的两个像素点同时出现的联合概率分布,其中Δx和Δy的具体范围由两个参数决定:像素间距d和矩阵的生成方向θ[46],且满足Δx=dcosθ,Δy=dsinθ,如图6所示。GLCM用可表示为:Here, GLCM refers to the joint probability distribution of two pixels that are separated by (Δx, Δy) in the spectrogram, and gray levels are i and j, respectively. The specific ranges of Δx and Δy are determined by two parameters: pixel The spacing d and the matrix generation direction θ[46], and satisfy Δx=dcosθ, Δy=dsinθ, as shown in FIG. 6. GLCM can be expressed as:
P(i,j|d,θ)=#{(x,y),(x+Δx,y+Δy)|f(x,y)=i,f(x+Δx,y+Δy)=j}   (16)P(i,j|d,θ)=#{(x,y),(x+Δx,y+Δy)|f(x,y)=i, f(x+Δx,y+Δy)=j } (16)
其中,x,y表示声谱图中的像素坐标,且x+Δx≤M,y+Δy≤N,M×N表示图像的大小;i,j=0,1,…,L-1,L为图像的灰度级数,#(S)表示集合S中元素的数量。当d和θ确定时,P(i,j|d,θ)可以简写为P(i,j)。 Among them, x, y represent the pixel coordinates in the spectrogram, and x+Δx≤M, y+Δy≤N, M×N represents the size of the image; i,j=0,1,...,L-1,L Is the number of gray levels of the image, #(S) represents the number of elements in the set S. When d and θ are determined, P(i,j|d,θ) can be abbreviated as P(i,j). To
影响GLCM的性能和计算复杂度的最主要有三个:灰度级数L、像素间距d和方向θ。根据实验,本文中我们选取L=8,d=1,2,θ=0°,45°,90°,135°。There are three main factors that affect the performance and computational complexity of GLCM: the number of gray levels L, the pixel pitch d, and the direction θ. According to the experiment, we choose L=8, d=1, 2, θ=0°, 45°, 90°, 135° in this article.
如图7所示为一个GLCM的生成示例。其中图7(a)是从声谱图中截取的一个大小为4×5、灰度级为8的图像区域。图7(b)是该图像区域在d=1,θ=0°时,所对应的GLCM,即A1。如图7(a)在水平方向上从左到右,(4,6)这个灰度对共出现了2次,所以图7(b)中GLCM的第4行第6列的值应为2,即根据(16),P(4,6|1,0°)=2。又如图7(a)中(0,1)这个灰度对,从左到右出现了2次,所以图7(b)GLCM的第0行第1列的值为2,即P(0,1|1,0°)=2。Figure 7 shows an example of GLCM generation. Figure 7(a) is an image area with a size of 4×5 and a gray level of 8 cut from the spectrogram. Figure 7(b) shows the GLCM corresponding to the image area when d=1 and θ=0°, namely A 1 . As shown in Figure 7(a), from left to right in the horizontal direction, the gray pair (4,6) appears twice, so the value of the fourth row and sixth column of GLCM in Figure 7(b) should be 2. , That is, according to (16), P(4,6|1,0°)=2. As shown in Figure 7(a), the gray pair (0,1) appears twice from left to right, so the value of row 0 and column 1 of GLCM in Figure 7(b) is 2, that is, P(0 , 1|1, 0°)=2.
同样,对于d1=1及θ=0°,45°,90°,135°;d2=2及θ=0°,45°,90°,135°,我们可以得到另外的7个GLCM,A2,…,A8。我们把这八个矩阵组成一个8×8×8三阶矩阵,如图7(c)所示。Similarly, for d 1 =1 and θ=0°, 45°, 90°, 135°; d 2 = 2 and θ=0°, 45°, 90°, 135°, we can get another 7 GLCMs, A 2 ,…,A 8 . We put these eight matrices into an 8×8×8 third-order matrix, as shown in Figure 7(c).
C.HOSVDC.HOSVD
为了进一步提取声音事件的特征,我们把图7(c)的三阶矩阵进行HOSVD。In order to further extract the characteristics of the sound event, we put the third-order matrix in Figure 7(c) into HOSVD.
这里我们先来回顾一下奇异值分解[47]。对于任意一个大小为m×n的矩阵M,
Figure PCTCN2015077075-appb-000033
则存在一个分解使得
Here we first review the singular value decomposition [47]. For any matrix M of size m×n,
Figure PCTCN2015077075-appb-000033
Then there is a decomposition such that
M=UΣV*  (17)M=UΣV * (17)
其中,U:是m×m阶酉矩阵,∑是半正定m×n阶对角矩阵,而V*,即V的共轭转置,是n×n阶酉矩阵。Among them, U: is a unitary matrix of order m×m, Σ is a positive semi-definite m×n order diagonal matrix, and V * , the conjugate transpose of V, is a unitary matrix of order n×n.
对于一个大小为I1×I2×…IN的高阶矩阵A,
Figure PCTCN2015077075-appb-000034
可以对A进行张量的展开[34],得到A(n))
Figure PCTCN2015077075-appb-000035
其中,把A的元素
Figure PCTCN2015077075-appb-000036
放置在大小为In×(In+1×…IN×I1×…In-1)二维矩阵的in行、j列。这里,
Figure PCTCN2015077075-appb-000037
当k>n时,
Figure PCTCN2015077075-appb-000038
当k<n时,
Figure PCTCN2015077075-appb-000039
For a high-order matrix A of size I 1 ×I 2 ×…I N,
Figure PCTCN2015077075-appb-000034
The tensor of A can be expanded [34] to get A (n)) ,
Figure PCTCN2015077075-appb-000035
Among them, put the element of A
Figure PCTCN2015077075-appb-000036
It is placed in i n rows and j columns of a two-dimensional matrix of size I n ×(I n+1 ×...I N ×I 1 ×...I n-1 ). Here,
Figure PCTCN2015077075-appb-000037
When k>n,
Figure PCTCN2015077075-appb-000038
When k<n,
Figure PCTCN2015077075-appb-000039
与奇异值分解类似,对A(n)进行奇异值分解,Similar to singular value decomposition, perform singular value decomposition on A (n),
Figure PCTCN2015077075-appb-000040
Figure PCTCN2015077075-appb-000040
其中U(n)是酉矩阵;Σ(n)是半正定对角矩阵;而V(n)H,即V的共轭转置,是酉矩阵。可以得到Σ(n)。根据Σ(n),我们可以得到Among them, U (n) is a unitary matrix; Σ (n) is a positive semi-definite diagonal matrix; and V (n)H , the conjugate transpose of V, is a unitary matrix. We can get Σ (n) . According to Σ (n) , we can get
Figure PCTCN2015077075-appb-000041
Figure PCTCN2015077075-appb-000041
我们把σ(1)…σ(n)…σ(N)作为声音事件的特征,即 We take σ (1) …σ (n) …σ (N) as the characteristics of sound events, namely
Figure PCTCN2015077075-appb-000042
Figure PCTCN2015077075-appb-000042
其中,1≤n≤N;
Figure PCTCN2015077075-appb-000043
表示Σ(n)的第in个奇异值,1≤in≤In
Among them, 1≤n≤N;
Figure PCTCN2015077075-appb-000043
Represents the i n th singular value of Σ (n) , 1 ≤ i n ≤ I n .
以图7(c),8×8×8三阶矩阵为例,可以表示成
Figure PCTCN2015077075-appb-000044
其中,I1=8,I2=8,I3=8。则
Figure PCTCN2015077075-appb-000045
对A沿I1维展开,得到A(1)
Figure PCTCN2015077075-appb-000046
Figure PCTCN2015077075-appb-000047
Taking Figure 7(c), the 8×8×8 third-order matrix as an example, it can be expressed as
Figure PCTCN2015077075-appb-000044
Among them, I 1 =8, I 2 =8, and I 3 =8. then
Figure PCTCN2015077075-appb-000045
Unfold A along I one dimension to get A (1) ,
Figure PCTCN2015077075-appb-000046
which is
Figure PCTCN2015077075-appb-000047
Figure PCTCN2015077075-appb-000048
Figure PCTCN2015077075-appb-000048
同样,其沿I2和I3维展开,可以得到A(2)和A(3)Similarly, it expands along the I 2 and I 3 dimensions, and A (2) and A (3) can be obtained.
因此,根据(18)与(19)可以得到Σ(n)即Σ(1),Σ(2),Σ(3)及σ(1),σ(2)和σ(3)Therefore, according to (18) and (19), Σ (n) , namely Σ (1) , Σ (2) , Σ (3) and σ (1) , σ (2) and σ (3) can be obtained .
Figure PCTCN2015077075-appb-000049
Figure PCTCN2015077075-appb-000049
其中,On,m表示大小为n×m的零矩阵。σ(1)=[6.31,5.24,5.01,3.08,2.71,2.12,1.91,1.27]。同样,可以得到σ(2)=[6.26,5.66,4.60,3.31,2.77,2,1.69,1],σ(3)=[6.51,5.65,4.43,3.10,2.46,2.16,1.68,1.36]。最后根据(20)将σ(1),σ(2)和σ(3)组合成w,作为声音事件识别的特征。Among them, On ,m represents a zero matrix with a size of n×m. σ (1) = [6.31, 5.24, 5.01, 3.08, 2.71, 2.12, 1.91, 1.27]. Similarly, σ (2) = [6.26, 5.66, 4.60, 3.31,2.77, 2, 1.69, 1], σ (3) = [6.51, 5.65, 4.43, 3.10, 2.46, 2.16, 1.68, 1.36] can be obtained. Finally, according to (20), σ (1) , σ (2) and σ (3) are combined into w as the feature of sound event recognition.
w=[6.31 5.24 5.01 3.08 … 3.10 2.46 2.16 1.68 1.36].w=[6.31 5.24 5.01 3.08... 3.10 2.46 2.16 1.68 1.36].
用同样的方法,我们可以得到II部分所述的训练声音事件的特征wl,已知的场景声音的特征wn,待测试声音中的场景声音的特征wt和待测声音事件的特征we。对于包含M个声音事件的声音集,我们可以得到特征集W={w1,…,wM}。通过特征集W,我们可以进一步训练随机森林。 In the same way, we can obtain the characteristics w l of the training sound event described in Part II, the characteristics of the known scene sound w n , the characteristics of the scene sound in the sound to be tested w t and the characteristic w of the sound event to be tested. e . For a sound set containing M sound events, we can obtain a feature set W={w 1 ,...,w M }. Through the feature set W, we can further train the random forest.
D.RF识别D. RF identification
随机森林是一种利用多个决策树分类器来对数据进行判别的集成分类器算法[49-52]。其原理如图7所示,即通过自助重采样(Bootstrap)技术,从原始训练样本的特征集中自助重采样,生成新的k个训练数据集。然后这些新生成的k个训练数据集,按照决策树的构建方法生长成k颗/棵决策树,并组合在一起形成森林。测试数据的判别结果则由森林中k颗树,投票形成的分数而定。Random forest is an ensemble classifier algorithm that uses multiple decision tree classifiers to discriminate data [49-52]. The principle is shown in Figure 7, that is, through self-service re-sampling (Bootstrap) technology, self-service re-sampling from the feature set of the original training sample to generate new k training data sets. Then these newly generated k training data sets are grown into k/decision trees according to the decision tree construction method, and they are combined to form a forest. The discriminant result of the test data is determined by the scores formed by voting of k trees in the forest.
使用随机森林对未知测试样本的识别过程如下。首先,将图2或图3图4所示,待测试 声音中的场景声音的特征wt或待测声音事件的特征we,置于随机森林中所有k棵决策树的根节点处。然后按照决策树的分类规则向下传递,直到到达某一个叶节点处。这个叶节点对应类标签便是这棵决策树对特征wt或we所属类别l所做的投票。随机森林的k棵决策树均对wt或we的类别l进行了投票,统计这k个投票,其中票数最多的便成为wt或we对应的类标l。The process of using random forest to identify unknown test samples is as follows. First, as shown in Figure 2 or Figure 3 and Figure 4, the feature w t of the scene sound in the sound to be tested or the feature w e of the sound event to be tested is placed at the root node of all k decision trees in the random forest. Then it is passed down according to the classification rules of the decision tree until it reaches a certain leaf node. The corresponding class label of this leaf node is the vote made by this decision tree on the class l of the feature w t or w e. The k decision trees of the random forest all vote for the category l of w t or w e , and count these k votes. The one with the most votes becomes the category label l corresponding to w t or w e.
三、实验Three, experiment
A.声音事件样本集A. Sound event sample set
实验用到的40种纯净鸟鸣声来自Freesound声音数据库[43],每种鸟鸣声有30个样本,共1200个样本。实验用到的六种场景声音分别为高斯白噪声、繁忙的公路场景声、流水场景声、机场场景声、下雨场景声、刮风场景声。其中,高斯白噪声是由计算机随机产生的均值为0、功率谱密度为1且均匀分布的随机信号,经白化得到。其他场景声是以44.1kHz的采样频率,分别在相应的声场景中录制。为规范以上声音文件的编码格式和长度,将它们统一转换成采样频率为8kHz、采样精度为16bit,长度为2s的单声道WAV格式声音片段。The 40 kinds of pure bird sounds used in the experiment came from the Freesound sound database [43]. There are 30 samples of each kind of bird sounds, and a total of 1200 samples. The six scene sounds used in the experiment are Gaussian white noise, busy highway scene sound, flowing water scene sound, airport scene sound, rain scene sound, and wind scene sound. Among them, Gaussian white noise is a random signal with a mean value of 0, a power spectral density of 1 and a uniform distribution randomly generated by a computer, and it is obtained by whitening. The other scene sounds are recorded in the corresponding sound scenes at a sampling frequency of 44.1kHz. In order to standardize the encoding format and length of the above sound files, they are uniformly converted into mono WAV format sound clips with a sampling frequency of 8kHz, a sampling accuracy of 16bit, and a length of 2s.
B.实验设置B. Experimental setup
首先,是基本实验。First, the basic experiment.
1)GLCM-HOSVD与GLCM-SDH(Sum and difference histograms)[45,49]比较。分别用GLCM-HOSVD与GLCM-SDH提取包含声音事件的不同信噪比的声音信号的特征,通过随机森林(进行识别)。比较GLCM-HOSVD与GLCM-SDH这两种特征对6类不同声场景下,对声音事件识别的性能。1) Comparison between GLCM-HOSVD and GLCM-SDH (Sum and difference histograms) [45,49]. GLCM-HOSVD and GLCM-SDH are used to extract the features of sound signals with different signal-to-noise ratios including sound events, and random forest (for identification) is used. Compare the performance of GLCM-HOSVD and GLCM-SDH for sound event recognition in 6 different sound scenes.
1)本文关键实验。过程如图2所示,采用EMD,GLCM-HOSVD与RFM组成的架构。内容包括:a)随机森林矩阵RFM训练与生成部分;b)场景声音类型判别随机森林RFn训练与生成部分;c)待测声音事件识别部分。1) The key experiment in this article. The process is shown in Figure 2, using an architecture composed of EMD, GLCM-HOSVD and RFM. The content includes: a) Random forest matrix RFM training and generation part; b) Scene sound type discrimination random forest RF n training and generation part; c) Sound event recognition part to be tested.
森林矩阵RFM训练与生成把(每种20个)40种声音事件样本与6类场景声音样本以20,10,5,0,-5dB等5种信噪比进行混合,生成6×5=30个混合声音集。把混合声音进行GLCM-HOSVD,产生特征集。用30特征集训练并生成6×5的随机森林矩阵RFM。Forest matrix RFM training and generation. 40 kinds of sound event samples (20 each) and 6 kinds of scene sound samples are mixed with 5 signal-to-noise ratios of 20, 10, 5, 0, -5dB to generate 6×5=30 A mixed sound set. Perform GLCM-HOSVD on the mixed sound to generate a feature set. Train with 30 feature sets and generate 6×5 random forest matrix RFM.
2)场景声音类型判别随机森林RFn,建立对6种场景声音类型进行判别的随机森林。从RFM中选择RFs,n,用RFs,n对相应声场景和5种信噪比的待测声音事件进行识别。2) Random forest RF n for discriminating the sound types of scenes, and establishing a random forest for discriminating 6 sound types of scenes. Select RF s,n from RFM, and use RF s,n to identify the corresponding sound scene and 5 kinds of signal-to-noise ratios to be measured.
3)比较EMD,GLCM-HOSVD与RF架构与单纯声音事件训练的pRF方法。pRF方法,就是用声音事件样本库中的40种单纯的声音事件训练的随机森林。RFs,n方法就是图2中用待测声音事件的信噪比ls与场景声音的类型lt从RFM中选择相匹配的RFs,n,用选择的RFs,n对待测声音事件进行识别。在声场景确定的实时检测中,我们把图2的架构简化成图3的EMD,GLCM-HOSVD与RF架构,即通过RFs进行识别。3) Compare EMD, GLCM-HOSVD and RF architecture with the pRF method of pure sound event training. The pRF method is a random forest trained with 40 simple sound events in the sound event sample library. The RF s,n method is to use the signal-to-noise ratio l s of the sound event to be measured and the type of scene sound l t in Figure 2 to select the matching RF s,n from the RFM, and use the selected RF s,n to be the sound event to be measured Identify it. In the real-time detection of sound scene determination, we simplified the architecture of Figure 2 into the EMD, GLCM-HOSVD and RF architectures of Figure 3, that is, identification through RF s.
4)EMD,GLCM-HOSVD与RF架构的检测性能根据上述实验结果,对所提方法进行实用改进。即具体运用中,采用对图4所示的EMD,GLCM-HOSVD与M-RF架构。4) The detection performance of EMD, GLCM-HOSVD and RF architecture is based on the above experimental results to make practical improvements to the proposed method. That is, in the specific application, the EMD, GLCM-HOSVD and M-RF architecture shown in Figure 4 are adopted.
其次,EMD,GLCM-HOSVD与M-RF架构与MP-feature[27,28]比较。在6种不同声场景下,把图4的EMD,GLCM-HOSVD与M-RF架构与文献[27]的MP+PCA+LDA+SVM的方法进行比较。MP结合PCA与LDA的SVM方法,这里简称MP-feature,就是用匹配追踪算法从Gabor字典中选择重要的原子,用主成分分析(PCA)和线性判别分析(LDA)确定声音事件的特征,采用SVM分类器进行识别。Secondly, EMD, GLCM-HOSVD and M-RF architecture are compared with MP-feature [27,28]. In 6 different acoustic scenarios, compare the EMD, GLCM-HOSVD and M-RF architectures in Fig. 4 with the MP+PCA+LDA+SVM method in [27]. The SVM method of MP combining PCA and LDA, here referred to as MP-feature, is to use the matching pursuit algorithm to select important atoms from the Gabor dictionary, and use principal component analysis (PCA) and linear discriminant analysis (LDA) to determine the characteristics of the sound event. The SVM classifier performs recognition.
再次,EMD,GLCM-HOSVD与M-RF架构与SPD比较。把EMD,GLCM-HOSVD与M-RF架构和文献[20]的SPD+KNN的方法,在5dB、0dB和-5dB的情况下,进行声音事件识别性能的比较。SPD结合KNN的方法,简称SPD,其中子带功率分布(sub-band power distribution,SPD)把高能、可靠的少部分声音事件谱图与场景声音分离,最近邻居分类器(kNN)对这些高 能可靠的少部分谱图进行识别。Again, EMD, GLCM-HOSVD and M-RF architecture are compared with SPD. The EMD, GLCM-HOSVD and M-RF architecture and the SPD+KNN method of literature [20] are compared with the sound event recognition performance under the conditions of 5dB, 0dB and -5dB. SPD combines the KNN method, referred to as SPD, in which the sub-band power distribution (SPD) separates the high-energy and reliable small part of the sound event spectrogram from the scene sound, and the nearest neighbor classifier (kNN) treats these high frequencies. To Reliable identification of a small number of spectra.
C.实验场景C. Experimental scenario
6类声音场景包括:公路场景声音、刮风场景声音、流水场景声音、下雨场景声音、机场场景声音与高斯白噪声。The 6 types of sound scenes include: road scene sound, wind scene sound, flowing water scene sound, rain scene sound, airport scene sound and Gaussian white noise.
四、结果与讨论4. Results and discussion
A.基本情况A. Basic information
第1组实验,对比GLCM-HOSVD与GLCM-SDH两种方法的识别率。识别率如图9所示。图9中(a)、(b)、(c)、(d)、(e)、(f)分别为公路场景、风声场景、流水场景、雨声场景、机场声场景和高斯白噪声在不同信噪比下的识别率。The first group of experiments compares the recognition rates of GLCM-HOSVD and GLCM-SDH. The recognition rate is shown in Figure 9. In Figure 9 (a), (b), (c), (d), (e), (f) are highway scenes, wind sound scenes, flowing water scenes, rain sound scenes, airport sound scenes, and Gaussian white noise. The recognition rate under the signal-to-noise ratio.
可以看出,在大多数声场景下,信噪比在10-20dB的情况,采用GLCM-HOSVD特征比采用GLCM-SDH特征,识别率高出20%左右。It can be seen that in most acoustic scenes, when the signal-to-noise ratio is 10-20dB, the recognition rate of using GLCM-HOSVD feature is about 20% higher than using GLCM-SDH feature.
对于图9(a)中,由于公路周围声场景的不稳定,实验结果也依然是GLCM-HOSVD方法明显好于GLCM-SDH。虽然,也出现如图9(e)所示,在机场声场景下,当声音事件的信噪比为0dB时,GLCM-HOSVD方法微弱地低于GLCM-SDH方法。但总体而言,我们提出的GLCM-HOSVD方法比GLCM-SDH方法能更好地表征声谱图的纹理特征。For Figure 9(a), due to the instability of the sound scene around the highway, the experimental result is still that the GLCM-HOSVD method is significantly better than GLCM-SDH. Although, as shown in Figure 9(e), in the airport sound scene, when the signal-to-noise ratio of the sound event is 0dB, the GLCM-HOSVD method is slightly lower than the GLCM-SDH method. But overall, the GLCM-HOSVD method we proposed can better characterize the texture characteristics of the spectrogram than the GLCM-SDH method.
第2组实验,本文的关键实验。在实验中,我们预定的场景声音种类较少,选择随机森林判别器,可以确保对场景声音种类的正确识别。对于实际应用,我们在四.D中讨论。实验中,我们对随机森林矩阵RFM中每一种声场景及信噪比的随机森林RFs,n,都用这种声音场景下的5种不同信噪比的声音事件进行测试。6种声场景下的平均识别率如表1所示。The second set of experiments is the key experiment of this article. In the experiment, we have predetermined fewer types of scene sounds, and choosing a random forest discriminator can ensure the correct recognition of the types of scene sounds. For practical applications, we discuss in IV.D. In the experiment, we test each sound scene and the random forest RF s,n of the signal-to-noise ratio in the random forest matrix RFM with 5 sound events with different signal-to-noise ratios in this sound scene. The average recognition rate under the 6 sound scenes is shown in Table 1.
表1用不同的RFs,n对不同信噪比声音事件识别率Table 1 Use different RF s, n to recognize sound events with different signal-to-noise ratios
Figure PCTCN2015077075-appb-000050
Figure PCTCN2015077075-appb-000050
从表1可以看出,当随机森林RFs,n的信噪比与测试声音信噪比相匹配时,识别精度几乎不受信噪比的影响。如表1主对角线所示,不论是高信噪比,还是低信噪比,识别精度都很高。如果测试声音信噪比与训练声音信噪比有偏差,则识别精度随着偏差的加大而下降。如,表1第一行,当RFs,n信噪比为20dB,测试声音信噪比10dB时为68.63%,5dB时为46.88%,0dB时为27.63%,-5dB时为13.75%。同时,RFs,n信噪比越低,RFs,n信噪比与测试声音信噪比错误匹配时,对识别率的影响越大。如表1第五行,当RFs,n信噪比为-5dB,测试声音信噪比0dB时为19.00%,5dB时为7.13%,10dB时为2.38%,20dB时为5.43%。但是,只要确保RFs,n信噪比与测试声音信噪比匹配,即便RFs,n为-5dB低信噪比情况,也能保持很高的识别率。It can be seen from Table 1 that when the signal-to-noise ratio of the random forest RF s,n matches the signal-to-noise ratio of the test sound, the recognition accuracy is hardly affected by the signal-to-noise ratio. As shown in the main diagonal of Table 1, the recognition accuracy is very high whether it is a high signal-to-noise ratio or a low signal-to-noise ratio. If there is a deviation between the signal-to-noise ratio of the test sound and the signal-to-noise ratio of the training sound, the recognition accuracy will decrease as the deviation increases. For example, in the first row of Table 1, when the RF s,n signal-to-noise ratio is 20dB and the test sound signal-to-noise ratio is 10dB, it is 68.63%, 5dB is 46.88%, 0dB is 27.63%, and -5dB is 13.75%. Meanwhile, the lower RF s, n SNR, RF s, the n-to-noise ratio SNR of the test sound matching error, the greater the impact on the recognition rate. As shown in the fifth row of Table 1, when the RF s, n signal-to-noise ratio is -5dB, the test sound signal-to-noise ratio is 19.00% at 0dB, 7.13% at 5dB, 2.38% at 10dB, and 5.43% at 20dB. However, as long as it is ensured that the signal-to-noise ratio of RF s,n matches the signal-to-noise ratio of the test sound, even if the RF s,n is -5dB with a low signal-to-noise ratio, a high recognition rate can be maintained.
第3组实验,对于EMD,GLCM-HOSVD与RF架构,图10中简称EMD+RF。按照图3的架构,待测声音事件估算的信噪比ls和它的真实信噪比存在偏差。使得RFs的信噪比与待测声音事件的信噪比存在偏差。使得RFs对待测声音事件的识别率降低。这种情况在RFs低信噪比时尤为明显。相关结果分别如图10中的绿色直方图所示。其中,当RFs为20dB时,对6类声场景的平均识别率为92%;10dB时、83%;5dB时,77.5%;0dB时、64%;-5dB时、29%。The third group of experiments, for EMD, GLCM-HOSVD and RF architecture, is referred to as EMD+RF in Figure 10. According to the architecture of Fig. 3, the estimated signal-to-noise ratio l s of the sound event to be measured has a deviation from its true signal-to-noise ratio. This makes the signal-to-noise ratio of RF s deviate from the signal-to-noise ratio of the sound event to be measured. As a result, the recognition rate of the RF s of the sound event to be measured is reduced. This situation is especially obvious when the RF s is low. The relevant results are shown in the green histograms in Figure 10. Among them, when the RF s is 20dB, the average recognition rate for the 6 types of sound scenes is 92%; at 10dB, 83%; at 5dB, 77.5%; at 0dB, 64%; at -5dB, 29%.
对于pRF,在不同信噪比下,对6种声场景的声音事件的平均识别率,图10中简称RF.从图10的蓝色直方图可以看出,信噪比20dB时pRF识别率略高于RFs。但整体识别结果,RFs明显好于RF的识别结果。 For pRF, under different signal-to-noise ratios, the average recognition rate of sound events in 6 sound scenes is referred to as RF in Figure 10. From the blue histogram in Figure 10, it can be seen that the recognition rate of pRF is slightly less when the signal-to-noise ratio is 20dB. Higher than RF s . However, the overall recognition result, RF s is significantly better than the RF recognition result.
第4组实验EMD,GLCM-HOSVD与M-RF架构,图10中简称EMD+M-RF。在试验中我们选取与估计的信噪比相差的信噪比以及估计的信噪比本身,来对M-RF进行混合。在不同信噪比下的这种方法的平均识别率如图10的红色直方图所示。我们可以看到EMD,GLCM-HOSVD与M-RF架构的方法在低信噪比下能极大的提升识别率。关于相关的改进,在四.D中讨论。The fourth group of experimental EMD, GLCM-HOSVD and M-RF architectures are referred to as EMD+M-RF in Figure 10. In the experiment, we select the signal-to-noise ratio that is different from the estimated signal-to-noise ratio and the estimated signal-to-noise ratio itself to mix M-RF. The average recognition rate of this method under different signal-to-noise ratios is shown in the red histogram in Figure 10. We can see that the EMD, GLCM-HOSVD and M-RF architecture methods can greatly improve the recognition rate under low signal-to-noise ratio. Regarding related improvements, discussed in IV.D.
B.EMD,GLCM-HOSVD与M-RF架构与MP-feature的比较B. Comparison of EMD, GLCM-HOSVD and M-RF architecture and MP-feature
EMD,GLCM-HOSVD与M-RF架构与MP-feature两种特征提取方法的在6类声场景下的识别情况如图11所示。图11中,在6类声场景下,MP特征在低信噪比,如5dB以下,大部分都无法进行声音事件的识别。唯一例外的是图11(f),由于高斯白噪音不存在明显的规律,不易通过匹配追踪(MP)还原,因此,在5dB时还能保持一定的识别能力。而EMD,GLCM-HOSVD与M-RF架构在各种类型的场景声中,在0dB时,能够保持80%以上的识别率。尤其,在-5dB的情况下,依然保持平均70%以上的识别率。The recognition situation of the two feature extraction methods of EMD, GLCM-HOSVD and M-RF architecture and MP-feature in 6 types of sound scenes is shown in Figure 11. In Figure 11, in 6 types of sound scenes, the MP feature is at a low signal-to-noise ratio, such as 5dB or less, and most of the sound events cannot be identified. The only exception is Figure 11(f). Since Gaussian white noise does not have obvious laws, it is not easy to restore through matching pursuit (MP). Therefore, it can maintain a certain recognition ability at 5dB. The EMD, GLCM-HOSVD and M-RF architectures can maintain a recognition rate of more than 80% in various types of scene sound at 0dB. In particular, in the case of -5dB, an average recognition rate of more than 70% is still maintained.
C.EMD,GLCM-HOSVD与M-RF架构与SPD在低信噪比下的比较C. Comparison of EMD, GLCM-HOSVD and M-RF architecture with SPD under low signal-to-noise ratio
图12为EMD,GLCM-HOSVD与M-RF架构与SPD两种方法,在6类场景声音,在5dB、0dB和-5dB 3种信噪比下的识别率。SPD方法,在半监督的情况下,舍弃了被场景声音干扰的部分特征,保留部分可靠的高能特征。从图12可以看出,SPD虽然在5dB、0dB的情况下,依然能保持一定程度的识别率,但对于更低的信噪比,如-5dB的情况,则无法保持正常识别能力。而对于0dB、-5dB的情况,EMD,GLCM-HOSVD与M-RF架构依然保持良好的识别效率。Figure 12 shows the recognition rates of EMD, GLCM-HOSVD and M-RF architecture and SPD in 6 types of scene sounds and 3 signal-to-noise ratios of 5dB, 0dB and -5dB. The SPD method, in the case of semi-supervision, discards some features that are disturbed by scene sounds, and retains some reliable high-energy features. It can be seen from Figure 12 that although SPD can still maintain a certain degree of recognition rate in the case of 5dB and 0dB, it cannot maintain the normal recognition ability for a lower signal-to-noise ratio, such as -5dB. For 0dB and -5dB, EMD, GLCM-HOSVD and M-RF architecture still maintain good recognition efficiency.
D.讨论D. Discussion
这部分分析本文提出的EMD,GLCM-HOSVD与RFM架构、EMD,GLCM-HOSVD与RF架构以及EMD,GLCM-HOSVD与M-RF架构分类器,在各种声场景下,识别环境声音的性能。并把EMD,GLCM-HOSVD与M-RF架构与SPD、MP的方法进行比较。This part analyzes the EMD, GLCM-HOSVD and RFM architecture, EMD, GLCM-HOSVD and RF architecture, and EMD, GLCM-HOSVD and M-RF architecture classifiers proposed in this paper, and recognizes the performance of environmental sounds in various sound scenarios. And compare EMD, GLCM-HOSVD and M-RF architecture with SPD and MP methods.
实验表明,GLCM-HOSVD优于GLCM-SDH;采用EMD,GLCM-HOSVD与RFM架构与EMD,GLCM-HOSVD与RF架构可以检测出低信噪比中的声音事件。EMD,GLCM-HOSVD与M-RF架构性能优于MP结合PCA与LDA的SVM方法。在信噪比低于0dB时,EMD,GLCM-HOSVD与M-RF架构优于SPD结合KNN的方法。图13为6类声场景下,声音事件在5dB、0dB与-5dB等3种不同信噪比下的平均检测正确率。从图13可以看出,该方法能够在0dB到-5dB时依然保持较高的识别正确率。Experiments show that GLCM-HOSVD is better than GLCM-SDH; using EMD, GLCM-HOSVD and RFM architecture and EMD, GLCM-HOSVD and RF architecture can detect sound events in low signal-to-noise ratio. The performance of EMD, GLCM-HOSVD and M-RF architecture is better than the SVM method of MP combined with PCA and LDA. When the signal-to-noise ratio is lower than 0dB, EMD, GLCM-HOSVD and M-RF architecture are better than SPD combined with KNN. Figure 13 shows the average detection accuracy of sound events under 3 different signal-to-noise ratios: 5dB, 0dB, and -5dB under 6 types of sound scenarios. It can be seen from Figure 13 that this method can still maintain a high recognition accuracy rate from 0dB to -5dB.
如四.B基本实验2中所述,本文实验只选择6类声场景,使得采用RF来判别场景声音,不会产生误判。如果对场景声音产生错误的判断,可能对识别精度产生影响。在实际应用中,我们采用如图3或图4所示的方法。即采用III.A的声音事件端点检测与信噪比估算方法,把从待测声音事件中分离出来的场景声音;根据声音事件的信噪比,直接与声音事件样本库中的所有声音事件进行混合,生成相应场景下的声音事件集;再提取声音事件集的GLCM-HOSVD特征,训练并生成随机森林。用这个生成的RF对待测声音事件进行判别,可以确保待测声音事件的场景类型与随机森林的场景声音类型一致。As mentioned in IV.B basic experiment 2, this experiment only selects 6 types of sound scenes, so that the use of RF to distinguish the sound of the scene will not cause misjudgment. If the sound of the scene is judged incorrectly, it may affect the recognition accuracy. In practical applications, we use the method shown in Figure 3 or Figure 4. That is, the sound event endpoint detection and signal-to-noise ratio estimation method of III.A is used to separate the scene sound from the sound event to be measured; according to the signal-to-noise ratio of the sound event, it is directly connected with all the sound events in the sound event sample library Mix to generate the sound event set in the corresponding scene; then extract the GLCM-HOSVD feature of the sound event set, train and generate a random forest. Using this generated RF to discriminate the sound event under test can ensure that the scene type of the sound event under test is consistent with the scene sound type of the random forest.
在实际应用中,对于某种环境(场景)而言,可能发生的声音事件有限。因此,声音事件样本库中的声音事件数也是有限的。因此,按照图3的EMD,GLCM-HOSVD与RF架构或图4的EMD,GLCM-HOSVD与M-RF架构,把相关的场景声音与样本库中的声音事件进行混合,并建立RFs或RFsh-RFs-RFsl,可以实时进行。这样使得在各种声场景下识别低信噪比声音事件可以实时进行。In practical applications, for a certain environment (scene), the sound events that may occur are limited. Therefore, the number of sound events in the sound event sample library is also limited. Therefore, according to the EMD, GLCM-HOSVD and RF architecture in Figure 3 or the EMD, GLCM-HOSVD and M-RF architecture in Figure 4, the relevant scene sounds are mixed with the sound events in the sample library, and RF s or RF is established sh -RF s -RF sl can be performed in real time. This enables real-time recognition of low signal-to-noise ratio sound events in various sound scenarios.
进一步的问题,四.A基本实验4中所述,对待测声音事件信噪比估算的偏差,引起识别率降低。考虑到场景声音的非平稳性,分离出的环境声音与其它时间段的环境声音存在偏差。针对这个问题,改进方法之一,选择多段代表性的非平稳场景声音,分别与样本库中的声音 事件进行混合,生成多个RF,最后结果由多个RF的结果,进一步投票确定。A further problem, as described in the basic experiment 4. A, the deviation of the estimation of the signal-to-noise ratio of the sound event to be measured causes the recognition rate to decrease. Taking into account the non-stationarity of the scene sound, the separated environmental sound has a deviation from the environmental sound of other time periods. In response to this problem, one of the improved methods is to select multiple representative non-stationary scene sounds, respectively, and the sounds in the sample library To The events are mixed to generate multiple RFs, and the final result is determined by the results of multiple RFs and further voting.
因此,我们认为,以本文所提出的EMD,GLCM-HOSVD与RFM架构、EMD,GLCM-HOSVD与RF架构和EMD,GLCM-HOSVD与M-RF架构分类器为基础,能实现各种声场景中低信噪比声音事件的识别。Therefore, we believe that based on the EMD, GLCM-HOSVD and RFM architecture, EMD, GLCM-HOSVD and RF architecture and EMD, GLCM-HOSVD and M-RF architecture classifiers proposed in this article, it can be implemented in various acoustic scenarios. Recognition of sound events with low signal-to-noise ratio.
综上所述,本文提出的一种能够在各种声场景下、有效提高低信噪比下识别率的声音事件识别方法。该方法把待测声音事件中的场景声音,与声音事件样本库相结合,通过GLCM-HOSVD提取声音数据的特征,生成判别待测声音事件判别的RF。利用这种方法生成的RF,可以在特定场景中,实现低信噪比下,声音事件的识别。实验结果表明,该方法能使声音事件与场景声音信噪比为-5dB的情况,保持平均精度73%以上声音事件的识别率。与MP,SPD提取特征的方法相比,一定程度上说,我们所提出的这种方法解决了低信噪比情况下,声音事件的识别问题。In summary, this paper proposes a sound event recognition method that can effectively improve the recognition rate under low signal-to-noise ratio in various acoustic scenarios. This method combines the scene sound in the sound event to be measured with the sound event sample library, extracts the characteristics of the sound data through GLCM-HOSVD, and generates an RF for judging the sound event to be measured. The RF generated by this method can be used to recognize sound events under a low signal-to-noise ratio in a specific scene. Experimental results show that this method can make the sound event and scene sound signal-to-noise ratio of -5dB, and maintain an average accuracy of more than 73% of the sound event recognition rate. Compared with the MP and SPD feature extraction methods, to a certain extent, the proposed method solves the problem of sound event recognition in the case of low signal-to-noise ratio.
参考文献:references:
[1]H.Malik,“Acoustic environment identification and its applications to audio forensics,”IEEE Trans.Inf.Foren.Sec.,vol.8,no.11,pp.1827-1837,Nov.2013.[1]H.Malik, "Acoustic environment identification and its applications to audio forensics," IEEE Trans.Inf.Foren.Sec.,vol.8,no.11,pp.1827-1837,Nov.2013.
[2]T.Heittola,A.Mesaros,T.Virtanen,A.Eronen,“Sound Event Detection in Multisource Environments Using Source Separation,”in Proc.CHiME,pp.36-40,2011.[2] T. Heittola, A. Mesaros, T. Virtanen, A. Eronen, "Sound Event Detection in Multisource Environment Using Source Separation," in Proc. CHiME, pp. 36-40, 2011.
[3]C.-H.Lee,S.-B.Hsu,J.-L.Shih,and C.-H.Chou,“Continuous birdsong recognition using gaussian mixture modeling of image shape features,”IEEE Trans.multimedia,vol.15,no.2,pp.454-464,Feb.2013.[3]C.-H.Lee,S.-B.Hsu,J.-L.Shih,and C.-H.Chou,"Continuous birdsong recognition using gaussian mixture modeling of image shape features," IEEE Trans.multimedia ,vol.15,no.2,pp.454-464,Feb.2013.
[4]Z.Shi,J.Han,T.Zheng,and J.Li,“Identification of Objectionable Audio Segments Based on Pseudo and Heterogeneous Mixture Models,”IEEE Trans.Audio,Speech,Lang.Process.,vol.21,no.3,pp.611-623,Mar.2013.[4]Z.Shi,J.Han,T.Zheng,and J.Li,"Identification of Objectionable Audio Segments Based on Pseudo and Heterogeneous Mixture Models," IEEE Trans.Audio,Speech,Lang.Process.,vol.21 , no. 3, pp. 611-623, Mar. 2013.
[5]S.Ntalampiras,I.Potamitis,and N.Fakotakis,“An adaptive framework for acoustic monitoring of potential hazards,”EURASIP J.Audio,Speech,Music Process.vol.2009,pp.1-16,Jan.2009.[5]S.Ntalampiras,I.Potamitis,and N.Fakotakis,"An adaptive framework for acoustic monitoring of potential hazards," EURASIP J.Audio,Speech,Music Process.vol.2009,pp.1-16,Jan. 2009.
[6]C.Clavel,T.Ehrette,G.Richard,“Events detection for an audio-based surveillance system,”in Proc.ICME,pp.1306-1309,2005.[6]C.Clavel,T.Ehrette,G.Richard,"Events detection for an audio-based surveillance system,"in Proc.ICME,pp.1306-1309,2005.
[7]H.Zhao and H.Malik,“Audio recording location identification using acoustic environment signature,”IEEE Trans.Inf.Foren.Sec.,vol.8,no.11,pp.1746-1759,Nov.2013.[7]H.Zhao and H.Malik, "Audio recording location identification using acoustic environment signature," IEEE Trans.Inf.Foren.Sec.,vol.8,no.11,pp.1746-1759,Nov.2013.
[8]C.Clavel,I.Vasilescu,L.Devillers,G.Richard,T.Ehrette,“Fear-type emotion recognition for future audio-based surveillance systems,”Speech Commun.,vol.50,pp.487-503,2008.[8]C.Clavel,I.Vasilescu,L.Devillers,G.Richard,T.Ehrette,"Fear-type emotion recognition for future audio-based surveillance systems,"Speech Commun.,vol.50,pp.487- 503, 2008.
[9]J.N.Mcnames,A.M.Fraser,“Obstructive sleep apnea classification based on spectrogram patterns in the electrocardiogram,”Computers in Cardiology,vol.27,pp.749-752,Sep.2000.[9]J.N.Mcnames,A.M.Fraser,"Obstructive sleep apnea classification based on spectrogram patterns in the electrocardiogram,"Computers in Cardiology,vol.27,pp.749-752,Sep.2000.
[10]V.Kudriavtsev,V.Polyshchuk,and D.L.Roy,“Heart energy signature spectrogram for cardiovascular diagnosis,”BioMedical Engineering Online,vol.6,no.1,pp.16,2007.[10]V.Kudriavtsev,V.Polyshchuk,and D.L.Roy, "Heart energy signature spectrogram for cardiovascular diagnosis," BioMedical Engineering Online, vol. 6, no. 1, pp. 16, 2007.
[11]V.N.Varghees,K.I.Ramachandran,“A novel heart sound activity detection framework for automated heart sound analysis,”Biomedical Signal Processing and Contro.,vol.13,pp.174-188,Sep.2014.[11]V.N.Varghees,K.I.Ramachandran,"A novel heart sound activity detection framework for automated heart sound analysis," Biomedical Signal Processing and Contro.,vol.13,pp.174-188,Sep.2014.
[12]A.Gavrovska,V.
Figure PCTCN2015077075-appb-000051
I.Reljin,and B.Reljin,“Automatic heart sound detection in pediatric patients without electrocardiogram reference via pseudo-affine Wigner–Ville distribution and Haar wavelet lifting,”Computer methods and programs in biomedicine vol.113,no.2,pp.515-528,Feb.2014.
[12] A. Gavrovska, V.
Figure PCTCN2015077075-appb-000051
I.Reljin,and B.Reljin,"Automatic heart sound detection in pediatric patients without electrocardiogram reference via pseudo-affine Wigner–Ville distribution and Haar wavelet lifting,"Computer methods and programs in biomedicine vol.113,no.2,pp. 515-528, Feb. 2014.
[13]S.Ntalampiras,I.Potamitis,N.Fakotakis,“On acoustic surveillance of hazardous situations,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP’09),2009,pp.165-168.[13]S.Ntalampiras,I.Potamitis,N.Fakotakis,"On Acoustic Surveillance of Hazardous Situations," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP'09),2009,pp. 165-168.
[14]S.
Figure PCTCN2015077075-appb-000052
S.
Figure PCTCN2015077075-appb-000053
“Classification and analysis of non-stationary characteristics of crackle and rhonchus lung adventitious sounds,”Digital Signal Processing,vol.28,pp.18-27,May.2014.
[14]S.
Figure PCTCN2015077075-appb-000052
S.
Figure PCTCN2015077075-appb-000053
"Classification and analysis of non-stationary characteristics of crackle and rhonchus lung adventitious sounds," Digital Signal Processing, vol. 28, pp. 18-27, May. 2014.
[15]B.Lei,S.A.Rahman,and I.Song,“Content-based classification of breath sound with enhanced features,”Neurocomputing,vol.141,pp.139-147,Oct.2014.[15]B.Lei,S.A.Rahman,and I.Song,"Content-based classification of breath sound with enhanced features,"Neurocomputing,vol.141,pp.139-147,Oct.2014.
[16]Y.Wang,W.Li,J.Zhou,X.Li,and Y.Pu,“Identification of the normal and abnormal heart sounds using wavelet-time entropy  features based on OMS-WPD,”Future Generation Computer Systems,vol.37,pp.488-495,Jul.2014.[16]Y.Wang,W.Li,J.Zhou,X.Li,and Y.Pu,"Identification of the normal and abnormal heart sounds using wavelet-time entropy To features based on OMS-WPD,"Future Generation Computer Systems,vol.37,pp.488-495,Jul.2014.
[17]F.Jin,F.Sattar,and D.Y.Goh,“New approaches for spectro-temporal feature extraction with applications to respiratory sound classification,”Neurocomputing,vol.123,pp.362-371,Jan.2014.[17]F.Jin, F.Sattar, and D.Y.Goh, "New approaches for spectro-temporal feature extraction with applications to respiratory sound classification," Neurocomputing, vol.123,pp.362-371, Jan.2014.
[18]G.Muhammad,M.Moutasem,“Pathological voice detection and binary classification using MPEG-7 audio features,”Biomedical Signal Processing and Control,vol.11,pp.1-9,May.2014.[18]G. Muhammad, M. Moutasem, "Pathological voice detection and binary classification using MPEG-7 audio features," Biomedical Signal Processing and Control, vol.11,pp.1-9,May.2014.
[19]G.Richard,S.Sundaram,and S.Narayanan,“An overview on perceptually motivated audio indexing and classification,”Proc.IEEE,vol.101,no.9,pp.1939-1954,Sep.2013.[19]G.Richard, S.Sundaram, and S.Narayanan, "An overview on perceptually motivated audio indexing and classification," Proc.IEEE,vol.101,no.9,pp.1939-1954,Sep.2013.
[20]R.Yan,R.X.Gao,“Multi-scale enveloping spectrogram for vibration analysis in bearing defect diagnosis,”Tribology International,vol.42,no.2,pp.293-302,Feb.2009.[20]R.Yan,R.X.Gao, "Multi-scale enveloping spectrogram for vibration analysis in bearing defect diagnosis," Tribology International, vol.42, no.2, pp.293-302,Feb.2009.
[21]M.S.Lew,N.Sebe,C.Djeraba,and R.Jain,“Content-Based Multimedia Information Retrieval:State of the Art and Challenges,”ACM Trans.Multimedia Comput.,Commun.,Applic.,vol.2,no.1,pp.1-19,Feb.2006.[21]MSLew, N. Sebe, C. Djeraba, and R. Jain, "Content-Based Multimedia Information Retrieval: State of the Art and Challenges," ACM Trans. Multimedia Comput., Commun., Applic., vol. 2, no. 1, pp. 1-19, Feb. 2006.
[22]J.Wang,K.Zhang,K Madani,and C Sabourin,“Salient environmental sound detection framework for machine awareness,”Neurocomputing,vol.152,pp.444-454,Mar.2015.[22]J. Wang, K. Zhang, K Madani, and C Sabourin, “Salient environmental sound detection framework for machine awareness,” Neurocomputing, vol.152, pp.444-454, Mar. 2015.
[23]S.Ntalampiras,“A novel holistic modeling approach for generalized sound recognition,”IEEE Signal Process.Lett.,vol.20,no.2,pp.185-188,Feb.2013.[23]S.Ntalampiras, "A novel holistic modeling approach for generalized sound recognition," IEEE Signal Process.Lett.,vol.20,no.2,pp.185-188,Feb.2013.
[24]J.-C.Wang,C,-H.Lin,B,-W,Chen,and M.-K Tsai,“Gabor-based nonuniform scale-frequency map for environmental sound classification in home automation,”IEEE Trans.Autom.Sci.Eng.,vol.11,no.2,pp.607-613,Apr.2014.[24]J.-C.Wang,C,-H.Lin,B,-W,Chen,and M.-K Tsai,"Gabor-based nonuniform scale-frequency map for environmental sound classification in home automation," IEEE Trans.Autom.Sci.Eng.,vol.11,no.2,pp.607-613,Apr.2014.
[25]S.Chu,S.Narayanan,and C.C.J.Kuo,“Environmental sound recognition with time-frequency audio features,”IEEE Trans.Audio,Speech,Lang.Process.,vol.17,no.6,pp.1142-1158,Aug.2009.[25]S.Chu,S.Narayanan,and CCJKuo,"Environmental sound recognition with time-frequency audio features,"IEEE Trans.Audio,Speech,Lang.Process.,vol.17,no.6,pp.1142 -1158, Aug. 2009.
[26]Z.R.Feng,Q.Zhou,J.Zhang,P.Jiang,and X.W.Yang“A Target Guided Subband Filter for Acoustic Event Detection in Noisy Environments Using Wavelet Packets,”IEEE Trans.Audio,Speech,Lang.Process.,vol.23,no.22,pp.361-372,Feb.2015.[26]ZRFeng,Q.Zhou,J.Zhang,P.Jiang,and XWYang "A Target Guided Subband Filter for Acoustic Event Detection in Noisy Environment Using Wavelet Packets," IEEE Trans.Audio,Speech,Lang.Process. ,vol.23,no.22,pp.361-372,Feb.2015.
[27]J.
Figure PCTCN2015077075-appb-000054
-Choez,A Gallardo-Antolín,“Feature extraction based on the high-pass filtering of audio signals for Acoustic Event Classification,”Computer Speech & Language,vol.30,no.1,pp.32-42,Mar.2015.
[27]J.
Figure PCTCN2015077075-appb-000054
-Choez,A Gallardo-Antolín,"Feature extraction based on the high-pass filtering of audio signals for Acoustic Event Classification,"Computer Speech & Language,vol.30,no.1,pp.32-42,Mar.2015.
[28]H.Phan,M.Maas,R.Mazur,and A.Mertins,“Random Regression Forests for Acoustic Event Detection and Classification,”IEEE Trans.Audio,Speech,Lang.Process.,vol.23,no.1,pp.20-31,Jan.2015.[28]H.Phan, M.Maas, R.Mazur, and A. Mertins, "Random Regression Forests for Acoustic Event Detection and Classification," IEEE Trans.Audio, Speech, Lang.Process.,vol.23,no. 1,pp.20-31,Jan.2015.
[29]J.Ye,T.Kobayashi,M.Murakawa,T.Higuchi,“Kernel discriminant analysis for environmental sound recognition based on acoustic subspace,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP’13),2013,pp.808-812.[29] J. Ye , T. Kobayashi, M. Murakawa, T. Higuchi, "Kernel discriminant analysis for environmental sound recognition based on acoustic subspace," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.( ICASSP'13), 2013, pp.808-812.
[30]P.Khunarsal,C.Lursinsap,and T.Raicharoen,“Very short time environmental sound classification based on spectrogram pattern matching,”Inform.Sci.,vol.243,pp.57-74,Sep.2013.[30]P.Khunarsal, C. Lursinsap, and T. Raicharoen, "Very short time environmental sound classification based on spectrogram pattern matching," Inform.Sci.,vol.243,pp.57-74,Sep.2013.
[31]C.Baug′e,M.Lagrange,J.And′en,and S.Mallat,“Representing environmental sounds using the separable scattering transform,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP’13),2013,pp.8667-8671.[31]C.Baug′e,M.Lagrange,J.And′en,and S.Mallat,"Representing environmental sounds using the separate scattering transform," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process .(ICASSP'13), 2013, pp. 8667-8671.
[32]J.C.Wang,C.H.Lin,E.Siahaan,B.W.Chen,and H.L.Chuang,“Mixed sound event verification on wireless sensor network for home automation,” IEEE Trans. Ind. Informat.,vol.10,no.1,pp.803–812,Feb.2014.[32]JCWang, CHLin, E. Siahaan, BWChen, and HLChuang, “Mixed sound event verification on wireless sensor network for home automation,” IEEE Trans. Ind. Informat., vol. 10, no. 1, pp.803–812, Feb. 2014.
[33]J.Dennis,H.D.Tran and E.S.Chng.“Overlapping sound event recognition using local spectrogram features with the generalised hough transform,”Pattern Recognition Lett.,vol.34,no.9,pp.1085-1093,Sep.2013.[33]J. Dennis, HDTran and ESChng. "Overlapping sound event recognition using local spectrogram features with the generalised hough transform," Pattern Recognition Lett., vol. 34, no. 9, pp. 1085-1093, Sep. 2013.
[34]J.Dennis,H.D.Tran,and E.S.Chng,“Image feature representation of the subband power distribution for robust sound event classification,”IEEE Trans.Audio,Speech,Lang.Process.,vol.21,no.2,pp367-377,Feb.2013.[34]J.Dennis,HDTran,and ESChng,"Image feature representation of the subband power distribution for robust sound event classification," IEEE Trans.Audio,Speech,Lang.Process.,vol.21,no.2, pp367-377, Feb. 2013.
[35]T.Heittola,A.Mesaros,A.Eronen,and T.Virtanen,“Context-dependent sound event detection,”EURASIP J.Audio,Speech,Music Process.,vol.2013,no.1,pp.1-13,Jan.2013.[35]T.Heittola,A.Mesaros,A.Eronen,and T.Virtanen,"Context-dependent sound event detection," EURASIP J.Audio,Speech,Music Process.,vol.2013,no.1,pp. 1-13, Jan. 2013.
[36]A.Plinge,R.Grzeszick,and G.A.Fink,“A bag-of-features approach to acoustic event detection,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP’14),2014,pp.3704-3708.[36]A.Plinge, R.Grzeszick, and GAFink, "A bag-of-features approach to acoustic event detection," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP'14) , 2014, pp. 3704-3708.
[37]T.H.Dat,N.W.Z.Terence,J.W.Dennis,and L.Y.Ren,“Generalized Gaussian distribution kullback-leibler kernel for robust sound event recognition,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP’14),2014,pp.5949-5953.[37]THDat, NWZTerence, JWDennis, and LYRen, "Generalized Gaussian distribution kullback-leibler kernel for robust sound event recognition," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP' 14), 2014, pp. 5949-5953.
[38]J.Ye,T.Kobayashi,M Murakawa,and T.Higuchi,“Robust acoustic feature extraction for sound classification based on noise  reduction,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP’14),2014,pp.5944-4948.[38]J.Ye,T.Kobayashi,M Murakawa,and T.Higuchi,"Robust acoustic feature extraction for sound classification based on noise To reduction,"in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP’14),2014,pp.5944-4948.
[39]S.Deng,J.Han,C.Zhang,T.Zheng,and G.Zheng,“Robust minimum statistics project coefficients feature for acoustic environment recognition,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP’14),2014,pp.8232-8236.[39]S.Deng,J.Han,C.Zhang,T.Zheng,and G.Zheng, "Robust minimum statistics project coefficients feature for acoustic environment recognition," in Proc. IEEE Int.Conf.Acoust.,Speech, Signal Process. (ICASSP'14), 2014, pp. 8232-8236.
[40]X.Lu,Y.Tsao,S.Matsuda,and C.Hori,“Sparse representation based on a bag of spectral exemplars for acoustic event detection,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP’14),2014,pp.6255-6259.[40]X.Lu,Y.Tsao,S.Matsuda,and C.Hori,"Sparse representation based on a bag of spectral executives for acoustic event detection," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP'14), 2014, pp. 6255-6259.
[41]M.Seltzer,B.Raj,and R.Stern,“A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition,”Speech Commun.,vol.43,no.4,pp.379–393,2004.[41]M.Seltzer,B.Raj,and R.Stern,"A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition," Speech Commun.,vol.43,no.4,pp.379–393,2004 .
[42]K.Yamashita,T.Shimamura,“Nonstationary noise estimation using low-frequency regions for spectral subtraction,”IEEE Signal Process.Lett.,vol.12,no.6,pp.465-468,2005.[42]K.Yamashita, T.Shimamura, "Nonstationary noise estimation using low-frequency regions for special subtraction," IEEE Signal Process.Lett.,vol.12,no.6,pp.465-468,2005.
[43]K.Sunil and L.Philipos,“A multi-band spectral subtraction method for enhancing speech corrupted by colored noise,”in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP’02),2002,vol.4,pp.13-17.[43]K.Sunil and L.Philipos, "A multi-band spectral subtraction method for enhancing speech corrupted by colored noise," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP'02), 2002, vol. 4, pp. 13-17.
[44]H.Huang and J.Q.Pan,“Speech pitch determination based on Hilbert-Huang transform,”Signal Process.,vol.86,no.4,pp.792-803,2006.[44]H.Huang and J.Q.Pan,"Speech pitch determination based on Hilbert-Huang transform,"Signal Process.,vol.86,no.4,pp.792-803,2006.
[45]M.Unser,“Sum and difference histograms for texture classification,”IEEE Trans.Pattern Anal.Mach.Intell.,vol.8,no.1,pp.118-125,1986.[45]M.Unser, "Sum and difference histograms for texture classification," IEEE Trans.Pattern Anal.Mach.Intell.,vol.8,no.1,pp.118-125,1986.
[46]L.K.Soh and C.Tsatsoulis,“Texture analysis of SAR sea ice imagery using gray level co-occurrence matrices,”IEEE Trans.Geosci.Remote S.,vol.37,no.2,pp.780-795,1999.[46]LKSoh and C.Tsatsoulis, "Texture analysis of SAR sea ice imagery using gray level co-occurrence matrices," IEEE Trans.Geosci.Remote S.,vol.37,no.2,pp.780-795, 1999.
[47]Z.Xie,G.Liu,C.He,and Y.Wen,“Texture image retrieval based on gray level co-occurrence matrix and singular value decomposition,”in Proc.ICMT,pp.1-3,2010.[47]Z.Xie,G.Liu,C.He,and Y.Wen,"Texture image retrieval based on gray level co-occurrence matrix and singular valuedecomposition,"in Proc.ICMT,pp.1-3,2010 .
[48]L.D.Lathauwer,B.D.Moor,and J.Vandewalle,“A multilinear singular value decomposition,”SIAM J.Matrix Anal.Appli.,vol.21,no.4,pp.1253-1278,2000.[48]L.D.Lathauwer,B.D.Moor,and J.Vandewalle,"A multilinear singular value decomposition," SIAM J.Matrix Anal.Appli.,vol.21,no.4,pp.1253-1278,2000.
[49]J.Wei,Y.Li,“Specific environmental sounds recognition using time-frequency texture features and random forest”,in Proc.CISP,pp.1705-1709,2013.[49]J.Wei,Y.Li, "Specific environmental sounds recognition using time-frequency texture features and random forest", in Proc.CISP,pp.1705-1709,2013.
[50]L.Breiman.“Random forests,”Machine Learning,vol.45,no.1,pp.5-32,2001.[50]L.Breiman."Random forests,"Machine Learning,vol.45,no.1,pp.5-32,2001.
[51]H.Pang,A.Lin,M.Holford,and B.E.Enerson,“Pathway analysis using random forests classification and regression,”Bioinformatics,vol.22,no.16,pp.2028-2036,2006.[51]H.Pang,A.Lin,M.Holford,and B.E.Enerson,"Pathway analysis using random forests classification and regression,"Bioinformatics,vol.22,no.16,pp.2028-2036,2006.
[52]K.L.Unella,L.B.Hayward,J.Scgal,and P.V.Eerdewegh,“Screening large-scale association study data:exploiting interactions using random forests”,BMC Genetics,vol.11,no.5,pp.32-37,2004.[52]KLUnella, LBHayward, J.Scgal, and PVEerdewegh, "Screening large-scale association study data: exploiting interactions using random forests", BMC Genetics, vol.11, no.5, pp.32-37, 2004.
[53]Universitat Pompeu Fabra.Repository of sound under the creative commons license,Freesound.org[DB/OL].http://www.freesound.org,2012-5-14.[53]Universitat Pompeu Fabra.Repository of sound under the creative commons license,Freesound.org[DB/OL].http://www.freesound.org,2012-5-14.
以上是本发明的较佳实施例,凡依本发明技术方案所作的改变,所产生的功能作用未超出本发明技术方案的范围时,均属于本发明的保护范围。 The above are the preferred embodiments of the present invention. Any changes made according to the technical solution of the present invention and the resulting function does not exceed the scope of the technical solution of the present invention belong to the protection scope of the present invention. To

Claims (3)

  1. 一种低信噪比声场景下声音事件的识别方法,其特征在于:包括如下步骤,步骤S1:随机森林矩阵的训练与生成:将声音事件样本集中的已知声音事件样本和场景声音样本集中的已知场景声音样本进行声音混合,得到混合声音信号集,并存放于训练声音集中,将所述训练声音集中的声音信号通过GLCM-HOSVD生成训练声音集的特征集,对该训练声音集的特征集进行训练,生成随机森林矩阵;步骤S2:场景声音类型判别随机森林的训练与生成:对场景声音样本集中的已知场景声音样本进行GLCM-HOSVD,生成场景声音样本集的特征集,并对该场景声音样本集的特征集进行训练,生成场景声音类型判别随机森林;步骤S3:对待测声音事件进行识别:A method for identifying sound events in a low signal-to-noise ratio sound scene, which is characterized in that it includes the following steps. Step S1: Random forest matrix training and generation: the known sound event samples in the sound event sample set and the scene sound samples are collected The sound samples of the known scene are mixed to obtain a mixed sound signal set and stored in the training sound set. The sound signals in the training sound set are generated through GLCM-HOSVD to generate the feature set of the training sound set. The feature set is trained to generate a random forest matrix; Step S2: Scene sound type discrimination random forest training and generation: GLCM-HOSVD is performed on the known scene sound samples in the scene sound sample set to generate the feature set of the scene sound sample set, and Train the feature set of the scene sound sample set to generate a scene sound type discrimination random forest; Step S3: Recognize the sound event to be measured:
    第一步,将待测声音信号通过EMD分解出场景声音和声音事件,并计算出该待测声音事件的信噪比;第二步,计算待测场景声音和待测声音事件的特征值,并将所述待测场景声音的特征值输入所述步骤S2生成的场景声音类型判别随机森林,检测出待测场景声音类型;第三步,通过所述待测场景声音类型和待测声音事件的信噪比,从所述步骤S1生成的随机森林矩阵中选择进行声音事件识别的随机森林;第四步,将所述待测声音事件的特征值通过第三步所选择的随机森林进行识别得到声音类型。The first step is to decompose the sound signal to be measured into scene sounds and sound events through EMD, and calculate the signal-to-noise ratio of the sound event to be measured; the second step is to calculate the characteristic values of the scene sound to be measured and the sound event to be measured. And input the characteristic value of the scene sound to be measured into the scene sound type discrimination random forest generated in step S2, and detect the sound type of the scene to be measured; the third step is to pass the sound type of the scene to be measured and the sound event to be measured Select the random forest for sound event recognition from the random forest matrix generated in step S1; the fourth step is to identify the characteristic value of the sound event to be measured through the random forest selected in the third step Get the sound type.
  2. 根据权利要求1所述的低信噪比声场景下声音事件的识别方法,其特征在于:所述步骤S3中的第一步的具体实现过程如下,将待测声音信号y(t)通过EMD,EMD能依据信号自身的特性将待测声音信号y(t)自适地分为n级固有模态函数的线性叠加,即
    Figure PCTCN2015077075-appb-100001
    The method for recognizing a sound event in a low signal-to-noise ratio sound scene according to claim 1, wherein the specific realization process of the first step in the step S3 is as follows: the sound signal y(t) to be measured is passed through the EMD , EMD can adaptively divide the sound signal y(t) to be measured into a linear superposition of n-level intrinsic modal functions according to the characteristics of the signal itself, namely
    Figure PCTCN2015077075-appb-100001
    其中,ri(t)为残余函数,Li(t)为n级固有模态函数;Among them, r i (t) is the residual function, and Li (t) is the n-level intrinsic modal function;
    在n级固有模态函数Li(t)中,1级固有模态函数L1(t)主要包含噪音成分,有效声音成分极少,所述噪音成分即场景声音部分,有效声音成分即声音事件部分;因此,我们仅选取2-6级固有模态函数,即取i=2,3,…,6,用于对待测声音端点的检测;用第i级固有模态函数Li(t)进行待测声音端点检测的过程具体如下,Among the n-level inherent modal functions Li (t), the first-level inherent modal function L 1 (t) mainly contains noise components, with very few effective sound components. The noise components are the sound part of the scene, and the effective sound components are the sound Event part; therefore, we only select the intrinsic modal function of level 2-6, that is, take i=2,3,...,6 for the detection of the sound endpoint under test; use the i-th intrinsic modal function Li (t ) The process of detecting the sound endpoints to be tested is as follows:
    S311:对第i级固有模态函数Li(t)做预处理S311: Perform preprocessing on the i-th inherent modal function Li (t)
    ei(t)=|H{Li(t)}|+Li(t)      (2)e i (t)=|H{L i (t)}|+L i (t) (2)
    其中,H{Li(t)}表示对固有模态函数做希尔伯特变换;Among them, H{L i (t)} represents the Hilbert transformation of the intrinsic mode function;
    S312:对ei(t)进行平滑S312: Smooth e i (t)
    Figure PCTCN2015077075-appb-100002
    Figure PCTCN2015077075-appb-100002
    其中,σ为平滑窗口,取采样率的0.05倍;Among them, σ is the smoothing window, which is 0.05 times the sampling rate;
    S313:对Ei(t)归一化S313: Normalize E i (t)
    Figure PCTCN2015077075-appb-100003
    Figure PCTCN2015077075-appb-100003
    S314:计算声音事件等级Slevel、场景声音等级Nlevel和初始化场景声音等级阀值TS314: Calculate the sound event level S level , the scene sound level N level, and initialize the scene sound level threshold T
    Slevel=mean[Fi(t)]    (5)S level = mean[F i (t)] (5)
    Nlevel=βΣFi(t)   (6)N level = βΣF i (t) (6)
    T=αSlevel   (7) T=αS level (7)
    其中,α,β为门限值参数,取α=4,β=0.25;Among them, α and β are threshold parameters, and α=4, β=0.25;
    S315:计算Fi(t)在第k个窗口的平均值S315: Calculate the average value of F i (t) in the k-th window
    Figure PCTCN2015077075-appb-100004
    Figure PCTCN2015077075-appb-100004
    其中,k为窗口索引,Wd为窗长,取信号采样率0.02倍;Among them, k is the window index, W d is the window length, and the signal sampling rate is 0.02 times;
    S316:对是否存在声音事件进行判断S316: Judge whether there is a sound event
    Figure PCTCN2015077075-appb-100005
    Figure PCTCN2015077075-appb-100005
    若声音事件存在,跳转至步骤S318;If the sound event exists, skip to step S318;
    S317:对场景声音进行动态估计,更新场景声音等级S317: Dynamically estimate the scene sound and update the scene sound level
    Figure PCTCN2015077075-appb-100006
    Figure PCTCN2015077075-appb-100006
    其中,Nlevel(n)为第n个窗口的场景声音等级,在更新场景声音等级Nlevel(n)后跳转至步骤S319;Wherein, N level (n) is the scene sound level of the nth window, and jump to step S319 after updating the scene sound level N level (n);
    S318:更新场景声音等级阀值S318: Update the scene sound level threshold
    Figure PCTCN2015077075-appb-100007
    Figure PCTCN2015077075-appb-100007
    其中,θ为常数,取θ=0.2;Among them, θ is a constant, and θ=0.2;
    S319:若场景声音等级阀值在之前的循环中被更新过,则更新声音事件等级Slevel S319: If the scene sound level threshold has been updated in the previous cycle, update the sound event level S level
    Slevel=Nlevel+λ|T-Nlevel|   (12)S level =N level +λ|TN level | (12)
    其中,λ=0.5,作为声音事件等级更新的权值;Among them, λ=0.5, which is used as the weight for updating the level of the sound event;
    S3110:k=k+1,移动窗口,若窗口没有结束跳转至步骤S315,否则循环结束;S3110: k=k+1, move the window, if the window does not end, jump to step S315, otherwise the loop ends;
    选取的2-6级固有模态函数Li(t)经上述步骤S311至S3110的处理,得到5种不同的端点检测结果,再经投票确定最终端点检测结果;2-6 level selected intrinsic mode function L i (t) by the above-described processes S3110 to step S311, to give five different endpoint detection result, and then determines the final endpoint detection result of a vote;
    将声音信号y(t)分离为声音事件段s(t)与场景声音段n(t)之后,为了能够更准确地估计信噪比,我们对信号能量进行平滑,首先计算场景声音能量:After separating the sound signal y(t) into the sound event segment s(t) and the scene sound segment n(t), in order to be able to estimate the signal-to-noise ratio more accurately, we smooth the signal energy and first calculate the scene sound energy:
    Pn(t)=n2(t)   (13)P n (t)=n 2 (t) (13)
    其次,对场景声音能量进行调整Second, adjust the sound energy of the scene
    Pn(t)=mean(Pn) if Pn(t)>γmean(Pn)    (14)P n (t)=mean(P n ) if P n (t)>γmean(P n ) (14)
    其中,系数γ=3,该过程的目的是将场景声音段中错分的声音事件段做调整;Among them, the coefficient γ=3, the purpose of this process is to adjust the misclassified sound event segment in the scene sound segment;
    最后计算信噪比Finally calculate the signal-to-noise ratio
    Figure PCTCN2015077075-appb-100008
    Figure PCTCN2015077075-appb-100008
    其中,l表示声音事件段与场景声音段长度的比值,由于分离后的声音事件段中含有场景声音成分,对声音事件段的能量值产生影响,因此,使用lΣPn(t)作为该影响的估计,剔除了场景声音对能量值的影响。Among them, l represents the ratio of the length of the sound event segment to the scene sound segment. Since the separated sound event segment contains scene sound components, it has an impact on the energy value of the sound event segment. Therefore, lΣP n (t) is used as the influence It is estimated that the influence of the sound of the scene on the energy value is eliminated.
  3. 根据权利要求1所述的低信噪比声场景下声音事件的识别方法,其特征在于:所述步骤S1至S3中,待测场景声音的特征、待测声音事件的特征、训练声音事件的特征、已知场景声音的特征的计算方法如下:The method for recognizing sound events in a low signal-to-noise ratio sound scene according to claim 1, characterized in that: in the steps S1 to S3, the characteristics of the sound of the scene to be measured, the characteristics of the sound event to be measured, and the training sound event The calculation method of features and features of known scene sounds is as follows:
    GLCM可表示为:GLCM can be expressed as:
    P(i,j|d,θ)=#{(x,y),(x+Δx,y+Δy)|f(x,y)=i,f(x+Δx,y+Δy)=j}  (16)P(i,j|d,θ)=#{(x,y),(x+Δx,y+Δy)|f(x,y)=i, f(x+Δx,y+Δy)=j } (16)
    其中,x,y表示声谱图中的像素坐标,且x+Δx≤M,y+Δy≤N,M×N表示图像的大小;i,j=0,1,…,L-1,L为图像的灰度级数,#{S}表示集合S中元素的数量;Among them, x, y represent the pixel coordinates in the spectrogram, and x+Δx≤M, y+Δy≤N, M×N represents the size of the image; i,j=0,1,...,L-1,L Is the number of gray levels of the image, #{S} represents the number of elements in the set S;
    截取声音事件的声谱图中大小为M×N,灰度级为L的图像区域,根据公式(16)及d、θ的取值,计算获得GLCM,并将各个GLCM组合成高阶矩阵A,
    Figure PCTCN2015077075-appb-100009
    对该高阶矩阵A进行张量展开,得到A(n)
    Figure PCTCN2015077075-appb-100010
    其中,将A的元素
    Figure PCTCN2015077075-appb-100011
    放置在大小为In×(In-1×…×IN×I1×…In-1)二维矩阵的in行、j列,这里,
    Figure PCTCN2015077075-appb-100012
    当k>n时,
    Figure PCTCN2015077075-appb-100013
    当k<n时,
    Figure PCTCN2015077075-appb-100014
    Intercept the image area in the spectrogram of the sound event whose size is M×N and the gray level is L. According to formula (16) and the values of d and θ, calculate the GLCM, and combine each GLCM into a high-order matrix A ,
    Figure PCTCN2015077075-appb-100009
    Perform tensor expansion on the higher-order matrix A to obtain A (n) ,
    Figure PCTCN2015077075-appb-100010
    Among them, the elements of A
    Figure PCTCN2015077075-appb-100011
    Placed in the i n row and j column of a two-dimensional matrix of size I n ×(I n-1 ×…×I N ×I 1 ×…I n-1)
    Figure PCTCN2015077075-appb-100012
    When k>n,
    Figure PCTCN2015077075-appb-100013
    When k<n,
    Figure PCTCN2015077075-appb-100014
    对A(n)进行奇异值分解,得到Perform singular value decomposition on A (n) to get
    Figure PCTCN2015077075-appb-100015
    Figure PCTCN2015077075-appb-100015
    其中U(n)是酉矩阵;Σ(n)是半正定对角矩阵;而V(n)H,即V的共轭转置,是酉矩阵;根据式(17)得到Σ(n),根据Σ(n),可得Among them, U (n) is a unitary matrix; Σ (n) is a positive semi-definite diagonal matrix; and V (n)H , the conjugate transpose of V, is a unitary matrix; Σ (n) is obtained according to formula (17), According to Σ (n) , we can get
    Figure PCTCN2015077075-appb-100016
    Figure PCTCN2015077075-appb-100016
    将σ(1)…σ(n)…σ(N)作为声音事件的特征,即Take σ (1) …σ (n) …σ (N) as the characteristics of sound events, namely
    Figure PCTCN2015077075-appb-100017
    Figure PCTCN2015077075-appb-100017
    其中,1≤n≤N;
    Figure PCTCN2015077075-appb-100018
    表示Σ(n)的第in个奇异值,1≤in≤In
    Among them, 1≤n≤N;
    Figure PCTCN2015077075-appb-100018
    It represents Σ (n) of singular values i n, 1≤i n ≤I n;
    根据上述声音事件的特征值计算方式,即可计算得到待测场景声音的特征、待测声音事件的特征、训练声音事件的特征、已知场景声音的特征。 According to the above-mentioned method of calculating the feature value of the sound event, the characteristics of the sound event to be measured, the characteristics of the sound event to be measured, the characteristics of the training sound event, and the characteristics of the known scene sound can be calculated. To
PCT/CN2015/077075 2015-03-30 2015-04-21 Method of recognizing sound event in auditory scene having low signal-to-noise ratio WO2016155047A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510141907.4A CN104795064B (en) 2015-03-30 2015-03-30 The recognition methods of sound event under low signal-to-noise ratio sound field scape
CN2015101419074 2015-03-30

Publications (1)

Publication Number Publication Date
WO2016155047A1 true WO2016155047A1 (en) 2016-10-06

Family

ID=53559823

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/077075 WO2016155047A1 (en) 2015-03-30 2015-04-21 Method of recognizing sound event in auditory scene having low signal-to-noise ratio

Country Status (2)

Country Link
CN (1) CN104795064B (en)
WO (1) WO2016155047A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065034A (en) * 2018-09-25 2018-12-21 河南理工大学 A kind of vagitus interpretation method based on sound characteristic identification
WO2021021038A1 (en) 2019-07-30 2021-02-04 Aselsan Elektroni̇k Sanayi̇ Ve Ti̇caret Anoni̇m Şi̇rketi̇ Multi-channel acoustic event detection and classification method

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105788603B (en) * 2016-02-25 2019-04-16 深圳创维数字技术有限公司 A kind of audio identification methods and system based on empirical mode decomposition
CN106653032B (en) * 2016-11-23 2019-11-12 福州大学 Based on the animal sounds detection method of multiband Energy distribution under low signal-to-noise ratio environment
CN109036461A (en) * 2017-06-12 2018-12-18 杭州海康威视数字技术股份有限公司 A kind of output method of notification information, server and monitoring system
CN108303738A (en) * 2018-02-05 2018-07-20 西南石油大学 A kind of earthquake vocal print fluid prediction method based on HHT-MFCC
CN111415653B (en) * 2018-12-18 2023-08-01 百度在线网络技术(北京)有限公司 Method and device for recognizing speech
CN110070895B (en) * 2019-03-11 2021-06-22 江苏大学 Mixed sound event detection method based on factor decomposition of supervised variational encoder
CN111951786A (en) * 2019-05-16 2020-11-17 武汉Tcl集团工业研究院有限公司 Training method and device of voice recognition model, terminal equipment and medium
CN110808067A (en) * 2019-11-08 2020-02-18 福州大学 Low signal-to-noise ratio sound event detection method based on binary multiband energy distribution
CN113593600B (en) * 2021-01-26 2024-03-15 腾讯科技(深圳)有限公司 Mixed voice separation method and device, storage medium and electronic equipment
CN113160844A (en) * 2021-04-27 2021-07-23 山东省计算中心(国家超级计算济南中心) Speech enhancement method and system based on noise background classification
CN113822279B (en) * 2021-11-22 2022-02-11 中国空气动力研究与发展中心计算空气动力研究所 Infrared target detection method, device, equipment and medium based on multi-feature fusion

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003003349A1 (en) * 2001-06-28 2003-01-09 Oticon A/S Method for noise reduction and microphone array for performing noise reduction
CN102081928A (en) * 2010-11-24 2011-06-01 南京邮电大学 Method for separating single-channel mixed voice based on compressed sensing and K-SVD
CN103474072A (en) * 2013-10-11 2013-12-25 福州大学 Rapid anti-noise twitter identification method by utilizing textural features and random forest (RF)
CN103745731A (en) * 2013-12-31 2014-04-23 安徽科大讯飞信息科技股份有限公司 Automatic voice recognition effect testing system and automatic voice recognition effect testing method
WO2014134472A2 (en) * 2013-03-01 2014-09-04 Qualcomm Incorporated Transforming spherical harmonic coefficients

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003003349A1 (en) * 2001-06-28 2003-01-09 Oticon A/S Method for noise reduction and microphone array for performing noise reduction
CN102081928A (en) * 2010-11-24 2011-06-01 南京邮电大学 Method for separating single-channel mixed voice based on compressed sensing and K-SVD
WO2014134472A2 (en) * 2013-03-01 2014-09-04 Qualcomm Incorporated Transforming spherical harmonic coefficients
CN103474072A (en) * 2013-10-11 2013-12-25 福州大学 Rapid anti-noise twitter identification method by utilizing textural features and random forest (RF)
CN103745731A (en) * 2013-12-31 2014-04-23 安徽科大讯飞信息科技股份有限公司 Automatic voice recognition effect testing system and automatic voice recognition effect testing method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065034A (en) * 2018-09-25 2018-12-21 河南理工大学 A kind of vagitus interpretation method based on sound characteristic identification
CN109065034B (en) * 2018-09-25 2023-09-08 河南理工大学 Infant crying translation method based on voice feature recognition
WO2021021038A1 (en) 2019-07-30 2021-02-04 Aselsan Elektroni̇k Sanayi̇ Ve Ti̇caret Anoni̇m Şi̇rketi̇ Multi-channel acoustic event detection and classification method
US11830519B2 (en) 2019-07-30 2023-11-28 Aselsan Elektronik Sanayi Ve Ticaret Anonim Sirketi Multi-channel acoustic event detection and classification method

Also Published As

Publication number Publication date
CN104795064A (en) 2015-07-22
CN104795064B (en) 2018-04-13

Similar Documents

Publication Publication Date Title
WO2016155047A1 (en) Method of recognizing sound event in auditory scene having low signal-to-noise ratio
Sharan et al. An overview of applications and advancements in automatic sound recognition
WO2018107810A1 (en) Voiceprint recognition method and apparatus, and electronic device and medium
Dennis et al. Image feature representation of the subband power distribution for robust sound event classification
Uzkent et al. Non-speech environmental sound classification using SVMs with a new set of features
Dennis Sound event recognition in unstructured environments using spectrogram image processing
WO2020220440A1 (en) Gmm-hmm-based method for recognizing large-sized vehicle on expressway
Kong et al. Source separation with weakly labelled data: An approach to computational auditory scene analysis
Jancovic et al. Bird species recognition using unsupervised modeling of individual vocalization elements
Maxime et al. Sound representation and classification benchmark for domestic robots
CN108962229B (en) Single-channel and unsupervised target speaker voice extraction method
Mulimani et al. Segmentation and characterization of acoustic event spectrograms using singular value decomposition
CN111986699B (en) Sound event detection method based on full convolution network
Zhang et al. A pairwise algorithm using the deep stacking network for speech separation and pitch estimation
Salman et al. Machine learning inspired efficient audio drone detection using acoustic features
Wang et al. Audio event detection and classification using extended R-FCN approach
Prazak et al. Speaker diarization using PLDA-based speaker clustering
CN113707175A (en) Acoustic event detection system based on feature decomposition classifier and self-adaptive post-processing
Dennis et al. Analysis of spectrogram image methods for sound event classification
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
Song et al. Research on scattering transform of urban sound events detection based on self-attention mechanism
Feki et al. Audio stream analysis for environmental sound classification
Stadelmann et al. Fast and robust speaker clustering using the earth mover's distance and Mixmax models
Arslan et al. Noise robust voice activity detection based on multi-layer feed-forward neural network
Zhang et al. Sparse coding for sound event classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15887015

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15887015

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.04.2018)

122 Ep: pct application non-entry in european phase

Ref document number: 15887015

Country of ref document: EP

Kind code of ref document: A1