WO2016155047A1

WO2016155047A1 - Method of recognizing sound event in auditory scene having low signal-to-noise ratio

Info

Publication number: WO2016155047A1
Application number: PCT/CN2015/077075
Authority: WO
Inventors: 李应; 林巍
Original assignee: 福州大学
Priority date: 2015-03-30
Filing date: 2015-04-21
Publication date: 2016-10-06
Also published as: CN104795064A; CN104795064B

Abstract

A method of recognizing a sound event in an auditory scene having a low signal-to-noise ratio. The method comprises: combining a scene sound in a sound event to be tested and a sound event sample set, extracting a feature of sound data via gray level co-occurrence matrix (GLCM)-higher-order singular value decomposition (HOSVD), and generating random forest (RF) data for recognizing the sound event to be tested. The RF data generated in this method can recognize, in a specific scene, the sound event in the auditory scene having a low signal-to-noise ratio. The method maintains a recognition rate of the sound event that exceeds 73% of average precision when the signal-to-noise ratio of the sound event and the scene sound is -5dB.

Description

Sound event recognition method in low signal-to-noise ratio sound scene

Technical field

The invention relates to a method for identifying sound events in a low signal-to-noise ratio sound scene that can effectively improve the recognition rate under a low signal-to-noise ratio in various sound scenes.

Background technique

Recently, sound event detection (recognition) has attracted widespread attention. Sound event detection is for audio forensics [1], environmental sound recognition [2], biological sound monitoring [3], acoustic scene analysis [4], environmental security monitoring [5], real-time military focus detection [6], location tracking Harmony source classification [7], patient monitoring [8-12], abnormal event monitoring [13-18], fault diagnosis, and submission of key information for early maintenance [21, 22] are all of great significance. Detect (recognize) sound events in the sound scene, and try to identify the real events hidden in them in the audio data.

Due to different environments, the sound scenes coexisting with sound events are also different, and often appear in unstable forms. Therefore, it is still a challenging task to effectively identify sound events in various acoustic scenarios, especially at low signal-to-noise ratios. Related work has been studied [23-40]. These studies mainly include the extraction of sound signal features and the classification and recognition of these features. For feature extraction, there are two common effective methods, namely 1) the combination of time and frequency, and 2) the spectrogram and its related features. Regarding the features of the combination of time and frequency, there are mainly time, frequency features, and wavelet domain features [23], features extracted by Gabor dictionary matching pursuit algorithm [24,25], filtering based on Wavelet Packets [26], The extended features of high-pass filtering and MFCC [27], decomposed into multiple intersecting super frames, and proposed based on random regression forest [28]. Regarding the extremely relevant features of the spectrogram, there are mainly subband power distribution (SPD), local spectral feature (Local Spectrogram Feature, LSF), Gabor transform, Cosine Log Scattering (CLS), etc.[ 29-40]. For the classification of sound events and scene sounds, common effective methods include Support Vector Machine (SVM)[24,29,32,37,40], Gaussian Mixture Model (GMM)[23,31 ,39], k-nearest neighbor (k-NN) [30, 34], Kernel Fisher Discriminant Analysis (KFDA) [29,38], Generalised Hough transform (Generalised Hough Transform, GHT) voting [19], GMM combined with Hidden Markov Model (HMM) [35], Maximum Likelihood (ML) [36].

These methods have achieved certain results in the recognition of sound events. However, the feature extraction process all have different degrees of impact on the characteristics of the sound event, that is, the structure of the sound signal to be measured. Although the spectral mask estimation algorithm used for feature missing can effectively remove the features of the sound event disturbed by the scene sound [34], it also masks some of the features of the sound event. In the case of white noise, the method of short-term estimation of feature concealment range [41] is easy to filter out most of the sound event features, and the recognition effect is very poor. Spectral subtraction [42] processes signals in all frequency bands and inevitably destroys the characteristics of sound events. Although the multi-band spectral subtraction [43] has made improvements to the spectral subtraction, there are still cases where the characteristics of the sound event are destroyed.

In order to avoid the influence on the signal structure of the sound event while suppressing the scene sound, so as to obtain a higher recognition rate under the low signal-to-noise ratio, this paper proposes to train the classifier with the sound of the scene sound mixed with the sound event. In the training process of the classifier model, the scene sounds are superimposed with the sound events according to different signal-to-noise ratios, and the sound data of the sound events in various sound scenes are obtained, and the classifier is trained. In the detection process, the boundary point between the sound event and the scene sound is detected through the Empirical Mode Decomposition (EMD) [44] in the Hilbert-Huang transform (HHT) transformation. According to the detected boundary points between the sound event and the scene sound, the signal-to-noise ratio of the sound event and the type of scene sound are estimated. Thus, using the signal-to-noise ratio interval and the type of scene sound, a classifier is selected to identify the sound event in the sound data.

For the signal characteristics of various sound events and scene sounds, this paper summarizes related literature [45-48] and existing work [49], using the gray level co-occurrence matrix (GLCM) of the spectrogram and Higher-Order Singular Value Decomposition (HOSVD) extracts the characteristics of the sound signal. For sound events To As for the classification and recognition of scene sounds, we use Random forests Matrix (RFM), Random forests [50] (Random forests, RF) and Multi Random forests (M-RF).

The process of identifying the sound event is to use the signal-to-noise ratio range and the scene sound type to select RF from the RFM, and use the selected RF to identify the sound event. In real-time sound event detection, we use scene sound data and sound event sample sets to train RF or M-RF to identify sound events in the scene.

Summary of the invention

The purpose of the present invention is to provide a method for identifying sound events in a low signal-to-noise ratio sound scene that can effectively improve the recognition rate under a low signal-to-noise ratio in various sound scenes.

In order to achieve the above objective, the technical solution of the present invention is: a method for identifying sound events in a low signal-to-noise ratio sound scene, including the following steps

Step S1: Training and generation of the random forest matrix: sound mixing the known sound event samples in the sound event sample set and the known scene sound samples in the scene sound sample set to obtain a mixed sound signal set and store it in the training sound set. Generating a feature set of the training sound set through GLCM-HOSVD from the sound signals in the training sound set, and training the feature set of the training sound set to generate a random forest matrix;

Step S2: The training and generation of the random forest for discriminating the type of scene sound: GLCM-HOSVD is performed on the known scene sound samples in the scene sound sample set to generate the feature set of the scene sound sample set, and perform GLCM-HOSVD on the feature set of the scene sound sample set Training to generate a random forest to discriminate the sound type of the scene;

Step S3: Recognizing the sound event to be measured:

In the first step, the sound signal to be measured is decomposed into scene sounds and sound events through EMD, and the signal-to-noise ratio of the sound event to be measured is calculated;

The second step is to calculate the characteristic values of the scene sound to be measured and the sound event to be measured, and input the characteristic values of the scene sound to be measured into the scene sound type discrimination random forest generated in step S2, and detect the sound type of the scene to be measured ；

The third step is to select a random forest for sound event recognition from the random forest matrix generated in step S1 according to the sound type of the scene to be measured and the signal-to-noise ratio of the sound event to be measured;

In the fourth step, the characteristic value of the sound event to be measured is identified through the random forest selected in the third step to obtain the sound type.

In the embodiment of the present invention, the specific implementation process of the first step in step S3 is as follows:

Pass the measured sound signal y(t) through the EMD, and the EMD can adaptively divide the measured sound signal y(t) into the linear superposition of n-level inherent modal functions according to the characteristics of the signal itself, namely

Among them, r _i (t) is the residual function, and _Li (t) is the n-level intrinsic modal function;

Among the n-level inherent modal functions _Li (t), the first-level inherent modal function L ₁ (t) mainly contains noise components, with very few effective sound components. The noise components are the sound part of the scene, and the effective sound components are the sound Event part; therefore, we only select the intrinsic modal function of level 2-6, that is, take i=2,3,...,6 for the detection of the sound endpoint under test; use the i-th intrinsic modal function _Li (t ) The process of detecting the sound endpoints to be tested is as follows:

S311: Perform preprocessing on the i-th inherent modal function _Li (t)

e _i (t)=|H{L _i (t)}|+L _i (t) (2)

Among them, H{L _i (t)} represents the Hilbert transformation of the intrinsic mode function;

S312: _{Smooth e i} (t)

Among them, σ is the smoothing window, which is 0.05 times the sampling rate;

S313: _{Normalize E i} (t)

S314: Calculate the sound event level S _level , the scene sound level N _level, and initialize the scene sound level threshold T

Slevel=mean[F _i (t)] (5)

N _level = βΣF _i (t) (6)

T=αS _level (7)

Among them, α and β are threshold parameters, and α=4, β=0.25;

S315: Calculate _{the average value of F i} (t) in the k-th window

Among them, k is the window index, W _d is the window length, and the signal sampling rate is 0.02 times;

S316: Judge whether there is a sound event

If the sound event exists, skip to step S318;

S317: Dynamically estimate the scene sound and update the scene sound level

Wherein, N _level (n) is the scene sound level of the nth window, and jump to step S319 after _{updating the scene sound level N level (n);}

S318: Update the scene sound level threshold

Among them, θ is a constant, and θ=0.2;

S319: If the scene sound level threshold has been updated in the previous cycle, update the sound event level S _level

S _level ＝N _level +λ|TN _level | (12)

Among them, λ=0.5, which is used as the weight for updating the level of the sound event;

S3110: k=k+1, move the window, if the window does not end, jump to step S315, otherwise the loop ends;

2-6 level selected intrinsic mode function L _i (t) by the above-described processes S3110 to step S311, to give five different endpoint detection result, and then determines the final endpoint detection result of a vote;

After separating the sound signal y(t) into the sound event segment s(t) and the scene sound segment n(t), in order to be able to estimate the signal-to-noise ratio more accurately, we smooth the signal energy and first calculate the scene sound energy:

P _n (t)=n ² (t) (13)

Second, adjust the sound energy of the scene

P _n (t)=mean(P _n ) if P _n (t)>γmean(P _n ) (14)

Among them, the coefficient γ=3, the purpose of this process is to adjust the misclassified sound event segment in the scene sound segment;

Finally calculate the signal-to-noise ratio

Among them, l represents the ratio of the length of the sound event segment to the scene sound segment. Since the separated sound event segment contains scene sound components, it has an impact on the energy value of the sound event segment. Therefore, lΣP _n (t) is used as the influence It is estimated that the influence of the sound of the scene on the energy value is eliminated.

In the embodiment of the present invention, in the steps S1 to S3, the calculation methods for the characteristics of the scene sound to be measured, the characteristics of the sound event to be measured, the characteristics of the training sound event, and the characteristics of the known scene sound are as follows:

GLCM can be expressed as:

P(i,j|d,θ)=#{(x,y),(x+Δx,y+Δy)|f(x,y)=i, f(x+Δx,y+Δy)=j } (16)

Among them, x, y represent the pixel coordinates in the spectrogram, and x+Δx≤M, y+Δy≤N, M×N represents the size of the image; i,j=0,1,...,L-1,L Is the number of gray levels of the image, #{S} represents the number of elements in the set S;

Intercept the image area in the spectrogram of the sound event whose size is M×N and the gray level is L. According to formula (16) and the values of d and θ, calculate the GLCM, and combine each GLCM into a high-order matrix A ,

Perform tensor expansion on the higher-order matrix A to obtain A _(n) ,

Among them, the elements of A

Placed in _{the i n} row and j column of a two-dimensional matrix of _{size I n} ×(I _n+1 ×…I _N ×I ₁ ×…I _{n-1 ), here,}

When k>n,

When k<n,

Perform singular value decomposition on A _{(n) to get}

Among them, U ⁽ⁿ⁾ is a unitary matrix; Σ ⁽ⁿ⁾ is a positive semi-definite diagonal matrix; and V ^(n)H , the conjugate transpose of V, is a unitary matrix; Σ ^{(n) is} obtained according to formula (17), According to Σ ⁽ⁿ⁾ , we can get

Take σ ⁽¹⁾ …σ ⁽ⁿ⁾ …σ ^(N) as the characteristics of sound events, namely

Among them, 1≤n≤N;

It represents Σ ⁽ⁿ⁾ of singular values i _n, 1≤i _{_n} ≤I _n;

According to the above-mentioned method of calculating the feature value of the sound event, the characteristics of the sound event to be measured, the characteristics of the sound event to be measured, the characteristics of the training sound event, and the characteristics of the known scene sound can be calculated.

Compared with the prior art, the present invention has the following beneficial effects:

1. Propose a random forest matrix (RFM): mix sound events with various environmental sounds with different signal-to-noise ratios, and use the mixed sounds to train the sound events as a classifier;

2. Propose a method of using EMD and voting on the Intrinsic Mode Function (IMF) to detect scene sounds and sound event endpoints and estimate the signal-to-noise ratio: the multi-level intrinsic mode function is used to determine the signal-to-noise ratio in the sound data. The boundary points of scene sounds and sound events are detected, the final boundary point detection results are determined by voting, and the signal-to-noise ratio of the sound event is estimated;

3. Propose GLCM-HOSVD features: convert the spectrogram into a gray-level co-occurrence matrix (GLCM), and obtain the eigenvalues of the sound signal by performing high-order singular value decomposition (HOSVD) on GLCM;

4. Use Random Forest Matrix (RFM) to identify sound events in different scenes and signal-to-noise ratios: select the corresponding random forest to identify sound events through the scene sound type and the signal-to-noise ratio of the sound event;

5. Propose random forest (RF) and multi-random forest (M-RF) to recognize sound events in real time: mix real-time scene sounds with sound events in the sound event sample set, and train RF or M-RF for real-time sound event recognition .

Description of the drawings

Figure 1 shows the spectrogram GLCM-HOSVD.

Figure 2 shows the EMD+GLCM-HOSVD+RFM architecture diagram of sound event recognition in various sound scenarios.

Figure 3 is an EMD+GLCM-HOSVD+RF architecture diagram for real-time recognition of sound events in a sound scene.

Figure 4 is an EMD+GLCM-HOSVD+M-RF architecture diagram for real-time recognition of sound events in a sound scene.

Figure 5 shows the results of endpoint detection in different sound scenes of 0db. Figure 5(a) Pure sound, Figure 5((b) Wind sound scene, Figure 5(c) Rain sound scene, Figure 5((d) Gaussian white noise , Figure 5 ((e) road sound scene, Figure 5 ((f) airport sound scene).

Figure 6 shows the positional relationship between pixel pairs in the gray level co-occurrence matrix.

Fig. 7 is an example of gray-level co-occurrence matrix GLCM generation, Fig. 7(a) 4×5 gray-scale image, Fig. 7(b) _{GLCM when d 1} =1 and θ=0°, Fig. 7(c) d ₁ ＝ 1. _{An 8×8×8 third-order matrix formed by d 2} = 2 and θ=0°, 45°, 90°, and 135°.

Figure 8 is the basic schematic diagram of random forest.

Figure 9 is a diagram of the recognition results of two texture feature extraction methods in different scenes and different signal-to-noise ratios. Figure 9(a) highway scene,

Figure 9 (b) wind sound scene, Figure 9 (c) flowing water scene, Figure 9 (d) rain sound scene, Figure 9 (e) airport sound scene, Figure 9 (f) Gaussian white noise.

Figure 10 shows the average recognition results of EMD+M-RF, EMD+RF and pRF in 6 scenarios.

Figure 11 is a comparison diagram of recognition rate between EMD+GLCM-HOSVD+M-RF and MP-feature, Figure 11(a) highway scene, Figure 11(b) wind sound scene, Figure 11(c) flowing water scene, Figure 11(d) Rain scene, Figure 11(e) airport noise, Figure 11(f) Gaussian white noise.

Figure 12 is a comparison diagram of EMD+GLCM-HOSVD+M-RF and SPD under low signal-to-noise ratio, Figure 12(a) highway scene,

Figure 12(b) Wind sound scene, Figure 12(c) Flowing water scene, Figure 12(d) Rain sound scene, Figure 12(e) Airport noise, Figure 12(f) Gaussian To White Noise.

Figure 13 is a graph of the average recognition rate of the three methods of EMD+GLCM-HOSVD+M-RF, MP-feature and SPD under low signal-to-noise ratio.

Detailed ways

The technical solution of the present invention will be described in detail below in conjunction with the accompanying drawings.

The invention provides a method for recognizing sound events in a low signal-to-noise ratio sound scene, including the following steps:

Step S3: Recognizing the sound event to be measured:

The following specifically describes the implementation process of the method of the present invention.

1. Sound event recognition model

This part introduces the architecture of sound event recognition based on GLCM-HOSVD in various sound scenes with low signal-to-noise ratio. Among them, the process of passing the sound signal through GLCM-HOSVD to generate the characteristic value w is shown in Figure 1. The process of GLCM-HOSVD is to convert the sound signal into a spectrogram, calculate the GLCM of the spectrogram, and obtain the characteristic value w of the sound signal by performing HOSVD on the GLCM. The feature value w that needs to be involved in this application includes the feature set W ^l of the training sound set in training figure 2 and W ^{s in} figure 3 and W ^sh , W ^s , and W ^{sl in} figure 4, which are known limited kinds of scenes. The characteristic value W ⁿ ^{of the sound, the characteristic value w t} of the scene sound in the sound to be tested, ^{and the characteristic value w e of} the sound event to be measured.

Figure 2 shows the architecture for recognizing sound events in various sound scenarios. This architecture is called EMD, GLCM-HOSVD and RFM architecture. Related content includes: 1) Random forest matrix RFM training and generation part, as shown in the dashed box in Figure 2; 2) Scene sound type discrimination Random forest RF _n training and generation part, as shown in the dotted line box in Figure 2. ; 3) The recognition part of the sound event to be tested, as shown in the half-lined frame part in Figure 2.

Among them, the random forest matrix training and generation part includes sound event samples, scene sound samples, sound mixing, training sound set, GLCM-HOSVD and random forest matrix RFM. Sound event samples, storing various types of known sound event samples. Scene sound samples, store S kinds of known types of scene sound samples. Sound mixing, which superimposes various known sound event samples and S scene sound samples according to N different signal-to-noise ratios to generate S×N types of mixed sound signals of S sound scenes and N different signal-to-noise ratios Set and stored in the training sound set. GLCM-HOSVD performs GLCM-HOSVD on the sounds in the training sound set to generate the feature set of the sound training set

Among them, M is the number of sound samples. RFM trains the S×N feature sets W1 to generate an S×N random forest matrix.

Scene sound type discrimination random forest training and generation part, GLCM-HOSVD is performed on the scene sound samples to generate scene sound characteristics w ⁿ . Sample set of sound characteristics of the scene

Perform training to generate a scene sound type discrimination random forest RF _n , where P is the number of scene sound samples.

In the sound event recognition part to be measured, the sound signal y(t) to be measured is passed through EMD to decompose the scene sound part n(t) and the sound event part s(t). ^{Calculate the characteristic value w t of the} sound of the scene to be tested. The input scene sound type of ^{w t is} _{judged by random forest RF n} _{, and the sound type l t} of the scene to be tested is detected. Through the scene sound n(t) and the sound event s(t), the signal-to-noise ratio l _{s of} the sound event to be measured is calculated. According to the scene sound type l _t and the signal-to-noise ratio l _{s in} _{the sound event to be tested, a random forest RF s,n} for sound event recognition is selected from the random forest matrix. ^{Calculate the characteristic value w e of} the sound event to be measured, and use w ^e to identify the sound event through the random forest RF _{s,n to} obtain the type l. For the real-time recognition process of sound events in the acoustic scene, we simplified the architecture of Figure 2 to obtain the EMD, GLCM-HOSVD and RF architectures shown in Figure 3.

In the real-time test, only the scene sound n(t) after EMD segmentation needs to be directly compared with the M in the sound event sample library according to the _{signal-to-noise ratio l s of the scene sound n(t) and the sound event s(t).} A variety of sound events are mixed, and the mixed sound set is GLCM-HOSVD to generate a feature set

Use W ^{s to} establish RF _s . The established RF _{s is} used to identify ^{the characteristic value w e of the} sound event s(t) to be measured.

In general, there is a deviation in the detection of the signal-to-noise ratio of the sound event in the sound signal. Especially when the signal-to-noise ratio is low, if the estimation of the signal-to-noise ratio is biased, the trained RF classifier may not be able to accurately detect the sound event. Therefore, we further expand the architecture composed of EMD, GLCM-HOSVD and RF in Figure 3 into the sound event recognition architecture composed of EMD, GLCM-HOSVD and M-RF shown in Figure 4. For the sound signal, we use the two signal-to-noise ratios l _sh and l _sl (l _sh >l _s >l _sl ) with similar values of the actual detected signal-to-noise ratio l _s at the same time, and mix them with the sound event samples to form three sets of sounds. set. Pass the three sets of mixed sound sets through GLCM-HOSVD to generate

with

Train three RF classifiers RF _sh , RF _s , and RF _{sl respectively} . When recognizing sound events, RF _sh , RF _s , and RF _sl are used to recognize the sound events. Finally, all decision trees in the three random forests are voted to determine the recognition result.

2. Low signal-to-noise ratio sound event recognition

This part includes the empirical modal decomposition of the sound data, the detection of the end points of the sound event, and the calculation of the signal-to-noise ratio between the sound event and the scene sound in the sound data. Convert the sound data into a spectrogram, and calculate the GLCM of the sound data. Perform HOSVD on GLCM to generate feature w. Use feature set W to train to generate a random forest matrix, and use random forest pairs to identify sound events in the sound data.

A. Sound event endpoint detection and signal-to-noise ratio estimation

First, we use empirical mode decomposition to detect the sound event endpoint, and then estimate the signal-to-noise ratio based on the scene sound and the sound event endpoint.

EMD is the core of HHT transformation [44]. EMD can adaptively divide the original signal y(t) into the linear superposition of n-level IMF according to the characteristics of the signal itself, namely

Among them, r _i (t) is the residual function.

Among the n-level natural modal functions _Li (t), the first-level natural modal function L ₁ (t) mainly contains noise components, with very few effective sound components. Therefore, we only select 2-6 intrinsic modal functions, that is, i=2,3,...,6, which is used to detect the foreground sound endpoints. The process of using the i-th intrinsic mode function _Li (t) to detect the foreground sound endpoint is as follows.

1) Preprocessing of the i-th inherent modal function _Li (t)

e _i (t)=|H{L _i (t)}|+L _i (t)(2)

Among them, H{L _i (t)} represents the Hilbert transformation of the intrinsic mode function.

2) _{Smooth e i} (t).

Among them, σ is the smoothing window, and here is 0.05 times the sampling rate.

3) _{Normalize E i} (t).

4) Calculate the sound event level S _level , the scene sound level N _level and the initial scene sound level threshold T

S _level = mean[F _i (t)] (5)

N _level = βΣF _i (t) (6)

T=αS _level (7)

Among them, α, β are threshold parameters, and α=4, β=0.25.

5) Calculate _{the average of F i} (t) in the kth window

Among them, k is the window index, W _d is the window length, and the signal sampling rate is 0.02 times.

6) Judge whether there is a sound event.

If the sound event exists, skip to step 8).

7) Dynamically estimate the scene sound, and update the scene sound level N _level .

Among them, N _level (n) is the scene sound level of the nth window. After updating the scene sound level N _level (n), skip to step 9).

8) Update the scene sound level threshold.

Among them, θ is a constant, and θ=0.2. To

9) If the threshold has been updated in the previous cycle, update the sound event level S _level :

S _level ＝N _level +λ|TN _level | (12)

Among them, λ=0.5, which is used as the weight for updating the level of the sound event.

10) k=k+1, move the window. If the window is not over, skip to step 5), otherwise the loop ends.

The selected 2-6 level intrinsic modal function _Li (t) is processed by the above steps, and 5 different endpoint detection results can be obtained, and then the final endpoint detection result is determined by voting.

In Figure 5, the blue part is the sound signal waveform, the red part is the endpoint detection result, the high bit indicates that the sound event is included, and the low bit indicates that only the scene sound is included. (b), (c), (d), (e), (f) are waveform diagrams and sound event endpoint detection results with a signal strength of 0db in various acoustic scenarios. From the above figures, it can be seen that this method can basically detect the sound segment of the sound event under 0db.

After separating the sound signal y(t) into the sound event segment s(t) and the scene sound segment n(t), in order to be able to estimate the signal-to-noise ratio more accurately, we smooth the signal energy. First calculate the sound energy of the scene:

P _n (t)=n ² (t) (13)

Secondly, adjust the sound energy of the scene:

P _n (t)=mean(P _n ) if P _n (t)>γmean(P _n ) (14)

The coefficient γ=3. The purpose of this process is to adjust the misclassified sound event segment in the scene sound segment.

Finally calculate the signal-to-noise ratio:

Among them, l represents the ratio of the length of the sound event segment to the scene sound segment. Since the separated sound event segment contains scene sound components, it affects the energy value of the sound event segment. Therefore, lΣP _n (t) is used as the estimation of the influence, and the influence of the scene sound on the energy value is eliminated. Due to the error in endpoint detection, there is a certain error in the calculation of the signal-to-noise ratio. Therefore, in order to match the corresponding classifier model, the SNR calculation result is in the interval of (-6,-0.5),[-0.5,2.5),[2.5,7.5),[7.5,15),[15,25)dB Inside, the -5dB, 0dB, 5dB, 10dB, and 20dB classifier models are used to identify sound events.

B. GLCM of sound signal

In this part, we calculate the GLCM of the spectrum S(f,t) of each sound segment.

Here, GLCM refers to the joint probability distribution of two pixels that are separated by (Δx, Δy) in the spectrogram, and gray levels are i and j, respectively. The specific ranges of Δx and Δy are determined by two parameters: pixel The spacing d and the matrix generation direction θ[46], and satisfy Δx=dcosθ, Δy=dsinθ, as shown in FIG. 6. GLCM can be expressed as:

P(i,j|d,θ)=#{(x,y),(x+Δx,y+Δy)|f(x,y)=i, f(x+Δx,y+Δy)=j } (16)

Among them, x, y represent the pixel coordinates in the spectrogram, and x+Δx≤M, y+Δy≤N, M×N represents the size of the image; i,j=0,1,...,L-1,L Is the number of gray levels of the image, #(S) represents the number of elements in the set S. When d and θ are determined, P(i,j|d,θ) can be abbreviated as P(i,j). To

There are three main factors that affect the performance and computational complexity of GLCM: the number of gray levels L, the pixel pitch d, and the direction θ. According to the experiment, we choose L=8, d=1, 2, θ=0°, 45°, 90°, 135° in this article.

Figure 7 shows an example of GLCM generation. Figure 7(a) is an image area with a size of 4×5 and a gray level of 8 cut from the spectrogram. Figure 7(b) shows the GLCM corresponding to the image area when d=1 and θ=0°, namely A ₁ . As shown in Figure 7(a), from left to right in the horizontal direction, the gray pair (4,6) appears twice, so the value of the fourth row and sixth column of GLCM in Figure 7(b) should be 2. , That is, according to (16), P(4,6|1,0°)=2. As shown in Figure 7(a), the gray pair (0,1) appears twice from left to right, so the value of row 0 and column 1 of GLCM in Figure 7(b) is 2, that is, P(0 , 1|1, 0°)=2.

Similarly, for d ₁ =1 and θ=0°, 45°, 90°, 135°; d ₂ = 2 and θ=0°, 45°, 90°, 135°, we can get another 7 GLCMs, A ₂ ,…,A ₈ . We put these eight matrices into an 8×8×8 third-order matrix, as shown in Figure 7(c).

C.HOSVD

In order to further extract the characteristics of the sound event, we put the third-order matrix in Figure 7(c) into HOSVD.

Here we first review the singular value decomposition [47]. For any matrix M of size m×n,

Then there is a decomposition such that

M=UΣV ^* (17)

Among them, U: is a unitary matrix of order m×m, Σ is a positive semi-definite m×n order diagonal matrix, and V ^* , the conjugate transpose of V, is a unitary matrix of order n×n.

For a high-order matrix A of _{size I 1} ×I ₂ ×…I _N,

The tensor of A can be expanded [34] to get A _(n)) ,

Among them, put the element of A

It is placed in _{i n} rows and j columns of a two-dimensional matrix of _{size I n} ×(I _n+1 ×...I _N ×I ₁ ×...I _{n-1 ).} Here,

When k>n,

When k<n,

Similar to singular value decomposition, perform singular value decomposition on A _(n),

Among them, U ⁽ⁿ⁾ is a unitary matrix; Σ ⁽ⁿ⁾ is a positive semi-definite diagonal matrix; and V ^(n)H , the conjugate transpose of V, is a unitary matrix. We can get Σ ⁽ⁿ⁾ . According to Σ ⁽ⁿ⁾ , we can get

We take σ ⁽¹⁾ …σ ⁽ⁿ⁾ …σ ^(N) as the characteristics of sound events, namely

Among them, 1≤n≤N;

Represents the i _n th singular value of ^{Σ (n)} _{, 1 ≤ i n} ≤ I _n .

Taking Figure 7(c), the 8×8×8 third-order matrix as an example, it can be expressed as

Among them, I ₁ =8, I ₂ =8, and I ₃ =8. then

Unfold A along I _one dimension to get A ₍₁₎ ,

which is

Similarly, it _{expands along the I 2} and I ₃ dimensions, and A ₍₂₎ and A ₍₃₎ can be obtained.

Therefore, according to (18) and (19), Σ ⁽ⁿ⁾ , namely Σ ⁽¹⁾ , Σ ⁽²⁾ , Σ ⁽³⁾ and σ ⁽¹⁾ , σ ⁽²⁾ and σ ^{(3) can be obtained} .

Among them, On _,m represents a zero matrix with a size of n×m. σ ⁽¹⁾ = [6.31, 5.24, 5.01, 3.08, 2.71, 2.12, 1.91, 1.27]. Similarly, σ ⁽²⁾ ^{= [6.26, 5.66, 4.60, 3.31,2.77, 2, 1.69, 1], σ (3)} = [6.51, 5.65, 4.43, 3.10, 2.46, 2.16, 1.68, 1.36] can be obtained. Finally, according to (20), σ ⁽¹⁾ , σ ⁽²⁾ and σ ^{(3) are} combined into w as the feature of sound event recognition.

w=[6.31 5.24 5.01 3.08... 3.10 2.46 2.16 1.68 1.36].

^{In the same way, we can obtain the characteristics w l} of the training sound event described in Part II, the characteristics of the known scene sound w ⁿ , the characteristics of the scene sound in the sound to be tested w ^t and the characteristic w of the sound event to be tested. ^e . For a sound set containing M sound events, we can obtain a feature set W={w ₁ ,...,w _M }. Through the feature set W, we can further train the random forest.

D. RF identification

Random forest is an ensemble classifier algorithm that uses multiple decision tree classifiers to discriminate data [49-52]. The principle is shown in Figure 7, that is, through self-service re-sampling (Bootstrap) technology, self-service re-sampling from the feature set of the original training sample to generate new k training data sets. Then these newly generated k training data sets are grown into k/decision trees according to the decision tree construction method, and they are combined to form a forest. The discriminant result of the test data is determined by the scores formed by voting of k trees in the forest.

The process of using random forest to identify unknown test samples is as follows. First, as shown in Figure 2 or Figure 3 and Figure 4, the feature w ^t ^{of the scene sound in the sound to be tested or the feature w e of} the sound event to be tested is placed at the root node of all k decision trees in the random forest. Then it is passed down according to the classification rules of the decision tree until it reaches a certain leaf node. The corresponding class label of this leaf node is the vote made by ^{this decision tree on the class l of the feature w t} or w ^e. The k decision trees of the random forest all ^{vote for the category l of w t} or w ^e , and count these k votes. The one with the most votes becomes the category label l corresponding to ^{w t} or w ^e.

Three, experiment

A. Sound event sample set

The 40 kinds of pure bird sounds used in the experiment came from the Freesound sound database [43]. There are 30 samples of each kind of bird sounds, and a total of 1200 samples. The six scene sounds used in the experiment are Gaussian white noise, busy highway scene sound, flowing water scene sound, airport scene sound, rain scene sound, and wind scene sound. Among them, Gaussian white noise is a random signal with a mean value of 0, a power spectral density of 1 and a uniform distribution randomly generated by a computer, and it is obtained by whitening. The other scene sounds are recorded in the corresponding sound scenes at a sampling frequency of 44.1kHz. In order to standardize the encoding format and length of the above sound files, they are uniformly converted into mono WAV format sound clips with a sampling frequency of 8kHz, a sampling accuracy of 16bit, and a length of 2s.

B. Experimental setup

First, the basic experiment.

1) Comparison between GLCM-HOSVD and GLCM-SDH (Sum and difference histograms) [45,49]. GLCM-HOSVD and GLCM-SDH are used to extract the features of sound signals with different signal-to-noise ratios including sound events, and random forest (for identification) is used. Compare the performance of GLCM-HOSVD and GLCM-SDH for sound event recognition in 6 different sound scenes.

1) The key experiment in this article. The process is shown in Figure 2, using an architecture composed of EMD, GLCM-HOSVD and RFM. The content includes: a) Random forest matrix RFM training and generation part; b) Scene sound type discrimination random forest RF _n training and generation part; c) Sound event recognition part to be tested.

Forest matrix RFM training and generation. 40 kinds of sound event samples (20 each) and 6 kinds of scene sound samples are mixed with 5 signal-to-noise ratios of 20, 10, 5, 0, -5dB to generate 6×5=30 A mixed sound set. Perform GLCM-HOSVD on the mixed sound to generate a feature set. Train with 30 feature sets and generate 6×5 random forest matrix RFM.

2) Random forest RF _n for discriminating the sound types of scenes, and establishing a random forest for discriminating 6 sound types of scenes. Select RF _s,n from RFM, and use RF _s,n to identify the corresponding sound scene and 5 kinds of signal-to-noise ratios to be measured.

3) Compare EMD, GLCM-HOSVD and RF architecture with the pRF method of pure sound event training. The pRF method is a random forest trained with 40 simple sound events in the sound event sample library. The RF _s,n method is to use the signal-to-noise ratio l _{s of} the sound event to be measured and the type of scene sound l _{t in} Figure 2 to select the matching RF _s,n from the RFM, and use the selected RF _s,n to be the sound event to be measured Identify it. In the real-time detection of sound scene determination, we simplified the architecture of Figure 2 into the EMD, GLCM-HOSVD and RF architectures of Figure 3, that is, identification _{through RF s.}

4) The detection performance of EMD, GLCM-HOSVD and RF architecture is based on the above experimental results to make practical improvements to the proposed method. That is, in the specific application, the EMD, GLCM-HOSVD and M-RF architecture shown in Figure 4 are adopted.

Secondly, EMD, GLCM-HOSVD and M-RF architecture are compared with MP-feature [27,28]. In 6 different acoustic scenarios, compare the EMD, GLCM-HOSVD and M-RF architectures in Fig. 4 with the MP+PCA+LDA+SVM method in [27]. The SVM method of MP combining PCA and LDA, here referred to as MP-feature, is to use the matching pursuit algorithm to select important atoms from the Gabor dictionary, and use principal component analysis (PCA) and linear discriminant analysis (LDA) to determine the characteristics of the sound event. The SVM classifier performs recognition.

Again, EMD, GLCM-HOSVD and M-RF architecture are compared with SPD. The EMD, GLCM-HOSVD and M-RF architecture and the SPD+KNN method of literature [20] are compared with the sound event recognition performance under the conditions of 5dB, 0dB and -5dB. SPD combines the KNN method, referred to as SPD, in which the sub-band power distribution (SPD) separates the high-energy and reliable small part of the sound event spectrogram from the scene sound, and the nearest neighbor classifier (kNN) treats these high frequencies. To Reliable identification of a small number of spectra.

C. Experimental scenario

The 6 types of sound scenes include: road scene sound, wind scene sound, flowing water scene sound, rain scene sound, airport scene sound and Gaussian white noise.

4. Results and discussion

A. Basic information

The first group of experiments compares the recognition rates of GLCM-HOSVD and GLCM-SDH. The recognition rate is shown in Figure 9. In Figure 9 (a), (b), (c), (d), (e), (f) are highway scenes, wind sound scenes, flowing water scenes, rain sound scenes, airport sound scenes, and Gaussian white noise. The recognition rate under the signal-to-noise ratio.

It can be seen that in most acoustic scenes, when the signal-to-noise ratio is 10-20dB, the recognition rate of using GLCM-HOSVD feature is about 20% higher than using GLCM-SDH feature.

For Figure 9(a), due to the instability of the sound scene around the highway, the experimental result is still that the GLCM-HOSVD method is significantly better than GLCM-SDH. Although, as shown in Figure 9(e), in the airport sound scene, when the signal-to-noise ratio of the sound event is 0dB, the GLCM-HOSVD method is slightly lower than the GLCM-SDH method. But overall, the GLCM-HOSVD method we proposed can better characterize the texture characteristics of the spectrogram than the GLCM-SDH method.

The second set of experiments is the key experiment of this article. In the experiment, we have predetermined fewer types of scene sounds, and choosing a random forest discriminator can ensure the correct recognition of the types of scene sounds. For practical applications, we discuss in IV.D. In the experiment, we test each sound scene and the random forest RF _{s,n of the} signal-to-noise ratio in the random forest matrix RFM with 5 sound events with different signal-to-noise ratios in this sound scene. The average recognition rate under the 6 sound scenes is shown in Table 1.

Table 1 Use different RF _{s, n} to recognize sound events with different signal-to-noise ratios

It can be seen from Table 1 that when _{the signal-to-noise ratio of the random forest RF s,n} matches the signal-to-noise ratio of the test sound, the recognition accuracy is hardly affected by the signal-to-noise ratio. As shown in the main diagonal of Table 1, the recognition accuracy is very high whether it is a high signal-to-noise ratio or a low signal-to-noise ratio. If there is a deviation between the signal-to-noise ratio of the test sound and the signal-to-noise ratio of the training sound, the recognition accuracy will decrease as the deviation increases. For example, in the first row of Table 1, when the RF _s,n signal-to-noise ratio is 20dB and the test sound signal-to-noise ratio is 10dB, it is 68.63%, 5dB is 46.88%, 0dB is 27.63%, and -5dB is 13.75%. Meanwhile, the lower RF _{s, n} _SNR, RF _s, the _n-to-noise ratio SNR of the test sound matching error, the greater the impact on the recognition rate. As shown in the fifth row of Table 1, when the RF _{s, n} signal-to-noise ratio is -5dB, the test sound signal-to-noise ratio is 19.00% at 0dB, 7.13% at 5dB, 2.38% at 10dB, and 5.43% at 20dB. However, as long as it is ensured that the _{signal-to-noise ratio of RF s,n} matches the signal-to-noise ratio of the test sound, even if the RF _s,n is -5dB with a low signal-to-noise ratio, a high recognition rate can be maintained.

The third group of experiments, for EMD, GLCM-HOSVD and RF architecture, is referred to as EMD+RF in Figure 10. According to the architecture of Fig. 3, the estimated signal-to-noise ratio l _{s of} the sound event to be measured has a deviation from its true signal-to-noise ratio. This makes _{the signal-to-noise ratio of RF s} deviate from the signal-to-noise ratio of the sound event to be measured. As a _{result, the recognition rate of the RF s} of the sound event to be measured is reduced. This situation is especially obvious when the _{RF s is low.} The relevant results are shown in the green histograms in Figure 10. Among them, when the RF _s is 20dB, the average recognition rate for the 6 types of sound scenes is 92%; at 10dB, 83%; at 5dB, 77.5%; at 0dB, 64%; at -5dB, 29%.

For pRF, under different signal-to-noise ratios, the average recognition rate of sound events in 6 sound scenes is referred to as RF in Figure 10. From the blue histogram in Figure 10, it can be seen that the recognition rate of pRF is slightly less when the signal-to-noise ratio is 20dB. Higher than RF _s . However, the overall recognition result, RF _{s is} significantly better than the RF recognition result.

The fourth group of experimental EMD, GLCM-HOSVD and M-RF architectures are referred to as EMD+M-RF in Figure 10. In the experiment, we select the signal-to-noise ratio that is different from the estimated signal-to-noise ratio and the estimated signal-to-noise ratio itself to mix M-RF. The average recognition rate of this method under different signal-to-noise ratios is shown in the red histogram in Figure 10. We can see that the EMD, GLCM-HOSVD and M-RF architecture methods can greatly improve the recognition rate under low signal-to-noise ratio. Regarding related improvements, discussed in IV.D.

B. Comparison of EMD, GLCM-HOSVD and M-RF architecture and MP-feature

The recognition situation of the two feature extraction methods of EMD, GLCM-HOSVD and M-RF architecture and MP-feature in 6 types of sound scenes is shown in Figure 11. In Figure 11, in 6 types of sound scenes, the MP feature is at a low signal-to-noise ratio, such as 5dB or less, and most of the sound events cannot be identified. The only exception is Figure 11(f). Since Gaussian white noise does not have obvious laws, it is not easy to restore through matching pursuit (MP). Therefore, it can maintain a certain recognition ability at 5dB. The EMD, GLCM-HOSVD and M-RF architectures can maintain a recognition rate of more than 80% in various types of scene sound at 0dB. In particular, in the case of -5dB, an average recognition rate of more than 70% is still maintained.

C. Comparison of EMD, GLCM-HOSVD and M-RF architecture with SPD under low signal-to-noise ratio

Figure 12 shows the recognition rates of EMD, GLCM-HOSVD and M-RF architecture and SPD in 6 types of scene sounds and 3 signal-to-noise ratios of 5dB, 0dB and -5dB. The SPD method, in the case of semi-supervision, discards some features that are disturbed by scene sounds, and retains some reliable high-energy features. It can be seen from Figure 12 that although SPD can still maintain a certain degree of recognition rate in the case of 5dB and 0dB, it cannot maintain the normal recognition ability for a lower signal-to-noise ratio, such as -5dB. For 0dB and -5dB, EMD, GLCM-HOSVD and M-RF architecture still maintain good recognition efficiency.

D. Discussion

This part analyzes the EMD, GLCM-HOSVD and RFM architecture, EMD, GLCM-HOSVD and RF architecture, and EMD, GLCM-HOSVD and M-RF architecture classifiers proposed in this paper, and recognizes the performance of environmental sounds in various sound scenarios. And compare EMD, GLCM-HOSVD and M-RF architecture with SPD and MP methods.

Experiments show that GLCM-HOSVD is better than GLCM-SDH; using EMD, GLCM-HOSVD and RFM architecture and EMD, GLCM-HOSVD and RF architecture can detect sound events in low signal-to-noise ratio. The performance of EMD, GLCM-HOSVD and M-RF architecture is better than the SVM method of MP combined with PCA and LDA. When the signal-to-noise ratio is lower than 0dB, EMD, GLCM-HOSVD and M-RF architecture are better than SPD combined with KNN. Figure 13 shows the average detection accuracy of sound events under 3 different signal-to-noise ratios: 5dB, 0dB, and -5dB under 6 types of sound scenarios. It can be seen from Figure 13 that this method can still maintain a high recognition accuracy rate from 0dB to -5dB.

As mentioned in IV.B basic experiment 2, this experiment only selects 6 types of sound scenes, so that the use of RF to distinguish the sound of the scene will not cause misjudgment. If the sound of the scene is judged incorrectly, it may affect the recognition accuracy. In practical applications, we use the method shown in Figure 3 or Figure 4. That is, the sound event endpoint detection and signal-to-noise ratio estimation method of III.A is used to separate the scene sound from the sound event to be measured; according to the signal-to-noise ratio of the sound event, it is directly connected with all the sound events in the sound event sample library Mix to generate the sound event set in the corresponding scene; then extract the GLCM-HOSVD feature of the sound event set, train and generate a random forest. Using this generated RF to discriminate the sound event under test can ensure that the scene type of the sound event under test is consistent with the scene sound type of the random forest.

In practical applications, for a certain environment (scene), the sound events that may occur are limited. Therefore, the number of sound events in the sound event sample library is also limited. Therefore, according to the EMD, GLCM-HOSVD and RF architecture in Figure 3 or the EMD, GLCM-HOSVD and M-RF architecture in Figure 4, the relevant scene sounds are mixed with the sound events in the sample library, and RF _s or RF is established _sh -RF _s -RF _sl can be performed in real time. This enables real-time recognition of low signal-to-noise ratio sound events in various sound scenarios.

A further problem, as described in the basic experiment 4. A, the deviation of the estimation of the signal-to-noise ratio of the sound event to be measured causes the recognition rate to decrease. Taking into account the non-stationarity of the scene sound, the separated environmental sound has a deviation from the environmental sound of other time periods. In response to this problem, one of the improved methods is to select multiple representative non-stationary scene sounds, respectively, and the sounds in the sample library To The events are mixed to generate multiple RFs, and the final result is determined by the results of multiple RFs and further voting.

Therefore, we believe that based on the EMD, GLCM-HOSVD and RFM architecture, EMD, GLCM-HOSVD and RF architecture and EMD, GLCM-HOSVD and M-RF architecture classifiers proposed in this article, it can be implemented in various acoustic scenarios. Recognition of sound events with low signal-to-noise ratio.

In summary, this paper proposes a sound event recognition method that can effectively improve the recognition rate under low signal-to-noise ratio in various acoustic scenarios. This method combines the scene sound in the sound event to be measured with the sound event sample library, extracts the characteristics of the sound data through GLCM-HOSVD, and generates an RF for judging the sound event to be measured. The RF generated by this method can be used to recognize sound events under a low signal-to-noise ratio in a specific scene. Experimental results show that this method can make the sound event and scene sound signal-to-noise ratio of -5dB, and maintain an average accuracy of more than 73% of the sound event recognition rate. Compared with the MP and SPD feature extraction methods, to a certain extent, the proposed method solves the problem of sound event recognition in the case of low signal-to-noise ratio.

references:

[1]H.Malik, "Acoustic environment identification and its applications to audio forensics," IEEE Trans.Inf.Foren.Sec.,vol.8,no.11,pp.1827-1837,Nov.2013.

[2] T. Heittola, A. Mesaros, T. Virtanen, A. Eronen, "Sound Event Detection in Multisource Environment Using Source Separation," in Proc. CHiME, pp. 36-40, 2011.

[3]C.-H.Lee,S.-B.Hsu,J.-L.Shih,and C.-H.Chou,"Continuous birdsong recognition using gaussian mixture modeling of image shape features," IEEE Trans.multimedia ,vol.15,no.2,pp.454-464,Feb.2013.

[4]Z.Shi,J.Han,T.Zheng,and J.Li,"Identification of Objectionable Audio Segments Based on Pseudo and Heterogeneous Mixture Models," IEEE Trans.Audio,Speech,Lang.Process.,vol.21 , no. 3, pp. 611-623, Mar. 2013.

[5]S.Ntalampiras,I.Potamitis,and N.Fakotakis,"An adaptive framework for acoustic monitoring of potential hazards," EURASIP J.Audio,Speech,Music Process.vol.2009,pp.1-16,Jan. 2009.

[6]C.Clavel,T.Ehrette,G.Richard,"Events detection for an audio-based surveillance system,"in Proc.ICME,pp.1306-1309,2005.

[7]H.Zhao and H.Malik, "Audio recording location identification using acoustic environment signature," IEEE Trans.Inf.Foren.Sec.,vol.8,no.11,pp.1746-1759,Nov.2013.

[8]C.Clavel,I.Vasilescu,L.Devillers,G.Richard,T.Ehrette,"Fear-type emotion recognition for future audio-based surveillance systems,"Speech Commun.,vol.50,pp.487- 503, 2008.

[9]J.N.Mcnames,A.M.Fraser,"Obstructive sleep apnea classification based on spectrogram patterns in the electrocardiogram,"Computers in Cardiology,vol.27,pp.749-752,Sep.2000.

[10]V.Kudriavtsev,V.Polyshchuk,and D.L.Roy, "Heart energy signature spectrogram for cardiovascular diagnosis," BioMedical Engineering Online, vol. 6, no. 1, pp. 16, 2007.

[11]V.N.Varghees,K.I.Ramachandran,"A novel heart sound activity detection framework for automated heart sound analysis," Biomedical Signal Processing and Contro.,vol.13,pp.174-188,Sep.2014.

[12] A. Gavrovska, V.

I.Reljin,and B.Reljin,"Automatic heart sound detection in pediatric patients without electrocardiogram reference via pseudo-affine Wigner–Ville distribution and Haar wavelet lifting,"Computer methods and programs in biomedicine vol.113,no.2,pp. 515-528, Feb. 2014.

[13]S.Ntalampiras,I.Potamitis,N.Fakotakis,"On Acoustic Surveillance of Hazardous Situations," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP'09),2009,pp. 165-168.

[14]S.

S.

"Classification and analysis of non-stationary characteristics of crackle and rhonchus lung adventitious sounds," Digital Signal Processing, vol. 28, pp. 18-27, May. 2014.

[15]B.Lei,S.A.Rahman,and I.Song,"Content-based classification of breath sound with enhanced features,"Neurocomputing,vol.141,pp.139-147,Oct.2014.

[16]Y.Wang,W.Li,J.Zhou,X.Li,and Y.Pu,"Identification of the normal and abnormal heart sounds using wavelet-time entropy To features based on OMS-WPD,"Future Generation Computer Systems,vol.37,pp.488-495,Jul.2014.

[17]F.Jin, F.Sattar, and D.Y.Goh, "New approaches for spectro-temporal feature extraction with applications to respiratory sound classification," Neurocomputing, vol.123,pp.362-371, Jan.2014.

[18]G. Muhammad, M. Moutasem, "Pathological voice detection and binary classification using MPEG-7 audio features," Biomedical Signal Processing and Control, vol.11,pp.1-9,May.2014.

[19]G.Richard, S.Sundaram, and S.Narayanan, "An overview on perceptually motivated audio indexing and classification," Proc.IEEE,vol.101,no.9,pp.1939-1954,Sep.2013.

[20]R.Yan,R.X.Gao, "Multi-scale enveloping spectrogram for vibration analysis in bearing defect diagnosis," Tribology International, vol.42, no.2, pp.293-302,Feb.2009.

[21]MSLew, N. Sebe, C. Djeraba, and R. Jain, "Content-Based Multimedia Information Retrieval: State of the Art and Challenges," ACM Trans. Multimedia Comput., Commun., Applic., vol. 2, no. 1, pp. 1-19, Feb. 2006.

[22]J. Wang, K. Zhang, K Madani, and C Sabourin, “Salient environmental sound detection framework for machine awareness,” Neurocomputing, vol.152, pp.444-454, Mar. 2015.

[23]S.Ntalampiras, "A novel holistic modeling approach for generalized sound recognition," IEEE Signal Process.Lett.,vol.20,no.2,pp.185-188,Feb.2013.

[24]J.-C.Wang,C,-H.Lin,B,-W,Chen,and M.-K Tsai,"Gabor-based nonuniform scale-frequency map for environmental sound classification in home automation," IEEE Trans.Autom.Sci.Eng.,vol.11,no.2,pp.607-613,Apr.2014.

[25]S.Chu,S.Narayanan,and CCJKuo,"Environmental sound recognition with time-frequency audio features,"IEEE Trans.Audio,Speech,Lang.Process.,vol.17,no.6,pp.1142 -1158, Aug. 2009.

[26]ZRFeng,Q.Zhou,J.Zhang,P.Jiang,and XWYang "A Target Guided Subband Filter for Acoustic Event Detection in Noisy Environment Using Wavelet Packets," IEEE Trans.Audio,Speech,Lang.Process. ,vol.23,no.22,pp.361-372,Feb.2015.

[27]J.

-Choez,A Gallardo-Antolín,"Feature extraction based on the high-pass filtering of audio signals for Acoustic Event Classification,"Computer Speech & Language,vol.30,no.1,pp.32-42,Mar.2015.

[28]H.Phan, M.Maas, R.Mazur, and A. Mertins, "Random Regression Forests for Acoustic Event Detection and Classification," IEEE Trans.Audio, Speech, Lang.Process.,vol.23,no. 1,pp.20-31,Jan.2015.

[29] J. Ye , T. Kobayashi, M. Murakawa, T. Higuchi, "Kernel discriminant analysis for environmental sound recognition based on acoustic subspace," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.( ICASSP'13), 2013, pp.808-812.

[30]P.Khunarsal, C. Lursinsap, and T. Raicharoen, "Very short time environmental sound classification based on spectrogram pattern matching," Inform.Sci.,vol.243,pp.57-74,Sep.2013.

[31]C.Baug′e,M.Lagrange,J.And′en,and S.Mallat,"Representing environmental sounds using the separate scattering transform," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process .(ICASSP'13), 2013, pp. 8667-8671.

[32]JCWang, CHLin, E. Siahaan, BWChen, and HLChuang, “Mixed sound event verification on wireless sensor network for home automation,” IEEE Trans. Ind. Informat., vol. 10, no. 1, pp.803–812, Feb. 2014.

[33]J. Dennis, HDTran and ESChng. "Overlapping sound event recognition using local spectrogram features with the generalised hough transform," Pattern Recognition Lett., vol. 34, no. 9, pp. 1085-1093, Sep. 2013.

[34]J.Dennis,HDTran,and ESChng,"Image feature representation of the subband power distribution for robust sound event classification," IEEE Trans.Audio,Speech,Lang.Process.,vol.21,no.2, pp367-377, Feb. 2013.

[35]T.Heittola,A.Mesaros,A.Eronen,and T.Virtanen,"Context-dependent sound event detection," EURASIP J.Audio,Speech,Music Process.,vol.2013,no.1,pp. 1-13, Jan. 2013.

[36]A.Plinge, R.Grzeszick, and GAFink, "A bag-of-features approach to acoustic event detection," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP'14) , 2014, pp. 3704-3708.

[37]THDat, NWZTerence, JWDennis, and LYRen, "Generalized Gaussian distribution kullback-leibler kernel for robust sound event recognition," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP' 14), 2014, pp. 5949-5953.

[38]J.Ye,T.Kobayashi,M Murakawa,and T.Higuchi,"Robust acoustic feature extraction for sound classification based on noise To reduction,"in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP’14),2014,pp.5944-4948.

[39]S.Deng,J.Han,C.Zhang,T.Zheng,and G.Zheng, "Robust minimum statistics project coefficients feature for acoustic environment recognition," in Proc. IEEE Int.Conf.Acoust.,Speech, Signal Process. (ICASSP'14), 2014, pp. 8232-8236.

[40]X.Lu,Y.Tsao,S.Matsuda,and C.Hori,"Sparse representation based on a bag of spectral executives for acoustic event detection," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP'14), 2014, pp. 6255-6259.

[41]M.Seltzer,B.Raj,and R.Stern,"A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition," Speech Commun.,vol.43,no.4,pp.379–393,2004 .

[42]K.Yamashita, T.Shimamura, "Nonstationary noise estimation using low-frequency regions for special subtraction," IEEE Signal Process.Lett.,vol.12,no.6,pp.465-468,2005.

[43]K.Sunil and L.Philipos, "A multi-band spectral subtraction method for enhancing speech corrupted by colored noise," in Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.(ICASSP'02), 2002, vol. 4, pp. 13-17.

[44]H.Huang and J.Q.Pan,"Speech pitch determination based on Hilbert-Huang transform,"Signal Process.,vol.86,no.4,pp.792-803,2006.

[45]M.Unser, "Sum and difference histograms for texture classification," IEEE Trans.Pattern Anal.Mach.Intell.,vol.8,no.1,pp.118-125,1986.

[46]LKSoh and C.Tsatsoulis, "Texture analysis of SAR sea ice imagery using gray level co-occurrence matrices," IEEE Trans.Geosci.Remote S.,vol.37,no.2,pp.780-795, 1999.

[47]Z.Xie,G.Liu,C.He,and Y.Wen,"Texture image retrieval based on gray level co-occurrence matrix and singular valuedecomposition,"in Proc.ICMT,pp.1-3,2010 .

[48]L.D.Lathauwer,B.D.Moor,and J.Vandewalle,"A multilinear singular value decomposition," SIAM J.Matrix Anal.Appli.,vol.21,no.4,pp.1253-1278,2000.

[49]J.Wei,Y.Li, "Specific environmental sounds recognition using time-frequency texture features and random forest", in Proc.CISP,pp.1705-1709,2013.

[50]L.Breiman."Random forests,"Machine Learning,vol.45,no.1,pp.5-32,2001.

[51]H.Pang,A.Lin,M.Holford,and B.E.Enerson,"Pathway analysis using random forests classification and regression,"Bioinformatics,vol.22,no.16,pp.2028-2036,2006.

[52]KLUnella, LBHayward, J.Scgal, and PVEerdewegh, "Screening large-scale association study data: exploiting interactions using random forests", BMC Genetics, vol.11, no.5, pp.32-37, 2004.

[53]Universitat Pompeu Fabra.Repository of sound under the creative commons license,Freesound.org[DB/OL].http://www.freesound.org,2012-5-14.

The above are the preferred embodiments of the present invention. Any changes made according to the technical solution of the present invention and the resulting function does not exceed the scope of the technical solution of the present invention belong to the protection scope of the present invention. To

Claims

A method for identifying sound events in a low signal-to-noise ratio sound scene, which is characterized in that it includes the following steps. Step S1: Random forest matrix training and generation: the known sound event samples in the sound event sample set and the scene sound samples are collected The sound samples of the known scene are mixed to obtain a mixed sound signal set and stored in the training sound set. The sound signals in the training sound set are generated through GLCM-HOSVD to generate the feature set of the training sound set. The feature set is trained to generate a random forest matrix; Step S2: Scene sound type discrimination random forest training and generation: GLCM-HOSVD is performed on the known scene sound samples in the scene sound sample set to generate the feature set of the scene sound sample set, and Train the feature set of the scene sound sample set to generate a scene sound type discrimination random forest; Step S3: Recognize the sound event to be measured:

The first step is to decompose the sound signal to be measured into scene sounds and sound events through EMD, and calculate the signal-to-noise ratio of the sound event to be measured; the second step is to calculate the characteristic values of the scene sound to be measured and the sound event to be measured. And input the characteristic value of the scene sound to be measured into the scene sound type discrimination random forest generated in step S2, and detect the sound type of the scene to be measured; the third step is to pass the sound type of the scene to be measured and the sound event to be measured Select the random forest for sound event recognition from the random forest matrix generated in step S1; the fourth step is to identify the characteristic value of the sound event to be measured through the random forest selected in the third step Get the sound type.
The method for recognizing a sound event in a low signal-to-noise ratio sound scene according to claim 1, wherein the specific realization process of the first step in the step S3 is as follows: the sound signal y(t) to be measured is passed through the EMD , EMD can adaptively divide the sound signal y(t) to be measured into a linear superposition of n-level intrinsic modal functions according to the characteristics of the signal itself, namely

Among them, r i (t) is the residual function, and Li (t) is the n-level intrinsic modal function;

Among the n-level inherent modal functions Li (t), the first-level inherent modal function L 1 (t) mainly contains noise components, with very few effective sound components. The noise components are the sound part of the scene, and the effective sound components are the sound Event part; therefore, we only select the intrinsic modal function of level 2-6, that is, take i=2,3,...,6 for the detection of the sound endpoint under test; use the i-th intrinsic modal function Li (t ) The process of detecting the sound endpoints to be tested is as follows:

S311: Perform preprocessing on the i-th inherent modal function Li (t)

e i (t)=|H{L i (t)}|+L i (t) (2)

Among them, H{L i (t)} represents the Hilbert transformation of the intrinsic mode function;

S312: Smooth e i (t)

Among them, σ is the smoothing window, which is 0.05 times the sampling rate;

S313: Normalize E i (t)

S314: Calculate the sound event level S level , the scene sound level N level, and initialize the scene sound level threshold T

S level = mean[F i (t)] (5)

N level = βΣF i (t) (6)

T=αS level (7)

Among them, α and β are threshold parameters, and α=4, β=0.25;

S315: Calculate the average value of F i (t) in the k-th window

Among them, k is the window index, W d is the window length, and the signal sampling rate is 0.02 times;

S316: Judge whether there is a sound event

If the sound event exists, skip to step S318;

S317: Dynamically estimate the scene sound and update the scene sound level

Wherein, N level (n) is the scene sound level of the nth window, and jump to step S319 after updating the scene sound level N level (n);

S318: Update the scene sound level threshold

Among them, θ is a constant, and θ=0.2;

S319: If the scene sound level threshold has been updated in the previous cycle, update the sound event level S level

S level ＝N level +λ|TN level | (12)

Among them, λ=0.5, which is used as the weight for updating the level of the sound event;

S3110: k=k+1, move the window, if the window does not end, jump to step S315, otherwise the loop ends;

2-6 level selected intrinsic mode function L i (t) by the above-described processes S3110 to step S311, to give five different endpoint detection result, and then determines the final endpoint detection result of a vote;

After separating the sound signal y(t) into the sound event segment s(t) and the scene sound segment n(t), in order to be able to estimate the signal-to-noise ratio more accurately, we smooth the signal energy and first calculate the scene sound energy:

P n (t)=n 2 (t) (13)

Second, adjust the sound energy of the scene

P n (t)=mean(P n ) if P n (t)＞γmean(P n ) (14)

Among them, the coefficient γ=3, the purpose of this process is to adjust the misclassified sound event segment in the scene sound segment;

Finally calculate the signal-to-noise ratio

Among them, l represents the ratio of the length of the sound event segment to the scene sound segment. Since the separated sound event segment contains scene sound components, it has an impact on the energy value of the sound event segment. Therefore, lΣP n (t) is used as the influence It is estimated that the influence of the sound of the scene on the energy value is eliminated.
The method for recognizing sound events in a low signal-to-noise ratio sound scene according to claim 1, characterized in that: in the steps S1 to S3, the characteristics of the sound of the scene to be measured, the characteristics of the sound event to be measured, and the training sound event The calculation method of features and features of known scene sounds is as follows:

GLCM can be expressed as:

P(i,j|d,θ)=#{(x,y),(x+Δx,y+Δy)|f(x,y)=i, f(x+Δx,y+Δy)=j } (16)

Among them, x, y represent the pixel coordinates in the spectrogram, and x+Δx≤M, y+Δy≤N, M×N represents the size of the image; i,j=0,1,...,L-1,L Is the number of gray levels of the image, #{S} represents the number of elements in the set S;

Intercept the image area in the spectrogram of the sound event whose size is M×N and the gray level is L. According to formula (16) and the values of d and θ, calculate the GLCM, and combine each GLCM into a high-order matrix A ,
Perform tensor expansion on the higher-order matrix A to obtain A (n) ,
Among them, the elements of A
Placed in the i n row and j column of a two-dimensional matrix of size I n ×(I n-1 ×…×I N ×I 1 ×…I n-1)
When k>n,
When k<n,

Perform singular value decomposition on A (n) to get

Among them, U (n) is a unitary matrix; Σ (n) is a positive semi-definite diagonal matrix; and V (n)H , the conjugate transpose of V, is a unitary matrix; Σ (n) is obtained according to formula (17), According to Σ (n) , we can get

Take σ (1) …σ (n) …σ (N) as the characteristics of sound events, namely

Among them, 1≤n≤N;
It represents Σ (n) of singular values i n, 1≤i n ≤I n;

According to the above-mentioned method of calculating the feature value of the sound event, the characteristics of the sound event to be measured, the characteristics of the sound event to be measured, the characteristics of the training sound event, and the characteristics of the known scene sound can be calculated. To