KR101732399B1

KR101732399B1 - Sound Detection Method Using Stereo Channel

Info

Publication number: KR101732399B1
Application number: KR1020150155057A
Authority: KR
Inventors: 김홍국; 이동윤; 유승우; 전광명
Original assignee: 광주과학기술원
Priority date: 2015-11-05
Filing date: 2015-11-05
Publication date: 2017-05-08

Abstract

A method for detecting an anomalous sound event using an input signal of a stereo in an environment where background noise is mixed includes receiving a stereo input acoustic signal, converting the input acoustic signal into a time-frequency domain , Decomposing the input acoustic signal transformed using a non-sound tensor decomposition (NTF) algorithm, firstly discriminating whether an event sound is generated by referring to a channel gain value, discriminating characteristics of the identified event sound signals Extracting an event sound signal, performing an HMM classification on the extracted event sound signal, and finally determining whether an abnormal sound event has occurred according to the HMM classification.

Description

[0001] The present invention relates to a sound detection method using a stereo channel,

The present invention relates to a sound detection method, and more particularly, to a method for classifying and detecting an abnormal sound in a real environment in which various background noises are mixed, and a method for detecting anomalous sound using a stereo channel.

An acoustic-based event detection method means to detect the occurrence of an acoustic event from a recorded sound source and to classify the event according to the characteristics of the sound. In this paper, we propose a new method for detecting acoustic event based on a Gaussian mixed model and a non-singular nonlinear system. Is a method of detecting an acoustic event using a recorded signal.

However, since the conventional acoustic event detection techniques use a stereo channel signal as an independent single channel signal and use it as an input signal, the detection accuracy of an acoustic event is lowered There was a problem.

In the conventional acoustic detection method, a method using GMM (Gaussian Mixture Models) is used after deriving MFCC feature values. However, this method has high detection accuracy for a single acoustic signal having no signal complexity, but there is a problem in that the accuracy is low when detecting an abnormal sound in a signal in which a plurality of acoustic signals are combined.

To solve this problem, an NMF-based sound detection method has been proposed in which an input signal in which a plurality of sound signals are mixed is classified into a plurality of sound sources. This method is characterized in that a plurality of sound sources are labeled so that each abnormal event can be independently detected. However, since the channel information can not be utilized for a stereo sound signal, the detection accuracy of an abnormal sound event is somewhat deteriorated.

It is an object of the present invention to provide a sound detection method capable of increasing the detection accuracy of an abnormal sound event by using stereo channel signal information in detecting an abnormal sound event.

An embodiment of the present invention is a method of detecting an anomalous acoustic event using an input signal of a stereo, comprising: receiving a stereo input acoustic signal; Converting the input acoustic signal into a time-frequency domain; Decomposing the converted input acoustic signal using a non-sound tensor decomposition (NTF) algorithm; Firstly determining whether an event sound is generated by referring to the channel gain value; Extracting feature values for the identified event sound signals; Performing HMM classification on the extracted event sound signal; And determining whether a final abnormal acoustic event has occurred according to the HMM classification.

According to the present invention, the acoustic detection method of the embodiment uses a non-sound tensor decomposition (NTF) algorithm to separately detect various acoustic signals into respective event sounds, and by referring to the channel gain of the NTF to acoustic event detection, It is possible to detect an abnormal sound with high accuracy even in a mixed environment.

The present invention primarily classifies the event signal based on the non-sound water tensor decomposition technique, classifies the background noise and the event signal again through the hidden Markov model, and obtains a specific sound aimed at by the user in the input signal with higher accuracy Can be reliably detected.

1 is a flowchart showing a sound detection method according to an embodiment of the present invention.
2 is a diagram specifically showing each step of the sound detection method according to the embodiment of the present invention.
3 is a diagram illustrating a step of primarily determining whether an event sound is generated based on a channel gain value in the sound detection method according to the embodiment of the present invention.
4 is a diagram illustrating a step of extracting a feature value for signals discriminated in the sound detection method according to the embodiment of the present invention
5 is a diagram illustrating a step of extracting a feature value for signals discriminated in the sound detection method according to the embodiment of the present invention
6 is a graph comparing the performance of acoustic detection according to an embodiment of the present invention with the conventional art
FIG. 7 is a graph showing the performance of sound detection according to an embodiment of the present invention,

The embodiments of the present invention will be described in detail with reference to the accompanying drawings, but the present invention is not limited to these embodiments. In describing the present invention, a detailed description of well-known functions or constructions may be omitted for the sake of clarity of the present invention.

The present invention relates to a method for detecting a specific event sound among inputted sound signals, and in particular, a method for detecting an event sound in an input sound signal obtained with a stereo channel.

1 is a flowchart illustrating a sound detection method according to an embodiment of the present invention.

Referring to FIG. 1, a method for detecting an acoustic event according to an embodiment includes receiving a stereo input acoustic signal S1, converting a received input signal to a mel-amplitude spectrum signal S2, (S3), a step (S4) of discriminating whether an event sound is generated based on the channel gain value, a step (S5) of extracting a feature value for the discriminated signals, a HMM likelihood verification (S6), and detecting the occurrence of an abnormal event (S7) according to the verification result.

FIG. 2 is a diagram illustrating each step of the sound detection method according to the embodiment of the present invention in detail.

Referring to FIGS. 1 and 2 together, it is assumed that an acoustic event detection method of the present invention is such that, in separating a sound source, an input signal is a signal obtained from a stereo channel. The stereo input signal obtained in step S1 may be expressed by the following equation.

i is the frame number, c is the channel number, e is the event classification number, E is the number of event classifications, S _i ^{c, e} is the e-categorized acoustic event signal of channel ^c , d _i ^c is the background Represents a noise signal. The stereo input signal (S _i ^{c, e} (n)) is assumed to contain a mixture of E acoustic events.

In step S2, the input signal spectrum is obtained from the received input signal y _i ^c (n) as described above. The input signal is subjected to a short-term Fourier transform (STFT) _i ^e (k) |) and this signal is converted to a Mel amplitude spectrum signal Y _i ^c (m).

The mel-amplitude spectrum signal can be expressed by the following equation.

Where m and M represent the order of mel-amplitude spectrum.

Then, in step S3, a step of decomposing mel-amplitude spectrum into a non-sound tensor is performed. The mel-amplitude spectrum by non-sound hydrostatic resolution can consist of a channel gain, a frequency basis, and a time activation matrix called a tensor. This can be expressed by the following equation.

here,

Is the tensor product, and J is the rank of the basis in the NTF decomposition. The channel gain (C), frequency base (B) and time activation (A) matrix are as follows.

_{^{C = [C i 1, S}} , ···, C i E, S, C i D]: channel gain matrix (2 × J)

C _i ^{e, S} , C _i ^D represent the channel gain matrix of the acoustic event and background noise.

B = B _i ^{1, S} , ..., B _i ^{E, S} , B _i ^D : Frequency gain matrix (2 x J '

J 'represents the base rank of each acoustic event and background noise.

B _i ^{e, S} , and B _i ^D represent the frequency gain matrix of the acoustic event and the background noise.

A = [A _i ^{1, S} , ..., A _i ^{E, S} , A _i ^D ]: Time gain matrix (1 × J)

A _i ^{e, S} , and A _i ^D represent the time gain matrix of the acoustic event and the background noise.

The channel gain and the time gain may be updated by successive updating rules. The channel gain and time gain can be updated by repeatedly performing the following equation.

Where h is a repetition factor, and ° denotes a multiplication operation. Then, P ^h _{c, k, m} is

Lt; RTI ID = 0.0 > Y &_lt; / RTI > As described above, in the channel gain update rule, the channel gain, the frequency gain,

The ratio of Y _{i to} the channel gain, the frequency gain, and the channel gain for the update rule of the time gain.

The percentage value of Y _i on can be considered. Equation (4) can be performed until the relative decrease of the KL divergence becomes smaller than a predetermined threshold value.

The Y _i may be expressed as a sum of each event sound and background sound as follows.

Then, Y _i can be expressed as a background noise and an abnormal sound event signal by applying a tensor product composed of the respective factors as follows.

&Quot; (6) "

As shown in Equation (6), the abnormal acoustic event signal and background noise can be decomposed into a combination of channel gain, frequency gain, and time gain tensor, respectively.

3 is a diagram illustrating a step S40 of determining whether an event sound is generated based on a channel gain value in the sound detection method according to the embodiment of the present invention. The acoustic detection method of the present invention divides an input signal transformed into a time-frequency domain into a combination of a channel gain, a time gain, and a tensor with respect to a frequency gain, and in particular, detects an abnormal sound event This is primarily used for

In operation S40, it is determined whether an event sound is generated based on the channel gain value. In operation S40, an average of channel gains for each event sound may be determined. The channel gain of the e-th event can be calculated as:

here

Represents the number of bases corresponding to each event classification, and the average (C _i ) of all channel gains can be expressed by the following equation.

Where J represents the number of all Basis.

In addition, the embodiment first derives the channel gain value and the average of all the channel gains for each sound, and applies the mean-to-max threshold value to the average of the channel gains of the e-th (arbitrary) event . And calculating a ratio value of an e-th channel gain with respect to an average of all channel gains and comparing the ratio value with a predetermined threshold value.

If, in the case larger than the threshold (thr _c) the ratio value is a predetermined acoustic signal is carried over the case of the e-th HMM classification step. If the ratio value is smaller than the predetermined threshold value, the corresponding signal is determined to be a background noise signal, and does not consider the event occurrence decision. The above-described judgment process can be expressed by the following equation.

If the channel gain ratio of the e-th signal is greater than a predetermined value as shown in Equation (9), it is classified as an acoustic having the event information of the first degree, and this signal can be denoted as Flag _c ^e .

4 is a diagram illustrating a step S50 of extracting a feature value for signals discriminated in the sound detection method according to the embodiment of the present invention.

Referring to FIG. 4, a process of converting signals filtered by channel gain using NTF decomposition to a feature vector composed of MFCCs may be performed to detect an event sound using HMM classification.

That is, an MFCC extraction is performed with reference to a signal (Flag _c ^e ) in which the NTF is performed considering a channel gain, and extracted feature values can be expressed as follows.

Then, the average of the feature values

, The change value of the feature value is obtained. The change value may be the difference between the previous feature value and the previous value of the previous feature value. (

)

That is, the feature value of the i-th sound

, The main feature value is

Can be expressed as

Referring to FIG. 5, it is possible to use previously trained sound event HMMs (? ^{1, S} ...? ^{E, S} ) and background noise HMM (? ^D ) for HMM classification. A likelihood for the event sound detected using the HMMs can be derived.

Acoustic event HMM (Φ ^{E, S)} and the background noise HMM (Φ ^D) may comprise the initial state distribution (π), the state transition probability matrix (T), Gaussian mixture observation (θ) as a parameter, Φ ^{E, S} = [π ^{E, S} , T ^{E, S} , θ ^{E, S} ] and φ ^D = [π ^D , T ^D , θ ^D ].

A likelihood (L _i ^Flag ) for the event sound and a likelihood (L _i ^D ) for the background noise can be derived by calculating probability values for the HMMs and the feature values. The likelihood for likelihood and background noise for event sound is as follows.

Then, the average of the channel gains of the event sounds derived in step S50 (

), And a value obtained by normalizing this value (

). Through this process, the weighted likelihood probability is

And this value can be used to perform the determination of the event sound.

In step S60 of performing HMM classification on the feature values extracted in step S50 to verify the likelihood, step S61 of performing maximum likelihood classification and step S60 of performing a maximum likelihood classification are performed. In step S61, the ratio of likelihoods for event sound to likelihood for background noise And determining whether the event sound is an event sound according to the step S62.

In step S61, the feature value obtained in step S50

) And the average value of the normalized channel gain obtained in step S40

), Data (? ^{1, S} ...? ^{E, S} ) from a plurality of pre-trained event sound HMMs (? ^{E, S} ) and data (? ^D ) from background noise HMMs for the final event sound Likelihood for likelihood and background noise can be obtained.

In step S62, the ratio of the likelihood of the event sound to the likelihood (L _i ^D ) of the background noise HMM is calculated. If the value of the likelihood is greater than the predetermined value, it can be determined that the sound signal is a singular event. The expression is as follows.

The finally obtained result value (Flag _i ^e ) represents e or 0, and it can be determined whether there is an abnormal situation in the current frame.

Here,

Is a preset threshold value, it can be determined that the i-th frame includes an abnormal sound when the detected result {Flag _i ^e } is e _i as in Equation (11). If the result value is 0, it can be determined that only the background noise exists in the i-th frame.

If it is determined that the i-th frame includes the abnormal sound through the comparison of the likelihood with the reference value as described above, it can be determined that an abnormal sound exists in the currently input signal.

FIG. 6 is a graph comparing the performance of acoustic detection according to an embodiment of the present invention with the conventional art, and F-measure is used as an index indicating the accuracy of sound detection. The F-measure is an indicator of accuracy by integrating the precision and the trade-off of the recall, and is also referred to as the harmonic mean with weight.

Referring to FIG. 6 (a), the case of detecting the likelihood of the event sound is compared with the case of the GMM algorithm, and the case of detecting the likelihood of the event sound considering the channel gain through the NMF decomposition as in the embodiment. As shown in the graph, when the sound detection method of the embodiment is applied, the F-measure is the highest at 0.5042, and in the remaining cases, the value close to 0.4 is obtained.

6 (b) is a graph showing a relative improvement rate of the F-measure. Comparing the conventional method and the embodiment using the GMM algorithm, the embodiment improved the F-measure by about 23. 64%, and comparing the conventional method and the embodiment using the NMF algorithm, the embodiment showed about 31.30% The accuracy of the detection of the abnormal event sound is remarkably improved.

FIG. 7 is a graph comparing the performance of sound detection according to an embodiment of the present invention, according to the value of SNR. Referring to FIG. 7, the F-measure is compared with the case of FIG. 6 when the SNR is 6 dB, 0 dB, and -6 dB.

The proposed method has higher F-measure than the method using GMM and NMF in all cases where SNR is 6dB, 0dB, -6dB. Especially, in -6dB environment where SNR is relatively poor, Although the accuracy of the detection is hardly shown, the embodiment shows an F-measure of 0.2588 to 0.3448, which can be judged to be remarkable even in an environment where the background noise dominates.

The sound detection method of the embodiment uses an input signal of a stereo and uses an NTF algorithm and decomposes the input signal into a plurality of combinations of tensors. Among them, the use of the channel gain related tensor for acoustic event detection can improve the detection accuracy of the abnormal sound even in a mixed environment of several noises.

The present invention considers the channel gain through the NTF algorithm, and at the same time, compares the likelihoods of the background noise and the event sound through the hidden Markov model, thereby making it possible to more accurately and reliably detect a specific sound intended by the user from among the input signals.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be understood that various modifications and applications other than those described above are possible. For example, each component specifically shown in the embodiments of the present invention can be modified and implemented. It is to be understood that all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A method for detecting an anomalous acoustic event using an input signal of a stereo,
Receiving a stereo input acoustic signal;
Converting the input acoustic signal into a time-frequency domain;
Decomposing the converted input acoustic signal using a non-sound tensor decomposition (NTF) algorithm;
Firstly determining whether an event sound is generated by referring to the channel gain value;
Extracting feature values for the identified event sound signals;
Performing HMM classification on the extracted event sound signal; And
And finally determining whether an abnormal acoustic event has occurred according to the HMM classification.

The method according to claim 1,
Wherein the step of converting the input acoustic signal into a time-frequency domain comprises converting the stereo input sound signal into a Mel amplitude spectrum signal after performing a short-term Fourier transform (STFT) on the stereo input sound signal. Way.

3. The method of claim 2,
Wherein the mel-amplitude spectrum signal is decomposed into a tensor composed of a combination of a channel gain, a frequency gain, and a time gain.

The method of claim 3,
Wherein the channel gain and the time gain are continuously updated by repetitive execution of an update rule.

The method according to claim 1,
The step of determining whether an event sound is generated based on the channel gain value comprises:
Calculating an average of all the channel gains, calculating a ratio value of an arbitrary channel gain with respect to the average of all the channel gains, and comparing it with a preset threshold value Wherein the audio channel is a stereo channel.

6. The method of claim 5,
When the ratio value is greater than the predetermined threshold value, applies only the corresponding event sound signal to the future classification algorithm, and when the ratio value is greater than the threshold value, determines that the event sound signal corresponding to the event sound signal is background noise A method for detecting sound using a channel.

The method according to claim 1,
The step of extracting feature values for the determined event sound signals comprises:
And converting the signals filtered by the channel gain obtained by the NTF into a feature vector composed of MFCCs.

The method according to claim 1,
The step of performing HMM classification on the extracted event sound signal includes:
A sound detection method using a stereo channel that derives a likelihood for an event sound detected using a pre-trained acoustic event HMM and a background noise HMM.

9. The method of claim 8,
Calculating a probability and a likelihood value for the base and background noise HMMs of the acoustic event HMM and the feature value to derive the likelihood for the event sound and the likelihood for the background noise.

10. The method of claim 9,
Averaging the channel gains of the detected event sounds and obtaining a generalized value thereof and then multiplying the generalized value by the likelihood for the event sound to obtain a stereo likelihood for the event sound having a weighted value of the channel gain, A method for detecting sound using a channel.

10. The method of claim 9,
A method of sound detection using a stereo channel that derives likelihood for a final event sound and likelihood for a background noise by referring to data from a pre-trained event acoustic HMM and data from a background noise HMM.

11. The method of claim 10,
Calculating a rate of likelihood for a final event sound to a likelihood for the background noise and determining that the sound signal is a sound signal including a specific event when the value is greater than a preset value.