KR101741418B1

KR101741418B1 - A method for recognizing sound based on acoustic feature extraction and probabillty model

Info

Publication number: KR101741418B1
Application number: KR1020150056272A
Authority: KR
Inventors: 홍승기
Original assignee: (주)사운드렉
Priority date: 2015-04-22
Filing date: 2015-04-22
Publication date: 2017-06-02
Also published as: KR20160125628A

Abstract

The present invention relates to an acoustic feature extraction method and a probability model based acoustic model for acoustic recognition of a continuous impact sound. More particularly, the present invention relates to a method for recognizing a specific impact sound, And a sound recognition method based on an acoustic feature extraction and a probability model for controlling a digital device or a device of the device.

Description

Technical Field [0001] The present invention relates to an acoustic feature extraction method and a probabilistic model based acoustic feature extraction method,

A method of extracting a conventional acoustic feature and a method of generating an acoustic recognition model are intended for speech recognition, and a technology for extracting an acoustic feature from a human voice and generating a probability model based on the extracted feature is also proposed . Therefore, there is a limit to apply to the purpose of recognizing the specific sound generated from the object through the existing methods and recognizing the situation.

Korean Patent Publication No. 2014-0136332 Korean Patent Publication No. 2009-0035222

The present invention provides a method for solving the above-described conventional problems, and a method for making a determination of a dangerous situation by recognizing an unexpected impact sound, or a method for achieving a purpose of controlling an apparatus or a device by intentionally generating an impact sound And to provide the above objects. That is, if an unexpected impact sound is generated in a limited space, it is determined as an emergency and remotely recognized, or intentionally generated continuous clapping sound to control a specific object. For example, it can be applied by turning on / off the light by generating a continuous impact sound. Or a case where a window is cracked in a general home, and can be applied to a remote monitoring and control system for the purpose of informing emergency emergency situations.

The means for achieving the object of the present invention extracts features to be utilized through the collected impact sound sound database and trains a Gaussian Mixture Model (GMM) using vectors of the extracted feature data, A training module for training the GMM in the same way through a database of various general sounds to detect the sudden change in the signal from the input signal and extracting the acoustic characteristic using the voice signal of a certain length based on the detected change. And a recognizer module for performing a Likelihood Ratio Test (LRT) on the signal to verify whether the impulse sound of the signal is an impulse sound, do.

In the present invention, by suggesting a method of recognizing an unexpected impact sound to make a judgment as to a dangerous situation, or a method of intentionally generating an impact sound to control an apparatus or a device, an unexpected impact sound It is possible to remotely recognize the emergency situation, or to intentionally generate consecutive impact sounds to control a specific object, and to inform the remote monitoring and control system of an emergency emergency situation.

1 is a configuration diagram of a training unit module according to the present invention;
2 is a configuration diagram of a recognition module according to the present invention;
Fig. 3 is an example of a filter bank of a linear scale and a mel scale,
FIG. 4 is a flow chart of a feature extraction process of LFCC and MFCC,
5 is an exemplary detection of the impact sound sound starting point, and Fig.
6 is a state diagram for counting the impact sound.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will now be described in detail with reference to the accompanying drawings.

As shown in FIGS. 1 and 2, the impact sound recognition system is largely divided into a training section module and a recognition section module. The training department extracts the acoustic features to be used through the collected impact sound database and trains the Gaussian Mixture Model (GMM) using the extracted feature vectors. It also consists of training GMMs in the same way through a database of various general sounds to identify sounds other than impact sound. The recognition unit detects sudden signal change from the input signal, extracts features by using the speech signal having a predetermined length as a reference, performs a Likelihood Ratio Test (LRT) And finally determining whether the sound of the impact sound is a command for the purpose of outputting the final recognition result.

1. Training Department

As shown in FIG. 1, the training unit extracts acoustic features from a database composed of sound impact sounds and a general sound database that is frequently encountered in home environments such as conversation and music, and then trains each GMM using a machine learning method .

1.1 Database for statistical model learning

In order to train the statistical model, the impact sound database and the general sound database were constructed and utilized. The impact sound database consists of 110 seconds of training data in a relatively quiet environment. The general sound database was used for training for a total of about 150 minutes of data such as 50 minutes of TV, 8 minutes of music, 61 minutes of conversation, 30 minutes of various sounds in the room in consideration of the possibility of actual home environment. Generally, a database corresponding to a general sound has a purpose of reflecting a more varied sound to a probability model, so there is no great difficulty in performing work related to a database other than a lot of sounds to be considered. On the other hand, in the case of the impact sound, since the duration of the sound is very short, if the database of the file unit in which the impact sound is inputted is used several times as it is, if the noise is used instead of the impact sound in the modeling, A problem that is more involved in this probability model occurs. To avoid this problem, only the impact sound regions are cut out for each impact sound database and utilized for statistical model training.

1.2 Acoustic Feature Extraction

In order to construct the acoustic model of the impact sound and the general sound, the feature extraction method expressing the characteristics of the acoustic data includes the Mel-Frequency Cepstral Coefficients (MFCC) and the Linear Frequency Cepstral Coefficients (LFCC) . Both MFCC and LFCC are features that extract the features using the filter bank in the frequency domain. The difference is that the spectrum uses a nonlinear filter bank of the Mel Scale and a linearly spaced filter bank.

The method of extracting MFCC and LFCC is as follows. In this sound recognition system, the frame size is 25ms and the feature is extracted in the corresponding frame by 12.5ms. First, a pre-emphasis process and a Hamming window of Equation (2) are applied to a frame of an inputted signal using Equation (1), and Fourier transform is performed.

(One)

(2)

Where N represents the total number of samples in the frame. x _w ( n ) is the signal to which the Hamming window is applied, and a generally uses a value of 0.95 to 0.98. Apply filter bank and log corresponding to each scale in Fig. 3 to filter bank characteristics of MFCC and LFCC as shown in equation (3).

(3)

Is a frequency domain signal after the FFT,

Denotes an mth triangular filter in the filter bank, and k denotes a frequency bin. B means the number of filter banks, and 26 filter banks are used in both MFCC and LFCC. Here, rather than using the frequency bin signal of the Magnitude spectrum as it is, the reason why the filter bank is applied is that the frequency band energy is used rather than the frequency bin signal, And can exhibit a strong characteristic even in the case where noise is mixed.

Finally, the coefficient of the 15th order is extracted by performing the DCT transform as shown in equation (4). The MFCC and LFCC extraction procedures are summarized in FIG. In this way, the logarithm of the DCT is applied to the signal in the frequency domain, and the signal in the signal domain is the frequency characteristic of the frequency domain signal. Therefore, the frequency domain characteristic can be efficiently represented by only a low-order cepstral coefficient.

(4)

For the feature vectors obtained through the above procedure, the Delta feature is finally used to consider the time-varying characteristics of the feature vectors. The Delta feature is a feature vector that is constructed considering the feature vectors of the front and back frames, and is obtained by the following equation.

(5)

here

and

Represents the Delta feature in the i- th frame and the original feature vector of MFCC or LFCC, respectively. As in the formula, the Delta feature can be viewed as considering the signal in the adjacent frame through the average of the front and back frames with respect to the current frame. The obtained Delta feature is combined with the originally obtained MFCC or LFCC and then combined with the 15th feature vector obtained first to construct the 30th feature vector finally.

1.3 Gaussian Mixed Model Training

In order to construct a statistical model of impact sound and general sound to be used in impact sound recognizers, the MFCC and LFCC features are extracted from the impact sound and general sound database and the Gaussian mixture model is trained based on this feature. Since the Gaussian mixture model estimates the probability distribution of data by using various Gaussian probability density functions, it is possible to estimate the probability distribution of the general sound including various sound classes such as voice and music, It is effective to express. Therefore, the Gaussian mixture model was used as the impact sound and general sound statistical model.

The Gaussian mixture model is a representative learning technique based on the statistical model, and estimates the Probability Density Function of the data using the learning data. The Gaussian mixture model is defined as follows.

(6)

Is a D-dimensional random vector and becomes a feature vector of the audio.

Means the density of each element constituting the mixed density.

Denotes the weight of the kth element density

Shall be satisfied.

The D-dimensional probability density function of the kth Gaussian function is calculated as follows.

(7)

here

Mean the mean vector and the covariance matrix, respectively.

The Gaussian mixture density can be defined as the parameters consisting of the mean vector of M element densities, the covariance matrix, and the mixture weights. These parameters are denoted as follows.

(8)

The Gaussian mixture model has various forms depending on the selection of the covariance matrix. In this study, we use the diagonal covariance method with computational advantages.

The maximum likelihood (ML) estimation method is used as a learning method of the parameters constituting the Gaussian mixture model. The purpose of the ML estimation method is to model parameters that maximize the likelihood of the GMM for given training data

. The training procedure for estimating GMM parameters by the ML method is as follows.

T _r training vector columns

The likelihood of GMM can be expressed as follows.

(9)

The equation (10)

It can not be directly maximized because it is a non-linear function. Therefore, the parameter is estimated using an EM (Expectation-Maximization) algorithm. The estimation of each parameter using the EM algorithm is as follows.

1. Mixed weights

(10)

2. Average vector

(11)

3. The covariance matrix

(12)

Here, the posterior probability for class k is obtained as follows.

(13)

2. Recognition

As shown in FIG. 2, the recognition unit first performs Abrupt Sound Detection (ASD) from an acoustic signal input through a microphone. This is because a frame in which a signal of a specific frequency band in the frequency domain changes abruptly is detected as a candidate group for an impact sound, so that a feature is extracted from a signal of a predetermined section based on a detected frame. After that, the likelihood is calculated for the mixed sound model of Gaussian mixture and the mixture of normal sound and Gaussian model, and then the Likelihood Ratio Test (LRT) is performed to determine whether the input signal through the microphone is one impact sound . Then, by using the result of the verification of the impact sound, it is possible to detect only the impact sound for the second and third impact sound corresponding to the target sound of the command in consideration of the time interval and frequency of the impact sound.

2.1 Abrupt Sound Detection (ASD)

The ASD module aims to detect the starting point of the impact sound. The impact sound is characterized by its acoustical characteristics suddenly appearing and being held for a short time. For this reason, it is judged as a potential impact sound for the sudden signals, and after the start point is detected, a process of verifying whether or not the signal detected by the ASD is an impact sound is performed through an additional process.

In the ASD, first, the signal change in the region where the power is highest in the spectrum region of the impact sound is considered. For this reason, ASD performs an FFT for each frame and performs a process of adding Magnitude values from 1.5 kHz to 5 kHz. This added value compares the previous and the value in the frame to detect the sudden change of the input signal, which is expressed by the following equation.

(14)

(15)

here

Represents the intensity of the signal in the impact sound frequency region in the i- th frame, and w1 and w2 denote frequency bins corresponding to 1.5 kHz and 5 kHz mentioned above.

Lt; RTI ID = 0.0 >

Of the previous two frames by looking at the ratio of

, And select the larger one among them. This means that within two frames

As a method to consider the case where the value of

Is compared with the threshold value to be used for detecting the start point of the impact sound. In this system,

Is set to 7. The important point here is

It is possible to reduce the case where the normal sound is selected as the candidate for the impact sound, but conversely, when the actual impact sound can be selected in the noisy environment, it is also reduced. Also, if the threshold value is lowered too much, the probability that a normal sound can be selected as a candidate for the impact sound increases, thereby increasing the error rate of the recognizer. Therefore, it is advisable to set this threshold slightly higher to reduce the error rate of the recognizer and to be able to respond only to certain impact sound. This is an advantage in that the reliability of the system can be increased merely by requiring the user to slightly increase the impact sound. Probability-based recognition system In this case, raising the threshold value causes the system to increase the error rate. To obtain a high probability value, it is necessary to make a sound similar to the training model as much as possible. However, It is because it is work. However, in this impact sound recognizer, as a preprocessing process, a process of filtering out a candidate of an impact sound through power is provided, so that the user is given a chance to construct a more stable system by paying a little attention.

Additional features in the ASD module include:

Is compared with the threshold value, if the frame exceeding all the threshold values is selected as the starting point of the impact sound, the result becomes too noisy. Therefore, within 200 ms

Is determined to be a phenomenon caused by noise or other cause rather than due to the impact sound, the start point of the impact sound is not determined and the start point is changed to the frame exceeding the next threshold value. In Fig. 5, the above-mentioned cases are summarized in the figure.

By using such a function, only one impact sound starting point is found within 200 ms determined as the starting point of the impact sound,

This is because the noisy characteristic of the sound of the impact sound is reduced and it is possible to detect the starting point of the impact sound only for those that can be definitely determined as the impact sound.

2.2 Acoustic Feature Extraction

The MFCC or LFCC feature is extracted for the 150 ms long signal based on the starting point of the sound of the impact sound detected by the ASD module. This is because the characteristics of the signal are not long lasting due to the characteristics of the impact sound, so that the duration of the sound is generally considered after the impact sound is started.

2.3 Impulse Sound Verification based on Likelihood Ratio Test

Using the feature vector sequence extracted from the 150-ms-long signal detected as a result of the ASD module, a log-likelihood test is performed on the Gaussian mixture model of each class constructed in the training section to determine whether or not the impact sound is sounded for one impact sound . The log-likelihood ratio of the class to the feature vector column is as follows.

(16)

here

Is the likelihood of the t- th frame with respect to the impact sound model,

Represents the likelihood of the t- th frame for the general sound model. T means the total number of frames at 150 ms. Finally, the test sound compares the log-likelihood ratio with the threshold value as shown in equation (16) to determine whether or not it is an impact sound.

(17)

and &thetas; _th denotes a threshold value for determining the impact sound. Threshold adjustments control the trade-off between false alarms and false rejections, allowing the system to be tailored to the user's needs. For example, in a situation where the impact sound must be recognized, it is possible to set the impact sound to be discriminated as the impact sound by accepting that the general sound is misdetected as the impact sound by lowering the threshold value. On the contrary, in a situation where normal sound is not recognized as a sound of a crash sound, it is possible to raise the threshold to detect only a reliable impact sound.

2.4 Sound impact sound counting

In the impact sound counting module

And the number of the impact sound is analyzed. In this module,

The time interval between frames where the value of the impact sound is 1 is grasped and the number of the impact sound sounds appearing for the purpose of the command is grasped. The most important function for counting impact sound is to count the impact only when the impact sound is connected between 200 ms and 500 ms after detecting one impact sound, and there is no input for the impact sound until it exceeds 500 ms , It is performed to perform an operation of returning the number of impact sound counts calculated before. As a more detailed description, a series of operations corresponding to the impact sound counting can be expressed as a state diagram as shown in FIG.

1) WAIT State: The WAIT State is the state waiting for the start of the impact sound

The counter for the sound of impact sound is set to 1 at the instant of 1 and the state is transited to CLAP1.

2) CLAP1 State: CLAP1 State is a state after one impact sound is generated, and performs an operation to observe that the next impact sound occurs between 200 ms and 500 ms. Within that time interval

The counter for the sound of impact sound is set to 2 and the state is changed to CLAP2. On the other hand, when 500 ms elapses without the next impact sound, set the impact sound sound counter to 0 and transition the state to OVER. This is because only two or three impact sound are recognized as impact sound for command.

3) CLAP2 State: The CLAP2 State is the state after the occurrence of two impact sounds, and performs the operation of observing that the next impact sound occurs between 200 ms and 500 ms. Within that time interval

The counter for the sound of impact sound is set to 3 and the state is changed to CLAP3. Conversely, if there is no next impact sound, set the impact sound sound counter to 2 when 500 ms elapses, and transit the state to OVER.

4) CLAP3 State: CLAP3 State is the state after three impact sounds, and performs the operation to observe that the next impact sound occurs between 200 ms and 500 ms. Within that time interval

The counter for the sound of impact sound is set to 0 and the state is changed to OVER. On the other hand, when 500 ms elapses without the next impact sound, set the impact sound sound counter to 3 and transit the state to OVER. This is because the sound of four or more impact sounds is not significant.

5) OVER State: In the over state, it returns the counter value for the sound sound calculated so far and transits the state to the WAIT state. The counter value of the impact sound returned from the OVER state is one of 0, 2, and 3.

3. Threshold Considerations

The impact sound system is largely divided into ASD module, LRT module and impact sound sound counting module, and each module has various adjustable parameters. Of particular importance is the correlation of thresholds used in ASD and LRT modules. As mentioned above, the higher the threshold value used in the ASD module, the harder it becomes to detect the starting point of the impact sound in a noisy situation. Therefore, when the threshold of the ASD module is increased, the LRT module shows a small contribution to the recognition of the impact sound. This is because many sounds are already filtered out from the ASD module. On the other hand, if the threshold of the ASD module is low, the contribution of the LRT module increases. If the threshold value of ASD is low, it is possible to find the starting point of the impact sound even in a noisy environment. However, since the signals other than the impact sound pass through the ASD module, a lot of non-impact sound must be filtered through the LRT. Therefore, when setting the thresholds of these two, the user should be encouraged to strike a loud noise with a certain degree of loudness, and the threshold value of the ASD module should be set to a relatively high value (between 8 and 10). The threshold used in the LRT module should also be adjusted to match the filtered sound, which should be set to MFCC-2 ± 0.5, LFCC-5 ± 1. This is because, as mentioned above, it is much easier to hit an impact sound to raise the value required by the ASD module than to strike an impact sound to increase the value of the Likelihood, and the system can recognize the stable sound.

Claims

delete

We extract the features expressing the characteristics of the sound data to be utilized through the collected impact sound database and train the Gaussian Mixture Model (GMM) using the extracted feature vectors to discriminate sounds other than the impact sound A training module for training the GMM in the same manner through a database for general sound, and a speaker module for detecting a sudden signal change from the input signal and extracting a voice characteristic using a voice signal having a predetermined length as a reference, And a recognition module for counting the sound of the impact sound by calculating the likelihood of the mixed sound model of the Gaussian mixture and the general sound Gaussian mixture model, Extraction and Extraction of Acoustic Feature for Impact Sound In the sound recognition method based on a model,
The method of extracting features expressing the characteristics of the sound data to be utilized through the impact sound database includes the Mel-Frequency Cepstral Coefficients (MFCC) and the Linear Frequency Cepstral Coefficients (LFCC), which are acoustic data feature extraction methods, Wherein the acoustic feature extraction and the probability model based acoustic model are used for continuous impact sound.

3. The method of claim 2,
The Gaussian mixture model (GMM) is a method based on acoustic sound recognition feature extraction and the probability model for a series of impact sound, characterized in that using the method diagonal covariance (diagonal covariance).

delete

3. The method of claim 2,
The likelihood ratio test determines whether the impact sound is an impact sound with respect to the impact sound through the log-likelihood ratio of the Gaussian mixture model of each class constructed in the training section using the feature vector stream extracted from the predetermined signal detected as a result of the ASD module A sound recognition method based on acoustic feature extraction and probability model for continuous impact sound.

3. The method of claim 2,
The impact sound counting increases the count only when the sound of the impact sound is detected after the sound of the impact sound is located between the predetermined lengths. If there is no input of the impact sound until the predetermined length is exceeded, And performing an operation of returning the number of times of sounding. The acoustic feature extraction and the probability model based acoustic sound recognition method for continuous impact sound are performed.