CN109785857B

CN109785857B - Abnormal sound event identification method based on MFCC + MP fusion characteristics

Info

Publication number: CN109785857B
Application number: CN201910153124.6A
Authority: CN
Inventors: 罗丽燕; 李芳足; 王玫; 仇洪冰; 宋浠瑜; 周陬; 覃泓铭; 韦金泉
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2020-08-14
Anticipated expiration: 2039-02-28
Also published as: CN109785857A

Abstract

The invention discloses an abnormal sound event identification method based on MFCC + MP fusion characteristics, which is characterized by comprising the following steps of: 1) carrying out first sound preprocessing; 2) extracting sound features for the first time; 3) training a classifier; 4) actually measuring sound input; 5) performing second sound preprocessing; 6) extracting the features for the second time; 7) application of a classifier; 8) and outputting a detection result. The method has good noise robustness, can effectively detect abnormal sounds in sound signals in a low signal-to-noise ratio environment, solves the problem of blind areas in video monitoring, and provides favorable help for security work.

Description

Abnormal sound event identification method based on MFCC + MP fusion characteristics

Technical Field

The invention relates to the technical field of sound signal identification, in particular to detection and identification of abnormal sounds such as gunshot sounds, screaming sounds, glass breaking sounds and the like, which are used for monitoring abnormal events in public places, and particularly relates to an abnormal sound event identification method based on fusion characteristics of Mel-frequency cepstral coefficient (MFCC) and Matching Pursuit (MP).

Background

The human exploration for voice signal recognition has already started in the 50 s of the 20 th century, and researchers from the Bell laboratories of AT & T in 1952 realized a voice recognition system for english digital isolated words of a specific speaker, which is realized by adopting analog electronic devices, mainly extracting formant information of vowels in digital pronunciation, and performing isolated digital recognition of the specific speaker by a simple template matching method. After the 60's of the 20 th century, speech recognition technology has been developed in great length, and the Vintsyuk of the soviet union proposed that the two speech events be out of synchronization by using a dynamic programming method, and by the dynamic time warping algorithm (DWT) proposed by the japanese scholars Sakoe in the 70's of the 20 th century, the method effectively solved the problem of unequal length of speech signals. By the 80 s of the 20 th century, speech signal recognition was brought into a high-speed development period by the introduction of a Hidden Markov model (HMM for short) method based on statistics and the introduction of MFCC features, and thereafter technologies applied to speech recognition were widely studied. In the 90 s of the 20 th century, with the development of computer technology and the pursuit of human beings for more convenient life, human perception of environmental sounds is developed to develop a series of researches, and it is expected that sound signals are collected through a sound sensor on a robot body, and then events occurring around the current environment and the current environment are distinguished through a series of environmental sound perception technologies, so that the robot obtains a certain environmental perception capability. The early studies of these works were conducted by Sawhney and Maes, the national institute of technology, Massachusetts, who obtained a 68% classification accuracy by extracting features from sounds and then classifying the sound scenes using recurrent neural networks and K-nearest neighbor algorithms. Then, with the researchers in the laboratory, the sound scene classification of the continuous sound stream is solved, and the extracted sound features are classified by means of the HMM, so as to obtain a preliminary recognition result. Since the 20 th century, researchers began to study psychoacoustics, and proposed a series of local and global features, Eronen et al extracted MFCC features from sound signals, used GMM to describe feature distribution, and then introduced a Hidden Markov Model (HMM) to reflect the time variation of GMM, providing a better solution for ambient sound perception. In 2008, China national science Foundation starts a major research plan of 'cognitive calculation of visual and auditory information', starts from the research of human on the cognitive mechanism of auditory sense, establishes a mathematical model, solves the problems of 'machine learning and understanding of perception data' and the like, and aims to construct an intelligent vehicle unmanned platform. International competitions regarding detection of sound events, such as DCASE Challenge, have been held in recent years, aiming to search for effective solutions for sound scene classification and important problems in the field of sound event detection on a global scale.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an abnormal sound event identification method based on MFCC + MP fusion characteristics. The method has good noise robustness, can effectively detect abnormal sounds in sound signals in a low signal-to-noise ratio environment, solves the problem of blind areas in video monitoring, and provides favorable help for security work.

The technical scheme for realizing the purpose of the invention is as follows:

an abnormal sound event identification method based on MFCC + MP fusion characteristics is different from the prior art and comprises the following steps:

1) first sound preprocessing: carrying out a series of digital processing on sound signals in a sound library to enable the signal distribution to be more stable and facilitate the extraction of subsequent sound characteristics, wherein the first sound preprocessing comprises normalization processing, framing processing and windowing processing, the normalization processing is to normalize the collected sound signals to be between-1 and 1, and the subsequent processing of the sound signals and the training of a neural network are facilitated; the framing process is to divide a section of sound signal into a group of short and equal-length time frames, when the sampling frequency of the sound signal is 44.1KHz, 1024 points are taken as one frame, and two adjacent frames are overlapped, which is called frame shift, and the frame shift is taken as frame shift

Because the ambient sound signal is generally non-stationary, but has short-time stationarity, namely appears stationary within 10-30ms, the stationary characteristic of the sound signal can be increased by framing; the windowing treatment can make the adjacent two frames smoother and more continuous, reduce the frequency spectrum leakage and reduce the windowing positionProcessing by using a Hamming window;

2) first-time sound feature extraction: firstly, extracting 12-order MFCC of each frame signal, then adopting MP algorithm to decompose each frame sound signal, its dictionary adopts Gabor atom, and its formula is

Wherein

s, u, ω, θ represent the size, time, frequency and phase of the atom,

representing a positive integer vector, taking s as 2^p，1≤p≤8；u＝{0，64，128}；ω＝Ki^2.6，1≤i≤35，K＝0.5×35^-2.6θ is 0; taking the s and omega parameters of the first five atoms, the mean value and the variance of the atoms and MFCC in series as a feature vector of the frame of sound signal, then solving the feature vector of each frame of sound signal of the sound segment, solving first-order and second-order difference parameters of the feature vector as dynamic supplementary features, finally taking 60 frames as feature representation of the sound segment, adopting the classic MFCC features as main sound features, and utilizing MP algorithm to extract time-frequency representation which is more robust to noise as supplementary features, wherein the flexibility, robustness and physical interpretability of the fusion features improve the detection and classification capability of abnormal sound events in the low signal-to-noise ratio environment;

3) training a classifier: the classifier adopts a convolutional neural network, and the convolutional neural network comprises a convolutional layer c1, a pooling layer s1, a convolutional layer c2, a pooling layer s2, a full-connection layer f1 and an output layer (out layer) which are sequentially connected; in the output layer, a noise category is added besides the abnormal sound category to be identified, and the category is trained by using the environmental noise and other sounds, so that the input environmental noise and other sounds can be classified into the category by the classifier, the interference of other sounds except the abnormal sound on the classification result is reduced, and the false detection rate is reduced; when the classifier is trained, the sound library containing a proper amount of noise is used for training the neural network, so that the generalization capability of the neural network can be enhanced; the linear convolution kernel in the convolution neural network is good at processing the sound characteristics represented by time frequency, and can extract the characteristics with better discrimination and higher level, thereby being beneficial to improving the identification accuracy of abnormal sound events;

4) actually measuring sound input: collecting actual measurement sound;

5) and (3) second sound preprocessing: firstly, carrying out normalization processing, framing processing and windowing processing on the actually measured sound, then carrying out noise reduction, wherein the noise reduction adopts amplitude spectrum subtraction, the short-time amplitude spectrum of each frame of sound signal is subtracted by the short-time amplitude spectrum of the noise acquired in advance, and the purpose of noise reduction is achieved through spectrum subtraction, the algorithm is simple and easy to realize, and the method is favorable for improving the identification accuracy of the method on abnormal sound;

6) and (3) second-time feature extraction: the second characteristic extraction mode of the actually measured sound is consistent with the first characteristic extraction method in the step 2);

7) application of the classifier: by taking the sound features of 60 frames as identification features, the classifier trained in the step 3) is used for identifying the first 60 frames of the sound segments, if the abnormal sound is not identified, the classifier is moved backwards for continuous identification for 60 frames, until the abnormal sound is identified, the moment is marked as the starting moment of the abnormal sound, the detection is continued backwards until the abnormal sound is not detected, the previous moment is marked as the ending moment of the abnormal sound, and the detection method can not only detect whether the abnormal sound exists in a section of sound signal, but also accurately position the starting time and the ending time of the abnormal sound;

8) and (3) outputting a detection result: and outputting whether the actually measured sound has abnormal sound and the starting time and the ending time of the abnormal sound.

The technical scheme has the following advantages:

1) the invention adopts the classic MFCC characteristics as the main sound characteristics, and the MP characteristics with good robustness as the supplementary characteristics, so that the flexibility, robustness and physical interpretability of the invention improve the detection and classification capability of abnormal sound events in the environment with low signal-to-noise ratio;

2) the technical scheme uses the convolutional neural network as a classifier, and a linear convolution kernel in the network is good at processing the sound characteristics represented by time frequency, and further extracts the characteristics with better discriminative power and higher level of the sound characteristics, so that the identification accuracy rate of abnormal sound events is improved;

3) according to the technical scheme, the neural network model is trained by using the sound data with different signal to noise ratios, so that the trained neural network has higher generalization capability, and the identification accuracy can be effectively improved.

4) According to the technical scheme, when the classifier is trained, a noise class is added besides the class of abnormal sound to be recognized, the class is trained by using the environmental noise and other sounds, and the input environmental noise and other sounds are classified into the class by the classifier, so that the interference of other sounds except the abnormal sound on a classification result is reduced, and the false detection rate is reduced;

5) according to the technical scheme, the method for positioning the foreground image position in the image by using the sliding window in the machine vision is used for carrying out time positioning on the abnormal sound segment on the collected sound signal segment, so that the starting time and the ending time of the abnormal sound event can be accurately positioned.

The method has good noise robustness, can effectively detect abnormal sounds in sound signals in a low signal-to-noise ratio environment, solves the problem of blind areas in video monitoring, and provides favorable help for security work.

Description of the drawings:

FIG. 1 is a schematic flow chart of the method in the example;

FIG. 2 is a schematic flow chart of a first sound preprocessing in the embodiment;

FIG. 3 is a schematic diagram illustrating a first sound feature extraction process in the embodiment;

FIG. 4 is a schematic structural diagram of a convolutional neural network in an embodiment;

FIG. 5 is a schematic view of a second sound preprocessing flow in the embodiment;

fig. 6 is a functional block diagram of an abnormal sound segment localization in the embodiment.

Detailed Description

The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.

Example (b):

referring to fig. 1, an abnormal acoustic event identification method based on MFCC + MP fusion features includes the following steps:

1) first sound preprocessing: performing a series of digital processing on the sound signals in the sound library to make the signal distribution more stable and facilitate the extraction of subsequent sound features, wherein the first sound preprocessing comprises normalization processing, framing processing and windowing processing, as shown in fig. 2, the normalization processing is to normalize the collected sound signals to-1, so as to facilitate the subsequent processing of the sound signals and the training of a neural network; the framing process is to divide a section of sound signal into a group of short and equal-length time frames, when the sampling frequency of the sound signal is 44.1KHz, 1024 points are taken as one frame, and two adjacent frames are overlapped, which is called frame shift, and the frame shift is taken as frame shift

Because the ambient sound signal is generally non-stationary, but has short-time stationarity, namely appears stationary within 10-30ms, the stationary characteristic of the sound signal can be increased by framing; windowing can enable two adjacent frames to be smoother and continuous, frequency spectrum leakage is reduced, and Hamming window processing is adopted in windowing;

Wherein

s, u, ω, θ represent the size, time, frequency and phase of the atom,

representing a positive integer vector, taking s as 2^p，1≤p≤8；u＝{0，64，128}；ω＝Ki^2.6，1≤i≤35，K＝0.5×35^-2.6θ is 0; taking the s and omega parameters of the first five atoms, the mean value and the variance of the atoms and MFCC in series as a feature vector of the frame of sound signal, then solving the feature vector of each frame of sound signal of the sound segment, solving first-order and second-order difference parameters of the feature vector as dynamic supplementary features, and finally taking 60 frames as feature representation of the sound of the segment, as shown in FIG. 3, adopting the classic MFCC features as main sound features, and utilizing MP algorithm to extract time-frequency representation which is more robust to noise as supplementary features, wherein the flexibility, robustness and physical interpretability of the fusion features improve the detection and classification capability of abnormal sound events in low signal-to-noise ratio environments;

4) actually measuring sound input: collecting actual measurement sound;

5) and (3) second sound preprocessing: as shown in fig. 5, firstly, the normalization processing, framing processing and windowing processing are performed on the actually measured sound, then the noise reduction is performed, the amplitude spectrum subtraction is adopted for noise reduction, the short-time amplitude spectrum of each frame of sound signal is subtracted from the short-time amplitude spectrum of the noise collected in advance, the purpose of noise reduction is achieved through the spectrum subtraction, the algorithm is simple and easy to implement, and the method is beneficial to improving the identification accuracy of the method on the abnormal sound;

6) and (3) second-time feature extraction: referring to fig. 3, the second time of feature extraction of the measured sound is consistent with the first time of feature extraction method in step 2);

7) application of the classifier: referring to fig. 6, by taking advantage of the method of positioning the foreground image position in an image by using a sliding window in machine vision to detect and identify an abnormal sound segment of a collected sound segment, because sound features of 60 frames are taken as identification features, the classifier trained in step 3) is used for identifying from the first 60 frames of the sound segment, if no abnormal sound is identified, the classifier is moved backwards by 60 frames to continue identification, until the abnormal sound is identified, the moment is marked as the starting moment of the abnormal sound, the detection continues backwards until the abnormal sound is not detected, the previous moment is marked as the ending moment of the abnormal sound, and the detection method can not only detect whether the abnormal sound exists in a section of sound signal, but also accurately position the starting and ending time of the abnormal sound;

Claims

1. An abnormal sound event identification method based on MFCC + MP fusion features is characterized by comprising the following steps:

1) first sound preprocessing: the first sound preprocessing is to perform normalization processing, framing processing and windowing processing on sound signals selected from a sound database, wherein the normalization processing is to normalize the collected sound signals to be between-1 and 1; the framing processing is to divide a section of sound signal into a group of short and equal-length time frames, when the sampling frequency of the sound signal is 44.1KHz, 1024 points are taken as one frame, and two adjacent frames are overlapped, which is called frame shifting, and the frame shifting is taken as frame shifting

Windowed processing miningProcessing by a Hamming window;

Wherein

θ∈[0，2π](ii) a s, u, ω, θ represent the size, time, frequency and phase of the atom respectively,

representing a positive integer vector, taking s as 2^p，1≤p≤8；u＝{0，64，129}；ω＝Ki^2.6，1≤i≤35，K＝0.5×35^-2.6θ is 0; taking the s and omega parameters of the first five atoms, the mean value and the variance of the atoms and MFCC in series as a feature vector of the frame of sound signal, then solving the feature vector of each frame of sound signal of the sound segment, solving first-order and second-order difference parameters of the solved feature vector as dynamic supplementary features, and finally taking 60 frames in the sound segment as feature representation of the sound segment;

3) training a classifier: the classifier adopts a convolutional neural network, the convolutional neural network comprises a convolutional layer c1, a pooling layer s1, a convolutional layer c2, a pooling layer s2, a fully-connected layer f1 and an output layer (out layer) which are sequentially connected, and when the classifier is trained, a sound library mixed with noise is used for training the neural network;

4) actually measuring sound input: collecting actual measurement sound;

5) and (3) second sound preprocessing: firstly, carrying out normalization processing, framing processing and windowing processing on actual measurement sound, and then carrying out noise reduction, wherein the noise reduction adopts amplitude spectrum subtraction to subtract a short-time amplitude spectrum of noise acquired in advance from a short-time amplitude spectrum of each frame of sound signal;

7) application of the classifier: detecting and identifying abnormal sound segments of a section of collected sound segments by using a method for positioning foreground image positions in images by using a sliding window in machine vision, and because sound features of 60 frames are taken as identification features, identifying the sound segments from the first 60 frames of the sound segments by using a classifier trained in the step 3), moving 60 frames backwards to continue identification if abnormal sounds are not identified, marking the moment as the starting moment of the abnormal sounds until the abnormal sounds are identified, continuing to detect backwards until the abnormal sounds are not detected, and marking the previous moment as the ending moment of the abnormal sounds;