CN114613389A

CN114613389A - Non-speech audio feature extraction method based on improved MFCC

Info

Publication number: CN114613389A
Application number: CN202210256684.6A
Authority: CN
Inventors: 姜琦; 董琦; 李红; 冯庆胜; 丁伟
Original assignee: Dalian Jiaotong University
Current assignee: Dalian Jiaotong University
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-06-10

Abstract

The invention relates to the technical field of audio feature extraction, and particularly discloses a non-speech audio feature extraction method based on improved MFCC (Mel frequency cepstrum coefficient), which comprises the following steps: collecting sound signals and preprocessing the collected sound signals; performing MFCC feature extraction on the preprocessed sound signals; performing EMD on the preprocessed sound signals to obtain IMF components, and extracting time domain characteristic vectors and frequency domain characteristic vectors of the IMF components; performing first-order difference and second-order difference on the MFCC coefficients to obtain dynamic feature vectors forming the MFCC; and performing feature fusion on the calculated MFCC feature vector, the time domain feature vector, the frequency domain feature vector and the MFCC dynamic feature vector to obtain an improved multi-scale MFCC feature vector. The invention can effectively extract the high-frequency part of the audio signal, and the characteristic information of the sound signal is richer and more comprehensive.

Description

Non-speech audio feature extraction method based on improved MFCC

Technical Field

The invention relates to the technical field of audio feature extraction.

Background

At present, there are three main types of common characteristic parameters in the sound signal feature extraction technology: linear Predictive Coefficients (LPC), Linear Predictive Cepstral Coefficients (LPCC), Mel-Frequency Cepstral Coefficients (MFCC). Compared with the first two model-based features, the MFCC does not make any assumption or limitation on the sound, and is a feature parameter set established based on the principle that the human brain processes external sound and the auditory characteristics of the human ear, and the feature is a feature parameter which is currently used in sound recognition more frequently. However, the MFCC features are designed according to the auditory characteristics of human ears, which are more sensitive to low-frequency sounds and have a masking effect on high frequencies, so that when facing non-speech audio signals with more high-frequency components, the feature parameters extracted by the method cannot comprehensively represent the acoustic characteristics of audio, and have certain limitations.

The key of the traditional MFCC sound signal feature extraction method is to construct a series of band-pass filter banks (Mel filters) with different weights to simulate the regulation effect of human ears on sound signals. In the research on the auditory mechanism of human ears, it is found that the traveling wave of low-frequency sound has a larger transmission distance on the inner cochlea basal membrane than that of high-frequency sound, and Mel filters in MFCC have fewer numbers and are distributed sparsely in a high-frequency area, so that the traditional MFCC method has poor characteristics of sound signals in the high-frequency part. In order to overcome the defects of the traditional MFCC and improve the applicability of the MFCC to non-speech audio feature extraction, a multi-scale fusion MFCC feature extraction method is researched and designed, and the method is necessary to overcome the problems in the existing MFCC method.

Disclosure of Invention

In order to solve the above problems in the existing audio feature extraction method, the present invention provides a non-speech audio feature extraction method based on an improved MFCC.

The technical scheme adopted by the invention for realizing the purpose is as follows: a non-speech audio feature extraction method based on improved MFCC comprises the following steps:

s1, collecting sound signals and preprocessing the collected sound signals;

s2, performing MFCC feature extraction on the preprocessed sound signals;

s3, performing EMD on the preprocessed sound signals to obtain IMF components, and extracting time domain characteristic vectors and frequency domain characteristic vectors of the IMF components;

s4, performing first-order difference and second-order difference on the MFCC coefficients to obtain dynamic feature vectors forming the MFCC;

and S5, performing feature fusion on the calculated MFCC feature vector, the calculated time domain feature vector, the calculated frequency domain feature vector and the calculated MFCC dynamic feature vector to obtain an improved multi-scale MFCC feature vector.

Preferably, the step S1 includes the following steps:

step S101: the amplitude of the audio sequence of the sound signal is subjected to normalization processing, and the function expression is as follows:

wherein: x (m) is the normalized sound sequence; x (n) is a sound sequence; x (n)_maxThe maximum value of the absolute value of the sound sequence;

step S102: performing framing processing on the audio sequence subjected to the standard vertebra processing;

step S103: and windowing the audio sequence after the frame division.

Preferably, in the step S102, the frame length in the framing processing is 20 to 30ms, and the frame shift is 0.3 to 0.5 times the frame length.

Preferably, in step S103, a hamming window is used in the windowing process.

Preferably, the step S2 includes the following steps:

s201: obtaining a frequency spectrum X (k) of a time domain frame by frame obtained after preprocessing of the sound signal through fast Fourier transform, wherein a function expression of the frequency spectrum X (k) is as follows:

wherein: n is the number of points of Fourier transform, k is frequency, and x (N) is a frame-by-frame time domain obtained after sound signal preprocessing;

s202: calculating the energy spectrum | X (k) of the sound signal by taking the square of the frequency spectrum of the sound signal²Then passing it through a set of triangular filters simulating the adjustment of human ear to make | X (k) luminance become zero²Performing Mel nonlinear transformation, wherein the functional expression is as follows:

H_m(k) for the frequency response of the mth filter, the functional expression is:

and satisfy

Where f (m) is the triangular filter center frequency;

s203: taking logarithm of all MelSpec (m) obtained by a group of filters to obtain logarithmic energy E (m), wherein the function expression is as follows:

E(m)＝lg[MelSpec(m)],0<m<M

wherein: m is the number of the filters;

s204: discrete cosine transforming the logarithmic energy E (m) to obtain a group of Mel cepstrum coefficients F (n), wherein the function expression is as follows:

where n is the order of the mel-frequency cepstral coefficient.

Preferably, in step S3, the IMF components are arranged in order from high frequency to low frequency, the first five IMF components are taken, and the time-domain feature vector and the frequency-domain feature vector thereof are extracted respectively.

Preferably, in step S3, the number of the temporal feature vectors is 11, including an average amplitude, a standard deviation, a square root amplitude, a root mean square, a peak-to-peak value, a skewness, a kurtosis, a peak factor, a margin factor, a form factor and a pulse index,

the average amplitude is expressed as a function of:

the functional expression of the standard deviation is:

the functional expression of the square root amplitude is:

the functional expression of root mean square is:

the functional expression of the peak-to-peak value is:

the functional expression of skewness is:

the function expression for kurtosis is:

the functional expression of the crest factor is:

the functional expression of the margin factor is:

the functional expression of the form factor is:

the functional expression of the pulse index is:

wherein: x (i) is a frequency component, X_PIs the peak value and N is the corresponding sound signal length.

Preferably, in step S3, the frequency domain feature vectors are 2, including the frequency center and the frequency root mean square,

the functional expression for the frequency center is:

the root mean square function of frequency is expressed as:

wherein: k is the number of spectral lines, f (i) is the frequency value of the ith spectral line, and s (i) is the ith value of the spectrum.

Preferably, in step S4, the function expression of the first order difference of the MFCC coefficients is:

wherein: d_tAnd C_tThe tth first-order difference and the cepstrum coefficient are respectively; q is the order of the cepstral coefficient; k is the time difference of the first derivative.

Preferably, in step S4, the function expression of the second order difference of the MFCC coefficients is:

wherein: d_tAnd C_tThe tth second-order difference and the cepstrum coefficient are respectively; q is the order of the cepstral coefficient; k is the time difference of the second derivative.

The invention relates to a non-voice audio feature extraction method based on an improved MFCC (Mel frequency cepstrum coefficient), which solves the problem that the traditional MFCC lacks representation of a high-frequency sound signal part due to the design based on the auditory characteristics of human ears, can effectively extract the high-frequency part of an audio signal outside the range of the audio signal which can be processed by the MFCC, has the characteristics of extracting the short-time characteristics of the signal by the traditional MFCC and containing the overall change of the sound signal, and enriches and more comprehensively the feature information by the first-order difference and the second-order difference of the MFCC.

Drawings

FIG. 1 is a flow chart of a non-speech audio feature extraction method based on an improved MFCC according to an embodiment of the present invention;

FIG. 2 is a flow chart of pre-processing of a sound signal;

FIG. 3 is a flowchart of an extraction process for MFCC parameters.

Detailed Description

The embodiments of the present invention will be described in further detail with reference to the drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

The method for extracting non-speech audio features based on the improved MFCC in the embodiment, as shown in FIG. 1, includes the following steps:

s1, collecting sound signals and preprocessing the collected sound signals;

s2, performing MFCC feature extraction on the preprocessed sound signals;

and S5, performing feature fusion on the calculated MFCC feature vector, the time domain feature vector, the frequency domain feature vector and the MFCC dynamic feature vector to obtain an improved multi-scale MFCC feature vector.

As shown in fig. 2, step S1 may include the following steps:

wherein: x (m) is the normalized sound sequence; x (n) is a sound sequence; | x (n) messaging_maxThe maximum value of the absolute value of the sound sequence;

although the sound signal is a non-stationary signal, the sound signal still has a short-time stationary characteristic in a short time, so that the sound sequence can be divided into a plurality of very small time periods, also called as a frame, so as to obtain the short-time characteristic of the signal, the frame length in the framing processing can be 20-30 ms, the frame shift can be 0.3-0.5 times of the frame length, and partial overlap exists between adjacent frames, so that the characteristic loss caused by overlarge difference of the two frames is avoided;

step S103: windowing the audio sequence after framing;

windowing may be used to smooth the transition between the beginning and end of the frame, and a hamming window may be used.

As shown in fig. 3, step S2 may include the steps of:

and satisfy

Where f (m) is the triangular filter center frequency;

E(m)＝lg[MelSpec(m)],0<m<M

wherein: m is the number of the filters;

s204: discrete cosine transform is carried out on the logarithmic energy E (m) to obtain a group of Mel cepstrum coefficients F (n), and the functional expression is as follows:

wherein n is the order of the mel-frequency cepstrum coefficient;

for the acquisition of the high frequency components of the sound signal, in addition to the EMD method, an EMD-based modification method such as EEMD, CEEMD, CEEMDAN, iceemda may be used.

In step S3, the IMF components may be arranged in order from high frequency to low frequency, the first five IMF components are taken, and their time domain feature vectors and frequency domain feature vectors are respectively extracted, the number of time domain feature vectors may be 11, including average amplitude, standard deviation, square root amplitude, root mean square, peak-to-peak value, skewness, kurtosis, peak factor, margin factor, form factor and pulse index,

the average amplitude is expressed as a function of:

the functional expression of the standard deviation is:

the functional expression of the square root amplitude is:

the functional expression of root mean square is:

the functional expression of the peak-to-peak value is:

the functional expression of skewness is:

the function expression for kurtosis is:

the functional expression of the crest factor is:

the functional expression of the margin factor is:

the functional expression of the form factor is:

the functional expression of the pulse index is:

In step S3, the frequency domain feature vectors may be 2, including the frequency center and the frequency root mean square,

the functional expression for the frequency center is:

the root mean square function of frequency is expressed as:

wherein: k is the number of spectral lines, f (i) is the frequency value of the ith spectral line, s (i) is the ith value of the spectrum;

for different sound signals, after EMD decomposition, the method can not be limited to only retaining the first five IMF components, at most retaining all IMF components with the correlation degree greater than 0.3 with the original signal, and subsequently calculating corresponding time domain and frequency domain characteristics, wherein the time domain and frequency domain characteristics of the signals are not limited to the above formula, and can replace other formulas to construct characteristics according to the characteristics of the analyzed sound signals in different aspects, such as root mean square energy representing energy; attack time, zero-crossing rate and autocorrelation in the time domain; spectral centroid in the frequency domain, spectral flatness, spectral flux, etc.

In order to obtain more abundant information, the MFCC coefficients are subjected to first-order difference and second-order difference to obtain dynamic feature vectors forming the MFCC,

in step S4, the functional expression of the first order difference of the MFCC coefficients is:

In step S4, the functional expression of the second order difference of the MFCC coefficients is:

wherein: d is a radical of_tAnd C_tThe tth second-order difference and the cepstrum coefficient are respectively; q is the order of the cepstral coefficient; k is the time difference of the second derivative. The first order difference and the second order difference enable the characteristic information to be richer and more comprehensive.

The invention not only has the extraction of the short-time characteristic of the traditional MFCC, but also contains the characteristic of the overall change of the sound signal, and can process not only voice type audio, but also non-voice type audio such as audio signals of mechanical sound and the like.

The embodiments of the present invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A non-speech audio feature extraction method based on improved MFCC is characterized by comprising the following steps:

s1, collecting sound signals and preprocessing the collected sound signals;

s2, performing MFCC feature extraction on the preprocessed sound signals;

2. The improved MFCC-based non-speech audio feature extraction method of claim 1, wherein the step S1 comprises the steps of:

step S103: and windowing the audio sequence after the frame division.

3. The method of claim 2, wherein in step S102, the frame length in the framing process is 20-30 ms, and the frame shift is 0.3-0.5 times the frame length.

4. The method of claim 2, wherein in step S103, a hamming window is used in the windowing process.

5. The improved MFCC-based non-speech audio feature extraction method of claim 1, wherein the step S2 comprises the steps of:

and satisfy

Where f (m) is the triangular filter center frequency;

E(m)＝lg[MelSpec(m)],0<m<M

wherein: m is the number of the filters;

where n is the order of the mel-frequency cepstral coefficient.

6. The method of claim 1, wherein in step S3, the IMF components are arranged in order from high frequency to low frequency, and the first five IMF components are taken to extract their time-domain feature vectors and frequency-domain feature vectors, respectively.

7. The method of claim 1, wherein in step S3, the number of time-domain feature vectors is 11, including mean amplitude, standard deviation, square root amplitude, root mean square, peak-to-peak value, skewness, kurtosis, peak factor, margin factor, form factor and pulse index,

the average amplitude is expressed as a function of:

the functional expression of the standard deviation is:

the functional expression of the square root amplitude is:

the functional expression of root mean square is:

the functional expression of the peak-to-peak value is:

the functional expression of skewness is:

the function expression for kurtosis is:

the functional expression of the crest factor is:

the functional expression of the margin factor is:

the functional expression of the form factor is:

the functional expression of the pulse index is:

8. The method of claim 1, wherein in step S3, the number of frequency domain feature vectors is 2, including frequency center and frequency root mean square,

the functional expression for the frequency center is:

the function expression of the root mean square of the frequency is:

9. The method of claim 1, wherein in step S4, the function expression of the first order difference of MFCC coefficients is:

10. The method of claim 1, wherein in step S4, the function expression of the second order difference of MFCC coefficients is: