CN114613389A - Non-speech audio feature extraction method based on improved MFCC - Google Patents

Non-speech audio feature extraction method based on improved MFCC Download PDF

Info

Publication number
CN114613389A
CN114613389A CN202210256684.6A CN202210256684A CN114613389A CN 114613389 A CN114613389 A CN 114613389A CN 202210256684 A CN202210256684 A CN 202210256684A CN 114613389 A CN114613389 A CN 114613389A
Authority
CN
China
Prior art keywords
frequency
mfcc
functional expression
expression
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210256684.6A
Other languages
Chinese (zh)
Inventor
姜琦
董琦
李红
冯庆胜
丁伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Jiaotong University
Original Assignee
Dalian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Jiaotong University filed Critical Dalian Jiaotong University
Priority to CN202210256684.6A priority Critical patent/CN114613389A/en
Publication of CN114613389A publication Critical patent/CN114613389A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Stereophonic System (AREA)

Abstract

The invention relates to the technical field of audio feature extraction, and particularly discloses a non-speech audio feature extraction method based on improved MFCC (Mel frequency cepstrum coefficient), which comprises the following steps: collecting sound signals and preprocessing the collected sound signals; performing MFCC feature extraction on the preprocessed sound signals; performing EMD on the preprocessed sound signals to obtain IMF components, and extracting time domain characteristic vectors and frequency domain characteristic vectors of the IMF components; performing first-order difference and second-order difference on the MFCC coefficients to obtain dynamic feature vectors forming the MFCC; and performing feature fusion on the calculated MFCC feature vector, the time domain feature vector, the frequency domain feature vector and the MFCC dynamic feature vector to obtain an improved multi-scale MFCC feature vector. The invention can effectively extract the high-frequency part of the audio signal, and the characteristic information of the sound signal is richer and more comprehensive.

Description

Non-speech audio feature extraction method based on improved MFCC
Technical Field
The invention relates to the technical field of audio feature extraction.
Background
At present, there are three main types of common characteristic parameters in the sound signal feature extraction technology: linear Predictive Coefficients (LPC), Linear Predictive Cepstral Coefficients (LPCC), Mel-Frequency Cepstral Coefficients (MFCC). Compared with the first two model-based features, the MFCC does not make any assumption or limitation on the sound, and is a feature parameter set established based on the principle that the human brain processes external sound and the auditory characteristics of the human ear, and the feature is a feature parameter which is currently used in sound recognition more frequently. However, the MFCC features are designed according to the auditory characteristics of human ears, which are more sensitive to low-frequency sounds and have a masking effect on high frequencies, so that when facing non-speech audio signals with more high-frequency components, the feature parameters extracted by the method cannot comprehensively represent the acoustic characteristics of audio, and have certain limitations.
The key of the traditional MFCC sound signal feature extraction method is to construct a series of band-pass filter banks (Mel filters) with different weights to simulate the regulation effect of human ears on sound signals. In the research on the auditory mechanism of human ears, it is found that the traveling wave of low-frequency sound has a larger transmission distance on the inner cochlea basal membrane than that of high-frequency sound, and Mel filters in MFCC have fewer numbers and are distributed sparsely in a high-frequency area, so that the traditional MFCC method has poor characteristics of sound signals in the high-frequency part. In order to overcome the defects of the traditional MFCC and improve the applicability of the MFCC to non-speech audio feature extraction, a multi-scale fusion MFCC feature extraction method is researched and designed, and the method is necessary to overcome the problems in the existing MFCC method.
Disclosure of Invention
In order to solve the above problems in the existing audio feature extraction method, the present invention provides a non-speech audio feature extraction method based on an improved MFCC.
The technical scheme adopted by the invention for realizing the purpose is as follows: a non-speech audio feature extraction method based on improved MFCC comprises the following steps:
s1, collecting sound signals and preprocessing the collected sound signals;
s2, performing MFCC feature extraction on the preprocessed sound signals;
s3, performing EMD on the preprocessed sound signals to obtain IMF components, and extracting time domain characteristic vectors and frequency domain characteristic vectors of the IMF components;
s4, performing first-order difference and second-order difference on the MFCC coefficients to obtain dynamic feature vectors forming the MFCC;
and S5, performing feature fusion on the calculated MFCC feature vector, the calculated time domain feature vector, the calculated frequency domain feature vector and the calculated MFCC dynamic feature vector to obtain an improved multi-scale MFCC feature vector.
Preferably, the step S1 includes the following steps:
step S101: the amplitude of the audio sequence of the sound signal is subjected to normalization processing, and the function expression is as follows:
Figure BDA0003548850640000021
wherein: x (m) is the normalized sound sequence; x (n) is a sound sequence; x (n)maxThe maximum value of the absolute value of the sound sequence;
step S102: performing framing processing on the audio sequence subjected to the standard vertebra processing;
step S103: and windowing the audio sequence after the frame division.
Preferably, in the step S102, the frame length in the framing processing is 20 to 30ms, and the frame shift is 0.3 to 0.5 times the frame length.
Preferably, in step S103, a hamming window is used in the windowing process.
Preferably, the step S2 includes the following steps:
s201: obtaining a frequency spectrum X (k) of a time domain frame by frame obtained after preprocessing of the sound signal through fast Fourier transform, wherein a function expression of the frequency spectrum X (k) is as follows:
Figure BDA0003548850640000022
wherein: n is the number of points of Fourier transform, k is frequency, and x (N) is a frame-by-frame time domain obtained after sound signal preprocessing;
s202: calculating the energy spectrum | X (k) of the sound signal by taking the square of the frequency spectrum of the sound signal2Then passing it through a set of triangular filters simulating the adjustment of human ear to make | X (k) luminance become zero2Performing Mel nonlinear transformation, wherein the functional expression is as follows:
Figure BDA0003548850640000031
Hm(k) for the frequency response of the mth filter, the functional expression is:
Figure BDA0003548850640000032
and satisfy
Figure BDA0003548850640000033
Where f (m) is the triangular filter center frequency;
s203: taking logarithm of all MelSpec (m) obtained by a group of filters to obtain logarithmic energy E (m), wherein the function expression is as follows:
E(m)=lg[MelSpec(m)],0<m<M
wherein: m is the number of the filters;
s204: discrete cosine transforming the logarithmic energy E (m) to obtain a group of Mel cepstrum coefficients F (n), wherein the function expression is as follows:
Figure BDA0003548850640000034
where n is the order of the mel-frequency cepstral coefficient.
Preferably, in step S3, the IMF components are arranged in order from high frequency to low frequency, the first five IMF components are taken, and the time-domain feature vector and the frequency-domain feature vector thereof are extracted respectively.
Preferably, in step S3, the number of the temporal feature vectors is 11, including an average amplitude, a standard deviation, a square root amplitude, a root mean square, a peak-to-peak value, a skewness, a kurtosis, a peak factor, a margin factor, a form factor and a pulse index,
the average amplitude is expressed as a function of:
Figure BDA0003548850640000041
the functional expression of the standard deviation is:
Figure BDA0003548850640000042
the functional expression of the square root amplitude is:
Figure BDA0003548850640000043
the functional expression of root mean square is:
Figure BDA0003548850640000044
the functional expression of the peak-to-peak value is:
Figure BDA0003548850640000045
the functional expression of skewness is:
Figure BDA0003548850640000046
the function expression for kurtosis is:
Figure BDA0003548850640000047
the functional expression of the crest factor is:
Figure BDA0003548850640000048
the functional expression of the margin factor is:
Figure BDA0003548850640000051
the functional expression of the form factor is:
Figure BDA0003548850640000052
the functional expression of the pulse index is:
Figure BDA0003548850640000053
wherein: x (i) is a frequency component, XPIs the peak value and N is the corresponding sound signal length.
Preferably, in step S3, the frequency domain feature vectors are 2, including the frequency center and the frequency root mean square,
the functional expression for the frequency center is:
Figure BDA0003548850640000054
the root mean square function of frequency is expressed as:
Figure BDA0003548850640000055
wherein: k is the number of spectral lines, f (i) is the frequency value of the ith spectral line, and s (i) is the ith value of the spectrum.
Preferably, in step S4, the function expression of the first order difference of the MFCC coefficients is:
Figure BDA0003548850640000056
wherein: dtAnd CtThe tth first-order difference and the cepstrum coefficient are respectively; q is the order of the cepstral coefficient; k is the time difference of the first derivative.
Preferably, in step S4, the function expression of the second order difference of the MFCC coefficients is:
Figure BDA0003548850640000061
wherein: dtAnd CtThe tth second-order difference and the cepstrum coefficient are respectively; q is the order of the cepstral coefficient; k is the time difference of the second derivative.
The invention relates to a non-voice audio feature extraction method based on an improved MFCC (Mel frequency cepstrum coefficient), which solves the problem that the traditional MFCC lacks representation of a high-frequency sound signal part due to the design based on the auditory characteristics of human ears, can effectively extract the high-frequency part of an audio signal outside the range of the audio signal which can be processed by the MFCC, has the characteristics of extracting the short-time characteristics of the signal by the traditional MFCC and containing the overall change of the sound signal, and enriches and more comprehensively the feature information by the first-order difference and the second-order difference of the MFCC.
Drawings
FIG. 1 is a flow chart of a non-speech audio feature extraction method based on an improved MFCC according to an embodiment of the present invention;
FIG. 2 is a flow chart of pre-processing of a sound signal;
FIG. 3 is a flowchart of an extraction process for MFCC parameters.
Detailed Description
The embodiments of the present invention will be described in further detail with reference to the drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
The method for extracting non-speech audio features based on the improved MFCC in the embodiment, as shown in FIG. 1, includes the following steps:
s1, collecting sound signals and preprocessing the collected sound signals;
s2, performing MFCC feature extraction on the preprocessed sound signals;
s3, performing EMD on the preprocessed sound signals to obtain IMF components, and extracting time domain characteristic vectors and frequency domain characteristic vectors of the IMF components;
s4, performing first-order difference and second-order difference on the MFCC coefficients to obtain dynamic feature vectors forming the MFCC;
and S5, performing feature fusion on the calculated MFCC feature vector, the time domain feature vector, the frequency domain feature vector and the MFCC dynamic feature vector to obtain an improved multi-scale MFCC feature vector.
As shown in fig. 2, step S1 may include the following steps:
step S101: the amplitude of the audio sequence of the sound signal is subjected to normalization processing, and the function expression is as follows:
Figure BDA0003548850640000071
wherein: x (m) is the normalized sound sequence; x (n) is a sound sequence; | x (n) messagingmaxThe maximum value of the absolute value of the sound sequence;
step S102: performing framing processing on the audio sequence subjected to the standard vertebra processing;
although the sound signal is a non-stationary signal, the sound signal still has a short-time stationary characteristic in a short time, so that the sound sequence can be divided into a plurality of very small time periods, also called as a frame, so as to obtain the short-time characteristic of the signal, the frame length in the framing processing can be 20-30 ms, the frame shift can be 0.3-0.5 times of the frame length, and partial overlap exists between adjacent frames, so that the characteristic loss caused by overlarge difference of the two frames is avoided;
step S103: windowing the audio sequence after framing;
windowing may be used to smooth the transition between the beginning and end of the frame, and a hamming window may be used.
As shown in fig. 3, step S2 may include the steps of:
s201: obtaining a frequency spectrum X (k) of a time domain frame by frame obtained after preprocessing of the sound signal through fast Fourier transform, wherein a function expression of the frequency spectrum X (k) is as follows:
Figure BDA0003548850640000072
wherein: n is the number of points of Fourier transform, k is frequency, and x (N) is a frame-by-frame time domain obtained after sound signal preprocessing;
s202: calculating the energy spectrum | X (k) of the sound signal by taking the square of the frequency spectrum of the sound signal2Then passing it through a set of triangular filters simulating the adjustment of human ear to make | X (k) luminance become zero2Performing Mel nonlinear transformation, wherein the functional expression is as follows:
Figure BDA0003548850640000081
Hm(k) for the frequency response of the mth filter, the functional expression is:
Figure BDA0003548850640000082
and satisfy
Figure BDA0003548850640000083
Where f (m) is the triangular filter center frequency;
s203: taking logarithm of all MelSpec (m) obtained by a group of filters to obtain logarithmic energy E (m), wherein the function expression is as follows:
E(m)=lg[MelSpec(m)],0<m<M
wherein: m is the number of the filters;
s204: discrete cosine transform is carried out on the logarithmic energy E (m) to obtain a group of Mel cepstrum coefficients F (n), and the functional expression is as follows:
Figure BDA0003548850640000084
wherein n is the order of the mel-frequency cepstrum coefficient;
for the acquisition of the high frequency components of the sound signal, in addition to the EMD method, an EMD-based modification method such as EEMD, CEEMD, CEEMDAN, iceemda may be used.
In step S3, the IMF components may be arranged in order from high frequency to low frequency, the first five IMF components are taken, and their time domain feature vectors and frequency domain feature vectors are respectively extracted, the number of time domain feature vectors may be 11, including average amplitude, standard deviation, square root amplitude, root mean square, peak-to-peak value, skewness, kurtosis, peak factor, margin factor, form factor and pulse index,
the average amplitude is expressed as a function of:
Figure BDA0003548850640000091
the functional expression of the standard deviation is:
Figure BDA0003548850640000092
the functional expression of the square root amplitude is:
Figure BDA0003548850640000093
the functional expression of root mean square is:
Figure BDA0003548850640000094
the functional expression of the peak-to-peak value is:
Figure BDA0003548850640000095
the functional expression of skewness is:
Figure BDA0003548850640000096
the function expression for kurtosis is:
Figure BDA0003548850640000097
the functional expression of the crest factor is:
Figure BDA0003548850640000098
the functional expression of the margin factor is:
Figure BDA0003548850640000101
the functional expression of the form factor is:
Figure BDA0003548850640000102
the functional expression of the pulse index is:
Figure BDA0003548850640000103
wherein: x (i) is a frequency component, XPIs the peak value and N is the corresponding sound signal length.
In step S3, the frequency domain feature vectors may be 2, including the frequency center and the frequency root mean square,
the functional expression for the frequency center is:
Figure BDA0003548850640000104
the root mean square function of frequency is expressed as:
Figure BDA0003548850640000105
wherein: k is the number of spectral lines, f (i) is the frequency value of the ith spectral line, s (i) is the ith value of the spectrum;
for different sound signals, after EMD decomposition, the method can not be limited to only retaining the first five IMF components, at most retaining all IMF components with the correlation degree greater than 0.3 with the original signal, and subsequently calculating corresponding time domain and frequency domain characteristics, wherein the time domain and frequency domain characteristics of the signals are not limited to the above formula, and can replace other formulas to construct characteristics according to the characteristics of the analyzed sound signals in different aspects, such as root mean square energy representing energy; attack time, zero-crossing rate and autocorrelation in the time domain; spectral centroid in the frequency domain, spectral flatness, spectral flux, etc.
In order to obtain more abundant information, the MFCC coefficients are subjected to first-order difference and second-order difference to obtain dynamic feature vectors forming the MFCC,
in step S4, the functional expression of the first order difference of the MFCC coefficients is:
Figure BDA0003548850640000111
wherein: dtAnd CtThe tth first-order difference and the cepstrum coefficient are respectively; q is the order of the cepstral coefficient; k is the time difference of the first derivative.
In step S4, the functional expression of the second order difference of the MFCC coefficients is:
Figure BDA0003548850640000112
wherein: d is a radical oftAnd CtThe tth second-order difference and the cepstrum coefficient are respectively; q is the order of the cepstral coefficient; k is the time difference of the second derivative. The first order difference and the second order difference enable the characteristic information to be richer and more comprehensive.
The invention not only has the extraction of the short-time characteristic of the traditional MFCC, but also contains the characteristic of the overall change of the sound signal, and can process not only voice type audio, but also non-voice type audio such as audio signals of mechanical sound and the like.
The embodiments of the present invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (10)

1. A non-speech audio feature extraction method based on improved MFCC is characterized by comprising the following steps:
s1, collecting sound signals and preprocessing the collected sound signals;
s2, performing MFCC feature extraction on the preprocessed sound signals;
s3, performing EMD on the preprocessed sound signals to obtain IMF components, and extracting time domain characteristic vectors and frequency domain characteristic vectors of the IMF components;
s4, performing first-order difference and second-order difference on the MFCC coefficients to obtain dynamic feature vectors forming the MFCC;
and S5, performing feature fusion on the calculated MFCC feature vector, the time domain feature vector, the frequency domain feature vector and the MFCC dynamic feature vector to obtain an improved multi-scale MFCC feature vector.
2. The improved MFCC-based non-speech audio feature extraction method of claim 1, wherein the step S1 comprises the steps of:
step S101: the amplitude of the audio sequence of the sound signal is subjected to normalization processing, and the function expression is as follows:
Figure FDA0003548850630000011
wherein: x (m) is the normalized sound sequence; x (n) is a sound sequence; | x (n) messagingmaxThe maximum value of the absolute value of the sound sequence;
step S102: performing framing processing on the audio sequence subjected to the standard vertebra processing;
step S103: and windowing the audio sequence after the frame division.
3. The method of claim 2, wherein in step S102, the frame length in the framing process is 20-30 ms, and the frame shift is 0.3-0.5 times the frame length.
4. The method of claim 2, wherein in step S103, a hamming window is used in the windowing process.
5. The improved MFCC-based non-speech audio feature extraction method of claim 1, wherein the step S2 comprises the steps of:
s201: obtaining a frequency spectrum X (k) of a time domain frame by frame obtained after preprocessing of the sound signal through fast Fourier transform, wherein a function expression of the frequency spectrum X (k) is as follows:
Figure FDA0003548850630000021
wherein: n is the number of points of Fourier transform, k is frequency, and x (N) is a frame-by-frame time domain obtained after sound signal preprocessing;
s202: calculating the energy spectrum | X (k) of the sound signal by taking the square of the frequency spectrum of the sound signal2Then passing it through a set of triangular filters simulating the adjustment of human ear to make | X (k) luminance become zero2Performing Mel nonlinear transformation, wherein the functional expression is as follows:
Figure FDA0003548850630000022
Hm(k) for the frequency response of the mth filter, the functional expression is:
Figure FDA0003548850630000023
and satisfy
Figure FDA0003548850630000024
Where f (m) is the triangular filter center frequency;
s203: taking logarithm of all MelSpec (m) obtained by a group of filters to obtain logarithmic energy E (m), wherein the function expression is as follows:
E(m)=lg[MelSpec(m)],0<m<M
wherein: m is the number of the filters;
s204: discrete cosine transform is carried out on the logarithmic energy E (m) to obtain a group of Mel cepstrum coefficients F (n), and the functional expression is as follows:
Figure FDA0003548850630000025
where n is the order of the mel-frequency cepstral coefficient.
6. The method of claim 1, wherein in step S3, the IMF components are arranged in order from high frequency to low frequency, and the first five IMF components are taken to extract their time-domain feature vectors and frequency-domain feature vectors, respectively.
7. The method of claim 1, wherein in step S3, the number of time-domain feature vectors is 11, including mean amplitude, standard deviation, square root amplitude, root mean square, peak-to-peak value, skewness, kurtosis, peak factor, margin factor, form factor and pulse index,
the average amplitude is expressed as a function of:
Figure FDA0003548850630000031
the functional expression of the standard deviation is:
Figure FDA0003548850630000032
the functional expression of the square root amplitude is:
Figure FDA0003548850630000033
the functional expression of root mean square is:
Figure FDA0003548850630000034
the functional expression of the peak-to-peak value is:
Figure FDA0003548850630000035
the functional expression of skewness is:
Figure FDA0003548850630000036
the function expression for kurtosis is:
Figure FDA0003548850630000037
the functional expression of the crest factor is:
Figure FDA0003548850630000041
the functional expression of the margin factor is:
Figure FDA0003548850630000042
the functional expression of the form factor is:
Figure FDA0003548850630000043
the functional expression of the pulse index is:
Figure FDA0003548850630000044
wherein: x (i) is a frequency component, XPIs the peak value and N is the corresponding sound signal length.
8. The method of claim 1, wherein in step S3, the number of frequency domain feature vectors is 2, including frequency center and frequency root mean square,
the functional expression for the frequency center is:
Figure FDA0003548850630000045
the function expression of the root mean square of the frequency is:
Figure FDA0003548850630000046
wherein: k is the number of spectral lines, f (i) is the frequency value of the ith spectral line, and s (i) is the ith value of the spectrum.
9. The method of claim 1, wherein in step S4, the function expression of the first order difference of MFCC coefficients is:
Figure FDA0003548850630000051
wherein: dtAnd CtThe tth first-order difference and the cepstrum coefficient are respectively; q is the order of the cepstral coefficient; k is the time difference of the first derivative.
10. The method of claim 1, wherein in step S4, the function expression of the second order difference of MFCC coefficients is:
Figure FDA0003548850630000052
wherein: dtAnd CtThe tth second-order difference and the cepstrum coefficient are respectively; q is the order of the cepstral coefficient; k is the time difference of the second derivative.
CN202210256684.6A 2022-03-16 2022-03-16 Non-speech audio feature extraction method based on improved MFCC Pending CN114613389A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210256684.6A CN114613389A (en) 2022-03-16 2022-03-16 Non-speech audio feature extraction method based on improved MFCC

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210256684.6A CN114613389A (en) 2022-03-16 2022-03-16 Non-speech audio feature extraction method based on improved MFCC

Publications (1)

Publication Number Publication Date
CN114613389A true CN114613389A (en) 2022-06-10

Family

ID=81862961

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210256684.6A Pending CN114613389A (en) 2022-03-16 2022-03-16 Non-speech audio feature extraction method based on improved MFCC

Country Status (1)

Country Link
CN (1) CN114613389A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863951A (en) * 2022-07-11 2022-08-05 中国科学院合肥物质科学研究院 Rapid dysarthria detection method based on modal decomposition
CN117153197A (en) * 2023-10-27 2023-12-01 云南师范大学 Speech emotion recognition method, apparatus, and computer-readable storage medium
CN117475360A (en) * 2023-12-27 2024-01-30 南京纳实医学科技有限公司 Biological sign extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863951A (en) * 2022-07-11 2022-08-05 中国科学院合肥物质科学研究院 Rapid dysarthria detection method based on modal decomposition
CN114863951B (en) * 2022-07-11 2022-09-23 中国科学院合肥物质科学研究院 Rapid dysarthria detection method based on modal decomposition
CN117153197A (en) * 2023-10-27 2023-12-01 云南师范大学 Speech emotion recognition method, apparatus, and computer-readable storage medium
CN117153197B (en) * 2023-10-27 2024-01-02 云南师范大学 Speech emotion recognition method, apparatus, and computer-readable storage medium
CN117475360A (en) * 2023-12-27 2024-01-30 南京纳实医学科技有限公司 Biological sign extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN
CN117475360B (en) * 2023-12-27 2024-03-26 南京纳实医学科技有限公司 Biological feature extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN

Similar Documents

Publication Publication Date Title
CN107610715B (en) Similarity calculation method based on multiple sound characteristics
CN114613389A (en) Non-speech audio feature extraction method based on improved MFCC
KR100908121B1 (en) Speech feature vector conversion method and apparatus
CN109767756B (en) Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient
CN108198545B (en) Speech recognition method based on wavelet transformation
CN108922514B (en) Robust feature extraction method based on low-frequency log spectrum
CN108682432B (en) Speech emotion recognition device
Wanli et al. The research of feature extraction based on MFCC for speaker recognition
CN107274887A (en) Speaker&#39;s Further Feature Extraction method based on fusion feature MGFCC
CN110942784A (en) Snore classification system based on support vector machine
CN105679321B (en) Voice recognition method, device and terminal
CN111489763B (en) GMM model-based speaker recognition self-adaption method in complex environment
CN111798846A (en) Voice command word recognition method and device, conference terminal and conference terminal system
Zhang et al. Low-Delay Speech Enhancement Using Perceptually Motivated Target and Loss.
Ali et al. Speech enhancement using dilated wave-u-net: an experimental analysis
CN112863517B (en) Speech recognition method based on perceptual spectrum convergence rate
CN112466276A (en) Speech synthesis system training method and device and readable storage medium
Singh et al. A comparative study of recognition of speech using improved MFCC algorithms and Rasta filters
CN104900227A (en) Voice characteristic information extraction method and electronic equipment
Nasreen et al. Speech analysis for automatic speech recognition
Rahali et al. Robust Features for Speech Recognition using Temporal Filtering Technique in the Presence of Impulsive Noise
Tantisatirapong et al. Comparison of feature extraction for accent dependent Thai speech recognition system
Vimal Study on the Behaviour of Mel Frequency Cepstral Coffecient Algorithm for Different Windows
CN110634473A (en) Voice digital recognition method based on MFCC
CN110610724A (en) Voice endpoint detection method and device based on non-uniform sub-band separation variance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination