CN112562642A - Dynamic multi-band nonlinear speech feature extraction method - Google Patents

Dynamic multi-band nonlinear speech feature extraction method Download PDF

Info

Publication number
CN112562642A
CN112562642A CN202011198847.7A CN202011198847A CN112562642A CN 112562642 A CN112562642 A CN 112562642A CN 202011198847 A CN202011198847 A CN 202011198847A CN 112562642 A CN112562642 A CN 112562642A
Authority
CN
China
Prior art keywords
frequency
correlation
dimension
point
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011198847.7A
Other languages
Chinese (zh)
Inventor
张晓俊
伍远博
周长伟
朱欣程
陶智
赵鹤鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202011198847.7A priority Critical patent/CN112562642A/en
Publication of CN112562642A publication Critical patent/CN112562642A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Abstract

The invention discloses a dynamic multiband nonlinear speech feature extraction method, which is used for filtering and frequency division of a speech sample by adopting a bark filter bank based on the auditory characteristic of human ears. And the 24 frequency band signals after frequency division are subjected to self-adaption to obtain a frequency division factor a in a mode of calculating a zero crossing rate. Then, in the frequency bands from 0 to a, calculating the frequency spectrum and logarithm of the voice, and extracting the characteristics of the Barker frequency cepstrum coefficient by adopting a discrete cosine transform scheme; in the frequency bands from a +1 to 24, the maximum lyapunov exponent and the associated dimensional feature are extracted after the signal is embedded in the phase space, and then the feature normalization processing is performed. The invention adopts the self-adaptive frequency division factor and adopts a sub-band processing mode, so that the processed signal is more in line with the auditory characteristics and the actual situation of human beings, and the voice characteristic parameters with more excellent performance can be extracted.

Description

Dynamic multi-band nonlinear speech feature extraction method
Technical Field
The invention relates to a voice recognition method, in particular to a dynamic multi-band nonlinear voice feature extraction method.
Background
Language is the most natural and convenient communication tool for human beings. The speech recognition technology is a technology for simulating the human recognition process by a computer and converting a human speech signal into a corresponding text or command, and the basic purpose of the technology is to develop a machine with human hearing function, which can receive human speech, understand human intention and make a corresponding response, thereby providing great help for the development of human beings. In recent years, with the rapid development of IT industries such as internet, computer, mobile phone, communication and the like, many application systems require simple, efficient and friendly human-computer interaction, so natural voice communication between human and computer has become an important research subject.
The existing speech signal recognition system has strong dependence on environmental conditions, which causes the difference of the extracted speech feature parameters, so how to improve the robustness of the speech feature parameters becomes the key for improving the speech recognition rate.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method for extracting feature parameters in speech recognition, which divides a speech sample into 24 frequency band information by using a barker filter bank conforming to the auditory characteristics of human ears, so that the processed signal conforms to the auditory system of human, and thus speech feature parameters with better performance can be extracted.
In order to achieve the technical purpose, the invention is realized by the following technical scheme:
the invention provides a dynamic multiband nonlinear speech feature extraction method, which comprises the following steps:
filtering and frequency dividing are carried out on the voice sample by adopting a bark filter bank based on the auditory characteristic of human ears, and 24 frequency band signals after frequency dividing are self-adaptive to obtain a frequency dividing factor alpha; then the following steps are carried out:
(1) in the frequency bands from 0 to alpha, after the voice logarithm operation of the voice signal, discrete cosine transform is adopted to extract the Barker frequency cepstrum coefficient characteristics, the mean value of each order of parameters is calculated and arranged;
(2) embedding the signals into a phase space in frequency bands from alpha +1 to 24, extracting a maximum Lyapunov exponent and associated dimensional characteristics, solving the mean value of each order of parameters, and arranging;
(3) and integrating the Barker frequency cepstrum coefficient characteristic, the maximum Lyapunov exponent and the associated dimensional characteristic into a dynamic multiband nonlinear characteristic parameter.
Further, the method for extracting the dynamic multiband nonlinear speech feature provided by the invention extracts the bark frequency cepstrum coefficient feature parameter in the step (1), and specifically comprises the following steps:
step 1), the bark domain wavelet function is expressed as:
Figure BDA0002754808200000011
obtaining a functional expression under the auditory perception domain:
Figure BDA0002754808200000021
wherein, Delta b is (b2-b1)/(K-1) is
Figure BDA0002754808200000022
K is a scale parameter, [ b1, b2 ]]A frequency bandwidth is perceptually audible; b represents the auditory perception frequency;
step 2), introducing a functional relation between the linear frequency and the auditory perception frequency:
b ═ 6.7asinh [ (f-20)/600 ]; in the formula, asinh represents an inverse hyperbolic sine function;
step 3), substituting the functional relation in the step 2) into the functional expression in the auditory perception domain in the step 1) to obtain an expression of the auditory perception wavelet function under linear frequency:
Figure BDA0002754808200000023
step 4), after the voice energy is calculated, the voice energy is processed through a barker filter bank: BW (Bandwidth)m(k) M is more than or equal to 1 and less than or equal to 24, and then a bark frequency cepstrum parameter is extracted through the discrete cosine transform of the energy logarithm。
Further, in the method for extracting a dynamic multiband nonlinear speech feature provided by the present invention, in step (2), the extraction of the maximum lyapunov parameter employs a wolff algorithm, which specifically includes the following steps:
step 1) for discrete time series x1,x2,x3,…,xNDetermining reconstruction dimension m by using G-P algorithm, determining delay time interval tau by using average mutual information method, and reconstructing phase space x (t) ═ xt,xt-τ,…,xt-(m-1)τ) The number of phases is N ═ N- (m-1) tau; wherein N represents the total number of discrete time series points;
step 2), in the phase point of (N- (m-1) tau), taking the initial phase point x0Selecting one and x as base point0Nearest point x1As the end points, an initial vector is formed, and the Euclidean distance between the end points of the base point is recorded as L (t)0);
Step 3), time step length or evolution time t, and evolving the initial vector along the track to obtain a new vector, wherein the Euclidean distance between the corresponding point and the endpoint is marked as L (t)1) And the exponential growth rate of the system linear index in the corresponding time period is recorded as:
Figure BDA0002754808200000024
step 4), continuously iterating and traversing until all phase points, and taking the mean value of each exponential growth rate as an estimated value of LLE:
Figure BDA0002754808200000025
Figure BDA0002754808200000026
further, in step (2), the method for extracting nonlinear speech features in dynamic multiband provided by the present invention, the extracting of the associated dimensional parameters includes the following steps:
step 1), reconstructing a phase space: for a given set of one-dimensional time series: x is the number of1,x2,x3,…,xNSelecting an appropriate panelDimension m of input0And a time delay amount tau, constructing a phase space with m dimensions:
Figure BDA0002754808200000027
step 2), calculating a correlation integral function:
Figure BDA0002754808200000028
in the formula:
Figure BDA0002754808200000029
representing the distance between the state vectors xi and xj in Euclidean space, θ (u) is a step function defined as:
Figure BDA0002754808200000031
c (r) represents the ratio of the point logarithm of the distance less than r to all the point logarithms on the phase space attractor, and is used for reflecting the convergence and divergence degree of the phase points;
step 3), estimating a correlation dimension D: when the time series N → ∞ and the correlation distance length r is small, i.e. r → 0, if the correlation integral function c (r) obeys an exponential law:
Figure BDA0002754808200000032
the attractor has fractal characteristics, and the correlation dimension D and the correlation function C (r) approximately satisfy a log-linear relationship: d (m) ═ lnc (r)/lnr, and m can be calculated by fitting0A corresponding estimate value;
step 4), estimating embedding dimension: increasing embedding dimension m0Substituting into the steps 2) and 3), and repeatedly calculating until m0Gradually converge to a saturation value, where D (m) does not follow m0Is changed by an increase of the value of the correlation dimension of the system, corresponding to m0Is the finally determined embedding dimension.
By adopting the technical scheme, compared with the prior art, the invention has the beneficial effects that:
the invention adopts the Barker filter bank to divide, obtains the frequency division factor in a self-adaptive way, processes the characteristics of the voice signal by frequency bands, and leads the processed signal to be more in line with the auditory characteristics of human beings, thereby being capable of extracting the voice characteristic parameters with more excellent performance.
Drawings
Fig. 1 is a flow chart of the extraction of the division factor.
Fig. 2 is a flow chart of dynamic multiband nonlinear speech feature parameter extraction.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Referring to fig. 1, the present invention provides a method for extracting dynamic multiband nonlinear language feature parameters, which performs filtering segmentation on a speech sample by using a barker filter bank conforming to the auditory characteristics of human ears, and adaptively obtains a frequency division factor α. According to the energy distribution characteristics of the voice, cepstrum characteristic parameters based on bark frequency are adopted to describe the frequency bands from 0 th to alpha; in the frequency bands from α +1 to 24, the maximum lyapunov exponent of the nonlinear dynamics and the associated dimensional characteristics are used for description. In this example, a speech library is used as an experimental object, and the specific method is as follows:
A. the extraction of the Barker frequency cepstrum parameters comprises the following steps:
step 1) selecting a bark domain wavelet function as:
Figure BDA0002754808200000033
functional expressions under the auditory perception domain are available:
Figure BDA0002754808200000041
wherein, Delta b is (b2-b1)/(K-1) is
Figure BDA0002754808200000042
K is a scale parameter, and the auditory perception frequency bandwidth is [ b1, b2 ]]。
Step 2) introducing a functional relation between linear frequency and auditory perception frequency given by Telien Miller:
b=6.7asinh[(f-20)/600];
and 3) substituting the formula to obtain an expression of the auditory perception wavelet function under the linear frequency:
Figure BDA0002754808200000043
step 4), calculating the voice energy and then passing through a barker filter bank: BW (Bandwidth)m(k) M is more than or equal to 1 and less than or equal to 24, and then a bark frequency cepstrum parameter is extracted through discrete cosine transform of energy logarithm;
B. the extraction of the maximum Lyapunov parameter adopts a classical Walff algorithm, and comprises the following steps:
step 1) for discrete time series x1,x2,x3,…,xNDetermining reconstruction dimension m by adopting G-P algorithm, determining delay time interval tau by using average mutual information method, and reconstructing phase space x (t) ═ xt,xt-τ,…,xt-(m-1)τ) The number of phases is N ═ N- (m-1) tau;
step 2) in the (N- (m-1) tau) phase point, taking the initial phase point x0Selecting one and x as base point0Nearest point x1As the end points, an initial vector is formed, and the Euclidean distance between the end points of the base point is recorded as L (t)0)。
Step 3) time step length or evolution time t, the initial vector evolves along the track to obtain a new vector, and the Euclidean distance between the corresponding point and the endpoint can be marked as L (t)1) The linear index of the system increases exponentially in the corresponding time periodThe length ratio is recorded as:
Figure BDA0002754808200000044
step 4) continuously iterating and traversing until all phase points, and taking the mean value of each exponential growth rate as an estimated value of LLE:
Figure BDA0002754808200000045
Figure BDA0002754808200000046
C. extracting the associated dimension parameters, comprising the following steps:
step 1), reconstructing a phase space: for a given set of one-dimensional time series: x is the number of1,x2,x3,…,xNSelecting the appropriate embedding dimension m0And a time delay amount tau, constructing a phase space with m dimensions:
Figure BDA0002754808200000047
step 2), calculating a correlation integral function:
Figure BDA0002754808200000048
in the formula:
Figure BDA0002754808200000049
representing the distance between the state vectors xi and xj in Euclidean space, θ (u) is a step function defined as:
Figure BDA00027548082000000410
c (r) the ratio of the logarithm of points on the phase space attractor at a distance less than r to the logarithm of all points, which reflects the degree of vergence of the phase points.
Step 3), estimating a correlation dimension D: when the time series N → ∞ and the correlation distance length r is small, i.e. r → 0, if the correlation integral function c (r) obeys an exponential law:
Figure BDA0002754808200000051
the attractor has fractal characteristics, and the correlation dimension D and the correlation function C (r) approximately satisfy a log-linear relationship: d (m) ═ lnc (r)/lnr, and m can be calculated by fitting0A corresponding estimate.
Step 4), estimating embedding dimension: increasing embedding dimension m0Substituting the processes (2) and (3), and repeatedly calculating until m0Gradually converge to a saturation value, where D (m) does not follow m0Is changed by an increase of the value of the correlation dimension of the system, corresponding to m0Is the finally determined embedding dimension.
D. The steps of the unified characterization are adopted:
1. the frequency division factor alpha is obtained by the 24 frequency band signals after frequency division. And in the frequency bands from 0 to alpha, extracting the Barker frequency cepstrum coefficient characteristics of the signals, solving the mean value of each order of parameters, and arranging.
2. In the frequency bands of α +1 to 24, the maximum lyapunov exponent and the associated dimensional characteristic parameter of the signal are extracted.
3. The arrangement is schematically as follows:
Figure BDA0002754808200000052
further, the dynamic multiband nonlinear characteristics are subjected to a Bayesian network, K nearest neighbor, multilayer neural network and support vector machine algorithm classifier, and a 10-fold cross-validation method is adopted for performance test.
The results of the experiment are shown in the following table:
Figure BDA0002754808200000053
the foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (4)

1. A dynamic multi-band nonlinear speech feature extraction method is characterized in that,
filtering and frequency dividing are carried out on the voice sample by adopting a bark filter bank based on the auditory characteristic of human ears, and 24 frequency band signals after frequency dividing are self-adaptive to obtain a frequency dividing factor alpha; then the following steps are carried out:
(1) in the frequency bands from 0 to alpha, after the voice logarithm operation of the voice signal, discrete cosine transform is adopted to extract the Barker frequency cepstrum coefficient characteristics, the mean value of each order of parameters is calculated and arranged;
(2) embedding the signals into a phase space in frequency bands from alpha +1 to 24, extracting a maximum Lyapunov exponent and associated dimensional characteristics, solving the mean value of each order of parameters, and arranging;
(3) and integrating the Barker frequency cepstrum coefficient characteristic, the maximum Lyapunov exponent and the associated dimensional characteristic into a dynamic multiband nonlinear characteristic parameter.
2. The method for extracting dynamic multiband nonlinear speech features according to claim 1, wherein in step (1), extracting barker frequency cepstrum coefficient feature parameters specifically includes the following steps:
step 1), the bark domain wavelet function is expressed as:
Figure FDA0002754808190000011
obtaining a functional expression under the auditory perception domain:
Figure FDA0002754808190000012
wherein, Delta b is (b2-b1)/(K-1) is
Figure FDA0002754808190000013
K is a scale parameter, [ b1, b2 ]]For auditory perception of frequency bandwidth, b stands for auditory perceptionKnowing the frequency;
step 2), introducing a functional relation between the linear frequency and the auditory perception frequency:
b ═ 6.7asinh [ (f-20)/600 ]; in the formula, asinh represents an inverse hyperbolic sine function;
step 3), substituting the functional relation in the step 2) into the functional expression in the auditory perception domain in the step 1) to obtain an expression of the auditory perception wavelet function under linear frequency:
Figure FDA0002754808190000014
step 4), after the speech energy is calculated, the speech energy passes through a bark filter bank BWm(k) And extracting Barker frequency cepstrum parameters through discrete cosine transform of energy logarithm, wherein m is more than or equal to 1 and less than or equal to 24.
3. The method for extracting dynamic multiband nonlinear speech feature according to claim 1, wherein in step (2), the extraction of the maximum lyapunov parameter employs a wolff algorithm, which specifically includes the following steps:
step 1) for discrete time series x1,x2,x3,…,xNDetermining reconstruction dimension m by using G-P algorithm, determining delay time interval tau by using average mutual information method, and reconstructing phase space x (t) ═ xt,xt-τ,…,xt-(m-1)τ) The number of phases is N ═ N- (m-1) tau; wherein the parameter N represents the total number of discrete time series points;
step 2), in the phase point of (N- (m-1) tau), taking the initial phase point x0Selecting one and x as base point0Nearest point x1As the end points, an initial vector is formed, and the Euclidean distance between the end points of the base point is recorded as L (t)0);
Step 3), setting a time step length or an evolution time t, evolving the initial vector along the track to obtain a new vector, and recording the Euclidean distance between the corresponding point and the endpoint as L (t)1) And the exponential growth rate of the system linear index in the corresponding time period is recorded as:
Figure FDA0002754808190000021
step 4), continuously iterating and traversing until all phase points, and taking the mean value of each exponential growth rate as an estimated value of LLE:
Figure FDA0002754808190000022
Figure FDA0002754808190000023
4. the dynamic multiband nonlinear speech feature extraction method according to claim 1, wherein in the step (2), the extraction of the associated dimensional feature parameters includes the following steps:
step 1), reconstructing a phase space: for a given set of one-dimensional time series: x is the number of1,x2,x3,…,xNSelecting the appropriate embedding dimension m0And a time delay amount tau, constructing a phase space with m dimensions:
Figure FDA0002754808190000024
step 2), calculating a correlation integral function:
Figure FDA0002754808190000025
in the formula:
Figure FDA0002754808190000026
representing the distance between the state vectors xi and xj in Euclidean space, θ (u) is a step function defined as:
Figure FDA0002754808190000027
c (r) represents the ratio of the point logarithm of the distance less than r to all the point logarithms on the phase space attractor, and is used for reflecting the convergence and divergence degree of the phase points;
step 3), estimating a correlation dimension D: when the time series N → ∞ and the correlation distance length r is small, i.e. r → 0, if the correlation integral function c (r) obeys an exponential law:
Figure FDA0002754808190000028
the attractor has fractal characteristics, and the correlation dimension D and the correlation function C (r) approximately satisfy a log-linear relationship: d (m) ═ ln C (r)/lnr, and m can be calculated by fitting0A corresponding estimate value;
step 4), estimating embedding dimension: increasing embedding dimension m0Substituting into step 2) and step 3), and repeatedly calculating until m0Gradually converge to a saturation value, where D (m) does not follow m0Is changed by an increase of the value of the correlation dimension of the system, corresponding to m0Is the finally determined embedding dimension.
CN202011198847.7A 2020-10-31 2020-10-31 Dynamic multi-band nonlinear speech feature extraction method Pending CN112562642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011198847.7A CN112562642A (en) 2020-10-31 2020-10-31 Dynamic multi-band nonlinear speech feature extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011198847.7A CN112562642A (en) 2020-10-31 2020-10-31 Dynamic multi-band nonlinear speech feature extraction method

Publications (1)

Publication Number Publication Date
CN112562642A true CN112562642A (en) 2021-03-26

Family

ID=75041322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011198847.7A Pending CN112562642A (en) 2020-10-31 2020-10-31 Dynamic multi-band nonlinear speech feature extraction method

Country Status (1)

Country Link
CN (1) CN112562642A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020059029A1 (en) * 1999-01-11 2002-05-16 Doran Todder Method for the diagnosis of thought states by analysis of interword silences
CN102646415A (en) * 2012-04-10 2012-08-22 苏州大学 Method for extracting characteristic parameters in speech recognition
CN109065073A (en) * 2018-08-16 2018-12-21 太原理工大学 Speech-emotion recognition method based on depth S VM network model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020059029A1 (en) * 1999-01-11 2002-05-16 Doran Todder Method for the diagnosis of thought states by analysis of interword silences
CN102646415A (en) * 2012-04-10 2012-08-22 苏州大学 Method for extracting characteristic parameters in speech recognition
CN109065073A (en) * 2018-08-16 2018-12-21 太原理工大学 Speech-emotion recognition method based on depth S VM network model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
侯丽敏等: "《利用语音非线性特征改进说话人识别的性能》", 《模式识别与人工智能》, vol. 19, no. 6, pages 776 - 781 *
周强等: "《嗓音多频带非线性分析的声带病变识别》", 《声学学报》, vol. 39, no. 1, pages 111 - 118 *
张晓俊等: "《基于非线性方法的病理嗓音识别研究》", 《信息安全与通信保密》, no. 3, pages 113 - 115 *
张晓俊等: "《采用多特征组合优化的语音特征参数研究》", 《通信技术》, vol. 45, no. 12, pages 98 - 100 *

Similar Documents

Publication Publication Date Title
TW546630B (en) Optimized local feature extraction for automatic speech recognition
Winursito et al. Improvement of MFCC feature extraction accuracy using PCA in Indonesian speech recognition
EP2507790B1 (en) Method and system for robust audio hashing.
Sarikaya et al. High resolution speech feature parametrization for monophone-based stressed speech recognition
US7082394B2 (en) Noise-robust feature extraction using multi-layer principal component analysis
Daqrouq et al. Average framing linear prediction coding with wavelet transform for text-independent speaker identification system
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
KR100930060B1 (en) Recording medium on which a signal detecting method, apparatus and program for executing the method are recorded
CN112735460B (en) Beam forming method and system based on time-frequency masking value estimation
CN110800048B (en) Processing of multichannel spatial audio format input signals
CN102646415A (en) Method for extracting characteristic parameters in speech recognition
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
Verma et al. Smartphone application for fault recognition
CN111968651A (en) WT (WT) -based voiceprint recognition method and system
CN112562642A (en) Dynamic multi-band nonlinear speech feature extraction method
JP2015175859A (en) Pattern recognition device, pattern recognition method, and pattern recognition program
CN110600038A (en) Audio fingerprint dimension reduction method based on discrete kini coefficient
Huang et al. Perceptual speech hashing authentication algorithm based on linear prediction analysis
CN110197657A (en) A kind of dynamic speech feature extracting method based on cosine similarity
CN103180847A (en) Music query method and apparatus
Noyum et al. Boosting the predictive accurary of singer identification using discrete wavelet transform for feature extraction
Chen et al. Feature Analysis and Optimization of Underwater Target Radiated Noise Based on t-SNE
Faek et al. Speaker recognition from noisy spoken sentences
Li et al. A reliable voice perceptual hash authentication algorithm
TWI749547B (en) Speech enhancement system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination