CN112562642A - Dynamic multi-band nonlinear speech feature extraction method - Google Patents
Dynamic multi-band nonlinear speech feature extraction method Download PDFInfo
- Publication number
- CN112562642A CN112562642A CN202011198847.7A CN202011198847A CN112562642A CN 112562642 A CN112562642 A CN 112562642A CN 202011198847 A CN202011198847 A CN 202011198847A CN 112562642 A CN112562642 A CN 112562642A
- Authority
- CN
- China
- Prior art keywords
- frequency
- correlation
- dimension
- point
- extracting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 14
- 241000282414 Homo sapiens Species 0.000 claims abstract description 16
- 210000005069 ears Anatomy 0.000 claims abstract description 5
- 238000001914 filtration Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 18
- 238000000034 method Methods 0.000 claims description 16
- 230000008447 perception Effects 0.000 claims description 14
- 239000013598 vector Substances 0.000 claims description 12
- 230000014509 gene expression Effects 0.000 claims description 8
- 238000005314 correlation function Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 abstract 1
- 238000001228 spectrum Methods 0.000 abstract 1
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000005034 decoration Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005312 nonlinear dynamic Methods 0.000 description 1
- 238000011056 performance test Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
Abstract
The invention discloses a dynamic multiband nonlinear speech feature extraction method, which is used for filtering and frequency division of a speech sample by adopting a bark filter bank based on the auditory characteristic of human ears. And the 24 frequency band signals after frequency division are subjected to self-adaption to obtain a frequency division factor a in a mode of calculating a zero crossing rate. Then, in the frequency bands from 0 to a, calculating the frequency spectrum and logarithm of the voice, and extracting the characteristics of the Barker frequency cepstrum coefficient by adopting a discrete cosine transform scheme; in the frequency bands from a +1 to 24, the maximum lyapunov exponent and the associated dimensional feature are extracted after the signal is embedded in the phase space, and then the feature normalization processing is performed. The invention adopts the self-adaptive frequency division factor and adopts a sub-band processing mode, so that the processed signal is more in line with the auditory characteristics and the actual situation of human beings, and the voice characteristic parameters with more excellent performance can be extracted.
Description
Technical Field
The invention relates to a voice recognition method, in particular to a dynamic multi-band nonlinear voice feature extraction method.
Background
Language is the most natural and convenient communication tool for human beings. The speech recognition technology is a technology for simulating the human recognition process by a computer and converting a human speech signal into a corresponding text or command, and the basic purpose of the technology is to develop a machine with human hearing function, which can receive human speech, understand human intention and make a corresponding response, thereby providing great help for the development of human beings. In recent years, with the rapid development of IT industries such as internet, computer, mobile phone, communication and the like, many application systems require simple, efficient and friendly human-computer interaction, so natural voice communication between human and computer has become an important research subject.
The existing speech signal recognition system has strong dependence on environmental conditions, which causes the difference of the extracted speech feature parameters, so how to improve the robustness of the speech feature parameters becomes the key for improving the speech recognition rate.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method for extracting feature parameters in speech recognition, which divides a speech sample into 24 frequency band information by using a barker filter bank conforming to the auditory characteristics of human ears, so that the processed signal conforms to the auditory system of human, and thus speech feature parameters with better performance can be extracted.
In order to achieve the technical purpose, the invention is realized by the following technical scheme:
the invention provides a dynamic multiband nonlinear speech feature extraction method, which comprises the following steps:
filtering and frequency dividing are carried out on the voice sample by adopting a bark filter bank based on the auditory characteristic of human ears, and 24 frequency band signals after frequency dividing are self-adaptive to obtain a frequency dividing factor alpha; then the following steps are carried out:
(1) in the frequency bands from 0 to alpha, after the voice logarithm operation of the voice signal, discrete cosine transform is adopted to extract the Barker frequency cepstrum coefficient characteristics, the mean value of each order of parameters is calculated and arranged;
(2) embedding the signals into a phase space in frequency bands from alpha +1 to 24, extracting a maximum Lyapunov exponent and associated dimensional characteristics, solving the mean value of each order of parameters, and arranging;
(3) and integrating the Barker frequency cepstrum coefficient characteristic, the maximum Lyapunov exponent and the associated dimensional characteristic into a dynamic multiband nonlinear characteristic parameter.
Further, the method for extracting the dynamic multiband nonlinear speech feature provided by the invention extracts the bark frequency cepstrum coefficient feature parameter in the step (1), and specifically comprises the following steps:
wherein, Delta b is (b2-b1)/(K-1) isK is a scale parameter, [ b1, b2 ]]A frequency bandwidth is perceptually audible; b represents the auditory perception frequency;
step 2), introducing a functional relation between the linear frequency and the auditory perception frequency:
b ═ 6.7asinh [ (f-20)/600 ]; in the formula, asinh represents an inverse hyperbolic sine function;
step 3), substituting the functional relation in the step 2) into the functional expression in the auditory perception domain in the step 1) to obtain an expression of the auditory perception wavelet function under linear frequency:
step 4), after the voice energy is calculated, the voice energy is processed through a barker filter bank: BW (Bandwidth)m(k) M is more than or equal to 1 and less than or equal to 24, and then a bark frequency cepstrum parameter is extracted through the discrete cosine transform of the energy logarithm。
Further, in the method for extracting a dynamic multiband nonlinear speech feature provided by the present invention, in step (2), the extraction of the maximum lyapunov parameter employs a wolff algorithm, which specifically includes the following steps:
step 1) for discrete time series x1,x2,x3,…,xNDetermining reconstruction dimension m by using G-P algorithm, determining delay time interval tau by using average mutual information method, and reconstructing phase space x (t) ═ xt,xt-τ,…,xt-(m-1)τ) The number of phases is N ═ N- (m-1) tau; wherein N represents the total number of discrete time series points;
step 2), in the phase point of (N- (m-1) tau), taking the initial phase point x0Selecting one and x as base point0Nearest point x1As the end points, an initial vector is formed, and the Euclidean distance between the end points of the base point is recorded as L (t)0);
Step 3), time step length or evolution time t, and evolving the initial vector along the track to obtain a new vector, wherein the Euclidean distance between the corresponding point and the endpoint is marked as L (t)1) And the exponential growth rate of the system linear index in the corresponding time period is recorded as:
step 4), continuously iterating and traversing until all phase points, and taking the mean value of each exponential growth rate as an estimated value of LLE:
further, in step (2), the method for extracting nonlinear speech features in dynamic multiband provided by the present invention, the extracting of the associated dimensional parameters includes the following steps:
step 1), reconstructing a phase space: for a given set of one-dimensional time series: x is the number of1,x2,x3,…,xNSelecting an appropriate panelDimension m of input0And a time delay amount tau, constructing a phase space with m dimensions:
in the formula:representing the distance between the state vectors xi and xj in Euclidean space, θ (u) is a step function defined as:
c (r) represents the ratio of the point logarithm of the distance less than r to all the point logarithms on the phase space attractor, and is used for reflecting the convergence and divergence degree of the phase points;
step 3), estimating a correlation dimension D: when the time series N → ∞ and the correlation distance length r is small, i.e. r → 0, if the correlation integral function c (r) obeys an exponential law:the attractor has fractal characteristics, and the correlation dimension D and the correlation function C (r) approximately satisfy a log-linear relationship: d (m) ═ lnc (r)/lnr, and m can be calculated by fitting0A corresponding estimate value;
step 4), estimating embedding dimension: increasing embedding dimension m0Substituting into the steps 2) and 3), and repeatedly calculating until m0Gradually converge to a saturation value, where D (m) does not follow m0Is changed by an increase of the value of the correlation dimension of the system, corresponding to m0Is the finally determined embedding dimension.
By adopting the technical scheme, compared with the prior art, the invention has the beneficial effects that:
the invention adopts the Barker filter bank to divide, obtains the frequency division factor in a self-adaptive way, processes the characteristics of the voice signal by frequency bands, and leads the processed signal to be more in line with the auditory characteristics of human beings, thereby being capable of extracting the voice characteristic parameters with more excellent performance.
Drawings
Fig. 1 is a flow chart of the extraction of the division factor.
Fig. 2 is a flow chart of dynamic multiband nonlinear speech feature parameter extraction.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Referring to fig. 1, the present invention provides a method for extracting dynamic multiband nonlinear language feature parameters, which performs filtering segmentation on a speech sample by using a barker filter bank conforming to the auditory characteristics of human ears, and adaptively obtains a frequency division factor α. According to the energy distribution characteristics of the voice, cepstrum characteristic parameters based on bark frequency are adopted to describe the frequency bands from 0 th to alpha; in the frequency bands from α +1 to 24, the maximum lyapunov exponent of the nonlinear dynamics and the associated dimensional characteristics are used for description. In this example, a speech library is used as an experimental object, and the specific method is as follows:
A. the extraction of the Barker frequency cepstrum parameters comprises the following steps:
wherein, Delta b is (b2-b1)/(K-1) isK is a scale parameter, and the auditory perception frequency bandwidth is [ b1, b2 ]]。
Step 2) introducing a functional relation between linear frequency and auditory perception frequency given by Telien Miller:
b=6.7asinh[(f-20)/600];
and 3) substituting the formula to obtain an expression of the auditory perception wavelet function under the linear frequency:
step 4), calculating the voice energy and then passing through a barker filter bank: BW (Bandwidth)m(k) M is more than or equal to 1 and less than or equal to 24, and then a bark frequency cepstrum parameter is extracted through discrete cosine transform of energy logarithm;
B. the extraction of the maximum Lyapunov parameter adopts a classical Walff algorithm, and comprises the following steps:
step 1) for discrete time series x1,x2,x3,…,xNDetermining reconstruction dimension m by adopting G-P algorithm, determining delay time interval tau by using average mutual information method, and reconstructing phase space x (t) ═ xt,xt-τ,…,xt-(m-1)τ) The number of phases is N ═ N- (m-1) tau;
step 2) in the (N- (m-1) tau) phase point, taking the initial phase point x0Selecting one and x as base point0Nearest point x1As the end points, an initial vector is formed, and the Euclidean distance between the end points of the base point is recorded as L (t)0)。
Step 3) time step length or evolution time t, the initial vector evolves along the track to obtain a new vector, and the Euclidean distance between the corresponding point and the endpoint can be marked as L (t)1) The linear index of the system increases exponentially in the corresponding time periodThe length ratio is recorded as:
step 4) continuously iterating and traversing until all phase points, and taking the mean value of each exponential growth rate as an estimated value of LLE:
C. extracting the associated dimension parameters, comprising the following steps:
step 1), reconstructing a phase space: for a given set of one-dimensional time series: x is the number of1,x2,x3,…,xNSelecting the appropriate embedding dimension m0And a time delay amount tau, constructing a phase space with m dimensions:
in the formula:representing the distance between the state vectors xi and xj in Euclidean space, θ (u) is a step function defined as:
c (r) the ratio of the logarithm of points on the phase space attractor at a distance less than r to the logarithm of all points, which reflects the degree of vergence of the phase points.
Step 3), estimating a correlation dimension D: when the time series N → ∞ and the correlation distance length r is small, i.e. r → 0, if the correlation integral function c (r) obeys an exponential law:the attractor has fractal characteristics, and the correlation dimension D and the correlation function C (r) approximately satisfy a log-linear relationship: d (m) ═ lnc (r)/lnr, and m can be calculated by fitting0A corresponding estimate.
Step 4), estimating embedding dimension: increasing embedding dimension m0Substituting the processes (2) and (3), and repeatedly calculating until m0Gradually converge to a saturation value, where D (m) does not follow m0Is changed by an increase of the value of the correlation dimension of the system, corresponding to m0Is the finally determined embedding dimension.
D. The steps of the unified characterization are adopted:
1. the frequency division factor alpha is obtained by the 24 frequency band signals after frequency division. And in the frequency bands from 0 to alpha, extracting the Barker frequency cepstrum coefficient characteristics of the signals, solving the mean value of each order of parameters, and arranging.
2. In the frequency bands of α +1 to 24, the maximum lyapunov exponent and the associated dimensional characteristic parameter of the signal are extracted.
3. The arrangement is schematically as follows:
further, the dynamic multiband nonlinear characteristics are subjected to a Bayesian network, K nearest neighbor, multilayer neural network and support vector machine algorithm classifier, and a 10-fold cross-validation method is adopted for performance test.
The results of the experiment are shown in the following table:
the foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (4)
1. A dynamic multi-band nonlinear speech feature extraction method is characterized in that,
filtering and frequency dividing are carried out on the voice sample by adopting a bark filter bank based on the auditory characteristic of human ears, and 24 frequency band signals after frequency dividing are self-adaptive to obtain a frequency dividing factor alpha; then the following steps are carried out:
(1) in the frequency bands from 0 to alpha, after the voice logarithm operation of the voice signal, discrete cosine transform is adopted to extract the Barker frequency cepstrum coefficient characteristics, the mean value of each order of parameters is calculated and arranged;
(2) embedding the signals into a phase space in frequency bands from alpha +1 to 24, extracting a maximum Lyapunov exponent and associated dimensional characteristics, solving the mean value of each order of parameters, and arranging;
(3) and integrating the Barker frequency cepstrum coefficient characteristic, the maximum Lyapunov exponent and the associated dimensional characteristic into a dynamic multiband nonlinear characteristic parameter.
2. The method for extracting dynamic multiband nonlinear speech features according to claim 1, wherein in step (1), extracting barker frequency cepstrum coefficient feature parameters specifically includes the following steps:
wherein, Delta b is (b2-b1)/(K-1) isK is a scale parameter, [ b1, b2 ]]For auditory perception of frequency bandwidth, b stands for auditory perceptionKnowing the frequency;
step 2), introducing a functional relation between the linear frequency and the auditory perception frequency:
b ═ 6.7asinh [ (f-20)/600 ]; in the formula, asinh represents an inverse hyperbolic sine function;
step 3), substituting the functional relation in the step 2) into the functional expression in the auditory perception domain in the step 1) to obtain an expression of the auditory perception wavelet function under linear frequency:
step 4), after the speech energy is calculated, the speech energy passes through a bark filter bank BWm(k) And extracting Barker frequency cepstrum parameters through discrete cosine transform of energy logarithm, wherein m is more than or equal to 1 and less than or equal to 24.
3. The method for extracting dynamic multiband nonlinear speech feature according to claim 1, wherein in step (2), the extraction of the maximum lyapunov parameter employs a wolff algorithm, which specifically includes the following steps:
step 1) for discrete time series x1,x2,x3,…,xNDetermining reconstruction dimension m by using G-P algorithm, determining delay time interval tau by using average mutual information method, and reconstructing phase space x (t) ═ xt,xt-τ,…,xt-(m-1)τ) The number of phases is N ═ N- (m-1) tau; wherein the parameter N represents the total number of discrete time series points;
step 2), in the phase point of (N- (m-1) tau), taking the initial phase point x0Selecting one and x as base point0Nearest point x1As the end points, an initial vector is formed, and the Euclidean distance between the end points of the base point is recorded as L (t)0);
Step 3), setting a time step length or an evolution time t, evolving the initial vector along the track to obtain a new vector, and recording the Euclidean distance between the corresponding point and the endpoint as L (t)1) And the exponential growth rate of the system linear index in the corresponding time period is recorded as:
4. the dynamic multiband nonlinear speech feature extraction method according to claim 1, wherein in the step (2), the extraction of the associated dimensional feature parameters includes the following steps:
step 1), reconstructing a phase space: for a given set of one-dimensional time series: x is the number of1,x2,x3,…,xNSelecting the appropriate embedding dimension m0And a time delay amount tau, constructing a phase space with m dimensions:
in the formula:representing the distance between the state vectors xi and xj in Euclidean space, θ (u) is a step function defined as:
c (r) represents the ratio of the point logarithm of the distance less than r to all the point logarithms on the phase space attractor, and is used for reflecting the convergence and divergence degree of the phase points;
step 3), estimating a correlation dimension D: when the time series N → ∞ and the correlation distance length r is small, i.e. r → 0, if the correlation integral function c (r) obeys an exponential law:the attractor has fractal characteristics, and the correlation dimension D and the correlation function C (r) approximately satisfy a log-linear relationship: d (m) ═ ln C (r)/lnr, and m can be calculated by fitting0A corresponding estimate value;
step 4), estimating embedding dimension: increasing embedding dimension m0Substituting into step 2) and step 3), and repeatedly calculating until m0Gradually converge to a saturation value, where D (m) does not follow m0Is changed by an increase of the value of the correlation dimension of the system, corresponding to m0Is the finally determined embedding dimension.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011198847.7A CN112562642A (en) | 2020-10-31 | 2020-10-31 | Dynamic multi-band nonlinear speech feature extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011198847.7A CN112562642A (en) | 2020-10-31 | 2020-10-31 | Dynamic multi-band nonlinear speech feature extraction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112562642A true CN112562642A (en) | 2021-03-26 |
Family
ID=75041322
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011198847.7A Pending CN112562642A (en) | 2020-10-31 | 2020-10-31 | Dynamic multi-band nonlinear speech feature extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112562642A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020059029A1 (en) * | 1999-01-11 | 2002-05-16 | Doran Todder | Method for the diagnosis of thought states by analysis of interword silences |
CN102646415A (en) * | 2012-04-10 | 2012-08-22 | 苏州大学 | Method for extracting characteristic parameters in speech recognition |
CN109065073A (en) * | 2018-08-16 | 2018-12-21 | 太原理工大学 | Speech-emotion recognition method based on depth S VM network model |
-
2020
- 2020-10-31 CN CN202011198847.7A patent/CN112562642A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020059029A1 (en) * | 1999-01-11 | 2002-05-16 | Doran Todder | Method for the diagnosis of thought states by analysis of interword silences |
CN102646415A (en) * | 2012-04-10 | 2012-08-22 | 苏州大学 | Method for extracting characteristic parameters in speech recognition |
CN109065073A (en) * | 2018-08-16 | 2018-12-21 | 太原理工大学 | Speech-emotion recognition method based on depth S VM network model |
Non-Patent Citations (4)
Title |
---|
侯丽敏等: "《利用语音非线性特征改进说话人识别的性能》", 《模式识别与人工智能》, vol. 19, no. 6, pages 776 - 781 * |
周强等: "《嗓音多频带非线性分析的声带病变识别》", 《声学学报》, vol. 39, no. 1, pages 111 - 118 * |
张晓俊等: "《基于非线性方法的病理嗓音识别研究》", 《信息安全与通信保密》, no. 3, pages 113 - 115 * |
张晓俊等: "《采用多特征组合优化的语音特征参数研究》", 《通信技术》, vol. 45, no. 12, pages 98 - 100 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TW546630B (en) | Optimized local feature extraction for automatic speech recognition | |
Winursito et al. | Improvement of MFCC feature extraction accuracy using PCA in Indonesian speech recognition | |
EP2507790B1 (en) | Method and system for robust audio hashing. | |
Sarikaya et al. | High resolution speech feature parametrization for monophone-based stressed speech recognition | |
US7082394B2 (en) | Noise-robust feature extraction using multi-layer principal component analysis | |
Daqrouq et al. | Average framing linear prediction coding with wavelet transform for text-independent speaker identification system | |
WO2018223727A1 (en) | Voiceprint recognition method, apparatus and device, and medium | |
KR100930060B1 (en) | Recording medium on which a signal detecting method, apparatus and program for executing the method are recorded | |
CN112735460B (en) | Beam forming method and system based on time-frequency masking value estimation | |
CN110800048B (en) | Processing of multichannel spatial audio format input signals | |
CN102646415A (en) | Method for extracting characteristic parameters in speech recognition | |
WO2022141868A1 (en) | Method and apparatus for extracting speech features, terminal, and storage medium | |
Verma et al. | Smartphone application for fault recognition | |
CN111968651A (en) | WT (WT) -based voiceprint recognition method and system | |
CN112562642A (en) | Dynamic multi-band nonlinear speech feature extraction method | |
JP2015175859A (en) | Pattern recognition device, pattern recognition method, and pattern recognition program | |
CN110600038A (en) | Audio fingerprint dimension reduction method based on discrete kini coefficient | |
Huang et al. | Perceptual speech hashing authentication algorithm based on linear prediction analysis | |
CN110197657A (en) | A kind of dynamic speech feature extracting method based on cosine similarity | |
CN103180847A (en) | Music query method and apparatus | |
Noyum et al. | Boosting the predictive accurary of singer identification using discrete wavelet transform for feature extraction | |
Chen et al. | Feature Analysis and Optimization of Underwater Target Radiated Noise Based on t-SNE | |
Faek et al. | Speaker recognition from noisy spoken sentences | |
Li et al. | A reliable voice perceptual hash authentication algorithm | |
TWI749547B (en) | Speech enhancement system based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |