CN112562642A

CN112562642A - Dynamic multi-band nonlinear speech feature extraction method

Info

Publication number: CN112562642A
Application number: CN202011198847.7A
Authority: CN
Inventors: 张晓俊; 伍远博; 周长伟; 朱欣程; 陶智; 赵鹤鸣
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2020-10-31
Filing date: 2020-10-31
Publication date: 2021-03-26

Abstract

The invention discloses a dynamic multiband nonlinear speech feature extraction method, which is used for filtering and frequency division of a speech sample by adopting a bark filter bank based on the auditory characteristic of human ears. And the 24 frequency band signals after frequency division are subjected to self-adaption to obtain a frequency division factor a in a mode of calculating a zero crossing rate. Then, in the frequency bands from 0 to a, calculating the frequency spectrum and logarithm of the voice, and extracting the characteristics of the Barker frequency cepstrum coefficient by adopting a discrete cosine transform scheme; in the frequency bands from a +1 to 24, the maximum lyapunov exponent and the associated dimensional feature are extracted after the signal is embedded in the phase space, and then the feature normalization processing is performed. The invention adopts the self-adaptive frequency division factor and adopts a sub-band processing mode, so that the processed signal is more in line with the auditory characteristics and the actual situation of human beings, and the voice characteristic parameters with more excellent performance can be extracted.

Description

Dynamic multi-band nonlinear speech feature extraction method

Technical Field

The invention relates to a voice recognition method, in particular to a dynamic multi-band nonlinear voice feature extraction method.

Background

Language is the most natural and convenient communication tool for human beings. The speech recognition technology is a technology for simulating the human recognition process by a computer and converting a human speech signal into a corresponding text or command, and the basic purpose of the technology is to develop a machine with human hearing function, which can receive human speech, understand human intention and make a corresponding response, thereby providing great help for the development of human beings. In recent years, with the rapid development of IT industries such as internet, computer, mobile phone, communication and the like, many application systems require simple, efficient and friendly human-computer interaction, so natural voice communication between human and computer has become an important research subject.

The existing speech signal recognition system has strong dependence on environmental conditions, which causes the difference of the extracted speech feature parameters, so how to improve the robustness of the speech feature parameters becomes the key for improving the speech recognition rate.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method for extracting feature parameters in speech recognition, which divides a speech sample into 24 frequency band information by using a barker filter bank conforming to the auditory characteristics of human ears, so that the processed signal conforms to the auditory system of human, and thus speech feature parameters with better performance can be extracted.

In order to achieve the technical purpose, the invention is realized by the following technical scheme:

the invention provides a dynamic multiband nonlinear speech feature extraction method, which comprises the following steps:

filtering and frequency dividing are carried out on the voice sample by adopting a bark filter bank based on the auditory characteristic of human ears, and 24 frequency band signals after frequency dividing are self-adaptive to obtain a frequency dividing factor alpha; then the following steps are carried out:

(1) in the frequency bands from 0 to alpha, after the voice logarithm operation of the voice signal, discrete cosine transform is adopted to extract the Barker frequency cepstrum coefficient characteristics, the mean value of each order of parameters is calculated and arranged;

(2) embedding the signals into a phase space in frequency bands from alpha +1 to 24, extracting a maximum Lyapunov exponent and associated dimensional characteristics, solving the mean value of each order of parameters, and arranging;

(3) and integrating the Barker frequency cepstrum coefficient characteristic, the maximum Lyapunov exponent and the associated dimensional characteristic into a dynamic multiband nonlinear characteristic parameter.

Further, the method for extracting the dynamic multiband nonlinear speech feature provided by the invention extracts the bark frequency cepstrum coefficient feature parameter in the step (1), and specifically comprises the following steps:

step 1), the bark domain wavelet function is expressed as:

obtaining a functional expression under the auditory perception domain:

wherein, Delta b is (b2-b1)/(K-1) is

K is a scale parameter, [ b1, b2 ]]A frequency bandwidth is perceptually audible; b represents the auditory perception frequency;

step 2), introducing a functional relation between the linear frequency and the auditory perception frequency:

b ═ 6.7asinh [ (f-20)/600 ]; in the formula, asinh represents an inverse hyperbolic sine function;

step 3), substituting the functional relation in the step 2) into the functional expression in the auditory perception domain in the step 1) to obtain an expression of the auditory perception wavelet function under linear frequency:

step 4), after the voice energy is calculated, the voice energy is processed through a barker filter bank: BW (Bandwidth)_m(k) M is more than or equal to 1 and less than or equal to 24, and then a bark frequency cepstrum parameter is extracted through the discrete cosine transform of the energy logarithm。

Further, in the method for extracting a dynamic multiband nonlinear speech feature provided by the present invention, in step (2), the extraction of the maximum lyapunov parameter employs a wolff algorithm, which specifically includes the following steps:

step 1) for discrete time series x₁，x₂，x₃，…，x_NDetermining reconstruction dimension m by using G-P algorithm, determining delay time interval tau by using average mutual information method, and reconstructing phase space x (t) ═ x_t，x_t-τ，…，x_t-(m-1)τ) The number of phases is N ═ N- (m-1) tau; wherein N represents the total number of discrete time series points;

step 2), in the phase point of (N- (m-1) tau), taking the initial phase point x₀Selecting one and x as base point₀Nearest point x₁As the end points, an initial vector is formed, and the Euclidean distance between the end points of the base point is recorded as L (t)₀)；

Step 3), time step length or evolution time t, and evolving the initial vector along the track to obtain a new vector, wherein the Euclidean distance between the corresponding point and the endpoint is marked as L (t)₁) And the exponential growth rate of the system linear index in the corresponding time period is recorded as:

step 4), continuously iterating and traversing until all phase points, and taking the mean value of each exponential growth rate as an estimated value of LLE:

further, in step (2), the method for extracting nonlinear speech features in dynamic multiband provided by the present invention, the extracting of the associated dimensional parameters includes the following steps:

step 1), reconstructing a phase space: for a given set of one-dimensional time series: x is the number of₁，x₂，x₃，…，x_NSelecting an appropriate panelDimension m of input₀And a time delay amount tau, constructing a phase space with m dimensions:

step 2), calculating a correlation integral function:

in the formula:

representing the distance between the state vectors xi and xj in Euclidean space, θ (u) is a step function defined as:

c (r) represents the ratio of the point logarithm of the distance less than r to all the point logarithms on the phase space attractor, and is used for reflecting the convergence and divergence degree of the phase points;

step 3), estimating a correlation dimension D: when the time series N → ∞ and the correlation distance length r is small, i.e. r → 0, if the correlation integral function c (r) obeys an exponential law:

the attractor has fractal characteristics, and the correlation dimension D and the correlation function C (r) approximately satisfy a log-linear relationship: d (m) ═ lnc (r)/lnr, and m can be calculated by fitting₀A corresponding estimate value;

step 4), estimating embedding dimension: increasing embedding dimension m₀Substituting into the steps 2) and 3), and repeatedly calculating until m₀Gradually converge to a saturation value, where D (m) does not follow m₀Is changed by an increase of the value of the correlation dimension of the system, corresponding to m₀Is the finally determined embedding dimension.

By adopting the technical scheme, compared with the prior art, the invention has the beneficial effects that:

the invention adopts the Barker filter bank to divide, obtains the frequency division factor in a self-adaptive way, processes the characteristics of the voice signal by frequency bands, and leads the processed signal to be more in line with the auditory characteristics of human beings, thereby being capable of extracting the voice characteristic parameters with more excellent performance.

Drawings

Fig. 1 is a flow chart of the extraction of the division factor.

Fig. 2 is a flow chart of dynamic multiband nonlinear speech feature parameter extraction.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Referring to fig. 1, the present invention provides a method for extracting dynamic multiband nonlinear language feature parameters, which performs filtering segmentation on a speech sample by using a barker filter bank conforming to the auditory characteristics of human ears, and adaptively obtains a frequency division factor α. According to the energy distribution characteristics of the voice, cepstrum characteristic parameters based on bark frequency are adopted to describe the frequency bands from 0 th to alpha; in the frequency bands from α +1 to 24, the maximum lyapunov exponent of the nonlinear dynamics and the associated dimensional characteristics are used for description. In this example, a speech library is used as an experimental object, and the specific method is as follows:

A. the extraction of the Barker frequency cepstrum parameters comprises the following steps:

step 1) selecting a bark domain wavelet function as:

functional expressions under the auditory perception domain are available:

wherein, Delta b is (b2-b1)/(K-1) is

K is a scale parameter, and the auditory perception frequency bandwidth is [ b1, b2 ]]。

Step 2) introducing a functional relation between linear frequency and auditory perception frequency given by Telien Miller:

b＝6.7asinh[(f-20)/600]；

and 3) substituting the formula to obtain an expression of the auditory perception wavelet function under the linear frequency:

step 4), calculating the voice energy and then passing through a barker filter bank: BW (Bandwidth)_m(k) M is more than or equal to 1 and less than or equal to 24, and then a bark frequency cepstrum parameter is extracted through discrete cosine transform of energy logarithm;

B. the extraction of the maximum Lyapunov parameter adopts a classical Walff algorithm, and comprises the following steps:

step 1) for discrete time series x₁，x₂，x₃，…，x_NDetermining reconstruction dimension m by adopting G-P algorithm, determining delay time interval tau by using average mutual information method, and reconstructing phase space x (t) ═ x_t，x_t-τ，…，x_t-(m-1)τ) The number of phases is N ═ N- (m-1) tau;

step 2) in the (N- (m-1) tau) phase point, taking the initial phase point x₀Selecting one and x as base point₀Nearest point x₁As the end points, an initial vector is formed, and the Euclidean distance between the end points of the base point is recorded as L (t)₀)。

Step 3) time step length or evolution time t, the initial vector evolves along the track to obtain a new vector, and the Euclidean distance between the corresponding point and the endpoint can be marked as L (t)₁) The linear index of the system increases exponentially in the corresponding time periodThe length ratio is recorded as:

step 4) continuously iterating and traversing until all phase points, and taking the mean value of each exponential growth rate as an estimated value of LLE:

C. extracting the associated dimension parameters, comprising the following steps:

step 1), reconstructing a phase space: for a given set of one-dimensional time series: x is the number of₁，x₂，x₃，…，x_NSelecting the appropriate embedding dimension m₀And a time delay amount tau, constructing a phase space with m dimensions:

step 2), calculating a correlation integral function:

in the formula:

c (r) the ratio of the logarithm of points on the phase space attractor at a distance less than r to the logarithm of all points, which reflects the degree of vergence of the phase points.

the attractor has fractal characteristics, and the correlation dimension D and the correlation function C (r) approximately satisfy a log-linear relationship: d (m) ═ lnc (r)/lnr, and m can be calculated by fitting₀A corresponding estimate.

Step 4), estimating embedding dimension: increasing embedding dimension m₀Substituting the processes (2) and (3), and repeatedly calculating until m₀Gradually converge to a saturation value, where D (m) does not follow m₀Is changed by an increase of the value of the correlation dimension of the system, corresponding to m₀Is the finally determined embedding dimension.

D. The steps of the unified characterization are adopted:

1. the frequency division factor alpha is obtained by the 24 frequency band signals after frequency division. And in the frequency bands from 0 to alpha, extracting the Barker frequency cepstrum coefficient characteristics of the signals, solving the mean value of each order of parameters, and arranging.

2. In the frequency bands of α +1 to 24, the maximum lyapunov exponent and the associated dimensional characteristic parameter of the signal are extracted.

3. The arrangement is schematically as follows:

further, the dynamic multiband nonlinear characteristics are subjected to a Bayesian network, K nearest neighbor, multilayer neural network and support vector machine algorithm classifier, and a 10-fold cross-validation method is adopted for performance test.

The results of the experiment are shown in the following table:

the foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A dynamic multi-band nonlinear speech feature extraction method is characterized in that,

2. The method for extracting dynamic multiband nonlinear speech features according to claim 1, wherein in step (1), extracting barker frequency cepstrum coefficient feature parameters specifically includes the following steps:

step 1), the bark domain wavelet function is expressed as:

obtaining a functional expression under the auditory perception domain:

wherein, Delta b is (b2-b1)/(K-1) is

K is a scale parameter, [ b1, b2 ]]For auditory perception of frequency bandwidth, b stands for auditory perceptionKnowing the frequency;

step 4), after the speech energy is calculated, the speech energy passes through a bark filter bank BW_m(k) And extracting Barker frequency cepstrum parameters through discrete cosine transform of energy logarithm, wherein m is more than or equal to 1 and less than or equal to 24.

3. The method for extracting dynamic multiband nonlinear speech feature according to claim 1, wherein in step (2), the extraction of the maximum lyapunov parameter employs a wolff algorithm, which specifically includes the following steps:

step 1) for discrete time series x₁，x₂，x₃，…，x_NDetermining reconstruction dimension m by using G-P algorithm, determining delay time interval tau by using average mutual information method, and reconstructing phase space x (t) ═ x_t，x_t-τ，…，x_t-(m-1)τ) The number of phases is N ═ N- (m-1) tau; wherein the parameter N represents the total number of discrete time series points;

Step 3), setting a time step length or an evolution time t, evolving the initial vector along the track to obtain a new vector, and recording the Euclidean distance between the corresponding point and the endpoint as L (t)₁) And the exponential growth rate of the system linear index in the corresponding time period is recorded as:

4. the dynamic multiband nonlinear speech feature extraction method according to claim 1, wherein in the step (2), the extraction of the associated dimensional feature parameters includes the following steps:

step 2), calculating a correlation integral function:

in the formula:

the attractor has fractal characteristics, and the correlation dimension D and the correlation function C (r) approximately satisfy a log-linear relationship: d (m) ═ ln C (r)/lnr, and m can be calculated by fitting₀A corresponding estimate value;

step 4), estimating embedding dimension: increasing embedding dimension m₀Substituting into step 2) and step 3), and repeatedly calculating until m₀Gradually converge to a saturation value, where D (m) does not follow m₀Is changed by an increase of the value of the correlation dimension of the system, corresponding to m₀Is the finally determined embedding dimension.