CN108766430A

CN108766430A - A kind of phonetic feature mapping method and system based on Pasteur's distance

Info

Publication number: CN108766430A
Application number: CN201810572146.1A
Authority: CN
Inventors: 王志锋; 左明章; 宁国勤; 叶俊民; 闵秋莎; 田元; 夏丹; 陈迪; 罗恒; 姚璜
Original assignee: Huazhong Normal University
Current assignee: Huazhong Normal University
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2018-11-06
Anticipated expiration: 2038-06-06
Also published as: CN108766430B

Abstract

The invention belongs to speech recognition/speaker Recognition Technology fields, disclose a kind of phonetic feature mapping method and system based on Pasteur's distance, extract the feature of voice signal and clean speech signal under complex environment respectively first；Mappings characteristics are initialized followed by complex characteristic and Feature Mapping formula, and establish the GMM model of mappings characteristics and clean feature respectively；Then it uses EM algorithm iterations to estimate minimum Pasteur's distance between two GMM models, and obtains final mappings characteristics；The voice signal model under mappings characteristics and the trained clean environment finished is finally subjected to pattern match and identification.The present invention maps complex characteristic by Pasteur's distance between minimizing complex characteristic GMM and clean feature GMM, obtains mappings characteristics, and mappings characteristics and clean model are carried out pattern match and identification；Complex characteristic is replaced with mappings characteristics, the accuracy of speech recognition can be effectively improved.

Description

Speech feature mapping method and system based on Bhattacharyya distance

Technical Field

The invention belongs to the technical field of voice recognition/speaker recognition, and particularly relates to a Pasteur distance-based voice feature mapping method and system.

Background

Currently, the current state of the art commonly used in the industry is such that:

with the push development of computer technology to human-computer interaction, voice interaction is more and more widely applied in real scenes. The voice interaction is the interaction between a human and a machine in a voice mode, and the interaction mode is closer to the interaction between humans and more accords with the interaction habit of the human. The voice interaction can enable a user to interact with a machine more comfortably, so that the interaction is carried out in a simpler, quicker and more efficient mode, the human-computer interaction process is more humanized, and the leading position of a person in the whole interaction process is more prominent. The voice interaction can liberate both hands of people, and the both hands can still interact with the machine when carrying out other operations, so that convenience is greatly provided for people, and convenience is provided for people in different aspects such as immersion, safety and the like.

The voice recognition is an entrance for starting the voice interaction process and is also an important component of the voice interaction technology, and the result of the voice recognition directly influences the performance of the whole voice interaction. In speech interaction, it is first of all that the machine can "understand" the human language, and only then can the speech interaction proceed. Thus, speech recognition is the basis and precondition for speech interaction. The meaning of speech recognition is that the content of a person's speech is converted into corresponding words through a series of processing means, i.e. speech is converted into words. Generally speaking, the function of the ears of the robot is given, and the machine can simulate the human hearing.

The current speech recognition can obtain better effect in more ideal environment. However, once the voice interaction is interfered by the external environment, the recognition rate is greatly reduced, and the voice interaction is seriously influenced. What is worse, the environment in which we are in reality is a complex environment, and is surrounded by various noises, and both natural noises (such as wind, rain, thunder, running water, etc.) and artificial noises (such as the voice of the surrounding people, the voice of machines, etc.) can affect voice interaction, thereby seriously affecting the user experience. Therefore, it is an urgent problem to effectively remove noise from a noisy speech signal and establish a recognition system with better anti-noise performance.

The prior art generally starts from three aspects, namely a speech enhancement technique applied to a speech signal stage, a feature mapping technique applied to a feature stage and a model compensation technique applied to a model stage. The feature mapping technique is most commonly used for improving the accuracy of speech recognition in a complex environment, because the features are the most appropriate means for representing speech signals, and the appropriate features can achieve good effects. The invention relates to a feature mapping method based on Pasteur distance, which is a feature mapping technology. The feature mapping technology is characterized in that parameters of a mapping function are estimated, and therefore the mapping function is determined to map features. In the prior art, parameter estimation is mostly performed by approximating the distribution of the reference feature by using the real distribution of the feature. The invention describes the distribution of the complex characteristic by GMM by adding prior information, and the GMM is not fixed and is continuously adjusted by minimizing the Babbitt distance between the GMM and the reference characteristic GMM, thereby mapping the real complex characteristic distribution to the characteristic distribution under the ideal state and achieving the purpose of characteristic mapping.

In summary, the problems of the prior art are as follows:

(1) in the prior art, only the real distribution of the features is utilized, and the mapping is started without any assumption or operation on the real distribution of the features, so that the mapping direction of the features in the mapping process is difficult to ensure under the mapping method.

(2) In the prior art, no more consideration is taken when measuring the complex characteristic distribution and the reference characteristic distribution. The pasteurisation distance is a good choice for the distance between the two distributions, which the prior art does not exploit.

(3) The prior art does not add prior information when processing complex features.

The difficulty and significance for solving the technical problems are as follows:

the purpose of feature mapping is to normalize the true distribution of complex features so that the complex features can reach an ideal state, i.e., the mapped feature distribution can conform to the distribution of clean features. This is also a difficult point for feature mapping.

If the theoretical distribution can be reasonably assumed before mapping, this is to add a priori information. Comparing the assumed theoretical distribution with the clean feature distribution, and continuously optimizing the assumed theoretical distribution by minimizing the distance between the two distributions, the assumed theoretical distribution becomes a true ideal distribution when the distance between the assumed theoretical distribution and the clean feature distribution is minimized. The solution of this problem is important for feature mapping.

For the method of measuring the distance between two distributions, the use of the babbitt distance is a good choice. The pap distance is a statistic specifically used to assess the distance between two distributions. Using the babbitt distance to optimize the distance between the two distributions mentioned above can simplify the calculation.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a speech feature mapping method and system based on Bhattacharyya distance. The feature mapping of the present invention is a solution starting from speech features. Through the function of the feature mapping function, the voice signal features under the complex environment are converted into the voice signal features under the environment which can be approximately regarded as the clean environment. The method can greatly reduce the influence of noise on the voice signal, improve the accuracy of voice recognition in practical application and enhance the robustness of the voice recognition system.

The invention is realized in such a way that a speech feature mapping method based on Bhattacharyya distance comprises the following steps:

firstly, respectively extracting the characteristics of a voice signal and a clean voice signal in a complex environment; secondly, initializing mapping characteristics by using a complex characteristic and a characteristic mapping formula, and respectively establishing GMM models of the mapping characteristics and the clean characteristics; then, iteratively estimating the minimum Pasteur distance between the two GMM models by using an EM (effective magnetic field) algorithm, and obtaining a final mapping characteristic; and finally, performing pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.

The method specifically comprises the following steps:

step 1: extracting voice features in a clean environment;

step 2: extracting the characteristics of the voice in the complex environment;

and step 3: obtaining an initialized mapping characteristic from the complex characteristic through a characteristic mapping formula, and respectively establishing a GMM model of the initialized mapping characteristic and a GMM model of the clean environment voice characteristic;

and 4, step 4: introducing a Bhattacharyya distance, and obtaining a mapping characteristic by minimizing the Bhattacharyya distance between two GMMs;

and 5: and carrying out pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.

Further, the specific implementation of step 1 comprises the following substeps:

step 1.1: preprocessing a voice signal obtained in a clean environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;

step 1.2: and (3) extracting Mel cepstrum coefficient characteristics MFCC from the signal preprocessed in the step 1.1.

Further, the specific implementation of step 2 comprises the following substeps:

step 2.1: preprocessing a voice signal obtained in a complex environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;

step 2.2: extracting Mel cepstrum coefficient characteristics from the signal preprocessed in the step 2.1, and recording the Mel cepstrum coefficient characteristics as X ═ X₁,x₂,...,x_t,...,x_n]，x_t∈X。

Further, the specific implementation of step 3 comprises the following substeps:

step 3.1: the feature mapping formula is:

wherein x_tT-th frame, y, representing input features_tThe t-th frame, which represents the output characteristics, a represents the gain matrix, b represents the offset matrix, W is the parameter expressed as a mapping function. The matrix A is a matrix sequence formed by 2L +1 matrixes, L is a non-negative number, and the dimension of each matrix is the same as that of each frame of the input features; the input features are a set of vectors consisting of 2L +1 frame features. W is thus composed of 2L +1 matrices and a column vector, anda column of vectors consisting of 2L +1 frames and 1;

step 3.2: let L equal 1, A in W₀Is an identity matrix, A_-1、A₁For 0 matrix, the initialization mapping characteristics y are constructed in turn_tInitializing the mapping feature to be substantially a complex feature;

step 3.3: and establishing GMM models of the initialization feature and the clean feature.

Further, the specific implementation of step 4 comprises the following substeps:

step 4.1: the pap distance between two GMMs is expressed according to the pap distance formula, which is expressed as:

wherein,andrespectively representing the ith Gaussian components of the clean feature GMM and the mapping feature GMM;

step 4.2: converting the formula in the step 4.1 according to a Gaussian formula to construct a loss function F_c：

Wherein,is the second order standard norm of a Frobenius matrix (Frobenius matrix), and beta and lambda are used to control the Frobenius normAnd two adjustable parameters of GMM model distribution influence degree, y is input characteristic, namely mapping characteristic newly obtained each time, T is frame number of input characteristic, y is_tI.e. the t-th frame of the input feature, another parameter y_t(i) The expression of (a) is:

where M is the number of gaussians in the GMM, ω_iA weight of the ith Gaussian;

step 4.3: introducing an EM algorithm to carry out iteration on the Pasteur distance in the step 4.2 to calculate the minimum value, and calculating a parameter W when the minimum value exists;

step 4.4: and substituting the parameter W into the characteristic mapping formula in the step 3.1 to obtain the mapping characteristic y.

Further, in step 5, training the clean features, usually using HMM-GMM model, and performing model matching using the mapping features obtained in step 4 to perform speech recognition/speaker recognition.

It is a further object of the present invention to provide a computer program for implementing said babbitt distance-based speech feature mapping method.

Another object of the present invention is to provide a speech recognition/speaker recognition system for implementing the babbitt distance-based speech feature mapping method.

It is another object of the present invention to provide a computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the babbitt distance-based speech feature mapping method.

In summary, the advantages and positive effects of the invention are：

The invention maps the complex characteristic by minimizing the Pasteur distance between the complex characteristic GMM and the clean characteristic GMM to obtain the mapping characteristic, and performs pattern matching and recognition on the mapping characteristic and the clean model. The mapping characteristics are used for replacing complex characteristics, and the accuracy of voice recognition can be effectively improved.

The mapping method provided by the invention can improve the recognition accuracy of the voice recognition system by mapping the voice characteristics in a complex environment, and has stronger robustness;

the feature mapping method provided by the invention is based on the Babbitt distance between the complex feature and the clean feature distribution, wherein the prior information is included, namely the prior information is assumed that the distribution of the complex feature conforms to GMM, and the prior information enables the distribution of the complex feature to gradually approach the clean feature distribution. The characteristic distribution characteristic is reasonably utilized, and a better effect is achieved;

the invention provides a broad algorithm for improving the accuracy of voice recognition.

The invention provides a feature mapping method based on Bhattacharyya distance aiming at the condition of lower voice recognition in a complex environment, and the feature mapping method is compared with original data which is not subjected to feature mapping and two existing technologies through experiments, and the experimental results show that the sentence accuracy and word accuracy can be effectively improved and the method has better effect.

Table 1: the experimental result is compared with the original data result

	Sentence accuracy	Word accuracy rate
			Raw data	50％	81.08％
Parameter estimation feature mapping method	62％	85.14％
			KL divergence characteristic mapping method	66％	85.81％
Pasteur distance feature mapping method	68％	87.16％

Drawings

FIG. 1 is a flowchart of a speech feature mapping method based on Papanicolaou distance according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a speech feature mapping method based on babbitt distance according to an embodiment of the present invention.

Fig. 3 is a flowchart of an experimental implementation provided in an embodiment of the present invention.

Fig. 4 is a flowchart of extracting Mel-frequency cepstral coefficient features MFCC from a signal obtained by preprocessing a speech signal in a clean environment according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The speech feature mapping method based on the Bhattacharyya distance provided by the embodiment of the invention comprises the steps of firstly, respectively extracting the features of a speech signal and a clean speech signal in a complex environment; secondly, initializing mapping characteristics by using a complex characteristic and a characteristic mapping formula, and respectively establishing GMM models of the mapping characteristics and the clean characteristics; then, iteratively estimating the minimum Pasteur distance between the two GMM models by using an EM (effective magnetic field) algorithm, and obtaining a final mapping characteristic; and finally, performing pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment. The invention utilizes the minimized Bhattacharyya distance between the complex characteristic GMM and the clean characteristic GMM to map the complex characteristic, so that the obtained mapping characteristic can be approximately regarded as the voice signal characteristic in the clean environment, thereby greatly improving the purity of the voice characteristic in the complex environment, and achieving the purposes of improving the voice recognition accuracy and enhancing the robustness of a voice recognition system.

Referring to fig. 1-2, the present invention provides a speech feature mapping method based on babbitt distance, which includes the following steps:

step 1: extracting the characteristics of the voice in a clean environment;

the specific implementation comprises the following substeps:

step 1.2: the signal preprocessed in step 1.1 is extracted as Mel cepstral coefficient features MFCC (shown in fig. 4).

Step 2: extracting the characteristics of the voice in the complex environment;

the specific implementation comprises the following substeps:

step 3.1: determining a feature mapping formula as:

step 3.2: let L equal 1, A in W₀Is an identity matrix, A_-1、A₁For 0 matrix, the initialization mapping characteristics y are constructed in turn_t；

the specific implementation comprises the following substeps:

Wherein,is a second order standard norm of a Frobenius matrix (Frobenius matrix)beta and lambda are used to control the Frobenius normAnd two adjustable parameters of GMM model distribution influence degree, y is input characteristic, namely mapping characteristic newly obtained each time, T is frame number of input characteristic, y is_tI.e. the t-th frame of the input feature, another parameter y_t(i) The expression of (a) is:

step 4.3: according to the formula:

there is a parameter W such that the loss function F_c(W) minimum value is obtained, and the parameter W is the parameter W to be obtained, so F in step 4.2 is obtained_cA minimum value of (W);

step 4.4: the auxiliary function is introduced by means of the EM algorithm:

the value of W in the auxiliary function,respectively representing a newly estimated parameter W and a current parameter W; m and T respectively represent the Gaussian number in the GMM and the frame number of the input feature y;andrespectively representing the mean values of ith components of the GMM models a and b;andrepresenting the diagonal covariance matrices derived from the variance in the mth gaussians of GMM models a and b, respectively, as: d represents the dimension, i.e. the number of rows, of the parameter W^(d)Line d representing the parameter W; and the G and p parameters are expressed as follows:

step 4.5: the gradient function of the auxiliary function Q for the d-th line of the parameter W is found:

step 4.6: iterating using the L-BFGS algorithm in combination with the gradient function in step 4.5 to obtain the minimum value of the auxiliary function in step 4.4 and the value of the parameter W at which the minimum value exists;

step 4.7: assigning the W value obtained in step 4.6 toRepeating the steps 4.4, 4.5 and 4.6, and iterating the W until the difference between the newly obtained W and the W obtained in the last iteration is less than a fixed value or the iteration times reaches a fixed value;

step 4.8: and substituting the W value obtained in the step 4.7 into the characteristic mapping formula in the step 3.1 to obtain the mapping characteristic y.

The invention is further described below in conjunction with specific examples/experiments/simulation analyses.

The invention proves the effectiveness of the medicine through experiments. The invention provides data support for experiments through Aurora2 general database data in the experiments. The completion of the experiment is assisted by using MATLAB simulation software, HTK voice toolkit and voicebox toolkit. The software and hardware environment of the experiment was as follows:

hardware environment:

intel Core i3-4150CPU 3.5G dual Core

8G internal memory

Dell E1916HV

Software environment:

MATLAB 2014a

voicebox

HTK3.5

Aurora2

in the experiment, firstly, MFCC (Mel frequency cepstrum coefficient) feature extraction is carried out on training data and test data in Aurora2 through an HTK (Hypertext tool kit) voice tool kit to obtain reference features and test features; then writing and calling a voicebox toolkit by MATLAB to establish a GMM model for the reference characteristics; compiling a script to establish a test feature GMM model and obtaining a mapped test feature MFCC by minimizing the Pasteur distance between the test feature GMM model and a reference feature GMM model; and identifying the mapped features by using an HTK (Hypertext transfer key) to obtain the identification rate. The screenshot of the recognition rate obtained under HTK and the experimental implementation flow chart are shown in fig. 3.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A Bhattacharyya distance-based speech feature mapping method is characterized in that the Bhattacharyya distance-based speech feature mapping method comprises the following steps:

firstly, respectively extracting the characteristics of a voice signal and a clean voice signal in a complex environment; secondly, initializing mapping characteristics by using a complex characteristic and a characteristic mapping formula, and respectively establishing GMM models of the mapping characteristics and the clean characteristics;

then, iteratively estimating the minimum Pasteur distance between the two GMM models by using an EM (effective magnetic field) algorithm, and obtaining a final mapping characteristic;

and finally, performing pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.

2. The babbitt distance-based speech feature mapping method according to claim 1, wherein the babbitt distance-based speech feature mapping method specifically comprises:

step 1: extracting voice features in a clean environment;

step 2: extracting the characteristics of the voice in the complex environment;

3. The babbitt distance-based speech feature mapping method of claim 2,

the method comprises the following steps:

4. The babbitt distance-based speech feature mapping method of claim 2,

step 2, specifically comprising the following steps:

step 2.2: extracting Mel cepstrum coefficient from the signal preprocessed in step 2.1Is characterized by X ═ X₁,x₂,...,x_t,...,x_n]，x_t∈X。

5. The babbitt distance-based speech feature mapping method of claim 2,

step 3, specifically comprising the following steps:

step 3.1: obtaining the initialized mapping characteristic feature from the complex characteristic through a characteristic mapping formula, wherein the mapping formula is as follows:

wherein x_tT-th frame, y, representing input features_tThe t frame representing the output characteristics, A represents a gain matrix, b represents an offset matrix, and W is a parameter represented as a mapping function; the matrix A is a matrix sequence formed by 2L +1 matrixes, L is a nonnegative number, and the dimension of each matrix is the same as that of each frame of the input characteristics; the input features form a group of vectors by 2L +1 frame features; w is composed of 2L +1 matrices and a column of vectors,a column of vectors consisting of 2L +1 frames and 1;

step 3.2: let L equal 1, A in W₀Is an identity matrix, A_-1、A₁For the 0 matrix, the initialized mapping characteristics y are constructed in sequence_tInitializing mapping characteristics into complex characteristics;

step 3.3: the speech signal is processed and a GMM model is built that initializes the MFCC features and the clean features.

6. The babbitt distance-based speech feature mapping method according to claim 2, wherein the step 4 specifically comprises the following steps:

step 4.1: the pap distance between two GMMs is expressed according to the pap distance formula, which is:

Wherein,being the second order standard norm of the Flobenius matrix, β and λ are to control the Flobenius normAnd two adjustable parameters of GMM model distribution influence degree, y is input characteristic, namely mapping characteristic newly obtained each time, T is frame number of input characteristic, y is_tI.e. the t-th frame of the input feature, another parameter y_t(i) The expression of (a) is:

m is the number of Gauss in GMM, omega_iA weight of the ith Gaussian;

7. The babbitt distance-based speech feature mapping method according to claim 2, wherein in step 5, the method specifically comprises: and (4) training the clean features, and performing model matching by using the mapping features obtained in the step (4) and an HMM-GMM model to perform voice recognition/speaker recognition.

8. A computer program for implementing the babbitt distance-based speech feature mapping method according to any one of claims 1 to 7.

9. A speech recognition/speaker recognition system for implementing the babbitt distance-based speech feature mapping method according to any one of claims 1 to 7.

10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the babbitt distance-based speech feature mapping method according to any one of claims 1-7.