CN108766430A - A kind of phonetic feature mapping method and system based on Pasteur's distance - Google Patents
A kind of phonetic feature mapping method and system based on Pasteur's distance Download PDFInfo
- Publication number
- CN108766430A CN108766430A CN201810572146.1A CN201810572146A CN108766430A CN 108766430 A CN108766430 A CN 108766430A CN 201810572146 A CN201810572146 A CN 201810572146A CN 108766430 A CN108766430 A CN 108766430A
- Authority
- CN
- China
- Prior art keywords
- mapping
- characteristic
- distance
- gmm
- clean
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013507 mapping Methods 0.000 title claims abstract description 122
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 9
- 238000009826 distribution Methods 0.000 claims description 33
- 239000011159 matrix material Substances 0.000 claims description 26
- 238000007781 pre-processing Methods 0.000 claims description 13
- 239000013598 vector Substances 0.000 claims description 9
- 238000009432 framing Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 6
- 230000003993 interaction Effects 0.000 description 21
- 230000006870 function Effects 0.000 description 17
- 238000002474 experimental method Methods 0.000 description 8
- 238000012360 testing method Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 108090000461 Aurora Kinase A Proteins 0.000 description 3
- 102100032311 Aurora kinase A Human genes 0.000 description 3
- 238000012545 processing Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000007654 immersion Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000009928 pasteurization Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to speech recognition/speaker Recognition Technology fields, disclose a kind of phonetic feature mapping method and system based on Pasteur's distance, extract the feature of voice signal and clean speech signal under complex environment respectively first;Mappings characteristics are initialized followed by complex characteristic and Feature Mapping formula, and establish the GMM model of mappings characteristics and clean feature respectively;Then it uses EM algorithm iterations to estimate minimum Pasteur's distance between two GMM models, and obtains final mappings characteristics;The voice signal model under mappings characteristics and the trained clean environment finished is finally subjected to pattern match and identification.The present invention maps complex characteristic by Pasteur's distance between minimizing complex characteristic GMM and clean feature GMM, obtains mappings characteristics, and mappings characteristics and clean model are carried out pattern match and identification;Complex characteristic is replaced with mappings characteristics, the accuracy of speech recognition can be effectively improved.
Description
Technical Field
The invention belongs to the technical field of voice recognition/speaker recognition, and particularly relates to a Pasteur distance-based voice feature mapping method and system.
Background
Currently, the current state of the art commonly used in the industry is such that:
with the push development of computer technology to human-computer interaction, voice interaction is more and more widely applied in real scenes. The voice interaction is the interaction between a human and a machine in a voice mode, and the interaction mode is closer to the interaction between humans and more accords with the interaction habit of the human. The voice interaction can enable a user to interact with a machine more comfortably, so that the interaction is carried out in a simpler, quicker and more efficient mode, the human-computer interaction process is more humanized, and the leading position of a person in the whole interaction process is more prominent. The voice interaction can liberate both hands of people, and the both hands can still interact with the machine when carrying out other operations, so that convenience is greatly provided for people, and convenience is provided for people in different aspects such as immersion, safety and the like.
The voice recognition is an entrance for starting the voice interaction process and is also an important component of the voice interaction technology, and the result of the voice recognition directly influences the performance of the whole voice interaction. In speech interaction, it is first of all that the machine can "understand" the human language, and only then can the speech interaction proceed. Thus, speech recognition is the basis and precondition for speech interaction. The meaning of speech recognition is that the content of a person's speech is converted into corresponding words through a series of processing means, i.e. speech is converted into words. Generally speaking, the function of the ears of the robot is given, and the machine can simulate the human hearing.
The current speech recognition can obtain better effect in more ideal environment. However, once the voice interaction is interfered by the external environment, the recognition rate is greatly reduced, and the voice interaction is seriously influenced. What is worse, the environment in which we are in reality is a complex environment, and is surrounded by various noises, and both natural noises (such as wind, rain, thunder, running water, etc.) and artificial noises (such as the voice of the surrounding people, the voice of machines, etc.) can affect voice interaction, thereby seriously affecting the user experience. Therefore, it is an urgent problem to effectively remove noise from a noisy speech signal and establish a recognition system with better anti-noise performance.
The prior art generally starts from three aspects, namely a speech enhancement technique applied to a speech signal stage, a feature mapping technique applied to a feature stage and a model compensation technique applied to a model stage. The feature mapping technique is most commonly used for improving the accuracy of speech recognition in a complex environment, because the features are the most appropriate means for representing speech signals, and the appropriate features can achieve good effects. The invention relates to a feature mapping method based on Pasteur distance, which is a feature mapping technology. The feature mapping technology is characterized in that parameters of a mapping function are estimated, and therefore the mapping function is determined to map features. In the prior art, parameter estimation is mostly performed by approximating the distribution of the reference feature by using the real distribution of the feature. The invention describes the distribution of the complex characteristic by GMM by adding prior information, and the GMM is not fixed and is continuously adjusted by minimizing the Babbitt distance between the GMM and the reference characteristic GMM, thereby mapping the real complex characteristic distribution to the characteristic distribution under the ideal state and achieving the purpose of characteristic mapping.
In summary, the problems of the prior art are as follows:
(1) in the prior art, only the real distribution of the features is utilized, and the mapping is started without any assumption or operation on the real distribution of the features, so that the mapping direction of the features in the mapping process is difficult to ensure under the mapping method.
(2) In the prior art, no more consideration is taken when measuring the complex characteristic distribution and the reference characteristic distribution. The pasteurisation distance is a good choice for the distance between the two distributions, which the prior art does not exploit.
(3) The prior art does not add prior information when processing complex features.
The difficulty and significance for solving the technical problems are as follows:
the purpose of feature mapping is to normalize the true distribution of complex features so that the complex features can reach an ideal state, i.e., the mapped feature distribution can conform to the distribution of clean features. This is also a difficult point for feature mapping.
If the theoretical distribution can be reasonably assumed before mapping, this is to add a priori information. Comparing the assumed theoretical distribution with the clean feature distribution, and continuously optimizing the assumed theoretical distribution by minimizing the distance between the two distributions, the assumed theoretical distribution becomes a true ideal distribution when the distance between the assumed theoretical distribution and the clean feature distribution is minimized. The solution of this problem is important for feature mapping.
For the method of measuring the distance between two distributions, the use of the babbitt distance is a good choice. The pap distance is a statistic specifically used to assess the distance between two distributions. Using the babbitt distance to optimize the distance between the two distributions mentioned above can simplify the calculation.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a speech feature mapping method and system based on Bhattacharyya distance. The feature mapping of the present invention is a solution starting from speech features. Through the function of the feature mapping function, the voice signal features under the complex environment are converted into the voice signal features under the environment which can be approximately regarded as the clean environment. The method can greatly reduce the influence of noise on the voice signal, improve the accuracy of voice recognition in practical application and enhance the robustness of the voice recognition system.
The invention is realized in such a way that a speech feature mapping method based on Bhattacharyya distance comprises the following steps:
firstly, respectively extracting the characteristics of a voice signal and a clean voice signal in a complex environment; secondly, initializing mapping characteristics by using a complex characteristic and a characteristic mapping formula, and respectively establishing GMM models of the mapping characteristics and the clean characteristics; then, iteratively estimating the minimum Pasteur distance between the two GMM models by using an EM (effective magnetic field) algorithm, and obtaining a final mapping characteristic; and finally, performing pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.
The method specifically comprises the following steps:
step 1: extracting voice features in a clean environment;
step 2: extracting the characteristics of the voice in the complex environment;
and step 3: obtaining an initialized mapping characteristic from the complex characteristic through a characteristic mapping formula, and respectively establishing a GMM model of the initialized mapping characteristic and a GMM model of the clean environment voice characteristic;
and 4, step 4: introducing a Bhattacharyya distance, and obtaining a mapping characteristic by minimizing the Bhattacharyya distance between two GMMs;
and 5: and carrying out pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.
Further, the specific implementation of step 1 comprises the following substeps:
step 1.1: preprocessing a voice signal obtained in a clean environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 1.2: and (3) extracting Mel cepstrum coefficient characteristics MFCC from the signal preprocessed in the step 1.1.
Further, the specific implementation of step 2 comprises the following substeps:
step 2.1: preprocessing a voice signal obtained in a complex environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 2.2: extracting Mel cepstrum coefficient characteristics from the signal preprocessed in the step 2.1, and recording the Mel cepstrum coefficient characteristics as X ═ X1,x2,...,xt,...,xn],xt∈X。
Further, the specific implementation of step 3 comprises the following substeps:
step 3.1: the feature mapping formula is:
wherein xtT-th frame, y, representing input featurestThe t-th frame, which represents the output characteristics, a represents the gain matrix, b represents the offset matrix, W is the parameter expressed as a mapping function. The matrix A is a matrix sequence formed by 2L +1 matrixes, L is a non-negative number, and the dimension of each matrix is the same as that of each frame of the input features; the input features are a set of vectors consisting of 2L +1 frame features. W is thus composed of 2L +1 matrices and a column vector, anda column of vectors consisting of 2L +1 frames and 1;
step 3.2: let L equal 1, A in W0Is an identity matrix, A-1、A1For 0 matrix, the initialization mapping characteristics y are constructed in turntInitializing the mapping feature to be substantially a complex feature;
step 3.3: and establishing GMM models of the initialization feature and the clean feature.
Further, the specific implementation of step 4 comprises the following substeps:
step 4.1: the pap distance between two GMMs is expressed according to the pap distance formula, which is expressed as:
wherein,andrespectively representing the ith Gaussian components of the clean feature GMM and the mapping feature GMM;
step 4.2: converting the formula in the step 4.1 according to a Gaussian formula to construct a loss function Fc:
Wherein,is the second order standard norm of a Frobenius matrix (Frobenius matrix), and beta and lambda are used to control the Frobenius normAnd two adjustable parameters of GMM model distribution influence degree, y is input characteristic, namely mapping characteristic newly obtained each time, T is frame number of input characteristic, y istI.e. the t-th frame of the input feature, another parameter yt(i) The expression of (a) is:
where M is the number of gaussians in the GMM, ωiA weight of the ith Gaussian;
step 4.3: introducing an EM algorithm to carry out iteration on the Pasteur distance in the step 4.2 to calculate the minimum value, and calculating a parameter W when the minimum value exists;
step 4.4: and substituting the parameter W into the characteristic mapping formula in the step 3.1 to obtain the mapping characteristic y.
Further, in step 5, training the clean features, usually using HMM-GMM model, and performing model matching using the mapping features obtained in step 4 to perform speech recognition/speaker recognition.
It is a further object of the present invention to provide a computer program for implementing said babbitt distance-based speech feature mapping method.
Another object of the present invention is to provide a speech recognition/speaker recognition system for implementing the babbitt distance-based speech feature mapping method.
It is another object of the present invention to provide a computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the babbitt distance-based speech feature mapping method.
In summary, the advantages and positive effects of the invention are:
The invention maps the complex characteristic by minimizing the Pasteur distance between the complex characteristic GMM and the clean characteristic GMM to obtain the mapping characteristic, and performs pattern matching and recognition on the mapping characteristic and the clean model. The mapping characteristics are used for replacing complex characteristics, and the accuracy of voice recognition can be effectively improved.
The mapping method provided by the invention can improve the recognition accuracy of the voice recognition system by mapping the voice characteristics in a complex environment, and has stronger robustness;
the feature mapping method provided by the invention is based on the Babbitt distance between the complex feature and the clean feature distribution, wherein the prior information is included, namely the prior information is assumed that the distribution of the complex feature conforms to GMM, and the prior information enables the distribution of the complex feature to gradually approach the clean feature distribution. The characteristic distribution characteristic is reasonably utilized, and a better effect is achieved;
the invention provides a broad algorithm for improving the accuracy of voice recognition.
The invention provides a feature mapping method based on Bhattacharyya distance aiming at the condition of lower voice recognition in a complex environment, and the feature mapping method is compared with original data which is not subjected to feature mapping and two existing technologies through experiments, and the experimental results show that the sentence accuracy and word accuracy can be effectively improved and the method has better effect.
Table 1: the experimental result is compared with the original data result
Sentence accuracy | Word accuracy rate | |
Raw data | 50% | 81.08% |
Parameter estimation feature mapping method | 62% | 85.14% |
KL divergence characteristic mapping method | 66% | 85.81% |
Pasteur distance feature mapping method | 68% | 87.16% |
Drawings
FIG. 1 is a flowchart of a speech feature mapping method based on Papanicolaou distance according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a speech feature mapping method based on babbitt distance according to an embodiment of the present invention.
Fig. 3 is a flowchart of an experimental implementation provided in an embodiment of the present invention.
Fig. 4 is a flowchart of extracting Mel-frequency cepstral coefficient features MFCC from a signal obtained by preprocessing a speech signal in a clean environment according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The speech feature mapping method based on the Bhattacharyya distance provided by the embodiment of the invention comprises the steps of firstly, respectively extracting the features of a speech signal and a clean speech signal in a complex environment; secondly, initializing mapping characteristics by using a complex characteristic and a characteristic mapping formula, and respectively establishing GMM models of the mapping characteristics and the clean characteristics; then, iteratively estimating the minimum Pasteur distance between the two GMM models by using an EM (effective magnetic field) algorithm, and obtaining a final mapping characteristic; and finally, performing pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment. The invention utilizes the minimized Bhattacharyya distance between the complex characteristic GMM and the clean characteristic GMM to map the complex characteristic, so that the obtained mapping characteristic can be approximately regarded as the voice signal characteristic in the clean environment, thereby greatly improving the purity of the voice characteristic in the complex environment, and achieving the purposes of improving the voice recognition accuracy and enhancing the robustness of a voice recognition system.
Referring to fig. 1-2, the present invention provides a speech feature mapping method based on babbitt distance, which includes the following steps:
step 1: extracting the characteristics of the voice in a clean environment;
the specific implementation comprises the following substeps:
step 1.1: preprocessing a voice signal obtained in a clean environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 1.2: the signal preprocessed in step 1.1 is extracted as Mel cepstral coefficient features MFCC (shown in fig. 4).
Step 2: extracting the characteristics of the voice in the complex environment;
the specific implementation comprises the following substeps:
step 2.1: preprocessing a voice signal obtained in a complex environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 2.2: extracting Mel cepstrum coefficient characteristics from the signal preprocessed in the step 2.1, and recording the Mel cepstrum coefficient characteristics as X ═ X1,x2,...,xt,...,xn],xt∈X。
And step 3: obtaining an initialized mapping characteristic from the complex characteristic through a characteristic mapping formula, and respectively establishing a GMM model of the initialized mapping characteristic and a GMM model of the clean environment voice characteristic;
the specific implementation comprises the following substeps:
step 3.1: determining a feature mapping formula as:
wherein xtT-th frame, y, representing input featurestThe t-th frame, which represents the output characteristics, a represents the gain matrix, b represents the offset matrix, W is the parameter expressed as a mapping function. The matrix A is a matrix sequence formed by 2L +1 matrixes, L is a non-negative number, and the dimension of each matrix is the same as that of each frame of the input features; the input features are a set of vectors consisting of 2L +1 frame features. W is thus composed of 2L +1 matrices and a column vector, anda column of vectors consisting of 2L +1 frames and 1;
step 3.2: let L equal 1, A in W0Is an identity matrix, A-1、A1For 0 matrix, the initialization mapping characteristics y are constructed in turnt;
Step 3.3: and establishing GMM models of the initialization feature and the clean feature.
And 4, step 4: introducing a Bhattacharyya distance, and obtaining a mapping characteristic by minimizing the Bhattacharyya distance between two GMMs;
the specific implementation comprises the following substeps:
step 4.1: the pap distance between two GMMs is expressed according to the pap distance formula, which is expressed as:
wherein,andrespectively representing the ith Gaussian components of the clean feature GMM and the mapping feature GMM;
step 4.2: converting the formula in the step 4.1 according to a Gaussian formula to construct a loss function Fc:
Wherein,is a second order standard norm of a Frobenius matrix (Frobenius matrix)beta and lambda are used to control the Frobenius normAnd two adjustable parameters of GMM model distribution influence degree, y is input characteristic, namely mapping characteristic newly obtained each time, T is frame number of input characteristic, y istI.e. the t-th frame of the input feature, another parameter yt(i) The expression of (a) is:
where M is the number of gaussians in the GMM, ωiA weight of the ith Gaussian;
step 4.3: according to the formula:
there is a parameter W such that the loss function Fc(W) minimum value is obtained, and the parameter W is the parameter W to be obtained, so F in step 4.2 is obtainedcA minimum value of (W);
step 4.4: the auxiliary function is introduced by means of the EM algorithm:
the value of W in the auxiliary function,respectively representing a newly estimated parameter W and a current parameter W; m and T respectively represent the Gaussian number in the GMM and the frame number of the input feature y;andrespectively representing the mean values of ith components of the GMM models a and b;andrepresenting the diagonal covariance matrices derived from the variance in the mth gaussians of GMM models a and b, respectively, as: d represents the dimension, i.e. the number of rows, of the parameter W(d)Line d representing the parameter W; and the G and p parameters are expressed as follows:
step 4.5: the gradient function of the auxiliary function Q for the d-th line of the parameter W is found:
step 4.6: iterating using the L-BFGS algorithm in combination with the gradient function in step 4.5 to obtain the minimum value of the auxiliary function in step 4.4 and the value of the parameter W at which the minimum value exists;
step 4.7: assigning the W value obtained in step 4.6 toRepeating the steps 4.4, 4.5 and 4.6, and iterating the W until the difference between the newly obtained W and the W obtained in the last iteration is less than a fixed value or the iteration times reaches a fixed value;
step 4.8: and substituting the W value obtained in the step 4.7 into the characteristic mapping formula in the step 3.1 to obtain the mapping characteristic y.
And 5: and carrying out pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.
The invention is further described below in conjunction with specific examples/experiments/simulation analyses.
The invention proves the effectiveness of the medicine through experiments. The invention provides data support for experiments through Aurora2 general database data in the experiments. The completion of the experiment is assisted by using MATLAB simulation software, HTK voice toolkit and voicebox toolkit. The software and hardware environment of the experiment was as follows:
hardware environment:
intel Core i3-4150CPU 3.5G dual Core
8G internal memory
Dell E1916HV
Software environment:
MATLAB 2014a
voicebox
HTK3.5
Aurora2
in the experiment, firstly, MFCC (Mel frequency cepstrum coefficient) feature extraction is carried out on training data and test data in Aurora2 through an HTK (Hypertext tool kit) voice tool kit to obtain reference features and test features; then writing and calling a voicebox toolkit by MATLAB to establish a GMM model for the reference characteristics; compiling a script to establish a test feature GMM model and obtaining a mapped test feature MFCC by minimizing the Pasteur distance between the test feature GMM model and a reference feature GMM model; and identifying the mapped features by using an HTK (Hypertext transfer key) to obtain the identification rate. The screenshot of the recognition rate obtained under HTK and the experimental implementation flow chart are shown in fig. 3.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (10)
1. A Bhattacharyya distance-based speech feature mapping method is characterized in that the Bhattacharyya distance-based speech feature mapping method comprises the following steps:
firstly, respectively extracting the characteristics of a voice signal and a clean voice signal in a complex environment; secondly, initializing mapping characteristics by using a complex characteristic and a characteristic mapping formula, and respectively establishing GMM models of the mapping characteristics and the clean characteristics;
then, iteratively estimating the minimum Pasteur distance between the two GMM models by using an EM (effective magnetic field) algorithm, and obtaining a final mapping characteristic;
and finally, performing pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.
2. The babbitt distance-based speech feature mapping method according to claim 1, wherein the babbitt distance-based speech feature mapping method specifically comprises:
step 1: extracting voice features in a clean environment;
step 2: extracting the characteristics of the voice in the complex environment;
and step 3: obtaining an initialized mapping characteristic from the complex characteristic through a characteristic mapping formula, and respectively establishing a GMM model of the initialized mapping characteristic and a GMM model of the clean environment voice characteristic;
and 4, step 4: introducing a Bhattacharyya distance, and obtaining a mapping characteristic by minimizing the Bhattacharyya distance between two GMMs;
and 5: and carrying out pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.
3. The babbitt distance-based speech feature mapping method of claim 2,
the method comprises the following steps:
step 1.1: preprocessing a voice signal obtained in a clean environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 1.2: and (3) extracting Mel cepstrum coefficient characteristics MFCC from the signal preprocessed in the step 1.1.
4. The babbitt distance-based speech feature mapping method of claim 2,
step 2, specifically comprising the following steps:
step 2.1: preprocessing a voice signal obtained in a complex environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 2.2: extracting Mel cepstrum coefficient from the signal preprocessed in step 2.1Is characterized by X ═ X1,x2,...,xt,...,xn],xt∈X。
5. The babbitt distance-based speech feature mapping method of claim 2,
step 3, specifically comprising the following steps:
step 3.1: obtaining the initialized mapping characteristic feature from the complex characteristic through a characteristic mapping formula, wherein the mapping formula is as follows:
wherein xtT-th frame, y, representing input featurestThe t frame representing the output characteristics, A represents a gain matrix, b represents an offset matrix, and W is a parameter represented as a mapping function; the matrix A is a matrix sequence formed by 2L +1 matrixes, L is a nonnegative number, and the dimension of each matrix is the same as that of each frame of the input characteristics; the input features form a group of vectors by 2L +1 frame features; w is composed of 2L +1 matrices and a column of vectors,a column of vectors consisting of 2L +1 frames and 1;
step 3.2: let L equal 1, A in W0Is an identity matrix, A-1、A1For the 0 matrix, the initialized mapping characteristics y are constructed in sequencetInitializing mapping characteristics into complex characteristics;
step 3.3: the speech signal is processed and a GMM model is built that initializes the MFCC features and the clean features.
6. The babbitt distance-based speech feature mapping method according to claim 2, wherein the step 4 specifically comprises the following steps:
step 4.1: the pap distance between two GMMs is expressed according to the pap distance formula, which is:
wherein,andrespectively representing the ith Gaussian components of the clean feature GMM and the mapping feature GMM;
step 4.2: converting the formula in the step 4.1 according to a Gaussian formula to construct a loss function Fc:
Wherein,being the second order standard norm of the Flobenius matrix, β and λ are to control the Flobenius normAnd two adjustable parameters of GMM model distribution influence degree, y is input characteristic, namely mapping characteristic newly obtained each time, T is frame number of input characteristic, y istI.e. the t-th frame of the input feature, another parameter yt(i) The expression of (a) is:
m is the number of Gauss in GMM, omegaiA weight of the ith Gaussian;
step 4.3: introducing an EM algorithm to carry out iteration on the Pasteur distance in the step 4.2 to calculate the minimum value, and calculating a parameter W when the minimum value exists;
step 4.4: and substituting the parameter W into the characteristic mapping formula in the step 3.1 to obtain the mapping characteristic y.
7. The babbitt distance-based speech feature mapping method according to claim 2, wherein in step 5, the method specifically comprises: and (4) training the clean features, and performing model matching by using the mapping features obtained in the step (4) and an HMM-GMM model to perform voice recognition/speaker recognition.
8. A computer program for implementing the babbitt distance-based speech feature mapping method according to any one of claims 1 to 7.
9. A speech recognition/speaker recognition system for implementing the babbitt distance-based speech feature mapping method according to any one of claims 1 to 7.
10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the babbitt distance-based speech feature mapping method according to any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810572146.1A CN108766430B (en) | 2018-06-06 | 2018-06-06 | Speech feature mapping method and system based on Bhattacharyya distance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810572146.1A CN108766430B (en) | 2018-06-06 | 2018-06-06 | Speech feature mapping method and system based on Bhattacharyya distance |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108766430A true CN108766430A (en) | 2018-11-06 |
CN108766430B CN108766430B (en) | 2020-08-04 |
Family
ID=63999822
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810572146.1A Active CN108766430B (en) | 2018-06-06 | 2018-06-06 | Speech feature mapping method and system based on Bhattacharyya distance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108766430B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111816187A (en) * | 2020-07-03 | 2020-10-23 | 中国人民解放军空军预警学院 | Deep neural network-based voice feature mapping method in complex environment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1521729A (en) * | 2003-01-21 | 2004-08-18 | Method of speech recognition using hidden trajectory hidden markov models | |
US7529666B1 (en) * | 2000-10-30 | 2009-05-05 | International Business Machines Corporation | Minimum bayes error feature selection in speech recognition |
CN104900232A (en) * | 2015-04-20 | 2015-09-09 | 东南大学 | Isolation word identification method based on double-layer GMM structure and VTS feature compensation |
CN106782520A (en) * | 2017-03-14 | 2017-05-31 | 华中师范大学 | Phonetic feature mapping method under a kind of complex environment |
US20170263269A1 (en) * | 2016-03-08 | 2017-09-14 | International Business Machines Corporation | Multi-pass speech activity detection strategy to improve automatic speech recognition |
-
2018
- 2018-06-06 CN CN201810572146.1A patent/CN108766430B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7529666B1 (en) * | 2000-10-30 | 2009-05-05 | International Business Machines Corporation | Minimum bayes error feature selection in speech recognition |
CN1521729A (en) * | 2003-01-21 | 2004-08-18 | Method of speech recognition using hidden trajectory hidden markov models | |
CN104900232A (en) * | 2015-04-20 | 2015-09-09 | 东南大学 | Isolation word identification method based on double-layer GMM structure and VTS feature compensation |
US20170263269A1 (en) * | 2016-03-08 | 2017-09-14 | International Business Machines Corporation | Multi-pass speech activity detection strategy to improve automatic speech recognition |
CN106782520A (en) * | 2017-03-14 | 2017-05-31 | 华中师范大学 | Phonetic feature mapping method under a kind of complex environment |
Non-Patent Citations (2)
Title |
---|
DUC HOANG HA NGUYEN: "Feature Adaptation Using Linear Spectro-Temporal Transform for Robust Speech Recognition", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 * |
谭萍,邢玉娟,高翔: "说话人模型聚类算法研究与分析", 《中国建材科技》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111816187A (en) * | 2020-07-03 | 2020-10-23 | 中国人民解放军空军预警学院 | Deep neural network-based voice feature mapping method in complex environment |
Also Published As
Publication number | Publication date |
---|---|
CN108766430B (en) | 2020-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7177167B2 (en) | Mixed speech identification method, apparatus and computer program | |
CN106683680B (en) | Speaker recognition method and device, computer equipment and computer readable medium | |
CN107564513B (en) | Voice recognition method and device | |
CN109360572B (en) | Call separation method and device, computer equipment and storage medium | |
WO2019237519A1 (en) | General vector training method, voice clustering method, apparatus, device and medium | |
CN112053695A (en) | Voiceprint recognition method and device, electronic equipment and storage medium | |
TWI475558B (en) | Method and apparatus for utterance verification | |
JP5932869B2 (en) | N-gram language model unsupervised learning method, learning apparatus, and learning program | |
CN108922543B (en) | Model base establishing method, voice recognition method, device, equipment and medium | |
CN107093422B (en) | Voice recognition method and voice recognition system | |
JP7124427B2 (en) | Multi-view vector processing method and apparatus | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
WO2019232826A1 (en) | I-vector extraction method, speaker recognition method and apparatus, device, and medium | |
CN109192200A (en) | A kind of audio recognition method | |
Sivaraman et al. | Personalized speech enhancement through self-supervised data augmentation and purification | |
CN112053694A (en) | Voiceprint recognition method based on CNN and GRU network fusion | |
CN114550703A (en) | Training method and device of voice recognition system, and voice recognition method and device | |
CN111161713A (en) | Voice gender identification method and device and computing equipment | |
CN113178192A (en) | Training method, device and equipment of speech recognition model and storage medium | |
CN114203154A (en) | Training method and device of voice style migration model and voice style migration method and device | |
CN112002307A (en) | Voice recognition method and device | |
CN113421573B (en) | Identity recognition model training method, identity recognition method and device | |
KR100897555B1 (en) | Apparatus and method of extracting speech feature vectors and speech recognition system and method employing the same | |
CN112133293A (en) | Phrase voice sample compensation method based on generation countermeasure network and storage medium | |
CN112017676A (en) | Audio processing method, apparatus and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |