CN108766430A - A kind of phonetic feature mapping method and system based on Pasteur's distance - Google Patents

A kind of phonetic feature mapping method and system based on Pasteur's distance Download PDF

Info

Publication number
CN108766430A
CN108766430A CN201810572146.1A CN201810572146A CN108766430A CN 108766430 A CN108766430 A CN 108766430A CN 201810572146 A CN201810572146 A CN 201810572146A CN 108766430 A CN108766430 A CN 108766430A
Authority
CN
China
Prior art keywords
mapping
characteristic
distance
gmm
clean
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810572146.1A
Other languages
Chinese (zh)
Other versions
CN108766430B (en
Inventor
王志锋
左明章
宁国勤
叶俊民
闵秋莎
田元
夏丹
陈迪
罗恒
姚璜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong Normal University
Original Assignee
Huazhong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong Normal University filed Critical Huazhong Normal University
Priority to CN201810572146.1A priority Critical patent/CN108766430B/en
Publication of CN108766430A publication Critical patent/CN108766430A/en
Application granted granted Critical
Publication of CN108766430B publication Critical patent/CN108766430B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to speech recognition/speaker Recognition Technology fields, disclose a kind of phonetic feature mapping method and system based on Pasteur's distance, extract the feature of voice signal and clean speech signal under complex environment respectively first;Mappings characteristics are initialized followed by complex characteristic and Feature Mapping formula, and establish the GMM model of mappings characteristics and clean feature respectively;Then it uses EM algorithm iterations to estimate minimum Pasteur's distance between two GMM models, and obtains final mappings characteristics;The voice signal model under mappings characteristics and the trained clean environment finished is finally subjected to pattern match and identification.The present invention maps complex characteristic by Pasteur's distance between minimizing complex characteristic GMM and clean feature GMM, obtains mappings characteristics, and mappings characteristics and clean model are carried out pattern match and identification;Complex characteristic is replaced with mappings characteristics, the accuracy of speech recognition can be effectively improved.

Description

Speech feature mapping method and system based on Bhattacharyya distance
Technical Field
The invention belongs to the technical field of voice recognition/speaker recognition, and particularly relates to a Pasteur distance-based voice feature mapping method and system.
Background
Currently, the current state of the art commonly used in the industry is such that:
with the push development of computer technology to human-computer interaction, voice interaction is more and more widely applied in real scenes. The voice interaction is the interaction between a human and a machine in a voice mode, and the interaction mode is closer to the interaction between humans and more accords with the interaction habit of the human. The voice interaction can enable a user to interact with a machine more comfortably, so that the interaction is carried out in a simpler, quicker and more efficient mode, the human-computer interaction process is more humanized, and the leading position of a person in the whole interaction process is more prominent. The voice interaction can liberate both hands of people, and the both hands can still interact with the machine when carrying out other operations, so that convenience is greatly provided for people, and convenience is provided for people in different aspects such as immersion, safety and the like.
The voice recognition is an entrance for starting the voice interaction process and is also an important component of the voice interaction technology, and the result of the voice recognition directly influences the performance of the whole voice interaction. In speech interaction, it is first of all that the machine can "understand" the human language, and only then can the speech interaction proceed. Thus, speech recognition is the basis and precondition for speech interaction. The meaning of speech recognition is that the content of a person's speech is converted into corresponding words through a series of processing means, i.e. speech is converted into words. Generally speaking, the function of the ears of the robot is given, and the machine can simulate the human hearing.
The current speech recognition can obtain better effect in more ideal environment. However, once the voice interaction is interfered by the external environment, the recognition rate is greatly reduced, and the voice interaction is seriously influenced. What is worse, the environment in which we are in reality is a complex environment, and is surrounded by various noises, and both natural noises (such as wind, rain, thunder, running water, etc.) and artificial noises (such as the voice of the surrounding people, the voice of machines, etc.) can affect voice interaction, thereby seriously affecting the user experience. Therefore, it is an urgent problem to effectively remove noise from a noisy speech signal and establish a recognition system with better anti-noise performance.
The prior art generally starts from three aspects, namely a speech enhancement technique applied to a speech signal stage, a feature mapping technique applied to a feature stage and a model compensation technique applied to a model stage. The feature mapping technique is most commonly used for improving the accuracy of speech recognition in a complex environment, because the features are the most appropriate means for representing speech signals, and the appropriate features can achieve good effects. The invention relates to a feature mapping method based on Pasteur distance, which is a feature mapping technology. The feature mapping technology is characterized in that parameters of a mapping function are estimated, and therefore the mapping function is determined to map features. In the prior art, parameter estimation is mostly performed by approximating the distribution of the reference feature by using the real distribution of the feature. The invention describes the distribution of the complex characteristic by GMM by adding prior information, and the GMM is not fixed and is continuously adjusted by minimizing the Babbitt distance between the GMM and the reference characteristic GMM, thereby mapping the real complex characteristic distribution to the characteristic distribution under the ideal state and achieving the purpose of characteristic mapping.
In summary, the problems of the prior art are as follows:
(1) in the prior art, only the real distribution of the features is utilized, and the mapping is started without any assumption or operation on the real distribution of the features, so that the mapping direction of the features in the mapping process is difficult to ensure under the mapping method.
(2) In the prior art, no more consideration is taken when measuring the complex characteristic distribution and the reference characteristic distribution. The pasteurisation distance is a good choice for the distance between the two distributions, which the prior art does not exploit.
(3) The prior art does not add prior information when processing complex features.
The difficulty and significance for solving the technical problems are as follows:
the purpose of feature mapping is to normalize the true distribution of complex features so that the complex features can reach an ideal state, i.e., the mapped feature distribution can conform to the distribution of clean features. This is also a difficult point for feature mapping.
If the theoretical distribution can be reasonably assumed before mapping, this is to add a priori information. Comparing the assumed theoretical distribution with the clean feature distribution, and continuously optimizing the assumed theoretical distribution by minimizing the distance between the two distributions, the assumed theoretical distribution becomes a true ideal distribution when the distance between the assumed theoretical distribution and the clean feature distribution is minimized. The solution of this problem is important for feature mapping.
For the method of measuring the distance between two distributions, the use of the babbitt distance is a good choice. The pap distance is a statistic specifically used to assess the distance between two distributions. Using the babbitt distance to optimize the distance between the two distributions mentioned above can simplify the calculation.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a speech feature mapping method and system based on Bhattacharyya distance. The feature mapping of the present invention is a solution starting from speech features. Through the function of the feature mapping function, the voice signal features under the complex environment are converted into the voice signal features under the environment which can be approximately regarded as the clean environment. The method can greatly reduce the influence of noise on the voice signal, improve the accuracy of voice recognition in practical application and enhance the robustness of the voice recognition system.
The invention is realized in such a way that a speech feature mapping method based on Bhattacharyya distance comprises the following steps:
firstly, respectively extracting the characteristics of a voice signal and a clean voice signal in a complex environment; secondly, initializing mapping characteristics by using a complex characteristic and a characteristic mapping formula, and respectively establishing GMM models of the mapping characteristics and the clean characteristics; then, iteratively estimating the minimum Pasteur distance between the two GMM models by using an EM (effective magnetic field) algorithm, and obtaining a final mapping characteristic; and finally, performing pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.
The method specifically comprises the following steps:
step 1: extracting voice features in a clean environment;
step 2: extracting the characteristics of the voice in the complex environment;
and step 3: obtaining an initialized mapping characteristic from the complex characteristic through a characteristic mapping formula, and respectively establishing a GMM model of the initialized mapping characteristic and a GMM model of the clean environment voice characteristic;
and 4, step 4: introducing a Bhattacharyya distance, and obtaining a mapping characteristic by minimizing the Bhattacharyya distance between two GMMs;
and 5: and carrying out pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.
Further, the specific implementation of step 1 comprises the following substeps:
step 1.1: preprocessing a voice signal obtained in a clean environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 1.2: and (3) extracting Mel cepstrum coefficient characteristics MFCC from the signal preprocessed in the step 1.1.
Further, the specific implementation of step 2 comprises the following substeps:
step 2.1: preprocessing a voice signal obtained in a complex environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 2.2: extracting Mel cepstrum coefficient characteristics from the signal preprocessed in the step 2.1, and recording the Mel cepstrum coefficient characteristics as X ═ X1,x2,...,xt,...,xn],xt∈X。
Further, the specific implementation of step 3 comprises the following substeps:
step 3.1: the feature mapping formula is:
wherein xtT-th frame, y, representing input featurestThe t-th frame, which represents the output characteristics, a represents the gain matrix, b represents the offset matrix, W is the parameter expressed as a mapping function. The matrix A is a matrix sequence formed by 2L +1 matrixes, L is a non-negative number, and the dimension of each matrix is the same as that of each frame of the input features; the input features are a set of vectors consisting of 2L +1 frame features. W is thus composed of 2L +1 matrices and a column vector, anda column of vectors consisting of 2L +1 frames and 1;
step 3.2: let L equal 1, A in W0Is an identity matrix, A-1、A1For 0 matrix, the initialization mapping characteristics y are constructed in turntInitializing the mapping feature to be substantially a complex feature;
step 3.3: and establishing GMM models of the initialization feature and the clean feature.
Further, the specific implementation of step 4 comprises the following substeps:
step 4.1: the pap distance between two GMMs is expressed according to the pap distance formula, which is expressed as:
wherein,andrespectively representing the ith Gaussian components of the clean feature GMM and the mapping feature GMM;
step 4.2: converting the formula in the step 4.1 according to a Gaussian formula to construct a loss function Fc
Wherein,is the second order standard norm of a Frobenius matrix (Frobenius matrix), and beta and lambda are used to control the Frobenius normAnd two adjustable parameters of GMM model distribution influence degree, y is input characteristic, namely mapping characteristic newly obtained each time, T is frame number of input characteristic, y istI.e. the t-th frame of the input feature, another parameter yt(i) The expression of (a) is:
where M is the number of gaussians in the GMM, ωiA weight of the ith Gaussian;
step 4.3: introducing an EM algorithm to carry out iteration on the Pasteur distance in the step 4.2 to calculate the minimum value, and calculating a parameter W when the minimum value exists;
step 4.4: and substituting the parameter W into the characteristic mapping formula in the step 3.1 to obtain the mapping characteristic y.
Further, in step 5, training the clean features, usually using HMM-GMM model, and performing model matching using the mapping features obtained in step 4 to perform speech recognition/speaker recognition.
It is a further object of the present invention to provide a computer program for implementing said babbitt distance-based speech feature mapping method.
Another object of the present invention is to provide a speech recognition/speaker recognition system for implementing the babbitt distance-based speech feature mapping method.
It is another object of the present invention to provide a computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the babbitt distance-based speech feature mapping method.
In summary, the advantages and positive effects of the invention are
The invention maps the complex characteristic by minimizing the Pasteur distance between the complex characteristic GMM and the clean characteristic GMM to obtain the mapping characteristic, and performs pattern matching and recognition on the mapping characteristic and the clean model. The mapping characteristics are used for replacing complex characteristics, and the accuracy of voice recognition can be effectively improved.
The mapping method provided by the invention can improve the recognition accuracy of the voice recognition system by mapping the voice characteristics in a complex environment, and has stronger robustness;
the feature mapping method provided by the invention is based on the Babbitt distance between the complex feature and the clean feature distribution, wherein the prior information is included, namely the prior information is assumed that the distribution of the complex feature conforms to GMM, and the prior information enables the distribution of the complex feature to gradually approach the clean feature distribution. The characteristic distribution characteristic is reasonably utilized, and a better effect is achieved;
the invention provides a broad algorithm for improving the accuracy of voice recognition.
The invention provides a feature mapping method based on Bhattacharyya distance aiming at the condition of lower voice recognition in a complex environment, and the feature mapping method is compared with original data which is not subjected to feature mapping and two existing technologies through experiments, and the experimental results show that the sentence accuracy and word accuracy can be effectively improved and the method has better effect.
Table 1: the experimental result is compared with the original data result
Sentence accuracy Word accuracy rate
Raw data 50% 81.08%
Parameter estimation feature mapping method 62% 85.14%
KL divergence characteristic mapping method 66% 85.81%
Pasteur distance feature mapping method 68% 87.16%
Drawings
FIG. 1 is a flowchart of a speech feature mapping method based on Papanicolaou distance according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a speech feature mapping method based on babbitt distance according to an embodiment of the present invention.
Fig. 3 is a flowchart of an experimental implementation provided in an embodiment of the present invention.
Fig. 4 is a flowchart of extracting Mel-frequency cepstral coefficient features MFCC from a signal obtained by preprocessing a speech signal in a clean environment according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The speech feature mapping method based on the Bhattacharyya distance provided by the embodiment of the invention comprises the steps of firstly, respectively extracting the features of a speech signal and a clean speech signal in a complex environment; secondly, initializing mapping characteristics by using a complex characteristic and a characteristic mapping formula, and respectively establishing GMM models of the mapping characteristics and the clean characteristics; then, iteratively estimating the minimum Pasteur distance between the two GMM models by using an EM (effective magnetic field) algorithm, and obtaining a final mapping characteristic; and finally, performing pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment. The invention utilizes the minimized Bhattacharyya distance between the complex characteristic GMM and the clean characteristic GMM to map the complex characteristic, so that the obtained mapping characteristic can be approximately regarded as the voice signal characteristic in the clean environment, thereby greatly improving the purity of the voice characteristic in the complex environment, and achieving the purposes of improving the voice recognition accuracy and enhancing the robustness of a voice recognition system.
Referring to fig. 1-2, the present invention provides a speech feature mapping method based on babbitt distance, which includes the following steps:
step 1: extracting the characteristics of the voice in a clean environment;
the specific implementation comprises the following substeps:
step 1.1: preprocessing a voice signal obtained in a clean environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 1.2: the signal preprocessed in step 1.1 is extracted as Mel cepstral coefficient features MFCC (shown in fig. 4).
Step 2: extracting the characteristics of the voice in the complex environment;
the specific implementation comprises the following substeps:
step 2.1: preprocessing a voice signal obtained in a complex environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 2.2: extracting Mel cepstrum coefficient characteristics from the signal preprocessed in the step 2.1, and recording the Mel cepstrum coefficient characteristics as X ═ X1,x2,...,xt,...,xn],xt∈X。
And step 3: obtaining an initialized mapping characteristic from the complex characteristic through a characteristic mapping formula, and respectively establishing a GMM model of the initialized mapping characteristic and a GMM model of the clean environment voice characteristic;
the specific implementation comprises the following substeps:
step 3.1: determining a feature mapping formula as:
wherein xtT-th frame, y, representing input featurestThe t-th frame, which represents the output characteristics, a represents the gain matrix, b represents the offset matrix, W is the parameter expressed as a mapping function. The matrix A is a matrix sequence formed by 2L +1 matrixes, L is a non-negative number, and the dimension of each matrix is the same as that of each frame of the input features; the input features are a set of vectors consisting of 2L +1 frame features. W is thus composed of 2L +1 matrices and a column vector, anda column of vectors consisting of 2L +1 frames and 1;
step 3.2: let L equal 1, A in W0Is an identity matrix, A-1、A1For 0 matrix, the initialization mapping characteristics y are constructed in turnt
Step 3.3: and establishing GMM models of the initialization feature and the clean feature.
And 4, step 4: introducing a Bhattacharyya distance, and obtaining a mapping characteristic by minimizing the Bhattacharyya distance between two GMMs;
the specific implementation comprises the following substeps:
step 4.1: the pap distance between two GMMs is expressed according to the pap distance formula, which is expressed as:
wherein,andrespectively representing the ith Gaussian components of the clean feature GMM and the mapping feature GMM;
step 4.2: converting the formula in the step 4.1 according to a Gaussian formula to construct a loss function Fc
Wherein,is a second order standard norm of a Frobenius matrix (Frobenius matrix)beta and lambda are used to control the Frobenius normAnd two adjustable parameters of GMM model distribution influence degree, y is input characteristic, namely mapping characteristic newly obtained each time, T is frame number of input characteristic, y istI.e. the t-th frame of the input feature, another parameter yt(i) The expression of (a) is:
where M is the number of gaussians in the GMM, ωiA weight of the ith Gaussian;
step 4.3: according to the formula:
there is a parameter W such that the loss function Fc(W) minimum value is obtained, and the parameter W is the parameter W to be obtained, so F in step 4.2 is obtainedcA minimum value of (W);
step 4.4: the auxiliary function is introduced by means of the EM algorithm:
the value of W in the auxiliary function,respectively representing a newly estimated parameter W and a current parameter W; m and T respectively represent the Gaussian number in the GMM and the frame number of the input feature y;andrespectively representing the mean values of ith components of the GMM models a and b;andrepresenting the diagonal covariance matrices derived from the variance in the mth gaussians of GMM models a and b, respectively, as: d represents the dimension, i.e. the number of rows, of the parameter W(d)Line d representing the parameter W; and the G and p parameters are expressed as follows:
step 4.5: the gradient function of the auxiliary function Q for the d-th line of the parameter W is found:
step 4.6: iterating using the L-BFGS algorithm in combination with the gradient function in step 4.5 to obtain the minimum value of the auxiliary function in step 4.4 and the value of the parameter W at which the minimum value exists;
step 4.7: assigning the W value obtained in step 4.6 toRepeating the steps 4.4, 4.5 and 4.6, and iterating the W until the difference between the newly obtained W and the W obtained in the last iteration is less than a fixed value or the iteration times reaches a fixed value;
step 4.8: and substituting the W value obtained in the step 4.7 into the characteristic mapping formula in the step 3.1 to obtain the mapping characteristic y.
And 5: and carrying out pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.
The invention is further described below in conjunction with specific examples/experiments/simulation analyses.
The invention proves the effectiveness of the medicine through experiments. The invention provides data support for experiments through Aurora2 general database data in the experiments. The completion of the experiment is assisted by using MATLAB simulation software, HTK voice toolkit and voicebox toolkit. The software and hardware environment of the experiment was as follows:
hardware environment:
intel Core i3-4150CPU 3.5G dual Core
8G internal memory
Dell E1916HV
Software environment:
MATLAB 2014a
voicebox
HTK3.5
Aurora2
in the experiment, firstly, MFCC (Mel frequency cepstrum coefficient) feature extraction is carried out on training data and test data in Aurora2 through an HTK (Hypertext tool kit) voice tool kit to obtain reference features and test features; then writing and calling a voicebox toolkit by MATLAB to establish a GMM model for the reference characteristics; compiling a script to establish a test feature GMM model and obtaining a mapped test feature MFCC by minimizing the Pasteur distance between the test feature GMM model and a reference feature GMM model; and identifying the mapped features by using an HTK (Hypertext transfer key) to obtain the identification rate. The screenshot of the recognition rate obtained under HTK and the experimental implementation flow chart are shown in fig. 3.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A Bhattacharyya distance-based speech feature mapping method is characterized in that the Bhattacharyya distance-based speech feature mapping method comprises the following steps:
firstly, respectively extracting the characteristics of a voice signal and a clean voice signal in a complex environment; secondly, initializing mapping characteristics by using a complex characteristic and a characteristic mapping formula, and respectively establishing GMM models of the mapping characteristics and the clean characteristics;
then, iteratively estimating the minimum Pasteur distance between the two GMM models by using an EM (effective magnetic field) algorithm, and obtaining a final mapping characteristic;
and finally, performing pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.
2. The babbitt distance-based speech feature mapping method according to claim 1, wherein the babbitt distance-based speech feature mapping method specifically comprises:
step 1: extracting voice features in a clean environment;
step 2: extracting the characteristics of the voice in the complex environment;
and step 3: obtaining an initialized mapping characteristic from the complex characteristic through a characteristic mapping formula, and respectively establishing a GMM model of the initialized mapping characteristic and a GMM model of the clean environment voice characteristic;
and 4, step 4: introducing a Bhattacharyya distance, and obtaining a mapping characteristic by minimizing the Bhattacharyya distance between two GMMs;
and 5: and carrying out pattern matching and recognition on the mapping characteristics and the trained voice signal model in the clean environment.
3. The babbitt distance-based speech feature mapping method of claim 2,
the method comprises the following steps:
step 1.1: preprocessing a voice signal obtained in a clean environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 1.2: and (3) extracting Mel cepstrum coefficient characteristics MFCC from the signal preprocessed in the step 1.1.
4. The babbitt distance-based speech feature mapping method of claim 2,
step 2, specifically comprising the following steps:
step 2.1: preprocessing a voice signal obtained in a complex environment, wherein the preprocessing comprises pre-emphasis, framing and windowing;
step 2.2: extracting Mel cepstrum coefficient from the signal preprocessed in step 2.1Is characterized by X ═ X1,x2,...,xt,...,xn],xt∈X。
5. The babbitt distance-based speech feature mapping method of claim 2,
step 3, specifically comprising the following steps:
step 3.1: obtaining the initialized mapping characteristic feature from the complex characteristic through a characteristic mapping formula, wherein the mapping formula is as follows:
wherein xtT-th frame, y, representing input featurestThe t frame representing the output characteristics, A represents a gain matrix, b represents an offset matrix, and W is a parameter represented as a mapping function; the matrix A is a matrix sequence formed by 2L +1 matrixes, L is a nonnegative number, and the dimension of each matrix is the same as that of each frame of the input characteristics; the input features form a group of vectors by 2L +1 frame features; w is composed of 2L +1 matrices and a column of vectors,a column of vectors consisting of 2L +1 frames and 1;
step 3.2: let L equal 1, A in W0Is an identity matrix, A-1、A1For the 0 matrix, the initialized mapping characteristics y are constructed in sequencetInitializing mapping characteristics into complex characteristics;
step 3.3: the speech signal is processed and a GMM model is built that initializes the MFCC features and the clean features.
6. The babbitt distance-based speech feature mapping method according to claim 2, wherein the step 4 specifically comprises the following steps:
step 4.1: the pap distance between two GMMs is expressed according to the pap distance formula, which is:
wherein,andrespectively representing the ith Gaussian components of the clean feature GMM and the mapping feature GMM;
step 4.2: converting the formula in the step 4.1 according to a Gaussian formula to construct a loss function Fc
Wherein,being the second order standard norm of the Flobenius matrix, β and λ are to control the Flobenius normAnd two adjustable parameters of GMM model distribution influence degree, y is input characteristic, namely mapping characteristic newly obtained each time, T is frame number of input characteristic, y istI.e. the t-th frame of the input feature, another parameter yt(i) The expression of (a) is:
m is the number of Gauss in GMM, omegaiA weight of the ith Gaussian;
step 4.3: introducing an EM algorithm to carry out iteration on the Pasteur distance in the step 4.2 to calculate the minimum value, and calculating a parameter W when the minimum value exists;
step 4.4: and substituting the parameter W into the characteristic mapping formula in the step 3.1 to obtain the mapping characteristic y.
7. The babbitt distance-based speech feature mapping method according to claim 2, wherein in step 5, the method specifically comprises: and (4) training the clean features, and performing model matching by using the mapping features obtained in the step (4) and an HMM-GMM model to perform voice recognition/speaker recognition.
8. A computer program for implementing the babbitt distance-based speech feature mapping method according to any one of claims 1 to 7.
9. A speech recognition/speaker recognition system for implementing the babbitt distance-based speech feature mapping method according to any one of claims 1 to 7.
10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the babbitt distance-based speech feature mapping method according to any one of claims 1-7.
CN201810572146.1A 2018-06-06 2018-06-06 Speech feature mapping method and system based on Bhattacharyya distance Active CN108766430B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810572146.1A CN108766430B (en) 2018-06-06 2018-06-06 Speech feature mapping method and system based on Bhattacharyya distance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810572146.1A CN108766430B (en) 2018-06-06 2018-06-06 Speech feature mapping method and system based on Bhattacharyya distance

Publications (2)

Publication Number Publication Date
CN108766430A true CN108766430A (en) 2018-11-06
CN108766430B CN108766430B (en) 2020-08-04

Family

ID=63999822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810572146.1A Active CN108766430B (en) 2018-06-06 2018-06-06 Speech feature mapping method and system based on Bhattacharyya distance

Country Status (1)

Country Link
CN (1) CN108766430B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816187A (en) * 2020-07-03 2020-10-23 中国人民解放军空军预警学院 Deep neural network-based voice feature mapping method in complex environment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1521729A (en) * 2003-01-21 2004-08-18 Method of speech recognition using hidden trajectory hidden markov models
US7529666B1 (en) * 2000-10-30 2009-05-05 International Business Machines Corporation Minimum bayes error feature selection in speech recognition
CN104900232A (en) * 2015-04-20 2015-09-09 东南大学 Isolation word identification method based on double-layer GMM structure and VTS feature compensation
CN106782520A (en) * 2017-03-14 2017-05-31 华中师范大学 Phonetic feature mapping method under a kind of complex environment
US20170263269A1 (en) * 2016-03-08 2017-09-14 International Business Machines Corporation Multi-pass speech activity detection strategy to improve automatic speech recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7529666B1 (en) * 2000-10-30 2009-05-05 International Business Machines Corporation Minimum bayes error feature selection in speech recognition
CN1521729A (en) * 2003-01-21 2004-08-18 Method of speech recognition using hidden trajectory hidden markov models
CN104900232A (en) * 2015-04-20 2015-09-09 东南大学 Isolation word identification method based on double-layer GMM structure and VTS feature compensation
US20170263269A1 (en) * 2016-03-08 2017-09-14 International Business Machines Corporation Multi-pass speech activity detection strategy to improve automatic speech recognition
CN106782520A (en) * 2017-03-14 2017-05-31 华中师范大学 Phonetic feature mapping method under a kind of complex environment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DUC HOANG HA NGUYEN: "Feature Adaptation Using Linear Spectro-Temporal Transform for Robust Speech Recognition", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *
谭萍,邢玉娟,高翔: "说话人模型聚类算法研究与分析", 《中国建材科技》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816187A (en) * 2020-07-03 2020-10-23 中国人民解放军空军预警学院 Deep neural network-based voice feature mapping method in complex environment

Also Published As

Publication number Publication date
CN108766430B (en) 2020-08-04

Similar Documents

Publication Publication Date Title
JP7177167B2 (en) Mixed speech identification method, apparatus and computer program
CN106683680B (en) Speaker recognition method and device, computer equipment and computer readable medium
CN107564513B (en) Voice recognition method and device
CN109360572B (en) Call separation method and device, computer equipment and storage medium
WO2019237519A1 (en) General vector training method, voice clustering method, apparatus, device and medium
CN112053695A (en) Voiceprint recognition method and device, electronic equipment and storage medium
TWI475558B (en) Method and apparatus for utterance verification
JP5932869B2 (en) N-gram language model unsupervised learning method, learning apparatus, and learning program
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
CN107093422B (en) Voice recognition method and voice recognition system
JP7124427B2 (en) Multi-view vector processing method and apparatus
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
WO2019232826A1 (en) I-vector extraction method, speaker recognition method and apparatus, device, and medium
CN109192200A (en) A kind of audio recognition method
Sivaraman et al. Personalized speech enhancement through self-supervised data augmentation and purification
CN112053694A (en) Voiceprint recognition method based on CNN and GRU network fusion
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN111161713A (en) Voice gender identification method and device and computing equipment
CN113178192A (en) Training method, device and equipment of speech recognition model and storage medium
CN114203154A (en) Training method and device of voice style migration model and voice style migration method and device
CN112002307A (en) Voice recognition method and device
CN113421573B (en) Identity recognition model training method, identity recognition method and device
KR100897555B1 (en) Apparatus and method of extracting speech feature vectors and speech recognition system and method employing the same
CN112133293A (en) Phrase voice sample compensation method based on generation countermeasure network and storage medium
CN112017676A (en) Audio processing method, apparatus and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant