CN107919115B - Characteristic compensation method based on nonlinear spectral transformation - Google Patents
Characteristic compensation method based on nonlinear spectral transformation Download PDFInfo
- Publication number
- CN107919115B CN107919115B CN201711112747.6A CN201711112747A CN107919115B CN 107919115 B CN107919115 B CN 107919115B CN 201711112747 A CN201711112747 A CN 201711112747A CN 107919115 B CN107919115 B CN 107919115B
- Authority
- CN
- China
- Prior art keywords
- voice
- nonlinear
- output probability
- mfcc
- gmm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
Abstract
The invention discloses a characteristic compensation method based on nonlinear spectrum transformation, in the training stage, a Gaussian Mixture Model (GMM) is generated by using the standard speech training of a large number of speakers; in the testing stage, nonlinear Frequency conversion is carried out on the magnitude spectrum of each frame of voice of the target speaker by using various conversion parameters to maximize the output probability of the GMM, and the Mel Frequency Cepstral Coefficients (MFCC) when the output probability is the maximum are used as the compensated target voice characteristic parameters. The invention can match the voice characteristic of the target speaker with the acoustic model trained in advance, reduces the influence of the mismatch of the speaker on the voice recognition system, and has the advantages of good real-time performance and independence from a back-end recognizer.
Description
Technical Field
The invention belongs to the field of voice recognition, and particularly relates to a nonlinear characteristic compensation method for carrying out nonlinear frequency conversion on the amplitude spectrum of each frame of voice of a target speaker to enable the frequency spectrum characteristic to be matched with a pre-trained acoustic model.
Background
In speech recognition systems, Hidden Markov Models (HMM), which are acoustic models for each speech unit, are typically trained from a large number of speaker's speech samples. This may cover the pronunciation characteristics of a large number of speakers, but also results in a reduced recognition performance of the speech recognition system for a certain speaker or class of speakers. Moreover, most speakers have a deviation from the standard pronunciation, even with severe accents. Therefore, in practical applications, it is necessary to compensate the speech characteristics of the target speaker or the parameters of the acoustic model, reduce the influence of environmental mismatch, and improve the recognition performance of the speech recognition system.
Speaker adaptation is a commonly used robust speech recognition method that adjusts the parameters of a pre-trained acoustic model to match the current speaker's pronunciation characteristics based on a small amount of pronunciation data of the target speaker in the test environment. In general, large vocabulary speech recognition systems have a large number of speech units and very limited data for model adaptation, so most gaussian units of acoustic models lack sufficient data to estimate their mean and variance. Thus, it is generally assumed that all or part of the gaussian units conform to the same linear transformation, and their data is combined to estimate the same transformation matrix for the parametric transformation of all gaussian units within the class. This reduces the accuracy of speaker adaptation and makes system performance far from ideal systems trained with large amounts of targeted speaker data. In addition, the speaker self-adapts to transform all Gaussian units of the system acoustic model, and the method relates to complex matrix operation and is high in calculation complexity.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a characteristic compensation method based on nonlinear spectral transformation. In the method, a Gaussian Mixture Model (GMM) is generated by standard speech training of a large number of speakers; in the testing stage, nonlinear Frequency conversion is carried out on the magnitude spectrum of each frame of voice of the target speaker by using various conversion parameters to maximize the output probability of the GMM, and the Mel Frequency Cepstral Coefficients (MFCC) when the output probability is the maximum are used as the compensated target voice characteristic parameters.
The method comprises the following specific steps:
(1) extracting standard MFCC from training voices of a large number of speakers, and training to generate a Gaussian mixture model;
(2) windowing the voice of a target speaker, framing, and performing Fast Fourier Transform (FFT) to obtain the amplitude spectrum of each frame of voice signal;
(3) carrying out frequency conversion on the amplitude spectrum of each frame of voice signal;
(4) performing Mel filtering on the transformed magnitude spectrum, taking logarithm, and performing Discrete Cosine Transform (DCT) to obtain MFCC after nonlinear frequency transformation;
(5) performing acoustic decoding on the MFCC subjected to the nonlinear frequency conversion by using the GMM, and recording the output probability;
(6) replacing the frequency conversion parameters, and repeating the steps (3) to (5);
(7) and comparing the output probability corresponding to each frequency conversion parameter, and selecting the MFCC corresponding to the conversion parameter with the maximum output probability as the compensated target voice characteristic parameter.
Drawings
Fig. 1 is a general framework diagram of a nonlinear spectral transform-based feature compensation system, which mainly includes FFT, frequency transform, Mel filtering, logarithm taking, DCT and GMM decoding modules.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
As shown in fig. 1, the characteristic compensation method based on nonlinear spectral transformation mainly includes FFT, frequency transformation, Mel filtering, logarithm taking, DCT and GMM decoding modules. The specific embodiments of the various main modules in the drawings are described in detail below:
1. model training
A Gaussian mixture model is generated by extracting standard MFCCs from training voices of a large number of speakers and training.
2、FFT
Windowing, framing and FFT are carried out on input voice of a target speaker, and a magnitude spectrum of each frame of voice signal is obtained.
3. Frequency conversion
Generally speaking, the formants of different speakers in the same voice unit are different, and the formant characteristic of the target speaker can approach the standard frequency spectrum of the training voice by transforming the frequency through a bilinear transformation formula. And if the digital frequency variable of the original target voice amplitude spectrum is k, performing digital frequency conversion by the following formula to obtain a new digital frequency variable l:
where a is the frequency transform parameter, round () is the rounding function, and N is the length (number of points) of the discrete fourier transform.
4. Feature extraction
And (4) performing Mel filtering on the transformed magnitude spectrum, and performing logarithm taking and DCT operation to obtain the MFCC after nonlinear frequency transformation.
5. GMM decoding
The nonlinear frequency transformed MFCC is acoustically decoded with a pre-trained GMM and the output probability is recorded. The frequency transformation parameter a takes several values at equal intervals in the interval [ -1, 1], and each value a is subjected to frequency transformation, feature extraction and GMM decoding, and the output probability is recorded. After the operation of all the a values is completed, comparing the output probabilities corresponding to all the a values, and selecting the MFCC corresponding to the a value with the maximum output probability as the compensated target voice characteristic parameter.
Claims (4)
1. A characteristic compensation method based on nonlinear spectral transformation is characterized in that: generating a Gaussian Mixture Model (GMM) by using standard voice training of a large number of speakers; in the testing stage, nonlinear Frequency conversion is carried out on the magnitude spectrum of each frame of voice of the target speaker by using various conversion parameters to maximize the output probability of the GMM, and the Mel Frequency Cepstral Coefficients (MFCC) when the output probability is the maximum are used as the compensated target voice characteristic parameters.
2. The feature compensation method based on nonlinear spectral transformation according to claim 1, specifically comprising:
(1) extracting standard MFCC from training voices of a large number of speakers, and training to generate a Gaussian mixture model;
(2) windowing the voice of a target speaker, framing, and performing Fast Fourier Transform (FFT) to obtain the amplitude spectrum of each frame of voice signal;
(3) carrying out frequency conversion on the amplitude spectrum of each frame of voice signal;
(4) performing Mel filtering on the transformed magnitude spectrum, taking logarithm, and performing Discrete Cosine Transform (DCT) to obtain MFCC after nonlinear frequency transformation;
(5) performing acoustic decoding on the MFCC subjected to the nonlinear frequency conversion by using the GMM, and recording the output probability;
(6) replacing the frequency conversion parameters, and repeating the steps (3) to (5);
(7) and comparing the output probability corresponding to each frequency conversion parameter, and selecting the MFCC corresponding to the conversion parameter with the maximum output probability as the compensated target voice characteristic parameter.
3. The nonlinear spectral transform-based feature compensation method according to claim 2, wherein: the non-linear transformation of the digital frequency is performed by:
wherein k and l respectively represent the digital frequency of the voice amplitude spectrum before and after transformation, a is a frequency transformation parameter, and round () is an integer function.
4. The nonlinear spectral transform-based feature compensation method according to claim 2, wherein: using pre-trained GMM to perform acoustic decoding on the MFCC after nonlinear frequency conversion, recording output probability, taking a plurality of values of a frequency conversion parameter a at equal intervals in an interval of [ -1, 1], performing frequency conversion, feature extraction and GMM decoding on each value a, recording the output probability, comparing the output probabilities corresponding to all the values a after completing the operation of all the values a, and selecting the MFCC corresponding to the value a with the maximum output probability as a compensated target voice feature parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711112747.6A CN107919115B (en) | 2017-11-13 | 2017-11-13 | Characteristic compensation method based on nonlinear spectral transformation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711112747.6A CN107919115B (en) | 2017-11-13 | 2017-11-13 | Characteristic compensation method based on nonlinear spectral transformation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107919115A CN107919115A (en) | 2018-04-17 |
CN107919115B true CN107919115B (en) | 2021-07-27 |
Family
ID=61896268
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711112747.6A Active CN107919115B (en) | 2017-11-13 | 2017-11-13 | Characteristic compensation method based on nonlinear spectral transformation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107919115B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108877784B (en) * | 2018-09-05 | 2022-12-06 | 河海大学 | Robust speech recognition method based on accent recognition |
CN108986794B (en) * | 2018-09-19 | 2023-02-28 | 河海大学 | Speaker compensation method based on power function frequency transformation |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102324232A (en) * | 2011-09-12 | 2012-01-18 | 辽宁工业大学 | Method for recognizing sound-groove and system based on gauss hybrid models |
US8160877B1 (en) * | 2009-08-06 | 2012-04-17 | Narus, Inc. | Hierarchical real-time speaker recognition for biometric VoIP verification and targeting |
KR20120130371A (en) * | 2011-05-23 | 2012-12-03 | 수원대학교산학협력단 | Method for recogning emergency speech using gmm |
CN107103914A (en) * | 2017-03-20 | 2017-08-29 | 南京邮电大学 | A kind of high-quality phonetics transfer method |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7165028B2 (en) * | 2001-12-12 | 2007-01-16 | Texas Instruments Incorporated | Method of speech recognition resistant to convolutive distortion and additive distortion |
JP2010078650A (en) * | 2008-09-24 | 2010-04-08 | Toshiba Corp | Speech recognizer and method thereof |
KR101014321B1 (en) * | 2009-02-24 | 2011-02-14 | 한국전자통신연구원 | Method for emotion recognition based on Minimum Classification Error |
CN102664010B (en) * | 2012-05-04 | 2014-04-16 | 山东大学 | Robust speaker distinguishing method based on multifactor frequency displacement invariant feature |
CN103000174B (en) * | 2012-11-26 | 2015-06-24 | 河海大学 | Feature compensation method based on rapid noise estimation in speech recognition system |
-
2017
- 2017-11-13 CN CN201711112747.6A patent/CN107919115B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8160877B1 (en) * | 2009-08-06 | 2012-04-17 | Narus, Inc. | Hierarchical real-time speaker recognition for biometric VoIP verification and targeting |
KR20120130371A (en) * | 2011-05-23 | 2012-12-03 | 수원대학교산학협력단 | Method for recogning emergency speech using gmm |
CN102324232A (en) * | 2011-09-12 | 2012-01-18 | 辽宁工业大学 | Method for recognizing sound-groove and system based on gauss hybrid models |
CN107103914A (en) * | 2017-03-20 | 2017-08-29 | 南京邮电大学 | A kind of high-quality phonetics transfer method |
Non-Patent Citations (5)
Title |
---|
《Cubic spline approximation of bilinear frequency unwarping》;Pramod H. Kachare et al.;《2015 International Conference on Industrial Instrumentation and Control (ICIC)》;20151231;全文 * |
《Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling》;Daniel Erro et al.;《IEEE Transactions on Audio, Speech, and Language Processing ( Volume: 21, Issue: 3, March 2013)》;20130331;全文 * |
《一种适于说话人识别的非线性频率尺度变换》;俞一彪等;《声学学报》;20080930;第2.2节,第3节,第4节 * |
《利用频谱搬移控制语音转换中的共振峰》;彭柏等;《语音技术》;20070131;全文 * |
《自适应高斯混合模型及说话人识别应用》;王韵琪等;《通信技术》;20140731;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN107919115A (en) | 2018-04-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8438026B2 (en) | Method and system for generating training data for an automatic speech recognizer | |
Boril et al. | Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments | |
CN108877784B (en) | Robust speech recognition method based on accent recognition | |
Aggarwal et al. | Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system | |
CN104392718A (en) | Robust voice recognition method based on acoustic model array | |
Chuang et al. | Speaker-Aware Deep Denoising Autoencoder with Embedded Speaker Identity for Speech Enhancement. | |
Ismail et al. | Mfcc-vq approach for qalqalahtajweed rule checking | |
Rajnoha et al. | ASR systems in noisy environment: Analysis and solutions for increasing noise robustness | |
Alam et al. | Robust feature extraction based on an asymmetric level-dependent auditory filterbank and a subband spectrum enhancement technique | |
Chauhan et al. | Speech to text converter using Gaussian Mixture Model (GMM) | |
CN107919115B (en) | Characteristic compensation method based on nonlinear spectral transformation | |
KR101236539B1 (en) | Apparatus and Method For Feature Compensation Using Weighted Auto-Regressive Moving Average Filter and Global Cepstral Mean and Variance Normalization | |
Hachkar et al. | A comparison of DHMM and DTW for isolated digits recognition system of Arabic language | |
Chavan et al. | Speech recognition in noisy environment, issues and challenges: A review | |
Alam et al. | Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition | |
Kaur et al. | Power-Normalized Cepstral Coefficients (PNCC) for Punjabi automatic speech recognition using phone based modelling in HTK | |
Upadhyay et al. | Robust recognition of English speech in noisy environments using frequency warped signal processing | |
Akhter et al. | An analysis of performance evaluation metrics for voice conversion models | |
Xiao | Robust speech features and acoustic models for speech recognition | |
Wu et al. | An environment-compensated minimum classification error training approach based on stochastic vector mapping | |
CN108986794B (en) | Speaker compensation method based on power function frequency transformation | |
Singh et al. | A comparative study of recognition of speech using improved MFCC algorithms and Rasta filters | |
Dutta et al. | A comparison of three spectral features for phone recognition in sub-optimal environments | |
Kathania et al. | Experiments on children's speech recognition under acoustically mismatched conditions | |
Harshita et al. | Speech Recognition with Frequency Domain Linear Prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |