CN107919115B - Characteristic compensation method based on nonlinear spectral transformation - Google Patents

Characteristic compensation method based on nonlinear spectral transformation Download PDF

Info

Publication number
CN107919115B
CN107919115B CN201711112747.6A CN201711112747A CN107919115B CN 107919115 B CN107919115 B CN 107919115B CN 201711112747 A CN201711112747 A CN 201711112747A CN 107919115 B CN107919115 B CN 107919115B
Authority
CN
China
Prior art keywords
voice
nonlinear
output probability
mfcc
gmm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711112747.6A
Other languages
Chinese (zh)
Other versions
CN107919115A (en
Inventor
吕勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201711112747.6A priority Critical patent/CN107919115B/en
Publication of CN107919115A publication Critical patent/CN107919115A/en
Application granted granted Critical
Publication of CN107919115B publication Critical patent/CN107919115B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]

Abstract

The invention discloses a characteristic compensation method based on nonlinear spectrum transformation, in the training stage, a Gaussian Mixture Model (GMM) is generated by using the standard speech training of a large number of speakers; in the testing stage, nonlinear Frequency conversion is carried out on the magnitude spectrum of each frame of voice of the target speaker by using various conversion parameters to maximize the output probability of the GMM, and the Mel Frequency Cepstral Coefficients (MFCC) when the output probability is the maximum are used as the compensated target voice characteristic parameters. The invention can match the voice characteristic of the target speaker with the acoustic model trained in advance, reduces the influence of the mismatch of the speaker on the voice recognition system, and has the advantages of good real-time performance and independence from a back-end recognizer.

Description

Characteristic compensation method based on nonlinear spectral transformation
Technical Field
The invention belongs to the field of voice recognition, and particularly relates to a nonlinear characteristic compensation method for carrying out nonlinear frequency conversion on the amplitude spectrum of each frame of voice of a target speaker to enable the frequency spectrum characteristic to be matched with a pre-trained acoustic model.
Background
In speech recognition systems, Hidden Markov Models (HMM), which are acoustic models for each speech unit, are typically trained from a large number of speaker's speech samples. This may cover the pronunciation characteristics of a large number of speakers, but also results in a reduced recognition performance of the speech recognition system for a certain speaker or class of speakers. Moreover, most speakers have a deviation from the standard pronunciation, even with severe accents. Therefore, in practical applications, it is necessary to compensate the speech characteristics of the target speaker or the parameters of the acoustic model, reduce the influence of environmental mismatch, and improve the recognition performance of the speech recognition system.
Speaker adaptation is a commonly used robust speech recognition method that adjusts the parameters of a pre-trained acoustic model to match the current speaker's pronunciation characteristics based on a small amount of pronunciation data of the target speaker in the test environment. In general, large vocabulary speech recognition systems have a large number of speech units and very limited data for model adaptation, so most gaussian units of acoustic models lack sufficient data to estimate their mean and variance. Thus, it is generally assumed that all or part of the gaussian units conform to the same linear transformation, and their data is combined to estimate the same transformation matrix for the parametric transformation of all gaussian units within the class. This reduces the accuracy of speaker adaptation and makes system performance far from ideal systems trained with large amounts of targeted speaker data. In addition, the speaker self-adapts to transform all Gaussian units of the system acoustic model, and the method relates to complex matrix operation and is high in calculation complexity.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a characteristic compensation method based on nonlinear spectral transformation. In the method, a Gaussian Mixture Model (GMM) is generated by standard speech training of a large number of speakers; in the testing stage, nonlinear Frequency conversion is carried out on the magnitude spectrum of each frame of voice of the target speaker by using various conversion parameters to maximize the output probability of the GMM, and the Mel Frequency Cepstral Coefficients (MFCC) when the output probability is the maximum are used as the compensated target voice characteristic parameters.
The method comprises the following specific steps:
(1) extracting standard MFCC from training voices of a large number of speakers, and training to generate a Gaussian mixture model;
(2) windowing the voice of a target speaker, framing, and performing Fast Fourier Transform (FFT) to obtain the amplitude spectrum of each frame of voice signal;
(3) carrying out frequency conversion on the amplitude spectrum of each frame of voice signal;
(4) performing Mel filtering on the transformed magnitude spectrum, taking logarithm, and performing Discrete Cosine Transform (DCT) to obtain MFCC after nonlinear frequency transformation;
(5) performing acoustic decoding on the MFCC subjected to the nonlinear frequency conversion by using the GMM, and recording the output probability;
(6) replacing the frequency conversion parameters, and repeating the steps (3) to (5);
(7) and comparing the output probability corresponding to each frequency conversion parameter, and selecting the MFCC corresponding to the conversion parameter with the maximum output probability as the compensated target voice characteristic parameter.
Drawings
Fig. 1 is a general framework diagram of a nonlinear spectral transform-based feature compensation system, which mainly includes FFT, frequency transform, Mel filtering, logarithm taking, DCT and GMM decoding modules.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
As shown in fig. 1, the characteristic compensation method based on nonlinear spectral transformation mainly includes FFT, frequency transformation, Mel filtering, logarithm taking, DCT and GMM decoding modules. The specific embodiments of the various main modules in the drawings are described in detail below:
1. model training
A Gaussian mixture model is generated by extracting standard MFCCs from training voices of a large number of speakers and training.
2、FFT
Windowing, framing and FFT are carried out on input voice of a target speaker, and a magnitude spectrum of each frame of voice signal is obtained.
3. Frequency conversion
Generally speaking, the formants of different speakers in the same voice unit are different, and the formant characteristic of the target speaker can approach the standard frequency spectrum of the training voice by transforming the frequency through a bilinear transformation formula. And if the digital frequency variable of the original target voice amplitude spectrum is k, performing digital frequency conversion by the following formula to obtain a new digital frequency variable l:
Figure BDA0001465615000000031
where a is the frequency transform parameter, round () is the rounding function, and N is the length (number of points) of the discrete fourier transform.
4. Feature extraction
And (4) performing Mel filtering on the transformed magnitude spectrum, and performing logarithm taking and DCT operation to obtain the MFCC after nonlinear frequency transformation.
5. GMM decoding
The nonlinear frequency transformed MFCC is acoustically decoded with a pre-trained GMM and the output probability is recorded. The frequency transformation parameter a takes several values at equal intervals in the interval [ -1, 1], and each value a is subjected to frequency transformation, feature extraction and GMM decoding, and the output probability is recorded. After the operation of all the a values is completed, comparing the output probabilities corresponding to all the a values, and selecting the MFCC corresponding to the a value with the maximum output probability as the compensated target voice characteristic parameter.

Claims (4)

1. A characteristic compensation method based on nonlinear spectral transformation is characterized in that: generating a Gaussian Mixture Model (GMM) by using standard voice training of a large number of speakers; in the testing stage, nonlinear Frequency conversion is carried out on the magnitude spectrum of each frame of voice of the target speaker by using various conversion parameters to maximize the output probability of the GMM, and the Mel Frequency Cepstral Coefficients (MFCC) when the output probability is the maximum are used as the compensated target voice characteristic parameters.
2. The feature compensation method based on nonlinear spectral transformation according to claim 1, specifically comprising:
(1) extracting standard MFCC from training voices of a large number of speakers, and training to generate a Gaussian mixture model;
(2) windowing the voice of a target speaker, framing, and performing Fast Fourier Transform (FFT) to obtain the amplitude spectrum of each frame of voice signal;
(3) carrying out frequency conversion on the amplitude spectrum of each frame of voice signal;
(4) performing Mel filtering on the transformed magnitude spectrum, taking logarithm, and performing Discrete Cosine Transform (DCT) to obtain MFCC after nonlinear frequency transformation;
(5) performing acoustic decoding on the MFCC subjected to the nonlinear frequency conversion by using the GMM, and recording the output probability;
(6) replacing the frequency conversion parameters, and repeating the steps (3) to (5);
(7) and comparing the output probability corresponding to each frequency conversion parameter, and selecting the MFCC corresponding to the conversion parameter with the maximum output probability as the compensated target voice characteristic parameter.
3. The nonlinear spectral transform-based feature compensation method according to claim 2, wherein: the non-linear transformation of the digital frequency is performed by:
Figure FDA0003053160990000011
wherein k and l respectively represent the digital frequency of the voice amplitude spectrum before and after transformation, a is a frequency transformation parameter, and round () is an integer function.
4. The nonlinear spectral transform-based feature compensation method according to claim 2, wherein: using pre-trained GMM to perform acoustic decoding on the MFCC after nonlinear frequency conversion, recording output probability, taking a plurality of values of a frequency conversion parameter a at equal intervals in an interval of [ -1, 1], performing frequency conversion, feature extraction and GMM decoding on each value a, recording the output probability, comparing the output probabilities corresponding to all the values a after completing the operation of all the values a, and selecting the MFCC corresponding to the value a with the maximum output probability as a compensated target voice feature parameter.
CN201711112747.6A 2017-11-13 2017-11-13 Characteristic compensation method based on nonlinear spectral transformation Active CN107919115B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711112747.6A CN107919115B (en) 2017-11-13 2017-11-13 Characteristic compensation method based on nonlinear spectral transformation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711112747.6A CN107919115B (en) 2017-11-13 2017-11-13 Characteristic compensation method based on nonlinear spectral transformation

Publications (2)

Publication Number Publication Date
CN107919115A CN107919115A (en) 2018-04-17
CN107919115B true CN107919115B (en) 2021-07-27

Family

ID=61896268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711112747.6A Active CN107919115B (en) 2017-11-13 2017-11-13 Characteristic compensation method based on nonlinear spectral transformation

Country Status (1)

Country Link
CN (1) CN107919115B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108877784B (en) * 2018-09-05 2022-12-06 河海大学 Robust speech recognition method based on accent recognition
CN108986794B (en) * 2018-09-19 2023-02-28 河海大学 Speaker compensation method based on power function frequency transformation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Method for recognizing sound-groove and system based on gauss hybrid models
US8160877B1 (en) * 2009-08-06 2012-04-17 Narus, Inc. Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
KR20120130371A (en) * 2011-05-23 2012-12-03 수원대학교산학협력단 Method for recogning emergency speech using gmm
CN107103914A (en) * 2017-03-20 2017-08-29 南京邮电大学 A kind of high-quality phonetics transfer method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7165028B2 (en) * 2001-12-12 2007-01-16 Texas Instruments Incorporated Method of speech recognition resistant to convolutive distortion and additive distortion
JP2010078650A (en) * 2008-09-24 2010-04-08 Toshiba Corp Speech recognizer and method thereof
KR101014321B1 (en) * 2009-02-24 2011-02-14 한국전자통신연구원 Method for emotion recognition based on Minimum Classification Error
CN102664010B (en) * 2012-05-04 2014-04-16 山东大学 Robust speaker distinguishing method based on multifactor frequency displacement invariant feature
CN103000174B (en) * 2012-11-26 2015-06-24 河海大学 Feature compensation method based on rapid noise estimation in speech recognition system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160877B1 (en) * 2009-08-06 2012-04-17 Narus, Inc. Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
KR20120130371A (en) * 2011-05-23 2012-12-03 수원대학교산학협력단 Method for recogning emergency speech using gmm
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Method for recognizing sound-groove and system based on gauss hybrid models
CN107103914A (en) * 2017-03-20 2017-08-29 南京邮电大学 A kind of high-quality phonetics transfer method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
《Cubic spline approximation of bilinear frequency unwarping》;Pramod H. Kachare et al.;《2015 International Conference on Industrial Instrumentation and Control (ICIC)》;20151231;全文 *
《Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling》;Daniel Erro et al.;《IEEE Transactions on Audio, Speech, and Language Processing ( Volume: 21, Issue: 3, March 2013)》;20130331;全文 *
《一种适于说话人识别的非线性频率尺度变换》;俞一彪等;《声学学报》;20080930;第2.2节,第3节,第4节 *
《利用频谱搬移控制语音转换中的共振峰》;彭柏等;《语音技术》;20070131;全文 *
《自适应高斯混合模型及说话人识别应用》;王韵琪等;《通信技术》;20140731;全文 *

Also Published As

Publication number Publication date
CN107919115A (en) 2018-04-17

Similar Documents

Publication Publication Date Title
US8438026B2 (en) Method and system for generating training data for an automatic speech recognizer
Boril et al. Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments
CN108877784B (en) Robust speech recognition method based on accent recognition
Aggarwal et al. Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system
CN104392718A (en) Robust voice recognition method based on acoustic model array
Chuang et al. Speaker-Aware Deep Denoising Autoencoder with Embedded Speaker Identity for Speech Enhancement.
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
Rajnoha et al. ASR systems in noisy environment: Analysis and solutions for increasing noise robustness
Alam et al. Robust feature extraction based on an asymmetric level-dependent auditory filterbank and a subband spectrum enhancement technique
Chauhan et al. Speech to text converter using Gaussian Mixture Model (GMM)
CN107919115B (en) Characteristic compensation method based on nonlinear spectral transformation
KR101236539B1 (en) Apparatus and Method For Feature Compensation Using Weighted Auto-Regressive Moving Average Filter and Global Cepstral Mean and Variance Normalization
Hachkar et al. A comparison of DHMM and DTW for isolated digits recognition system of Arabic language
Chavan et al. Speech recognition in noisy environment, issues and challenges: A review
Alam et al. Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition
Kaur et al. Power-Normalized Cepstral Coefficients (PNCC) for Punjabi automatic speech recognition using phone based modelling in HTK
Upadhyay et al. Robust recognition of English speech in noisy environments using frequency warped signal processing
Akhter et al. An analysis of performance evaluation metrics for voice conversion models
Xiao Robust speech features and acoustic models for speech recognition
Wu et al. An environment-compensated minimum classification error training approach based on stochastic vector mapping
CN108986794B (en) Speaker compensation method based on power function frequency transformation
Singh et al. A comparative study of recognition of speech using improved MFCC algorithms and Rasta filters
Dutta et al. A comparison of three spectral features for phone recognition in sub-optimal environments
Kathania et al. Experiments on children's speech recognition under acoustically mismatched conditions
Harshita et al. Speech Recognition with Frequency Domain Linear Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant