CN108877784B - Robust speech recognition method based on accent recognition - Google Patents
Robust speech recognition method based on accent recognition Download PDFInfo
- Publication number
- CN108877784B CN108877784B CN201811030962.6A CN201811030962A CN108877784B CN 108877784 B CN108877784 B CN 108877784B CN 201811030962 A CN201811030962 A CN 201811030962A CN 108877784 B CN108877784 B CN 108877784B
- Authority
- CN
- China
- Prior art keywords
- accent
- training
- target speaker
- acoustic model
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 17
- 238000012549 training Methods 0.000 claims abstract description 42
- 238000012360 testing method Methods 0.000 claims abstract description 29
- 239000013598 vector Substances 0.000 claims abstract description 16
- 230000003044 adaptive effect Effects 0.000 claims abstract description 6
- 239000000203 mixture Substances 0.000 claims abstract description 5
- 238000000605 extraction Methods 0.000 claims description 5
- 238000009432 framing Methods 0.000 claims description 5
- 230000037433 frameshift Effects 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 230000006978 adaptation Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a robust speech recognition method based on accent recognition, which predicts the accent characteristics of a target speaker by using acoustic models of various accents. In the training stage, accents with similar pronunciation characteristics are combined into one class, and a Gaussian mixture model and a group of hidden Markov models are generated for each class of accent training; in the testing stage, firstly, extracting a formant from the testing voice of a target speaker; then according to the characteristic of the formant, the accent of the speaker is identified, an acoustic model corresponding to the accent is selected according to the identification result, and the parameters of the acoustic model are adjusted to be matched with the pronunciation characteristic of the target speaker; and finally, recognizing the test voice feature vector by using the adaptive acoustic model to obtain a recognition result. The method can reduce the influence of the accent on the voice recognition system and improve the model self-adaption accuracy under the accent change condition.
Description
Technical Field
The invention belongs to the field of voice recognition, and particularly relates to a robust voice recognition method which uses a Gaussian mixture model to describe the formant vector distribution of an accent, uses a pre-trained Gaussian mixture model to recognize the accent of a test voice in a test environment, selects an acoustic model which is most matched with the accent of a current speaker, and performs speaker self-adaptation on the parameters of the acoustic model to obtain the acoustic model of the test environment.
Background
Speech recognition systems generally use Mel-Frequency Cepstral coefficients (MFCC) as feature vectors and Hidden Markov Models (HMM) as acoustic models. In order to reflect the voice characteristics of the target speaker, the acoustic model is generally trained by training voices of a large number of speakers. It is then very difficult to reduce the impact of speaker changes by adding training speech. This is because different speakers have different speaking modes, and the number of speakers is huge, so that it is difficult to cover too many speakers in the training phase. On the other hand, too many speakers are trained, so that the acoustic model is too flat, the characteristic gap between each speaker is increased, and the system recognition rate is reduced.
Currently, most speech recognition systems achieve high recognition rates for standard mandarin chinese pronunciations. However, in real life, there are few people who can say the standard, and most people pronounce more or less with regional accents. The speaker self-adaptation can transform the parameters of the acoustic model trained in advance according to a small amount of test voices in the test environment, so that the parameters are matched with the test environment as much as possible. The transformation relationship between the training environment and the test environment is then unknown and non-linear. For the sake of implementation, in speaker adaptation, it is generally assumed that such environment mapping relationship is linear transformation. This may result in a large difference between the adaptively obtained acoustic model and the ideal acoustic model. This difference is more pronounced, especially when the pronunciation characteristics of the training speech and the targeted speaker differ significantly.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a robust speech recognition method based on accent recognition.
The technical scheme is as follows: a robust speech recognition method based on accent recognition is characterized in that accents with similar pronunciation characteristics are combined into one class in a training stage, and a Gaussian Mixture Model (GMM) and a group of hidden Markov models are generated for each class of accent training; in the testing stage, firstly, extracting formants from the testing voice of a target speaker; then according to the characteristic of the formant, the accent of the speaker is identified, an acoustic model corresponding to the accent is selected according to the identification result, and the parameters of the acoustic model are adjusted to be matched with the pronunciation characteristic of the target speaker; and finally, recognizing the test voice feature vector by using the adaptive acoustic model to obtain a recognition result.
The method comprises the following specific steps:
(1) Obtaining training voices of various accents;
(2) Windowing the training voice of each type of accent, and framing to obtain frame signals;
(3) Extracting formants from voiced frame signals of each type of training speech, and forming formant vectors by the first three formants;
(4) GMM training is carried out on the formant vector of each type of training voice to obtain a GMM model of the accent;
(5) Extracting features of each type of training voice to obtain a Mel Frequency Cepstrum Coefficient (MFCC), and performing HMM training to obtain an HMM model (acoustic model) of each voice unit of the accent;
(6) Windowing the test voice of the target speaker, and framing to obtain a frame signal of the test voice;
(7) Extracting formant vectors from voiced frame signals of a target speaker;
(8) Carrying out accent recognition on the formant vector of the target speaker by using a pre-trained GMM to obtain accent information of the target speaker;
(9) Selecting an acoustic model of the similar accent according to the accent information of the target speaker, and adjusting parameters of the acoustic model to match the acoustic model with the pronunciation characteristics of the target speaker to obtain a self-adaptive acoustic model, wherein the matching process is an approximate approximation process, and the matching degree is improved if the recognition rate is improved but not completely matched if the recognition rate is improved;
(10) Extracting features from the frame signal of the target speaker to obtain the MFCC of the target speaker;
(11) And performing acoustic decoding on the MFCC of the target speaker by using the adaptive acoustic model to obtain a recognition result.
By adopting the technical scheme, the invention has the following beneficial effects:
the method can reduce the influence of the accent on the voice recognition system, improve the model self-adaption accuracy under the accent changing condition and enhance the recognition performance of the voice recognition system.
Drawings
Fig. 1 is a general framework diagram of a robust speech recognition method based on accent recognition according to an embodiment of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
A robust speech recognition method based on accent recognition mainly comprises preprocessing, formant extraction, GMM training, feature extraction, HMM training, accent recognition, model adaptation and acoustic decoding.
1. Pretreatment of
And respectively windowing the training voice and the test voice in a training stage and a testing stage, framing and generating each frame of signal. The sampling frequency of the speech signal is 8000Hz, the window function is a hamming window, the frame length is 256, and the frame shift is 128.
2. Formant extraction
In the training stage and the testing stage, formants are respectively extracted from voiced frame signals of training speech and testing speech, and the first three formants are combined into a formant vector.
3. Feature extraction
In the training stage and the testing stage, each frame signal of the training voice and the testing voice is subjected to fast Fourier transform, mel filtering, logarithmic transform and discrete cosine transform to generate a Mel Frequency Cepstrum Coefficient (MFCC).
4. GMM training
And carrying out GMM training on all training voice formant vectors of each type of accent to generate a GMM model of the accent.
5. HMM training
And carrying out HMM training on all training voices MFCC of each voice unit of each type of accent to obtain an HMM model of the voice unit of the type of accent. All HMMs for each type of accent make up an acoustic model for that type of accent.
6. Accent recognition
And inputting the formant vector of the tested voice of the target speaker into the GMM of each type of accent, and calculating the output probability of the GMM. And the accent corresponding to the GMM with the maximum output probability is the accent of the target speaker.
7. Model adaptation
According to the accent information of the target speaker obtained by accent recognition, an acoustic model of the accent is selected, and parameters of the acoustic model of the selected accent are transformed by using a maximum likelihood regression algorithm, so that the parameters are more matched with the pronunciation characteristics of the target speaker.
8. Acoustic decoding
And performing acoustic decoding on the MFCC of the target speaker by using the adaptive accent acoustic model to obtain a recognition result.
Claims (2)
1. A robust speech recognition method based on accent recognition is characterized in that in a training stage, training speech of various accents is obtained, and a Gaussian Mixture Model (GMM) and a group of hidden Markov models are generated for training each accent; in the testing stage, firstly, extracting formants from the testing voice of a target speaker; then according to the characteristic of the formant, the accent of the speaker is identified, an acoustic model corresponding to the accent is selected according to the identification result, and the parameters of the acoustic model are adjusted to be matched with the pronunciation characteristic of the target speaker, so that an adaptive acoustic model is obtained; finally, the acoustic model after self-adaptation is used for recognizing the test voice feature vector to obtain a recognition result;
the specific method for generating a GMM model and an HMM model for each type of accent training comprises the following steps:
(1) Combining the accents with similar pronunciation characteristics into one class to obtain training voices of various accents;
(2) Windowing the training voice of each type of accent, and framing to obtain frame signals;
(3) Extracting formants from voiced frame signals of each type of training speech, and forming formant vectors by the first three formants;
(4) GMM training is carried out on the formant vector of each type of training voice to obtain a GMM model of the accent;
(5) Performing feature extraction on each type of training speech to obtain a Mel Frequency Cepstrum Coefficient (MFCC), and performing HMM training to obtain an HMM model of each speech unit of the accent, wherein the HMM model is an acoustic model;
in the testing stage, the process of firstly identifying from the testing voice of the target speaker is as follows:
(1) Windowing the test voice of the target speaker, and framing to obtain a frame signal of the test voice;
(2) Extracting formant vectors from voiced frame signals of a target speaker;
(3) Performing accent recognition on the formant vector of the target speaker by using a pre-trained GMM to obtain accent information of the target speaker;
(4) Selecting an acoustic model of the accent according to the accent information of the target speaker, and adjusting parameters of the acoustic model to obtain a self-adaptive acoustic model so as to enable the acoustic model to be matched with the pronunciation characteristics of the target speaker;
(5) Extracting features from the frame signal of the target speaker to obtain the MFCC of the target speaker;
(6) And performing acoustic decoding on the MFCC of the target speaker by using the adaptive acoustic model to obtain a recognition result.
2. The robust speech recognition method based on accent recognition of claim 1, wherein the sampling frequency of the training speech and the test speech signal are both 8000Hz, the window function is a hamming window, the frame length is 256, and the frame shift is 128.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811030962.6A CN108877784B (en) | 2018-09-05 | 2018-09-05 | Robust speech recognition method based on accent recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811030962.6A CN108877784B (en) | 2018-09-05 | 2018-09-05 | Robust speech recognition method based on accent recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108877784A CN108877784A (en) | 2018-11-23 |
CN108877784B true CN108877784B (en) | 2022-12-06 |
Family
ID=64323254
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811030962.6A Active CN108877784B (en) | 2018-09-05 | 2018-09-05 | Robust speech recognition method based on accent recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108877784B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220189463A1 (en) * | 2020-12-16 | 2022-06-16 | Samsung Electronics Co., Ltd. | Electronic device and operation method thereof |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109961794B (en) * | 2019-01-14 | 2021-07-06 | 湘潭大学 | Method for improving speaker recognition efficiency based on model clustering |
CN112116909A (en) * | 2019-06-20 | 2020-12-22 | 杭州海康威视数字技术股份有限公司 | Voice recognition method, device and system |
CN110648654A (en) * | 2019-10-09 | 2020-01-03 | 国家电网有限公司客户服务中心 | Speech recognition enhancement method and device introducing language vectors |
CN112233659A (en) * | 2020-10-14 | 2021-01-15 | 河海大学 | Quick speech recognition method based on double-layer acoustic model |
CN112466056B (en) * | 2020-12-01 | 2022-04-05 | 上海旷日网络科技有限公司 | Self-service cabinet pickup system and method based on voice recognition |
CN112599118B (en) * | 2020-12-30 | 2024-02-13 | 中国科学技术大学 | Speech recognition method, device, electronic equipment and storage medium |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080147404A1 (en) * | 2000-05-15 | 2008-06-19 | Nusuara Technologies Sdn Bhd | System and methods for accent classification and adaptation |
CN101123648B (en) * | 2006-08-11 | 2010-05-12 | 中国科学院声学研究所 | Self-adapted method in phone voice recognition |
CN102881284B (en) * | 2012-09-03 | 2014-07-09 | 江苏大学 | Unspecific human voice and emotion recognition method and system |
CN103474061A (en) * | 2013-09-12 | 2013-12-25 | 河海大学 | Automatic distinguishing method based on integration of classifier for Chinese dialects |
CN104392718B (en) * | 2014-11-26 | 2017-11-24 | 河海大学 | A kind of robust speech recognition methods based on acoustic model array |
CN104485108A (en) * | 2014-11-26 | 2015-04-01 | 河海大学 | Noise and speaker combined compensation method based on multi-speaker model |
CN105355198B (en) * | 2015-10-20 | 2019-03-12 | 河海大学 | It is a kind of based on multiple adaptive model compensation audio recognition method |
CN105632501B (en) * | 2015-12-30 | 2019-09-03 | 中国科学院自动化研究所 | A kind of automatic accent classification method and device based on depth learning technology |
CN106251859B (en) * | 2016-07-22 | 2019-05-31 | 百度在线网络技术(北京)有限公司 | Voice recognition processing method and apparatus |
CN106531157B (en) * | 2016-10-28 | 2019-10-22 | 中国科学院自动化研究所 | Regularization accent adaptive approach in speech recognition |
CN107919115B (en) * | 2017-11-13 | 2021-07-27 | 河海大学 | Characteristic compensation method based on nonlinear spectral transformation |
-
2018
- 2018-09-05 CN CN201811030962.6A patent/CN108877784B/en active Active
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220189463A1 (en) * | 2020-12-16 | 2022-06-16 | Samsung Electronics Co., Ltd. | Electronic device and operation method thereof |
Also Published As
Publication number | Publication date |
---|---|
CN108877784A (en) | 2018-11-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108877784B (en) | Robust speech recognition method based on accent recognition | |
KR100679051B1 (en) | Apparatus and method for speech recognition using a plurality of confidence score estimation algorithms | |
Zen et al. | Continuous stochastic feature mapping based on trajectory HMMs | |
Aggarwal et al. | Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system | |
EP1675102A2 (en) | Method for extracting feature vectors for speech recognition | |
Das et al. | Bangladeshi dialect recognition using Mel frequency cepstral coefficient, delta, delta-delta and Gaussian mixture model | |
Nanavare et al. | Recognition of human emotions from speech processing | |
Ranjan et al. | Isolated word recognition using HMM for Maithili dialect | |
Bhukya | Effect of gender on improving speech recognition system | |
KR101236539B1 (en) | Apparatus and Method For Feature Compensation Using Weighted Auto-Regressive Moving Average Filter and Global Cepstral Mean and Variance Normalization | |
US11929058B2 (en) | Systems and methods for adapting human speaker embeddings in speech synthesis | |
Hachkar et al. | A comparison of DHMM and DTW for isolated digits recognition system of Arabic language | |
CN107919115B (en) | Characteristic compensation method based on nonlinear spectral transformation | |
Singh et al. | A critical review on automatic speaker recognition | |
Manjutha et al. | Automated speech recognition system—A literature review | |
Vuppala et al. | Recognition of consonant-vowel (CV) units under background noise using combined temporal and spectral preprocessing | |
Jayanna et al. | Multiple frame size and rate analysis for speaker recognition under limited data condition | |
Dey et al. | Content normalization for text-dependent speaker verification | |
Chakroun et al. | An improved approach for text-independent speaker recognition | |
Salman et al. | Speaker verification using boosted cepstral features with gaussian distributions | |
Biswas et al. | Speaker identification using Cepstral based features and discrete Hidden Markov Model | |
Khalifa et al. | Statistical modeling for speech recognition | |
Galić et al. | Speaker dependent recognition of whispered speech based on MLLR adaptation | |
CN108986794B (en) | Speaker compensation method based on power function frequency transformation | |
Kishimoto et al. | Model training using parallel data with mismatched pause positions in statistical esophageal speech enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |