CN108877784B

CN108877784B - Robust speech recognition method based on accent recognition

Info

Publication number: CN108877784B
Application number: CN201811030962.6A
Authority: CN
Inventors: 吕勇
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2018-09-05
Filing date: 2018-09-05
Publication date: 2022-12-06
Anticipated expiration: 2038-09-05
Also published as: CN108877784A

Abstract

The invention discloses a robust speech recognition method based on accent recognition, which predicts the accent characteristics of a target speaker by using acoustic models of various accents. In the training stage, accents with similar pronunciation characteristics are combined into one class, and a Gaussian mixture model and a group of hidden Markov models are generated for each class of accent training; in the testing stage, firstly, extracting a formant from the testing voice of a target speaker; then according to the characteristic of the formant, the accent of the speaker is identified, an acoustic model corresponding to the accent is selected according to the identification result, and the parameters of the acoustic model are adjusted to be matched with the pronunciation characteristic of the target speaker; and finally, recognizing the test voice feature vector by using the adaptive acoustic model to obtain a recognition result. The method can reduce the influence of the accent on the voice recognition system and improve the model self-adaption accuracy under the accent change condition.

Description

Robust speech recognition method based on accent recognition

Technical Field

The invention belongs to the field of voice recognition, and particularly relates to a robust voice recognition method which uses a Gaussian mixture model to describe the formant vector distribution of an accent, uses a pre-trained Gaussian mixture model to recognize the accent of a test voice in a test environment, selects an acoustic model which is most matched with the accent of a current speaker, and performs speaker self-adaptation on the parameters of the acoustic model to obtain the acoustic model of the test environment.

Background

Speech recognition systems generally use Mel-Frequency Cepstral coefficients (MFCC) as feature vectors and Hidden Markov Models (HMM) as acoustic models. In order to reflect the voice characteristics of the target speaker, the acoustic model is generally trained by training voices of a large number of speakers. It is then very difficult to reduce the impact of speaker changes by adding training speech. This is because different speakers have different speaking modes, and the number of speakers is huge, so that it is difficult to cover too many speakers in the training phase. On the other hand, too many speakers are trained, so that the acoustic model is too flat, the characteristic gap between each speaker is increased, and the system recognition rate is reduced.

Currently, most speech recognition systems achieve high recognition rates for standard mandarin chinese pronunciations. However, in real life, there are few people who can say the standard, and most people pronounce more or less with regional accents. The speaker self-adaptation can transform the parameters of the acoustic model trained in advance according to a small amount of test voices in the test environment, so that the parameters are matched with the test environment as much as possible. The transformation relationship between the training environment and the test environment is then unknown and non-linear. For the sake of implementation, in speaker adaptation, it is generally assumed that such environment mapping relationship is linear transformation. This may result in a large difference between the adaptively obtained acoustic model and the ideal acoustic model. This difference is more pronounced, especially when the pronunciation characteristics of the training speech and the targeted speaker differ significantly.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a robust speech recognition method based on accent recognition.

The technical scheme is as follows: a robust speech recognition method based on accent recognition is characterized in that accents with similar pronunciation characteristics are combined into one class in a training stage, and a Gaussian Mixture Model (GMM) and a group of hidden Markov models are generated for each class of accent training; in the testing stage, firstly, extracting formants from the testing voice of a target speaker; then according to the characteristic of the formant, the accent of the speaker is identified, an acoustic model corresponding to the accent is selected according to the identification result, and the parameters of the acoustic model are adjusted to be matched with the pronunciation characteristic of the target speaker; and finally, recognizing the test voice feature vector by using the adaptive acoustic model to obtain a recognition result.

The method comprises the following specific steps:

(1) Obtaining training voices of various accents;

(2) Windowing the training voice of each type of accent, and framing to obtain frame signals;

(3) Extracting formants from voiced frame signals of each type of training speech, and forming formant vectors by the first three formants;

(4) GMM training is carried out on the formant vector of each type of training voice to obtain a GMM model of the accent;

(5) Extracting features of each type of training voice to obtain a Mel Frequency Cepstrum Coefficient (MFCC), and performing HMM training to obtain an HMM model (acoustic model) of each voice unit of the accent;

(6) Windowing the test voice of the target speaker, and framing to obtain a frame signal of the test voice;

(7) Extracting formant vectors from voiced frame signals of a target speaker;

(8) Carrying out accent recognition on the formant vector of the target speaker by using a pre-trained GMM to obtain accent information of the target speaker;

(9) Selecting an acoustic model of the similar accent according to the accent information of the target speaker, and adjusting parameters of the acoustic model to match the acoustic model with the pronunciation characteristics of the target speaker to obtain a self-adaptive acoustic model, wherein the matching process is an approximate approximation process, and the matching degree is improved if the recognition rate is improved but not completely matched if the recognition rate is improved;

(10) Extracting features from the frame signal of the target speaker to obtain the MFCC of the target speaker;

(11) And performing acoustic decoding on the MFCC of the target speaker by using the adaptive acoustic model to obtain a recognition result.

By adopting the technical scheme, the invention has the following beneficial effects:

the method can reduce the influence of the accent on the voice recognition system, improve the model self-adaption accuracy under the accent changing condition and enhance the recognition performance of the voice recognition system.

Drawings

Fig. 1 is a general framework diagram of a robust speech recognition method based on accent recognition according to an embodiment of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

A robust speech recognition method based on accent recognition mainly comprises preprocessing, formant extraction, GMM training, feature extraction, HMM training, accent recognition, model adaptation and acoustic decoding.

1. Pretreatment of

And respectively windowing the training voice and the test voice in a training stage and a testing stage, framing and generating each frame of signal. The sampling frequency of the speech signal is 8000Hz, the window function is a hamming window, the frame length is 256, and the frame shift is 128.

2. Formant extraction

In the training stage and the testing stage, formants are respectively extracted from voiced frame signals of training speech and testing speech, and the first three formants are combined into a formant vector.

3. Feature extraction

In the training stage and the testing stage, each frame signal of the training voice and the testing voice is subjected to fast Fourier transform, mel filtering, logarithmic transform and discrete cosine transform to generate a Mel Frequency Cepstrum Coefficient (MFCC).

4. GMM training

And carrying out GMM training on all training voice formant vectors of each type of accent to generate a GMM model of the accent.

5. HMM training

And carrying out HMM training on all training voices MFCC of each voice unit of each type of accent to obtain an HMM model of the voice unit of the type of accent. All HMMs for each type of accent make up an acoustic model for that type of accent.

6. Accent recognition

And inputting the formant vector of the tested voice of the target speaker into the GMM of each type of accent, and calculating the output probability of the GMM. And the accent corresponding to the GMM with the maximum output probability is the accent of the target speaker.

7. Model adaptation

According to the accent information of the target speaker obtained by accent recognition, an acoustic model of the accent is selected, and parameters of the acoustic model of the selected accent are transformed by using a maximum likelihood regression algorithm, so that the parameters are more matched with the pronunciation characteristics of the target speaker.

8. Acoustic decoding

And performing acoustic decoding on the MFCC of the target speaker by using the adaptive accent acoustic model to obtain a recognition result.

Claims

1. A robust speech recognition method based on accent recognition is characterized in that in a training stage, training speech of various accents is obtained, and a Gaussian Mixture Model (GMM) and a group of hidden Markov models are generated for training each accent; in the testing stage, firstly, extracting formants from the testing voice of a target speaker; then according to the characteristic of the formant, the accent of the speaker is identified, an acoustic model corresponding to the accent is selected according to the identification result, and the parameters of the acoustic model are adjusted to be matched with the pronunciation characteristic of the target speaker, so that an adaptive acoustic model is obtained; finally, the acoustic model after self-adaptation is used for recognizing the test voice feature vector to obtain a recognition result;

the specific method for generating a GMM model and an HMM model for each type of accent training comprises the following steps:

(1) Combining the accents with similar pronunciation characteristics into one class to obtain training voices of various accents;

(5) Performing feature extraction on each type of training speech to obtain a Mel Frequency Cepstrum Coefficient (MFCC), and performing HMM training to obtain an HMM model of each speech unit of the accent, wherein the HMM model is an acoustic model;

in the testing stage, the process of firstly identifying from the testing voice of the target speaker is as follows:

(1) Windowing the test voice of the target speaker, and framing to obtain a frame signal of the test voice;

(2) Extracting formant vectors from voiced frame signals of a target speaker;

(3) Performing accent recognition on the formant vector of the target speaker by using a pre-trained GMM to obtain accent information of the target speaker;

(4) Selecting an acoustic model of the accent according to the accent information of the target speaker, and adjusting parameters of the acoustic model to obtain a self-adaptive acoustic model so as to enable the acoustic model to be matched with the pronunciation characteristics of the target speaker;

(5) Extracting features from the frame signal of the target speaker to obtain the MFCC of the target speaker;

(6) And performing acoustic decoding on the MFCC of the target speaker by using the adaptive acoustic model to obtain a recognition result.

2. The robust speech recognition method based on accent recognition of claim 1, wherein the sampling frequency of the training speech and the test speech signal are both 8000Hz, the window function is a hamming window, the frame length is 256, and the frame shift is 128.