KR20050063299A

KR20050063299A - Method for speaker adaptation based on maximum a posteriori eigenspace

Info

Publication number: KR20050063299A
Application number: KR1020030094690A
Authority: KR
Inventors: 김동국; 김승희
Original assignee: 한국전자통신연구원
Priority date: 2003-12-22
Filing date: 2003-12-22
Publication date: 2005-06-28

Abstract

본 발명은 최대 사후 고유공간에 근거한 화자적응 방법에 관한 것으로, 음성인식에서 학습화자와 인식화자의 차이로 인하여 음성인식의 성능 저하를 극복하기 위해 적은 양의 적응 데이터가 주어지는 경우 화자공간에 근거한 적응기법을 향상 시켜 학습화자나 특정화자로부터 얻어진 선 화자공간 정보를 이용하여 새로운 적응데이터와 결합하는 적응기법을 사용함으로써 학습화자와 인식화자 간의 불일치 환경에서 인식 성능을 향상시킬 수 있고 인식 시스템의 특정화자에 대한 적응이 향상되는 이점이 있다.The present invention relates to a speaker adaptation method based on maximal posterior eigenspace. When a small amount of adaptive data is given to overcome the performance degradation of speech recognition due to the difference between a learner and a recognizer in speech recognition, adaptation based on speaker space By improving the technique, we can improve the recognition performance in the inconsistent environment between the learner and the recognizer by using the adaptation technique that combines the new adaptation data using the speaker space information obtained from the learner or the specific speaker. There is an advantage that the adaptation to the is improved.

Description

Speaker adaptation method based on maximum post-eigen space {METHOD FOR SPEAKER ADAPTATION BASED ON MAXIMUM A POSTERIORI EIGENSPACE}

본 발명은 음성인식 분야의 화자적응 방법에 관한 것으로, 더욱 상세하게는 학습화자와 인식화자의 차이로 인한 음성인식의 성능 저하를 극복하기 위해 인식화자에 제공되는 적응데이터를 이용하여 모델을 적응하여 인식성능을 향상시키는 최대 사후 고유공간에 근거한 화자적응 방법에 관한 것이다.The present invention relates to a speaker adaptation method in the field of speech recognition, and more particularly, to adapt a model using adaptive data provided to a recognizer in order to overcome performance degradation of speech recognition caused by a difference between a learner and a recognizer. The present invention relates to a speaker adaptation method based on maximum posterior eigenspace for improving recognition performance.

주지와 같이, 현재의 음성 인식기술은 화자독립 시스템이 주류를 이루고 있으며, 실제 환경에 적용되어 실용적으로 많은 사람들이 다양하게 사용되고 있다.As is well known, current speech recognition technology is mainstream of speaker-independent system, and is applied to a real environment and used by many people in practical use.

그러나 인식기의 성능은 학습과 인식 환경이 다른 경우, 특히 학습에 포함되지 않은 인식화자가 인식하는 경우는 성능이 급격히 저하된다. 이를 극복하기 위해 인식화자로부터 적은 양의 적응 데이터를 이용하여 모델을 적응하고 적응된 모델을 이용하여 인식 성능을 향상시키는 화자적응 기법이 많이 연구되고 있다.However, the performance of the recognizer is dramatically degraded when the learning and the recognition environment are different, especially when the recognizer is not included in the learning. In order to overcome this problem, a lot of researches have been conducted on a speaker adaptation method that adapts a model using a small amount of adaptive data from a recognizer and improves a recognition performance using an adapted model.

화자적응은 가능한 특정화자로부터 적은 양의 데이터를 가지고 향상된 화자종속 인식 성능에 접근하도록 하는 것을 목표로 하고 있다.Speaker adaptation aims to access improved speaker dependent recognition performance with as little data as possible from a specific speaker.

최근의 인식 시스템은 HMM(Hidden Markov Model)에 근거한 통계적 모델을 사용한다. 그러므로 최근 HMM의 파라메터를 적응 데이터에 따라 갱신하는 모델기반 적응 알고리즘이 많이 제안되어 사용되고 있다.Modern recognition systems use statistical models based on the Hidden Markov Model (HMM). Therefore, recently, many model-based adaptive algorithms have been proposed and used to update HMM parameters according to adaptive data.

모델기반 화자적응은 HMM의 관측확률 분포의 파라메터를 적응 데이터에 따라 변화시켜 적응하는 것이다. 모델기반 적응 기법은 최대 사후(Maximum A Posteriori; MAP) 적응기법, 변환적응기법 또는 MLLR(Maximum Likelihood Linear Regression)기반 적응기법, 화자공간에 근거한 적응기법으로 분류된다.Model-based speaker adaptation adapts the parameters of the observation probability distribution of HMM according to the adaptation data. Model-based adaptation techniques are classified into Maximum A Posteriori (MAP) adaptation technique, transformation adaptation technique, or MLLR (Maximum Likelihood Linear Regression) based adaptation technique, and speaker space-based adaptation technique.

본 발명은 학습과 인식 환경사이에 존재하는 불일치, 그 중에서도 특히 화자간/화자내 발음 변이성에 기인한 경우에 인식화자로부터 적은 양의 적응 데이터를 수집하여 인식기를 적응시키므로 다양한 화자에 대해 음성인식 성능을 높이는 것을 그 목적으로 한다.The present invention collects a small amount of adaptation data from a recognizer and adapts the recognizer, especially in the case of inconsistencies between learning and recognition environments, especially between speakers and intra-speaker pronunciation variability. The purpose is to raise the

이를 위해, 본 발명에서는 학습화자의 학습 데이터로부터 화자공간 기법을 통해 각각 적응하고 이를 통해 선 공간모델에 대한 정보를 추출하고 이를 적응에 이용하여 적응 성능을 높인다.To this end, in the present invention, the adaptation is performed through the speaker space technique from the learning data of the learner, and through this, information on the line space model is extracted and used for the adaptation to increase the adaptation performance.

이러한 목적을 달성하기 위한 본 발명에 따른 화자적응 방법은, 음성인식에서 화자적응을 위해 화자공간 기법에 근거한 적응기법에서 공간계수에 대한 가우시안 선 밀도를 가정하고 이를 최대 사후 형태의 적응기법으로 결합하는 단계와, 상기 최대 사후 형태의 적응기법으로 결합하여 추정한 계수에 근거하여 모델을 적응하는 단계를 포함한다.The speaker adaptation method according to the present invention for achieving the above object is to assume a Gaussian line density for the spatial coefficient in the adaptive method based on the speaker space for speech adaptation in speech recognition and combine them into the maximum posterior adaptive method. And adapting the model based on coefficients estimated by combining with the maximum post shape adaptation technique.

이하, 본 발명에 따른 최대 사후 고유공간에 근거한 화자적응 방법을 첨부한 도면을 참조하여 설명하면 다음과 같다.Hereinafter, the speaker adaptation method based on the maximum posterior eigenspace according to the present invention will be described with reference to the accompanying drawings.

현재 대부분의 음성인식 시스템은 음성신호의 통계적 특성을 나타내기 위해 HMM(Hidden Markov Model)을 사용한다. HMM은 통계적 신호 모델에 근거하여 음성 신호를 표현하는 모델이다. 최근의 적응기법은 HMM 파라메터를 포함하는 음향모델을 적응 데이터에 따라 적응된 음향모델을 얻기 위한 과정을 수행한다. 적응을 위해 갱신되는 음향모델의 HMM는 각각의 상태(state), 상태 사이의 천이(transition), 그리고 각 상태에서 출력 값을 나타내는 출력분포(output probability)로 구성되어 있다. 인식과 적응 성능에 가장 큰 영향을 미치는 요소는 가우시안(Gaussian) 밀도 함수에 의해 표현되는 출력분포 함수이다. 일반적으로 출력분포 함수는 가우시안 믹스처(mixture) 연속밀도 함수가 사용되는데, 이를 CDHMM(Continuous Density HMM)이라 한다. 길이 T개의 동등하고 독립적으로 분포된 관측 벡터 열 X=(X₁,…,X_T)가 주어진 경우 CDHMN의 관측 확률분포는 다음의 수학식 1과 같다.Currently, most speech recognition systems use HMM (Hidden Markov Model) to represent the statistical characteristics of speech signals. HMM is a model that expresses a speech signal based on a statistical signal model. The recent adaptation technique performs a process for obtaining an acoustic model adapted to the acoustic model including the HMM parameter according to the adaptation data. The HMM of the acoustic model, updated for adaptation, consists of each state, transitions between states, and an output probability that represents the output value in each state. The factor that most affects the recognition and adaptation performance is the output distribution function expressed by the Gaussian density function. In general, as the output distribution function, a Gaussian mixture density function is used, which is called a continuous density HMM (CDHMM). Given a _T equally and independently distributed observation vector column X = (X ₁ ,..., X _T ), the observed probability distribution of CDHMN is given by Equation 1 below.

여기서, ω_jk는 상태 j, 혼합성분 k에 대한 가중치이며 을 만족하고, 는 d차원의 평균벡터 μ_jk와 d×d차원의 공 분산 행렬 Σ_jk을 갖는 다변수 가우시안 분포이다. CDHWM 파라메터 벡터는 아래의 수학식 2와 같다.Where ω _jk is the weight for state j, mixed component k Satisfying, Is a multivariate Gaussian distribution with an average vector μ _jk in d-dimension and a covariance matrix Σ _jk in d-d dimension. The CDHWM parameter vector is shown in Equation 2 below.

최근에 적응기법으로 HMM의 파라메터를 적응 데이터에 따라 갱신하는 모델기반 적응 알고리즘이 많이 제안되었다. 모델기반 화자적응은 CDHMM의 관측 확률 분포의 파라메터 λ을 적응 데이터에 따라 변화시켜 적응하는 것이다. 일반적으로 적응기법은 MAP(Maximum A Posteriori) 적응기법, 변환적응기법 또는 MLLR(Maximum Likelihood Linear Regression)기반 적응기법, 화자공간(speaker space)에 근거한 적응기법으로 분류된다. 본 발명에서 제안된 적응 방법은 MAP 적응기법과 화자공간기법을 결합한 형태로 학습화자의 데이터로부터 MAP에 필요한 선 공간모델에 정보를 추출하여 화자공간 기법과 MAP 형태로 결합하는 방식이다.Recently, a number of model-based adaptive algorithms have been proposed as an adaptive technique to update the parameters of the HMM according to the adaptive data. Model-based speaker adaptation adapts the parameter λ of the observed probability distribution of the CDHMM according to the adaptation data. In general, the adaptive technique is classified into a Maximum A Posteriori (MAP) adaptation technique, a transformation adaptation technique, or an Adaptive Technique based on Maximum Likelihood Linear Regression (MLLR), and a speaker space. The proposed adaptation method combines the MAP adaptation technique and the speaker space technique and extracts the information from the learner's data into the line space model required for the MAP and combines it into the speaker space technique and the MAP form.

MAP 적응기법은 학습 데이터에 포함되어 있는 선 지식 정보를 선 밀도 함수에 포함시켜 이를 적응 데이터와 최적의 방법으로 결합하여 적응하는 기법이다. MAP에서는 파라메터 λ가 어떤 분포를 갖는 랜덤(random) 변수라 가정한다. 만약 λ가 하이퍼파라메터(hyperparameter) ψ을 갖는 선 확률 밀도함수 와 유사도 를 갖는 관측열로부터 추정된다면, MAP 기법은 다음의 수학식 3과 같이 λ의 사후 모드(posterior mode)로 정의된다.The MAP adaptation technique is a technique that incorporates the line knowledge information included in the training data into a line density function and combines it with the adaptation data in an optimal manner to adapt. In MAP, it is assumed that parameter λ is a random variable with some distribution. If λ is a line probability density function with a hyperparameter ψ And similarity If estimated from the observed sequence with, the MAP technique is defined as the posterior mode of λ, as shown in Equation 3 below.

선 밀도 함수 는 관측 열이 주어지기 전에 관심 있는 파라메터에 대한 통계적 특성을 포함하여 파라메터가 어떤 제약된 값을 갖도록 한다.Line density function Allows the parameter to have any restricted value, including the statistical characteristics of the parameter of interest before the observation column is given.

화자공간 기법은 학습 화자에 대한 선 지식정보를 활용하는 방법이다. 학습화자에 대한 선 지식은 화자종속 모델이나 화자종속 변환 파라메터로부터 추출된 대표적인 화자 벡터에 의해 정의된다. 새로운 화자에 대한 적응 데이터가 주어지는 경우 적응될 모델은 화자 공간내에서 화자벡터의 선형적인 결합에 의해 표현되어진다. 그러므로 추정되어야 할 파라메터 수가 매우 적으므로 빠른 화자적응에 매우 적합하다. 가장 많이 사용되는 화자공간 기반 적응기법으로 고유음성(eigenvoice)이 있다.The speaker space technique is a method of utilizing pre-knowledge information about the learner. The line knowledge of the learner is defined by a representative speaker vector extracted from the speaker-dependent model or speaker-dependent transformation parameters. Given adaptation data for a new speaker, the model to be adapted is represented by the linear combination of speaker vectors in speaker space. Therefore, the number of parameters to be estimated is very small, which is very suitable for fast speaker adaptation. The most commonly used speaker space based adaptive technique is eigenvoice.

고유음성은 화자종속 모델로부터 PCA(Principal Component Analysis)를 통해 화자공간의 화자벡터를 구하고, ML(Maximum Likelihood) 기준을 통해 화자벡터의 선형 결합 계수를 추정하여 적응하는 기법이다.Intrinsic voice is a technique that obtains the speaker vector of speaker space from the speaker-dependent model through PCA (Principal Component Analysis) and estimates the linear coupling coefficient of the speaker vector based on ML (Maximum Likelihood) criterion.

{μ₁,…μ_R}을 학습 데이터로부터 R개의 잘 학습된 화자종속 모델이라고 하자. 여기서 는 각 화자종속 모델로부터 모든 가우시안 평균 벡터를 모아서 만든 D차원의 슈퍼벡터(supervector)이다. 고유음성은 모든 학습 화자 모델로부터 P차원의 선형공간(또는 고유음성) U를 PCA을 통해 구하게 된다. 그러면 새로 적응된 모델 은 다음의 수학식 4와 같이 P개의 주요 화자 벡터의 선형 결합에 의해 구하여 진다.{μ ₁ ,... Let μ _R } be R well trained speaker dependent models from the training data. here Is a D-dimensional supervector created by collecting all Gaussian mean vectors from each speaker-dependent model. The intrinsic speech is obtained from the PCA by the linear space (or intrinsic speech) U of the P dimension from all the learner models. Newly adapted model Is obtained by linear combination of P main speaker vectors as shown in Equation 4 below.

여기서, 는 {μ_r}에 대한 평균값이며, y는 추정될 가중치 벡터이며 알려지지 않은 상수이며, MLED(ML Eigen-Decomposition) 기법에 의해 다음의 수학식 5와 같이 구해진다.here, Is an average value for {μ _r }, y is a weight vector to be estimated and is an unknown constant, and is obtained by Equation 5 below by MLED (ML Eigen-Decomposition) technique.

고유음성은 적은 양의 적응 데이터에 대해 매우 잘 동작하지만 화자공간에서의 적응 모델을 구하는 제약조건으로 인해 많은 양의 데이터가 주어지는 경우에 더 이상의 성능 향상이 어렵다.Intrinsic voice works very well for a small amount of adaptive data, but it is difficult to further improve the performance when a large amount of data is given due to constraints on the adaptation model in speaker space.

화자공간, 화자벡터 그리고 고유음성에 의한 화자 적응 개념을 도 1에 나타내었다. 여기서 D=3, P=2이다.The concept of speaker adaptation by speaker space, speaker vector, and eigenvoice is shown in FIG. 1. Where D = 3 and P = 2.

본 발명에서 제안하는 적응기법인 최대 사후 고유공간(MAP Eigenspace; MAPES) 기법에서는 가중치 벡터 y가 랜덤한 변수로 가우시안 확률분포를 갖는다고 가정한다. 즉 이다. 그러면 새로운 가중치는 MAP 기준에 의해 다음의 수학식 6과 같이 구하여진다.The maximum post eigenspace (MAP Eigenspace) technique proposed in the present invention assumes that the weight vector y has a Gaussian probability distribution as a random variable. In other words to be. The new weight is then obtained as shown in Equation 6 below by the MAP criteria.

여기서, ψ는 하이퍼파라레터이다. 위 수학식 6은 λ=y를 갖고 EM 알고리즘을 이용하여 다음의 수학식 7과 같이 풀 수 있다.Where ψ is a hyperparameter. Equation 6 has λ = y and can be solved as shown in Equation 7 using the EM algorithm.

여기서, X는 관측된 음성 특징 벡터열이며, S,K는 상태열과 혼합 성분열을 각각 나타낸다. 이 함수를 최대화하도록 추정함으로 파라메터 y를 추정한다. 파라메터를 추정하기 위해 EM 기법은 E-스텝과 M-스텝을 반복적으로 적용함으로 파라메터를 추정한다. 위 수학식 7을 풀면 다음의 수학식 8과 같다.Here, X is the observed negative feature vector sequence, and S and K represent the state sequence and the mixed component sequence, respectively. The parameter y is estimated by estimating this function to be maximized. To estimate the parameters, the EM technique estimates the parameters by applying E-step and M-step repeatedly. Solving Equation 7 above is equal to Equation 8 below.

여기서는 는 관측 벡터 X_t가 주어졌을 때 상태 j에서 혼합성분 k가 관측된 사후 확률 값이다. 그리고, U_jk는 상태 j, 혼합성분 k에 해당하는 행렬 U의 준 행렬이고 τ는 선 확률의 기여도를 조절하는 파라메터이다. 위 수학식 8을 최대화하는 y를 구하면 다음의 수학식 9와 같다.Here Is the posterior probability value at which the mixed component k is observed in state j given the observation vector X _t . U _jk is a quasi-matrix of the matrix U corresponding to the state j, the mixed component k, and τ is a parameter for adjusting the contribution of the line probability. If y is maximized, Equation 8 is obtained as Equation 9 below.

여기서, 이고,이다. 고유음성의 ML 해는 위 수학식 9로 표현하면 다음의 수학식 10과 같다.here, ego, to be. The ML solution of the intrinsic voice is expressed by Equation 9 above.

즉 MAPES 기법은 기존의 고유음성 기법의 일반적인 형태로 표현됨을 알 수 있다. 최종적으로 MAPES 기법에 의해 추정된 파라메터는 고유음성과 같이 적응된 모델을 얻는데 이용된다.That is, it can be seen that the MAPES technique is expressed as a general form of the existing eigenvoice technique. Finally, the parameters estimated by the MAPES technique are used to obtain an adapted model such as eigenvoice.

위 수학식 11과 같이 제안된 MAP 고유음성을 자기 적응에 응용할 수 있다. 자기적응 기법은 현재 인식하고자 하는 데이터를 이용하여 먼저 적응하고 적응된 모델을 이용해 다시 인식하는 경우를 말한다.As shown in Equation 11, the proposed MAP eigenvoice can be applied to self-adaptation. The self-adaptive technique refers to a case in which an adaptation is performed first using data to be recognized and then re-recognized using an adapted model.

본 발명에 의하면 도 2와 도 3에 나타낸 바와 같이 두 가지 과정을 통해 MAPES에 근거한 자기적응을 수행할 수 있다.According to the present invention, self-adaptation based on MAPES can be performed through two processes as shown in FIGS. 2 and 3.

도 2는 MAP을 위한 고유공간(eigenspace)에 대한 선 분포를 구하는 과정을 나타내었다.2 shows a process of obtaining a line distribution for eigenspace for MAP.

먼저, 각 학습화자의 각각의 문장에 대해 특징을 추출하고(S101), 전사정보를 이용한 고유음성 적응기법을 통해 교사 적응을 수행한다(S103).First, a feature is extracted for each sentence of each learner (S101), and teacher adaptation is performed through a unique speech adaptation technique using transcription information (S103).

이후, 각 문장에 대한 고유공간 계수를 구한 후에 모든 문장에 대한 고유공간 계수로부터 계수의 선 분포를 위한 평균과 분산(화자공간 모델)을 구한다(S105). 이렇게 구해진 모든 학습화자에 대한 선 분포는 적응 과정동안에 이용된다.Then, after obtaining the eigenspace coefficients for each sentence, the average and the variance (speaker space model) for the line distribution of the coefficients are obtained from the eigenspace coefficients for all sentences (S105). The line distributions for all learned speakers are used during the adaptation process.

도 3은 자기 적응시 특정 선 화자공간에 대한 확률분포를 선택하는 과정과 적응하는 과정을 나타내었다.3 shows a process of selecting and adapting a probability distribution for a specific speaker space during self-adaptation.

먼저, 입력 음성에 대해 특징을 추출하고(S201), 이를 화자독립 인식기를 통해 인식하여 디코딩 과정에서 인식된 상태 열 정보(전사정보)를 얻는다(S203).First, a feature is extracted from an input speech (S201), and the state is recognized through a speaker independent recognizer to obtain state string information (transcription information) recognized in the decoding process (S203).

이후, 인식된 상태 열 정보와 각 학습화자로부터 얻어진 선 분포에 대한 정보(화자공간 모델)를 이용하여 현재의 입력 음성에 대해 가장 큰 확률 값을 내는 선 분포를 선택한다(S205).Subsequently, a line distribution having the largest probability value for the current input speech is selected using the recognized state string information and information on the line distribution obtained from each learner (speaker space model) (S205).

그리고, 선택된 선 분포를 이용하여 MAPES 기반 화자적응을 수행하며(S207), 적응된 모델을 기반으로 하여 다시 인식을 수행한다(S203).Then, MAPES-based speaker adaptation is performed using the selected line distribution (S207), and recognition is performed again based on the adapted model (S203).

본 발명은 음성인식 시스템을 사용할 때 학습화자와 인식화자 간의 불일치 환경에서 인식 성능을 향상시킬 수 있고 인식 시스템의 특정화자에 대한 적응이 향상된다.When the speech recognition system is used, the present invention can improve the recognition performance in the inconsistency environment between the learner and the recognizer, and the adaptation to the specific speaker of the recognition system is improved.

또한, 특정화자로부터 매우 적은 양의 적응 데이터가 주어진 경우에도 보다 효과적으로 학습화자와 인식화자의 정보를 이용함으로써 매우 빠른 적응 특성을 나타낼 수 있는 효과가 있다.In addition, even when a very small amount of adaptation data is given from a specific speaker, the use of the information of the learner and the recognizer can be effectively used to exhibit very fast adaptation characteristics.

도 1은 화자공간과 화자벡터 및 고유음성(eigenvoice)에 의한 화자 적응 개념을 표시하는 구성도,1 is a block diagram showing a speaker adaptation concept based on a speaker space, a speaker vector, and an eigenvoice;

도 2는 본 발명에서 학습화자 문장으로부터 MAPES 기법의 선 화자공간 모델을 구하기 위한 과정을 나타내는 구성도,2 is a block diagram showing a process for obtaining a line speaker space model of a MAPES technique from a learner sentence in the present invention;

도 3은 본 발명에서 입력음성으로부터 최적의 화자공간 모델을 선택하는 과정과 이를 이용하여 MAPES 적응을 수행하는 과정을 나타내는 구성도.3 is a block diagram illustrating a process of selecting an optimal speaker space model from an input voice and performing a MAPES adaptation using the same in the present invention.

Claims

Assessing Gaussian line density for spatial coefficients in the adaptive method based on the speaker space technique for speech adaptation in speech recognition and combining them into the maximum post-adaptive adaptive technique,

Adapting a model based on coefficients estimated by combining with the maximum post-adaptive adaptive technique

Speaker adaptation method comprising a.

The method of claim 1,

The process of calculating the line distribution for the eigenspace for maximum posterior is performed by extracting a feature for each sentence of each learner and performing teacher adaptation through eigen-voice adaptation using transcription information;

After calculating the eigenspace coefficients for each sentence, calculating the mean and the variance for the line distribution of coefficients from the eigenspace coefficients for all sentences and using the line distributions for all learners in the adaptation process.

Speaker adaptation method comprising a.

The method of claim 2,

The speaker adaptation method may include extracting a feature of an input speech during the adaptation process and obtaining state string information recognized in a speaker independence recognition process;

Selecting a line distribution that produces the largest probability value for the current input speech by using the recognized state string information and information on the line distribution obtained from each learner;

Performing the maximum post-shared space based speaker adaptation using the selected line distribution and performing recognition again based on the adapted model

Speaker adaptation method comprising a further.