KR101001684B1

KR101001684B1 - System and Method for Speaker Adaptation using Bilinear Model

Info

Publication number: KR101001684B1
Application number: KR1020090016903A
Authority: KR
Inventors: 김형순; 송화전
Original assignee: 부산대학교 산학협력단
Priority date: 2009-02-27
Filing date: 2009-02-27
Publication date: 2010-12-15
Also published as: KR20100097982A

Abstract

The present invention provides a speaker adaptation system and method using a bilinear model which performs speaker adaptation using new speaker's speech data and adjusts the estimated parameter number according to the adaptation data number, thereby improving speaker adaptation performance. The method relates to a method comprising: selecting and configuring a Hidden Markov Model (HMM) and a Gaussian mixture model (GMM) including a total of Gaussian distributions for each speaker, and constructing an observation matrix using an average vector; And assigning it to a bilinear model; performing speaker adaptation using content basis vectors among the bilinear models.

Bilinear model, bilinear model, speaker adaptation, HMM, GMM, speaker adaptation model

Description

System and Method for Speaker Adaptation using Bilinear Model

본 발명은 음성인식 기술에 관한 것으로, 구체적으로 새로운 화자의 음성데이터를 이용하여 화자 적응을 수행하고, 추정 파라메터 수를 적응 데이터 수에 따라 조절할 수 있도록 하여 화자 적응 성능을 높인 쌍일차 모델을 이용한 화자 적응 시스템 및 방법에 관한 것이다.The present invention relates to speech recognition technology. Specifically, a speaker adaptation is performed using a new speaker's voice data, and the speaker using a bilinear model having improved speaker adaptation performance by adjusting the estimated parameter number according to the adaptation data number. It relates to an adaptive system and method.

최근 휴대용 단말기, 차량용 내비게이션 그리고 지능 로봇 등의 보급으로 인해 음성 인터페이스에 대한 많은 연구가 이루어지고 있다. 이런 응용분야에서는 주 사용자가 고정된 경우가 많으며, 고속 화자적응 기술을 이용하는 경우 적은 양의 적응 데이터만으로도 인식 성능 향상을 얻을 수 있다. Recently, due to the widespread use of portable terminals, vehicle navigation systems, and intelligent robots, many researches have been conducted on voice interfaces. In these applications, the main user is often fixed, and when using the fast speaker adaptation technique, the recognition performance can be improved with only a small amount of adaptive data.

도 1은 일반적인 음성 인식 시스템의 개략적인 구성도이다.1 is a schematic configuration diagram of a general speech recognition system.

음성인식 시스템은 일반적으로 도 1과 같은 구성을 가지고 있다.The speech recognition system generally has a configuration as shown in FIG. 1.

먼저, 음성신호단계에서 음성신호가 입력되면, 실음성 검출단계에서 실제 사람이 발성한 음성신호만을 검출한다. 실음성을 검출하고 특징추출단계에서 음성의 특징인 특징 벡터를 추출한다.First, when a voice signal is input in the voice signal step, only a voice signal spoken by a real person is detected in the real voice detection step. Real speech is detected and a feature vector, which is a feature of speech, is extracted in the feature extraction step.

추출 특징벡터는 기준음성 모델과의 비교를 통한 유사도 측정단계와 인식결정단계를 거치게 된다.The extracted feature vector is subjected to a similarity measurement step and a recognition decision step through comparison with a reference speech model.

음성인식기의 성능 향상을 위해 사용하는 대표적인 방법으로 화자 적응 방법이 있다.Speaker adaptation is a representative method used to improve the performance of speech recognizers.

가장 대표적인 화자 적응 방법으로 MAP 계열, MLLR 계열 및 Eigenvoice 계열을 언급할 수 있으며, 이 중 MLLR과 eigenvoice 계열이 적응 데이터가 적절하거나 아주 적은 경우에 좋은 성능 향상을 보인다.The most representative speaker adaptation methods can be mentioned MAP series, MLLR series and Eigenvoice series. Among them, MLLR and eigenvoice series show good performance improvement when the adaptation data is appropriate or very few.

MLLR의 경우는 각각의 기준모델 대해 선형변환, 특히 어파인(affine) 선형변환을 통해 적응 모델을 표현한다. 또한 선형변환행렬은 적응데이터로부터 선형 회귀(linear regression) 방법을 통해 구한다.In the case of MLLR, the adaptive model is represented through a linear transformation, particularly an affine linear transformation, for each reference model. The linear transformation matrix is also obtained from the adaptive data through linear regression.

여기서, 적응 데이터가 적절할 경우 선형변환 행렬의 신뢰성 있는 추정이 이루어져 인식기의 성능을 향상시킬 수 있지만 적응 데이터가 아주 적응 경우 선형변환 행렬이 제대로 추정되지 않아 오히려 인식기의 성능이 급속히 떨어진다.Here, if the adaptive data is appropriate, reliable estimation of the linear transformation matrix can be performed to improve the performance of the recognizer. However, if the adaptive data is very adaptive, the linear transformation matrix is not properly estimated.

이를 보완하기 위해 주성분 분석법(Principal Component Analysis;PCA) 등을 이용하여 음성 특징 벡터의 차원을 감소시키는 방법이 사용될 수도 있다.To compensate for this, a method of reducing the dimension of the speech feature vector using a principal component analysis (PCA) or the like may be used.

마지막으로 eigenvoice기반 방식은 훈련데이터의 SD(Speaker Dependent) 모델들의 평균으로부터의 변위량에 대한 기저 벡터(basis vector)를 구한 후 새로운 화자의 모델은 적응 데이터로부터 평균값과 기저 벡터들의 가중합으로 표현할 수 있다.Finally, the eigenvoice-based method obtains a basis vector of displacements from the mean of the SD (Speaker Dependent) models of the training data, and then the new speaker model can express the weighted sum of the mean and the basis vectors from the adaptive data. .

즉, 평균값은 컨텐트 팩터(content factor)를 나타내며, 변위량에 대한 기저 벡터는 각각의 화자간의 차이(style factor)를 나타내어 이 스타일의 차이에 대한 기저 벡터의 가중치만을 구하므로 그 개수가 적어 고속 화자 적응에 아주 유리하다.That is, the average value represents the content factor, and the base vector for the displacement amount represents the style factor between each speaker, so that only the weight of the base vector for the difference of styles is obtained, so that the number is small. It is very advantageous to

이 방법에서도 가중치를 선형 회귀(linear regression) 방법으로 구한다. 이 경우에는 적응 데이터 수가 증가하더라도 그 성능을 더 이상 향상되지 못하는 문제가 있다.In this method, the weights are also obtained by linear regression. In this case, even if the number of adaptive data increases, the performance is no longer improved.

본 발명은 이와 같은 종래 기술의 화자 적응 방식의 문제를 해결하기 위한 것으로, 새로운 화자의 음성데이터를 이용하여 화자 적응을 수행하고, 추정 파라메터 수를 적응 데이터 수에 따라 조절할 수 있도록 하여 화자 적응 성능을 높인 쌍일차 모델을 이용한 화자 적응 시스템 및 방법을 제공하는데 그 목적이 있다.The present invention is to solve the problem of the speaker adaptation method of the prior art, it is possible to perform the speaker adaptation using the speech data of the new speaker, and to adjust the estimated parameter number according to the adaptation data number to improve the speaker adaptation performance The purpose of the present invention is to provide a speaker adaptation system and method using the enhanced bilinear model.

본 발명은 훈련 단계에서 훈련에 참여한 화자에 대한 스타일 팩터 및 콘텐트 팩터를 구성하고, 적응데이터를 이용하여 화자에 관계없는 콘텐트 팩터를 기준으로 하여 새로운 화자에 대한 스타일 팩터를 추정하여 새로운 화자에 대한 모델을 추정하는 것에 의해 화자 적응 성능을 높인 쌍일차 모델을 이용한 화자 적응 시스템 및 방법을 제공하는데 그 목적이 있다.The present invention configures the style factor and content factor for the speaker who participated in the training in the training phase, and estimates the style factor for the new speaker based on the content factor irrelevant to the speaker by using the adaptation data. It is an object of the present invention to provide a speaker adaptation system and method using a bilinear model that improves speaker adaptation performance by estimating.

본 발명은 쌍일차 모델(Bilinear model)의 콘텐트 기저 벡터(content basis vector) 수를 조절함으로써 적응 데이터가 아주 적은 경우에도 높은 화자 적응 성능을 유지 할 수 있도록 한 쌍일차 모델을 이용한 화자 적응 시스템 및 방법을 제공하는데 그 목적이 있다.The present invention provides a speaker adaptation system and method using a paired linear model to maintain a high speaker adaptation performance even when the adaptation data is very small by adjusting the number of content basis vectors of the bilinear model. The purpose is to provide.

본 발명은 콘텐트 기저 벡터(content basis vector) 차원을 증가시키고, 추정 파라메터 수를 적응 데이터 수에 따라 조절할 수 있도록 한 쌍일차 모델을 이용한 화자 적응 시스템 및 방법을 제공하는데 그 목적이 있다.An object of the present invention is to provide a speaker adaptation system and method using a bilinear model to increase the content basis vector dimension and to adjust the estimated parameter number according to the adaptation data number.

이와 같은 목적을 달성하기 위한 본 발명에 따른 쌍일차 모델을 이용한 화자 적응 시스템은 각각의 화자에 대해

차원의

개의 가우시안으로 구성된 HMM(Hidden Markov Model; 은닉 마르코프 모델),GMM(Gaussian mixture model; 가우시안 믹스쳐 모델)을 선택적으로 사용하여

개의 SD(Speaker Dependent;화자 종속) 모델들을 구성한 후 평균 벡터(mean vector)만을 고려하여 구성된 SD(Speaker Dependent) 모델을 사용하여 관찰 행렬을 구성하는 관찰 행렬 구성부;구성된 행렬에 대해 SVD(Singular Value Decomposition)를 적용하여 비대칭 쌍일차 모델 파라미터를 구하고, 화자의 스타일을 반영한 공간으로 선형 변환하여 쌍일차 모델을 구성하는 쌍일차 모델 구성부;구성된 쌍일차 모델을 이용해 새로운 화자의 적응 데이터가 들어오면 스타일 팩터만을 추정하여 화자 적응 모델을 구성하는 화자 적응부;구성된 화자 적응 모델을 이용하여 사용자의 테스트 음성 인식을 수행하는 음성 인식부;를 포함하는 것을 특징으로 한다.Speaker adaptation system using the bilinear model according to the present invention for achieving the above object is for each speaker

Dimension

HMM (Hidden Markov Model) and GMM (Gaussian mixture model) consisting of two Gaussians

Observation matrix component that constructs observation matrix using SD (Speaker Dependent) model which is composed of speaker dependent (SD) models and considers only mean vector; Singular Value of constructed matrix Decomposition) to obtain the asymmetric bilinear model parameters, and to form a bilinear model by linear transformation into a space that reflects the speaker's style; the bilinear model component; A speaker adaptor configured to estimate only a factor to construct a speaker adaptation model; a speech recognizer configured to perform test speech recognition of a user using the configured speaker adaptation model.

다른 목적을 달성하기 위한 본 발명에 따른 쌍일차 모델을 이용한 화자 적응 방법은 쌍일차 모델을 이용한 화자 적응을 위하여 관찰 행렬을 구성하는 단계에서,

차원의

개의 가우시안으로 구성된 HMM(Hidden Markov Model),GMM(Gaussian mixture model)을 선택적으로 사용하여

개의 SD(Speaker Dependent) 모델들을 구성하고, s번째 화자의 관찰 행렬은 가우시안 평균 벡터를 이용하여,

으로 나타내고,

이고,여기서,

관찰 행렬의 크기는

이고,

으로 정규화되는 것을 특징으로 한다.Speaker adaptation method using a bilinear model according to the present invention for achieving another object in the step of configuring the observation matrix for the speaker adaptation using the bilinear model,

Dimension

HMM (Hidden Markov Model) and GMM (Gaussian mixture model)

Speaker Dependent (SD) models, and the observation matrix of the s-th speaker is a Gaussian mean vector.

Represented by

Where,

The size of the observation matrix

ego,

It is characterized in that normalized to.

이와 같은 본 발명에 따른 쌍일차 모델을 이용한 화자 적응 시스템 및 방법은 다음과 같은 효과를 갖는다.The speaker adaptation system and method using the bilinear model according to the present invention has the following effects.

첫째, 새로운 화자의 음성데이터를 이용하여 화자 적응을 수행하고, 추정 파라메터 수를 적응 데이터 수에 따라 조절할 수 있도록 하여 화자 적응 성능을 높인 다.First, speaker adaptation is performed using the new speaker's voice data, and the estimated parameter number can be adjusted according to the number of adaptive data, thereby improving speaker adaptation performance.

둘째, 적응데이터를 이용하여 화자에 관계없는 콘텐트 팩터를 기준으로 하여 새로운 화자에 대한 스타일 팩터를 추정하여 새로운 화자에 대한 모델을 추정하는 것에 의해 화자 적응 성능을 높인다.Second, the speaker adaptation performance is improved by estimating the style factor for the new speaker based on the content factor irrelevant to the speaker using the adaptation data.

셋째, 쌍일차 모델(Bilinear model)의 콘텐트 기저 벡터(content basis vector) 수를 조절함으로써 적응 데이터가 아주 적은 경우에도 높은 화자 적응 성능을 유지 할 수 있다. Third, by adjusting the number of content basis vectors of the bilinear model, high speaker adaptation performance can be maintained even when the adaptation data is very small.

이하, 본 발명에 따른 쌍일차 모델을 이용한 화자 적응 시스템 및 방법의 바람직한 실시예에 관하여 상세히 설명하면 다음과 같다.Hereinafter, a preferred embodiment of a speaker adaptation system and method using a bilinear model according to the present invention will be described in detail.

본 발명에 따른 쌍일차 모델을 이용한 화자 적응 시스템 및 방법의 특징 및 이점들은 이하에서의 각 실시예에 대한 상세한 설명을 통해 명백해질 것이다.Features and advantages of the speaker adaptation system and method using the bilinear model according to the present invention will become apparent from the following detailed description of each embodiment.

도 2는 본 발명에 따른 쌍일차 모델을 이용한 화자 적응 시스템의 구성도이다.2 is a block diagram of a speaker adaptation system using a bilinear model according to the present invention.

본 발명은 새로운 화자의 음성데이터를 이용하여 화자 적응을 수행하고, 추정 파라메터 수를 적응 데이터 수에 따라 조절할 수 있도록 하여 화자 적응 성능을 높일수 있도록 한 것이다.The present invention is to improve the speaker adaptation performance by performing the speaker adaptation using the new speaker's voice data, and by adjusting the estimated number of parameters according to the adaptation data number.

본 발명은 MLLR(maximum likelihood linear regression)과 EV(eigenvoice)과 같은 화자 적응 방식에서의 문제를 해결하고 화자 인식 성능을 높이기 위하여 쌍일 차 모델을 화자 적응에 이용하는 방법을 제안한다.The present invention proposes a method of using a bi-linear difference model for speaker adaptation to solve problems in speaker adaptation methods such as maximum likelihood linear regression (MLLR) and eigenvoice (EV) and to improve speaker recognition performance.

본 발명은 고속 화자적응에 유리한 EV(eigenvoice)의 장점과 적응데이터가 적절할 때 높은 성능 향상을 보이는 MLLR의 장점을 취하여 적응 데이터 수에 관계없이 높은 성능 향상을 이루기 위한 것이다.The present invention aims to achieve high performance regardless of the number of adaptive data by taking advantage of EV (eigenvoice), which is advantageous for high-speed speaker adaptation, and MLLR, which exhibits high performance improvement when adaptive data is appropriate.

이를 위한 본 발명은 쌍일차 모델(bilinear model) 방법을 사용하여 모델의 파라메터를 스타일 팩터(style factor)와 콘텐트 팩터(content factor)를 분리하는 방식을 적용한다.To this end, the present invention applies a method of separating a style factor and a content factor from a parameter of a model using a bilinear model method.

훈련단계에서 이상의 두 가지 팩터로 모델을 구성해 놓고 새로운 화자의 약간의 적응 데이터를 이용하여 그 화자의 스타일 팩터(style factor)정보를 추정하여 이미 구성된 스타일 팩터(style factor)에 관계없는 콘텐트 팩터(content factor)와 결합하면 이는 화자 적응 방법이라 할 수 있다.In the training phase, the model is composed of two or more factors, and a little adaptation data of the new speaker is used to estimate the style factor information of the speaker, and the content factor (regardless of the style factor already configured) Combined with a content factor, this is a speaker adaptation method.

즉, 새로운 화자가 음성 인식기를 사용하는 경우 새로운 화자의 스타일로 모델을 변형시킴으로써 인식성능을 향상시킬 수 있기 때문이다. 이 두 가지 팩터는 쌍일차(bilinear) 모델을 사용하여 효과적으로 표현하고 수학적으로 간단히 다룰 수가 있다.That is, when the new speaker uses the speech recognizer, the recognition performance can be improved by transforming the model into a new speaker style. Both of these factors can be represented efficiently and mathematically simplified using a bilinear model.

쌍일차 모델은 스타일 팩터(style factor), 콘택트 팩터(content factor) 및 이중 매핑 펑션(bilinear mapping function)으로 구성된다.The bilinear model consists of a style factor, a contact factor and a bilinear mapping function.

여기서 이중 매핑 펑션(bilinear mapping function)은 두 가지 팩터의 공간에서 관찰 벡터 공간으로 쌍일차적으로 매핑해 주는 기능을 수행한다.In this case, the bilinear mapping function performs a bi-linear mapping from the two factor spaces to the observation vector space.

이를 이용하여 각각의 화자로부터 얻을 수 있는 정보(style과 content)를 효과적으로 표현할 수 있으며 또한 새로운 화자의 약간의 적응데이터로부터 그 화자 의 스타일 팩터를 추정하여 새로운 화자의 모델을 구성할 수도 있다.By using this, it is possible to effectively express the information (style and content) obtained from each speaker, and to construct a new speaker model by estimating the speaker's style factor from some adaptation data of the new speaker.

쌍일차 모델은 크게 대칭 쌍일차 모델(symmetric bilinear model)과 비대칭 쌍일차 모델(asymmetric bilinear model)로 나눌 수 있는데, 본 발명에서는 비대칭 쌍일차 모델(asymmetric bilinear model)을 기반으로 한 화자적응 방법을 제안한다.The bilinear model can be largely divided into a symmetric bilinear model and an asymmetric bilinear model. The present invention proposes a speaker adaptation method based on an asymmetric bilinear model. do.

비대칭 쌍일차 모델 기반의 화자 적응 방법은 MLLR의 일반화로 해석될 수 있다.Speaker adaptation based on asymmetric bilinear model can be interpreted as generalization of MLLR.

본 발명에 따른 쌍일차 모델을 이용한 화자 적응 시스템은

명의 화자로 구성된 훈련데이터로부터 각각의 화자에 대해

차원의

개의 가우시안으로 구성된 HMM 또는 GMM을 사용하여

개의 SD모델들을 구성하는 화자별 모델 구성부(21)와, 가우시안 파라메터중 평균 벡터(mean vector)만을 고려하여 구성된 SD 모델을 사용하여 관찰 행렬을 구성하는 관찰 행렬 구성부(22)와, 비대칭 쌍일차 모델 파라미터를 구하고, 콘텐트 공간에 존재하는 벡터에 대해 이와는 독립적인 화자의 스타일을 반영한 공간으로 선형 변환하여 쌍일차 모델을 구성하는 쌍일차 모델 구성부(23)와, 구성된 쌍일차 모델을 이용해 새로운 화자의 적응 데이터가 들어오면 스타일 팩터만을 추정하여 화자 적응 모델을 구성하는 화자 적응부(24)와, 구성된 화자 적응 모델을 이용하여 사용자의 테스트 음성 인식을 수행하는 음성 인식부(25)를 포함한다.Speaker adaptation system using a bilinear model according to the present invention

For each speaker from the training data consisting of two speakers

Dimension

Using HMM or GMM of four Gaussians

A speaker-specific model constructor 21 constituting the two SD models, an observation matrix constructer 22 constituting an observation matrix using an SD model configured by considering only a mean vector of Gaussian parameters, and an asymmetric pair The bilinear model constructing unit 23 constructs the bilinear model by obtaining a linear model parameter and linearly transforming the vector existing in the content space into a space that reflects the speaker's style independent from each other. A speaker adaptation unit 24 constituting a speaker adaptation model by estimating only a style factor when the speaker's adaptation data is received, and a speech recognition unit 25 performing a test speech recognition of a user using the configured speaker adaptation model. .

그리고 쌍일차 모델을 이용한 화자 적응 방법을 구체적으로 설명하면 다음과 같다.The speaker adaptation method using the bilinear model is described in detail as follows.

본 발명은 비대칭 쌍일차 모델(asymmetric bilinear model) 구성 단계와, 구성된 모델을 이용한 화자 적응 단계로 구성된다.The present invention consists of asymmetric bilinear model construction and speaker adaptation using the constructed model.

쌍일차 모델을 적용하기 위하여 훈련 DB로부터 관찰 행렬(observation matrix)를 구성해야 한다. In order to apply the bilinear model, an observation matrix must be constructed from the training DB.

먼저, 화자별 모델 구성부(21)에서

명의 화자로 구성된 훈련데이터로부터 각각의 화자에 대해

차원의

개의 가우시안으로 구성된 HMM 또는 GMM을 사용하여

개의 SD모델들을 구성하고, s번째 화자의 관찰 행렬은

으로 나타낸다.First, in the speaker model configuration unit 21

For each speaker from the training data consisting of two speakers

Dimension

Using HMM or GMM of four Gaussians

Two SD models, the observation matrix of the s-th speaker

Represented by

여기서 가우시안 파라메터중 평균 벡터(mean vector)만을 고려한다.Only the mean vector of Gaussian parameters is considered here.

이와 같이 화자별 모델이 구성되면 구성된 SD 모델을 사용하여 관찰 행렬 구성부(22)에서 관찰 행렬을 구성한다.When the speaker-specific model is constructed as described above, the observation matrix constructing unit 22 constructs the observation matrix using the constructed SD model.

여기서,

관찰 행렬의 크기는

이며, 또한, 각각의 모델간의 차이에 대해 쌍일차 모델을 적용하는 것 외에 예를 들어 SI(Speaker Independent) 모델과 SD 모델들의 차이에 대해서도 스타일과 콘텐트 팩터를 구성할 수도 있다.here,

The size of the observation matrix

Also, in addition to applying a bilinear model to the difference between each model, for example, a style and content factor may be configured for a difference between a speaker independent model and an SD model.

수학식 1에서 각각의 콘텐트 팩터별로 평균값을 구한 후

에 이를 빼준 후 평균값과 각각의 SD 모델 값들 간의 차이(즉, SD모델이 평균으로부터 얼마큼 떨어져 있느냐)에 대한 값에 대해 각각의 factor들에 대해 기저 팩터를 구성할 수 있다.After calculating the average value for each content factor in Equation 1

After subtracting this, we can construct the base factor for each factor for the value of the difference between the mean value and the respective SD model values (ie, how far the SD model is from the mean).

본 발명은 다음과 같은 sample mean을 사용하여 평균 벡터를 구한다.The present invention obtains an average vector using the following sample mean.

따라서, 관찰 행렬

는 다음과 같이 평균에 의해 정규화(normalization) 된다.Thus, observation matrix

Is normalized by the mean as

그리고 쌍일차 모델 구성부(23)에서 다음과 같은 처리 과정을 수행한다.The bilinear model component 23 performs the following processing.

먼저 비대칭 쌍일차 모델 파라미터를 구하기 위해 SVD(Singular Value Decomposition)를 이용한다.First, we use Singular Value Decomposition (SVD) to obtain asymmetric bilinear model parameters.

수학식 1에서 SVD를 적용하면

이다.
여기서, U는 SD x SD 의

행렬의 고유벡터 행렬, V는 C x C 의

행렬의 고유벡터 행렬이고, S는 SD x C의 특이값이 SD와 C의 크기중 작은 것을 선택한 min(SD,C) x min(SD,C)의 부 정방행렬의 주대각에 위치하는 행렬이다.
다음과 같이 J개의 고유 벡터(eigenvector) 수를 사용하여 스타일 스페시픽 매트릭스(style-specific matrix)

와 콘텐트 베이시스 매트릭스(content basis matrix)

를 정의한다.Applying SVD in Equation 1

to be.
Where U is SD x SD

Matrix of eigenvectors, where V is C x C

S is the eigenvector matrix of the matrix, and S is the matrix located at the main diagonal of min (SD, C) x min (SD, C) where the singular value of SD x C is the smaller of SD and C. .
Style-specific matrix using J number of eigenvectors as follows:

And content basis matrix

Define.

는

크기의

에서 s번째 화자의 스타일 스페시픽 매트릭스(style-specific matrix)이고,

는 (J x C) 크기의

에서 c번째 콘텐트 베이시스 벡터(content basis vector)이다.

Is

Size

Is the style-specific matrix of the s-th speaker in

Is the size of ( J x C )

Is the c th content basis vector.

이를 사용하여 s번째 화자의 c번째 콘텐트 평균 벡터(content mean vector)를 구하면 다음과 같다.Using this, the c-th content mean vector of the s-th speaker is obtained as follows.

즉, 이는 스타일과 무관하며 모든 화자에 공통적인 특징인 콘텐트 공간에 존재하는 벡터에 대해 이와는 독립적인 화자의 스타일을 반영한 공간으로 선형 변환됨을 알 수 있다.In other words, it can be seen that the vector existing in the content space, which is independent of the style and is common to all speakers, is linearly transformed into the space reflecting the speaker's style.

는 서로 직교하는 공간이다.

Are spaces orthogonal to each other.

이는 이전의 MLLR에서 선형변환행렬을 구하는 것과 가장 큰 차이이다.This is the biggest difference from finding the linear transformation matrix in the previous MLLR.

수학식 3을 이용하는 경우에는

으로 나타낸다.When using equation (3)

Represented by

그리고 화자 적응부(24)에서 다음과 같은 과정을 수행한다.And the speaker adaptation unit 24 performs the following process.

구성된 쌍일차 모델을 이용해 새로운 화자의 적응 데이터가 들어오면, 새로운 화자에 대해서도 콘탠트 팩터는 변하지 않으므로 단지 스타일 팩터만을 추정하면 된다.When the new speaker's adaptation data comes in using the constructed bilinear model, the content factor does not change for the new speaker, so only the style factor needs to be estimated.

본 발명에서는 새로운 모델을 비대칭 쌍일차 모델을 사용하여 화자적응에 이용하였으며 화자 적응 모델을 수학식 6을 새로운 화자에 대해 다음과 나타낸다.In the present invention, a new model is used for speaker adaptation using an asymmetric bilinear model, and the speaker adaptation model is shown in Equation 6 for the new speaker as follows.

는

인 새로운 화자

의 스타일 스페시픽 매트릭스(style-specific matrix)를 나타내며,

는 수학식 5에 정의한 것처럼 c번째 콘텐트 베이시스 벡터(content basis vector)이며,

는 새로운 화자

의 c번째 콘텐트 벡터(content vector)를 뜻한다.

Is

New speaker

Represents the style-specific matrix of

Is the c th content basis vector as defined in Equation 5,

The new speaker

The c th content vector of the.

새로운 화자의 스타일 스페시픽 매트릭스(style-specific matrix)

는 적응 데이터인 관찰 벡터열 (observation vector sequence)

를 사용하여 추정하면 되고 본 발명에서는 MLE(maximum likelihood estimator)를 사용한다.New speaker's style-specific matrix

Is an adaptation data, an observation vector sequence

It is estimated by using and in the present invention, MLE (maximum likelihood estimator) is used.

이를 위하여 관찰 벡터에 대하여 각각의 콘텐트들은 가우시안(Gaussian) 분 포를 따른다고 가정한다. 그리고

는 현재 모델 파라메터 세트이고,

는 재추정(re-estimated) 모델 파라메터 세트라고 가정한다.For this purpose, it is assumed that each content follows a Gaussian distribution for the observation vector. And

Is the current model parameter set,

Is assumed to be a set of re-estimated model parameters.

새로운 화자의 적응데이터인

가 주어지면 다음과 같이 전체 유사도(total likelihood)

가 최대가 되는 것은 다음과 같은 보조 함수(auxiliary function)

을 반복적(iterative)으로 수행하면 된다.The new speaker's adaptation data

Is given, the total likelihood is

Is the maximum auxiliary function (auxiliary function)

This can be done iteratively.

여기서,

는 새로운 화자

의 t번째 D차원 적응데이터이고,
새로운 화자

의 관찰 벡터열

과 모델

가 주어졌을 때 t 시간에 content c 에 있을 사후확률은

이다.here,

The new speaker

T - D adaptive data of
A new speaker

Observation vector column

And model

Given, the post-probability at content c at time t is

to be.

여기서, 재추정 모델의 공분산 행렬(covariance matrix)

는 SI 모델과 동일하다고 가정하며 평균 벡터(mean vector)에 대해서만 선형변환을 수행한다.Where the covariance matrix of the reestimation model

Assumes the same as the SI model and performs linear transformation only on the mean vector.

수학식 7을 수학식 8에 대입하고 적응데이터

에 대하여

가 최대가 되는

는 다음과 같이 구해진다.Substituting Equation 7 into Equation 8 and adapting data

about

Is the maximum

Is obtained as follows.

이를 정리하면,In summary,

만약,

가 대각 행렬 요소로만 이루어져 있다면 수학식 10은

의 row-by-row basis로 Gaussian elimination이나 LU decomposition 방법 등을 이용하여

를 추정할 수 있으며, 이것은 MLLR 화자적응에서 변환행렬을 추정하는 것과 동일한 과정을 제공한다.if,

If is composed of only diagonal matrix elements, Equation 10

Gaussian elimination or LU decomposition on a row-by-row basis

Can be estimated, which provides the same process as estimating the transform matrix in MLLR speaker adaptation.

쌍일차 모델 중 본 발명에 적용하는 비대칭 모델(asymmetric model)을 사용하여 화자 적응에 적용한 경우는 MLLR 방법의 일반화된 형태이다. 즉, MLLR의 형태는 비대칭 쌍일차 모델의 형태에서 J 차원의 콘텐트 기저 벡턱 대신 관찰벡터 차원과 동일한 D차원의 SI모델을 사용하는 경우이다.The bilinear model, which is applied to speaker adaptation using an asymmetric model applied to the present invention, is a generalized form of the MLLR method. That is, the MLLR is a form of an asymmetric bilinear model in which the SI model of the same D dimension as the observation vector dimension is used instead of the J -based content basis vector.

즉, D차원으로 고정되어 있다. 이를 줄이기 위해서는 주성분 분석법등을 사용하여 관찰 벡터의 차원을 줄여야만 한다.That is, it is fixed in the D dimension. To reduce this, the dimension of the observation vector should be reduced by using principal component analysis.

그러나 쌍일차 모델의 경우 콘텐트 기저 벡터는 MLLR과는 달리 EV(eigenvoice)와 같이 훈련에 참여한 화자들 간의 variation(style factor)을 basis vector로 구성하며, 가장 큰 차이인 관찰 벡터 차원을 줄일 필요가 없이 basis vector의 개수를 조절하면 된다.However, in the case of bilinear models, the content basis vector, unlike MLLR, consists of the basis vector of the variation (style factor) among the speakers who participated in the training, such as EV (eigenvoice), and does not need to reduce the largest difference, the observed vector dimension. You can adjust the number of basis vectors.

즉, 1≤ J ≤ C의 범위로 값을 변화시킬 수 있다. 특히, J = 1인 경우는 추정해야 할 파라메터 수가 D개가 되므로 이 경우는 적응 데이터가 아주 적은 경우라 도

를 신뢰있게 추정할 수 있다.That is, the value can be changed in the range of 1 ≦ J ≦ C. In particular, when J = 1, the number of parameters to be estimated is D. In this case, even if the adaptation data is very small.

Can be estimated reliably.

따라서, 본 발명에 따른 화자 적응 방식은 MLLR과 같이 적응데이터가 적절할 경우뿐만 아니라 EV(eigenvoice)와 같이 적응 데이터가 아주 적은 경우에서도 잘 동작한다.Therefore, the speaker adaptation method according to the present invention works well when the adaptation data such as MLLR is appropriate as well as when there is very little adaptation data such as EV (eigenvoice).

그리고 음성 인식부(25)에서 구성된 화자 적응 모델을 이용하여 사용자의 테스트 음성에 대해 음성 인식을 수행한다.Then, speech recognition is performed on the user's test voice using the speaker adaptation model configured in the voice recognition unit 25.

이와 같은 본 발명에 따른 쌍일차 모델을 이용한 화자 적응 시스템 및 방법은 새로운 화자의 음성데이터를 이용하여 화자 적응을 수행하고, 추정 파라메터 수를 적응 데이터 수에 따라 조절할 수 있도록 하여 화자 적응 성능을 높인다.The speaker adaptation system and method using the bilinear model according to the present invention improves speaker adaptation performance by performing speaker adaptation using new speaker's speech data and adjusting the estimated parameter number according to the adaptation data number.

이상 설명한 내용을 통해 당업자라면 본 발명의 기술 사상을 일탈하지 아니하는 범위에서 다양한 변경 및 수정이 가능함을 알 수 있을 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention.

따라서, 본 발명의 기술적 범위는 실시예에 기재된 내용으로 한정되는 것이 아니라 특허 청구의 범위에 의하여 정해져야 한다.Therefore, the technical scope of the present invention should not be limited to the contents described in the embodiments, but should be defined by the claims.

도 1은 일반적인 음성 인식 시스템의 개략적인 구성도1 is a schematic configuration diagram of a general speech recognition system

도 2는 본 발명에 따른 쌍일차 모델을 이용한 화자 적응 시스템의 구성도2 is a block diagram of a speaker adaptation system using a bilinear model according to the present invention

도면의 주요 부분에 대한 부호의 설명Explanation of symbols for the main parts of the drawings

21. 화자별 모델 구성부 22. 관찰 행렬 구성부21. Speaker-specific model construct 22. Observation matrix construct

23. 쌍일차 모델 구성부 24. 화자 적응부23. Bilinear model component 24. Speaker adaptor

25. 음성 인식부25. Speech Recognition

Claims

For each speaker

Dimension

An observation matrix constructing unit for constructing SD (Speaker Dependent) models and constituting an observation matrix using a SD (Speaker Dependent) model configured by considering only a mean vector;

A bilinear model construction unit configured to obtain asymmetric bilinear model parameters by applying Singular Value Decomposition (SVD) to the constructed matrix, and to construct a bilinear model by linearly transforming to a space reflecting a speaker style;

A speaker adaptor configured to construct a speaker adaptation model by estimating only a style factor when a new speaker's adaptation data is input using the constructed bilinear model;

And a speech recognition unit configured to perform test speech recognition of a user using the configured speaker adaptation model.

In the step of constructing an observation matrix for speaker adaptation using a bilinear model,

Dimension

HMM (Hidden Markov Model) and GMM (Gaussian mixture model)

Represented by

ego,

here,

The size of the observation matrix

ego,

Speaker adaptation method using a bilinear model, characterized in that normalized to.

The method of claim 2, wherein applying Singular Value Decomposition (SVD) to an observation matrix,

Separated by,

Where U is SD x SD

Matrix of eigenvectors, where V is C x C

S is the eigenvector matrix of the matrix, and S is the matrix located at the main diagonal of min (SD, C) x min (SD, C) where the singular value of SD x C is the smaller of SD and C. A style-specific matrix using the number of dominant J (≤C) eigenvectors out of the total eigenvectors

And content basis matrix

If you define

here,

Is

Size

Is the style-specific matrix of the s-th speaker in

Is the size of ( J x C )

Is the c th content basis vector,

c-th content mean vector of s-th speaker

If you find

, The equal sign holds when J = C,

Same SVD application and style-specific matrix as above

And content basis matrix

Is equally applicable, where c is the content mean vector of the s-th speaker.

Is

Speaker adaptation method using a bilinear model, characterized in that to be summarized.

The method of claim 2, wherein when a new speaker's adaptation data is received, the speaker adaptation model is generated using an asymmetric bilinear model.

To be defined as,

Is

New speaker

Represents the style-specific matrix of

Is the c th content basis vector,

The new speaker

Speaker adaptation method using a bilinear model, characterized in that the c- th content vector (content vector).

The method according to claim 4, which is adaptation data of a new speaker.

Is given, the total likelihood

Is the maximum auxiliary function (auxiliary function)

Is iterative,

here,

Wow

Are the current and reestimated models, respectively.

The new speaker

T - D adaptive data of

And a new speaker

Observation vector column

And model

Given then, the posterior probability of being in content c at time t is

ego,

Covariance Matrix of Reestimation Model

Is assumed to be the same as the SI (Speaker Independent) model, and the adaptive data is performed by performing linear transformation only on the mean vector.

about

Is the maximum

Quot;

In summary,

Speaker adaptation method using a bi-linear model, characterized in that.